Closed Bug 1465420 Opened 7 years ago Closed 7 years ago

Some intermittent bugs not suggested any more for failures

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: emorley)

Details

I've just cleared the production Redis cache (which is where the bug suggestions are cached), and the link in comment 1 now returns the correct suggestions for me. (And the other links look improved now too) Do things look better for you?
(Clearing the Redis cache also logs out all users, in case you were thinking bug 1465355 was back)
Thank you, suggestions are back to expected ones.
Severity: blocker → normal
I don't understand how the cache entries could have been affected, since there were no New Relic errors during the deployment, and the cache result should only be stored in case of success: https://github.com/mozilla/treeherder/blob/df5cb3fcf400c0da06098e6cac88e5f8c02a06e4/treeherder/model/error_summary.py#L22-L39
To give more context on comment 6 - I was initially thinking that perhaps if logs were parsed whilst the migration was running, perhaps the queries might have failed. However: * the job in comment 1 completed at 12:12:40 UTC, which is well after the 08:25 UTC deploy * the cache shouldn't be saved in a case of a failed query * if the query had failed we should have seen it in new relic
This seems suspiciously close timing to the finish time of that job (I've adjusted the timestamp to show UTC): May 30 12:15:30 treeherder-prod app/worker_default.2: [2018-05-30 12:15:30,110: ERROR/MainProcess] Hard time limit (930s) exceeded for fetch-bugs[b1aa8125-d0fc-4b2b-95eb-cb55dd23c8cf] Looking at the bugscache population code, it seems at first glance that this shouldn't be a problem, since only old bugs are removed from the table (rather than truncating the table and then re-populating, like it used to do a few years ago, which is bad for race conditions and if the whole task fails). However perhaps this is buggy and not doing what we think? https://github.com/mozilla/treeherder/blob/df5cb3fcf400c0da06098e6cac88e5f8c02a06e4/treeherder/etl/bugzilla.py#L47-L49 ie: * instead of deleting just the old bugs, all bugs are deleted * normally the bugs table is then re-populated soon after (though still would leave a window where some bugs were missing) * in this case, since the fetch-bugs task timed out, the table was left with lots of bugs missing, until it ran again the next hour
Hmm seems to work fine: """ $ ths run ./manage.py shell Running ./manage.py shell on ⬢ treeherder-stage... up, run.7771 (Standard-1X) ... >>> from treeherder.model.models import Bugscache >>> bugs_stored = set(Bugscache.objects.values_list('id', flat=True)) >>> len(bugs_stored) 22080 >>> Bugscache.objects.first().id 473680L >>> bug_list = [{'id': 1}, {'id': 2}, {'id': 473680L}] >>> old_bugs = bugs_stored.difference(set(bug['id'] for bug in bug_list)) >>> len(old_bugs) 22079 """
Clearing the Redis cache fixed this, and unless it occurs again I'm out of low hanging fruit things to investigate.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 7 years ago
Component: Treeherder → Treeherder: Log Parsing & Classification
Flags: needinfo?(ghickman)
Priority: -- → P1
Resolution: --- → FIXED
Summary: some intermittent bugs not suggested anymore for failures → Some intermittent bugs not suggested any more for failures
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.