Closed Bug 1085682 Opened 10 years ago Closed 10 years ago

Analyse cases where we had to use the pushlog ingestion fallback

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Bug 1077136 added a workaround for missing pushes: If we find a job that refers to a push that is unknown to treeherder, then manually try and ingest that push again. This solves the initial urgency of the problem and was the best first step for us to do, however: 1) It's a workaround for the real problem. 2) Until a result set has a job, it still doesn't appear (so there's an additional lag). 3) Some result sets may never get a job (eg due to DONTBUILD or coalescing) and so never appear. 4) Even when we hit this fallback, the UI (at least currently) doesn't update, so the user has to manually refresh. In this bug it would be good to: a) Add additional logging to the initial pushlog ingestion process. b) Analyse the log output added as part of bug 1077136 and cross reference with that from #a, to see if we can find the exact times pushes were missed, which might lead to the root cause (eg see if they always occur when we've done a prod push, or at peak load times of day).
Blocks: 1080757
No longer blocks: treeherder-dev-transition
Blocks: 1090289
Blocks: 1090441
Note to self: /var/log/celery/celery_worker_pushlog.log on treeherder-etl[12]
Assignee: nobody → emorley
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
I think it's only worth looking into this once bug 1090441 is fixed - since without that, at least a proportion of the pushlog ingestion fallback cases will just be bad luck + races between the normal scheduled pushlog ingestion and the builds-pending.js ingestion.
Assignee: emorley → nobody
No longer blocks: 1090441
Depends on: 1090441
Though that said, some initial observations: 1) All of the fallback cases I could see were for Try - I'm guessing we're more likely to time out fetching the pushlog / get a 500 for it, and so not get the push before the job is ingested. Perhaps we need to increase the original pushlog ingestion timeout? 2) There were many many duplicate revisions being requested - it seems as though we queue up hundreds of the same revision in the fetch-missing-push-logs task queue.
Depends on: 1118068
(In reply to Ed Morley [:edmorley] from comment #3) > 2) There were many many duplicate revisions being requested - it seems as > though we queue up hundreds of the same revision in the > fetch-missing-push-logs task queue. Filed bug 1118068. The noise in the logs will be much lower once these two additional deps are fixed, so let's hold off here until then.
Saw a few of these whilst debugging bug 1125410: [2015-01-23 15:52:08,143: WARNING/Worker-4] Found builds4h jobs with missing resultsets. Scheduling re-fetch: defaultdict(<type 'set'>, {'mozilla-aurora': ['dca8fe9d9425']}) The revision doesn't exist on mozilla-aurora. The l10n jobs lie and put the wrong revision in builds-4hr: { "builder_id": 239152, "buildnumber": 555, "endtime": 1422045230, "id": 57291749, "master_id": 124, "properties": { "app": "browser", "appName": "Firefox", "appVersion": "37.0a2", "aws_ami_id": "ami-36502b5e", "aws_instance_id": "i-8d849d61", "aws_instance_type": "r3.xlarge", "basedir": "/builds/slave/m-aurora-l64-l10n-dep-00000000", "branch": "mozilla-aurora", "builddir": "m-aurora-l64-l10n-dep-00000000", "buildername": "Firefox mozilla-aurora linux64 l10n dep", "buildid": "20150123004028", "buildnumber": 555, "commit_titles": [ "update Punjabi Translation: merge" ], "en_revision": "default", "forced_clobber": false, "fx_revision": "79fcabb42355", "hashType": "sha512", "inipath": "dist/l10n-stage/firefox/application.ini", "l10n_revision": "dca8fe9d9425", "locale": "pa-IN", "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-dep-pa-IN-bm71-build1-build555.txt.gz", "master": "http://buildbot-master71.srv.releng.use1.mozilla.com:8001/", "packageUrl": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-l10n/firefox-37.0a2.pa-IN.linux-x86_64.tar.bz2", "periodic_clobber": false, "placement/availability_zone": "us-east-1a", "platform": "linux64", "product": "firefox", "project": "", "purge_actual": "28.19GB", "purge_target": "3GB", "purged_clobber": true, "repository": "", "request_ids": [ 60056556 ], "request_times": { "60056556": 1422044215 }, "revision": "dca8fe9d94258eb617529c22107e0e2c7c222025", "scheduler": "mozilla-aurora l10n", "slavebuilddir": "m-aurora-l64-l10n-dep-00000000", "slavename": "bld-linux64-spot-1019", "stage_platform": "linux64", "toolsdir": "/builds/slave/m-aurora-l64-l10n-dep-00000000/tools", "tree": "fxaurora" }, "reason": "scheduler", "request_ids": [ 60056556 ], "requesttime": 1422044215, "result": 0, "slave_id": 8387, "starttime": 1422044283 },
Depends on: 1125433
Assignee: nobody → emorley
Priority: P2 → P3
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.