Analyse cases where we had to use the pushlog ingestion fallback

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

4 years ago
Bug 1077136 added a workaround for missing pushes: If we find a job that refers to a push that is unknown to treeherder, then manually try and ingest that push again.

This solves the initial urgency of the problem and was the best first step for us to do, however:
1) It's a workaround for the real problem.
2) Until a result set has a job, it still doesn't appear (so there's an additional lag).
3) Some result sets may never get a job (eg due to DONTBUILD or coalescing) and so never appear.
4) Even when we hit this fallback, the UI (at least currently) doesn't update, so the user has to manually refresh.

In this bug it would be good to:
a) Add additional logging to the initial pushlog ingestion process.
b) Analyse the log output added as part of bug 1077136 and cross reference with that from #a, to see if we can find the exact times pushes were missed, which might lead to the root cause (eg see if they always occur when we've done a prod push, or at peak load times of day).
(Assignee)

Updated

4 years ago
Blocks: 1080757
No longer blocks: 1059400
(Assignee)

Updated

4 years ago
Blocks: 1090289
(Assignee)

Updated

4 years ago
Blocks: 1090441
(Assignee)

Comment 1

4 years ago
Note to self:
/var/log/celery/celery_worker_pushlog.log on treeherder-etl[12]
(Assignee)

Updated

4 years ago
Assignee: nobody → emorley
(Assignee)

Updated

4 years ago
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
(Assignee)

Comment 2

4 years ago
I think it's only worth looking into this once bug 1090441 is fixed - since without that, at least a proportion of the pushlog ingestion fallback cases will just be bad luck + races between the normal scheduled pushlog ingestion and the builds-pending.js ingestion.
Assignee: emorley → nobody
No longer blocks: 1090441
Depends on: 1090441
(Assignee)

Comment 3

4 years ago
Though that said, some initial observations:
1) All of the fallback cases I could see were for Try - I'm guessing we're more likely to time out fetching the pushlog / get a 500 for it, and so not get the push before the job is ingested. Perhaps we need to increase the original pushlog ingestion timeout?
2) There were many many duplicate revisions being requested - it seems as though we queue up hundreds of the same revision in the fetch-missing-push-logs task queue.
(Assignee)

Updated

4 years ago
Depends on: 1118068
(Assignee)

Comment 4

4 years ago
(In reply to Ed Morley [:edmorley] from comment #3)
> 2) There were many many duplicate revisions being requested - it seems as
> though we queue up hundreds of the same revision in the
> fetch-missing-push-logs task queue.

Filed bug 1118068.

The noise in the logs will be much lower once these two additional deps are fixed, so let's hold off here until then.
(Assignee)

Comment 5

4 years ago
Saw a few of these whilst debugging bug 1125410:

[2015-01-23 15:52:08,143: WARNING/Worker-4] Found builds4h jobs with missing resultsets.  Scheduling re-fetch: defaultdict(<type 'set'>, {'mozilla-aurora': ['dca8fe9d9425']})

The revision doesn't exist on mozilla-aurora. 

The l10n jobs lie and put the wrong revision in builds-4hr:

    {
      "builder_id": 239152, 
      "buildnumber": 555, 
      "endtime": 1422045230, 
      "id": 57291749, 
      "master_id": 124, 
      "properties": {
        "app": "browser", 
        "appName": "Firefox", 
        "appVersion": "37.0a2", 
        "aws_ami_id": "ami-36502b5e", 
        "aws_instance_id": "i-8d849d61", 
        "aws_instance_type": "r3.xlarge", 
        "basedir": "/builds/slave/m-aurora-l64-l10n-dep-00000000", 
        "branch": "mozilla-aurora", 
        "builddir": "m-aurora-l64-l10n-dep-00000000", 
        "buildername": "Firefox mozilla-aurora linux64 l10n dep", 
        "buildid": "20150123004028", 
        "buildnumber": 555, 
        "commit_titles": [
          "update Punjabi Translation: merge"
        ], 
        "en_revision": "default", 
        "forced_clobber": false, 
        "fx_revision": "79fcabb42355", 
        "hashType": "sha512", 
        "inipath": "dist/l10n-stage/firefox/application.ini", 
        "l10n_revision": "dca8fe9d9425", 
        "locale": "pa-IN", 
        "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-dep-pa-IN-bm71-build1-build555.txt.gz", 
        "master": "http://buildbot-master71.srv.releng.use1.mozilla.com:8001/", 
        "packageUrl": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-l10n/firefox-37.0a2.pa-IN.linux-x86_64.tar.bz2", 
        "periodic_clobber": false, 
        "placement/availability_zone": "us-east-1a", 
        "platform": "linux64", 
        "product": "firefox", 
        "project": "", 
        "purge_actual": "28.19GB", 
        "purge_target": "3GB", 
        "purged_clobber": true, 
        "repository": "", 
        "request_ids": [
          60056556
        ], 
        "request_times": {
          "60056556": 1422044215
        }, 
        "revision": "dca8fe9d94258eb617529c22107e0e2c7c222025", 
        "scheduler": "mozilla-aurora l10n", 
        "slavebuilddir": "m-aurora-l64-l10n-dep-00000000", 
        "slavename": "bld-linux64-spot-1019", 
        "stage_platform": "linux64", 
        "toolsdir": "/builds/slave/m-aurora-l64-l10n-dep-00000000/tools", 
        "tree": "fxaurora"
      }, 
      "reason": "scheduler", 
      "request_ids": [
        60056556
      ], 
      "requesttime": 1422044215, 
      "result": 0, 
      "slave_id": 8387, 
      "starttime": 1422044283
    },
(Assignee)

Updated

4 years ago
Depends on: 1125433
(Assignee)

Updated

4 years ago
Assignee: nobody → emorley
(Assignee)

Updated

4 years ago
Priority: P2 → P3
(Assignee)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.