1133455 - Hundreds of duplicated running jobs on stage

Assignee

Description

•

10 years ago

IMO this blocks the deploy, until we can confirm whether this is as a result of the changes that are on stage, but not yet on prod. Compare these three: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=e0cb32a0b1aa https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e0cb32a0b1aa https://tbpl.mozilla.org/?rev=e0cb32a0b1aa treeherder-stage has hundreds of jobs in the running state, whereas TBPL and treeherder-prod do not. Also see the buildapi page for this push, which confirms there are no running/pending builds: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central/rev/e0cb32a0b1aa I'm worried this may be fallout from bug 1093743 (though we've changed enough else recently, so this isn't guaranteed).

Ed Morley [:emorley]

Assignee

Comment 1

•

10 years ago

Another example: https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=9f40d0612b71 https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=9f40d0612b71 https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=9f40d0612b71 https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/9f40d0612b71 I'm struggling to find examples on Try, whereas I can easily find them elsewhere - something to do with coalescing?

Ed Morley [:emorley]

Assignee

Comment 2

•

10 years ago

We're hitting this exception during builds-4hr ingestion on stage only: https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/3112710518/similar_errors?original_error_id=3112710518 Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/hooks/application_celery.py", line 66, in wrapper return wrapped(*args, **kwargs) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__ return self.run(*args, **kwargs) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/tasks/buildapi_tasks.py", line 41, in fetch_buildapi_build4h Builds4hJobsProcess().run() File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 546, in run filter_to_project=filter_to_project) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 172, in transform job_guid_data = self.find_job_guid(build) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 66, in find_job_guid request_time = request_times[str(request_id)] TypeError: 'int' object has no attribute '__getitem__' ...which could mean we're missing completed jobs, and thus leaving them stranded in the running state. This code was changed in bug 1093743: https://github.com/mozilla/treeherder-service/pull/359/files#diff-433f17205677cdc81c28f202c4006d24L65

Ed Morley [:emorley]

Assignee

Comment 3

•

10 years ago

So the problem is that for some time, we've done: request_times = build['properties'].get('request_times', build['requesttime']) Earlier in the file this is explained with: # request_id and request_time are mandatory # and they can be found in a couple of different places Unfortunately that means depending on whether we had to fall back to build['requesttime'] or not, request_times could be a dict or an int (!!). This is why the convoluted logic (that bug 1093743 removed) was necessary. However we should just fix this so that we don't mix types within the same variable. I'd also be keen to see if we really do need to look in both places for the value.

Ed Morley [:emorley]

Assignee

Comment 4

•

10 years ago

Attached file Fix the exception when falling back to requesttimes — Details

This should give us the benefit of bug 1133455 without any downsides, even if the overall handling still isn't perfect (analysis today showed that the requesttimes is actually wrong some of the time, but we were using it before already).

Assignee: nobody → emorley

Status: NEW → ASSIGNED

Attachment #8565130 - Flags: review?(mdoglio)

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Attachment #8565130 - Flags: review?(mdoglio) → review+

Treeherder GitHub Bugbot

Comment 5

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/178c7c572f43ce64861f85e9f8ab1cde6c59ff77 Bug 1133455 - Fix the exception when falling back to requesttimes In bug 1093743, the handling of request times was simplified (since we now use a single value from request_times rather than all of them). Unfortunately in the process, this broke the handling of cases where request_times does not exist, and we have to fall back to requesttime. The latter is an int, whereas the former a dict, so in addition to fixing the exception, we now no longer try to store both in the same variable, to avoid future confusion.

Ed Morley [:emorley]

Assignee

Comment 6

•

10 years ago

Deployed to stage, pretty clear improvement on new relic: https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors?tw[end]=1424123699&tw[start]=1424120099 Will need to wait a while to check that everything is fine wrt orphan pending/running jobs on https://treeherder.allizom.org/ after bug 1093743 / this.

Ed Morley [:emorley]

Assignee

Comment 7

•

10 years ago

The exceptions are gone, but I can still see duplicated jobs - eg compare: https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=dad2b3245457 https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=dad2b3245457

Ed Morley [:emorley]

Assignee

Comment 8

•

10 years ago

One of the duplicated jobs: in running state: https://treeherder.allizom.org/api/project/mozilla-inbound/jobs/2062342/ https://treeherder.allizom.org/api/project/mozilla-inbound/artifact/?job_id=2062342&type=json&name__in=buildapi_running,buildapi_pending,buildapi_complete [{ "blob": { "submitted_at": 1424144498, "buildername": "Ubuntu VM 12.04 mozilla-inbound opt test jittest-2", "start_time": 1424144573, "number": 30, "claimed_by_name": "buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master", "request_ids": [62075685, 62075976], "last_heartbeat": 1424144572, "id": 63211341, "revision": "dad2b32454574aabe1b66d5ee36ecd18a3bd3de7" }, "type": "json", "id": 10150606, "name": "buildapi_running", "job_id": 2062342 }] and it's completed dupe: https://treeherder.allizom.org/api/project/mozilla-inbound/jobs/2064100/ https://treeherder.allizom.org/api/project/mozilla-inbound/artifact/?job_id=2064100&type=json&name__in=buildapi_running,buildapi_pending,buildapi_complete [{ "blob": { "slave_id": 7589, "builder_id": 354850, "request_ids": [62075685, 62075976], "buildnumber": 30, "reason": "scheduler", "result": 0, "id": 59151690, "master_id": 149, "properties": { "script_repo_revision": "04723e955a9e", "request_times": { "62075976": 1424144558, "62075685": 1424144498 }, "buildid": "20150216192531", "stage_platform": "linux", "basedir": "/builds/slave/test", "script_repo_url": "https://hg.mozilla.org/build/mozharness", "request_ids": [62075685, 62075976], "buildnumber": 30, "platform": "linux", "master": "http://buildbot-master01.bb.releng.use1.mozilla.com:8201/", "scheduler": "tests-mozilla-inbound-ubuntu32_vm-opt-unittest", "pgo_build": "False", "builduid": "691757d2e6754eb1943553236282d4d5", "repo_path": "integration/mozilla-inbound" } }, "type": "json", "id": 10161914, "name": "buildapi_complete", "job_id": 2064100 }] Note how whilst the last element in the request_ids array is the same (62075976), the corresponding request time in request_times in the buildapi_complete blob (1424144558), does not match what was listed in buildapi_running under submitted at (1424144498). I wonder if we can avoid this problem for now by not using the request time in job guid generation at all? Mauro said that catlee said it would be needed for uniqueness, but I can't see where? (For buildbot retry jobs even using request time as well doesn't make the guid unique, we have to use endtime as well)

Ed Morley [:emorley]

Assignee

Comment 9

•

10 years ago

I think the best way forwards might be to revert https://github.com/mozilla/treeherder-service/pull/359 (and the followup PR in this bug that fixed the exception with it) so we can deploy all the other changes - since we'll be no worse off than we were before. tl;dr the PR does make us do the "right" thing, but with both the new way, we're hitting inconsistencies with builds-4hr, that mean we have problems with running jobs, whereas before we only had issues with pending/coalesced, which were much fewer in number and so less obvious in the UI. I then think the best bet is getting release engineering to add a single id in a field called "request_id" which is the correct id from request_ids (to we don't have to guess), and if the request time is somehow definitely required too, then a new field that contains the correct request_time (the current top level field 'requesttime' is missing some of the time, and is often wrong).

Treeherder GitHub Bugbot

Comment 10

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/20bb725e70d0dbca491f1231c4fe494596667567 Revert "Bug 1133455 - Fix the exception when falling back to requesttimes" This reverts commit 178c7c572f43ce64861f85e9f8ab1cde6c59ff77.

Ed Morley [:emorley]

Assignee

Comment 11

•

10 years ago

Marking this bug fixed, since bug 1093743 has been reverted for now, unblocking deployment. I'll continue the discussion over there :-)

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

Hundreds of duplicated running jobs on stage

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type