Closed Bug 1133455 Opened 9 years ago Closed 9 years ago

Hundreds of duplicated running jobs on stage

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

IMO this blocks the deploy, until we can confirm whether this is as a result of the changes that are on stage, but not yet on prod.

Compare these three:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=e0cb32a0b1aa
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e0cb32a0b1aa
https://tbpl.mozilla.org/?rev=e0cb32a0b1aa

treeherder-stage has hundreds of jobs in the running state, whereas TBPL and treeherder-prod do not.

Also see the buildapi page for this push, which confirms there are no running/pending builds:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central/rev/e0cb32a0b1aa

I'm worried this may be fallout from bug 1093743 (though we've changed enough else recently, so this isn't guaranteed).
We're hitting this exception during builds-4hr ingestion on stage only:
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/3112710518/similar_errors?original_error_id=3112710518

Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/newrelic-2.44.0.36/newrelic/hooks/application_celery.py", line 66, in wrapper return wrapped(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__ return self.run(*args, **kwargs)
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/tasks/buildapi_tasks.py", line 41, in fetch_buildapi_build4h Builds4hJobsProcess().run()
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 546, in run filter_to_project=filter_to_project)
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 172, in transform job_guid_data = self.find_job_guid(build)
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/etl/buildapi.py", line 66, in find_job_guid request_time = request_times[str(request_id)]
TypeError: 'int' object has no attribute '__getitem__'

...which could mean we're missing completed jobs, and thus leaving them stranded in the running state.

This code was changed in bug 1093743:
https://github.com/mozilla/treeherder-service/pull/359/files#diff-433f17205677cdc81c28f202c4006d24L65
So the problem is that for some time, we've done:

request_times = build['properties'].get('request_times',
                                        build['requesttime'])

Earlier in the file this is explained with:
         # request_id and request_time are mandatory
         # and they can be found in a couple of different places

Unfortunately that means depending on whether we had to fall back to build['requesttime'] or not, request_times could be a dict or an int (!!).

This is why the convoluted logic (that bug 1093743 removed) was necessary. However we should just fix this so that we don't mix types within the same variable.

I'd also be keen to see if we really do need to look in both places for the value.
This should give us the benefit of bug 1133455 without any downsides, even if the overall handling still isn't perfect (analysis today showed that the requesttimes is actually wrong some of the time, but we were using it before already).
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Attachment #8565130 - Flags: review?(mdoglio)
Attachment #8565130 - Flags: review?(mdoglio) → review+
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/178c7c572f43ce64861f85e9f8ab1cde6c59ff77
Bug 1133455 - Fix the exception when falling back to requesttimes

In bug 1093743, the handling of request times was simplified (since we
now use a single value from request_times rather than all of them).
Unfortunately in the process, this broke the handling of cases where
request_times does not exist, and we have to fall back to requesttime.
The latter is an int, whereas the former a dict, so in addition to
fixing the exception, we now no longer try to store both in the same
variable, to avoid future confusion.
Deployed to stage, pretty clear improvement on new relic:
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors?tw[end]=1424123699&tw[start]=1424120099

Will need to wait a while to check that everything is fine wrt orphan pending/running jobs on https://treeherder.allizom.org/ after bug 1093743 / this.
One of the duplicated jobs:

in running state:
https://treeherder.allizom.org/api/project/mozilla-inbound/jobs/2062342/
https://treeherder.allizom.org/api/project/mozilla-inbound/artifact/?job_id=2062342&type=json&name__in=buildapi_running,buildapi_pending,buildapi_complete
[{
	"blob": {
		"submitted_at": 1424144498,
		"buildername": "Ubuntu VM 12.04 mozilla-inbound opt test jittest-2",
		"start_time": 1424144573,
		"number": 30,
		"claimed_by_name": "buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master",
		"request_ids": [62075685,
		62075976],
		"last_heartbeat": 1424144572,
		"id": 63211341,
		"revision": "dad2b32454574aabe1b66d5ee36ecd18a3bd3de7"
	},
	"type": "json",
	"id": 10150606,
	"name": "buildapi_running",
	"job_id": 2062342
}]


and it's completed dupe:
https://treeherder.allizom.org/api/project/mozilla-inbound/jobs/2064100/
https://treeherder.allizom.org/api/project/mozilla-inbound/artifact/?job_id=2064100&type=json&name__in=buildapi_running,buildapi_pending,buildapi_complete

[{
	"blob": {
		"slave_id": 7589,
		"builder_id": 354850,
		"request_ids": [62075685,
		62075976],
		"buildnumber": 30,
		"reason": "scheduler",
		"result": 0,
		"id": 59151690,
		"master_id": 149,
		"properties": {
			"script_repo_revision": "04723e955a9e",
			"request_times": {
				"62075976": 1424144558,
				"62075685": 1424144498
			},
			"buildid": "20150216192531",
			"stage_platform": "linux",
			"basedir": "/builds/slave/test",
			"script_repo_url": "https://hg.mozilla.org/build/mozharness",
			"request_ids": [62075685,
			62075976],
			"buildnumber": 30,
			"platform": "linux",
			"master": "http://buildbot-master01.bb.releng.use1.mozilla.com:8201/",
			"scheduler": "tests-mozilla-inbound-ubuntu32_vm-opt-unittest",
			"pgo_build": "False",
			"builduid": "691757d2e6754eb1943553236282d4d5",
			"repo_path": "integration/mozilla-inbound"
		}
	},
	"type": "json",
	"id": 10161914,
	"name": "buildapi_complete",
	"job_id": 2064100
}]

Note how whilst the last element in the request_ids array is the same (62075976), the corresponding request time in request_times in the buildapi_complete blob (1424144558), does not match what was listed in buildapi_running under submitted at (1424144498).

I wonder if we can avoid this problem for now by not using the request time in job guid generation at all? Mauro said that catlee said it would be needed for uniqueness, but I can't see where? (For buildbot retry jobs even using request time as well doesn't make the guid unique, we have to use endtime as well)
I think the best way forwards might be to revert https://github.com/mozilla/treeherder-service/pull/359 (and the followup PR in this bug that fixed the exception with it) so we can deploy all the other changes - since we'll be no worse off than we were before.

tl;dr the PR does make us do the "right" thing, but with both the new way, we're hitting inconsistencies with builds-4hr, that mean we have problems with running jobs, whereas before we only had issues with pending/coalesced, which were much fewer in number and so less obvious in the UI.

I then think the best bet is getting release engineering to add a single id in a field called "request_id" which is the correct id from request_ids (to we don't have to guess), and if the request time is somehow definitely required too, then a new field that contains the correct request_time (the current top level field 'requesttime' is missing some of the time, and is often wrong).
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/20bb725e70d0dbca491f1231c4fe494596667567
Revert "Bug 1133455 - Fix the exception when falling back to requesttimes"

This reverts commit 178c7c572f43ce64861f85e9f8ab1cde6c59ff77.
Marking this bug fixed, since bug 1093743 has been reverted for now, unblocking deployment.
I'll continue the discussion over there :-)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: