Closed
Bug 1287930
Opened 8 years ago
Closed 8 years ago
Treeherder sometimes doesn't generate an entry in the central jobs database
Categories
(Tree Management :: Treeherder: API, defect, P1)
Tree Management
Treeherder: API
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1318474
People
(Reporter: emorley, Unassigned)
References
(Blocks 1 open bug)
Details
For example:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/c24652da-4df0-11e6-9e2c-c81f66b8ceca_21484_26282
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/f6da525a-4df0-11e6-9e2c-c81f66b8ceca_5495_10103
Traceback (most recent call last):
...
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/rest_framework/views.py", line 463, in dispatch
File "/data/www/treeherder.mozilla.org/treeherder-service/treeherder/webapp/api/note.py", line 57, in create
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 334, in get
Example job that does this:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3996c602754f1653b9affb07485ad641189cc771&filter-searchStr=2dcad453160d2179bb5d56d1e355e608a35f3401&selectedJob=32107692
Wes said it seemed to only occur on jobs that claimed they didn't have a parsed log.
Will, I'm presuming this is due to bug 1266062? I don't suppose you could take a look?
Flags: needinfo?(wlachance)
When clicking that linked job, I get this response to GETting https://treeherder.mozilla.org/api/project/mozilla-inbound/note/?job_id=32107692
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder1.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:40 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY
If I try to classify that job, I get this response to POSTing to https://treeherder.mozilla.org/api/project/mozilla-inbound/note/
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder3.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:51 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY
And also this response to GETting https://treeherder.mozilla.org/api/project/mozilla-inbound/note/?job_id=32107692
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder1.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:51 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY
Comment 2•8 years ago
|
||
So I think the problem here is that we have the same guid for the job, one in 'mozilla-inbound', one in 'autoland'. But the intermediate jobs table (which the job note api depends on) can only have one entry per guid.
mysql> select job_guid from job where id=32107692;
+------------------------------------------+
| job_guid |
+------------------------------------------+
| 62714aa01cf0e7bebc20f7e42cf52ceacf306414 |
+------------------------------------------+
1 row in set (0.09 sec)
mysql> use treeherder;
Database changed
mysql> select * from job where guid='d6eb3d8e22cdcab5cce5b45e4eb73b6d259ba0c3';
+----------+------------------------------------------+---------------------+---------------+
| id | guid | project_specific_id | repository_id |
+----------+------------------------------------------+---------------------+---------------+
| 32248479 | d6eb3d8e22cdcab5cce5b45e4eb73b6d259ba0c3 | 736080 | 77 |
+----------+------------------------------------------+---------------------+---------------+
1 row in set (0.09 sec)
You can see an earlier exception in new relic when we tried to insert the entry into the intermediate job's table for inbound (when the guid for inbound already existed):
https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/b517e7bb-4dfb-11e6-9e2c-c81f66b8ceca_0_8994/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=s81pkv
Looking at the code in etl/buildapi.py, it appears that we're only using request_id and buildername to generate the guid, which could result in this error. I think we're gonna need to incorporate the repository name when generating the job guid; though I'm not sure how to deploy that change without generating a bunch of duplicated information. Maybe we need to do some kind of quick check in the etl layer to see if an "old style" guid exists first for the job?
Flags: needinfo?(wlachance)
Reporter | ||
Comment 3•8 years ago
|
||
Ah good spot.
As part of bug 1093743 / bug 1141603 I believe I was told that the request_id was unique across all repositories. (ignoring result==5 retries)
Chris, can you confirm whether this is indeed the case?
That said, "buildbot request_id and buildername" should typically be unique, since the buildername tends to contain the repository name too. Looking at:
https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/?job_guid=62714aa01cf0e7bebc20f7e42cf52ceacf306414
I thought I could find the buildapi artifact (and thus the buildername) using:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=32107692
...however that gets zero results?
Flags: needinfo?(catlee)
Reporter | ||
Comment 4•8 years ago
|
||
The job in question:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3996c602754f1653b9affb07485ad641189cc771&selectedJob=32107692
Buildapi for that push:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/3996c602754f1653b9affb07485ad641189cc771
Job details from buildapi (given the buildapi artifact is missing in our DB):
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/118494035
The buildnername:
"Windows 7 32-bit mozilla-inbound opt test mochitest-browser-chrome-1"
And request_id:
117318918
This request doesn't exist for the autoland repo:
https://secure.pub.build.mozilla.org/buildapi/self-serve/autoland/request/117318918
So I'm struggling to see how the collisions are happening, unless something in the scheduling/automation for the autoland repo is submitting jobs with the wrong metadata?
Comment 5•8 years ago
|
||
each request_id is tied directly to a buildername, which has the repo name as part of it.
what factors are used to generate the job guid?
Flags: needinfo?(catlee)
Comment 6•8 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #5)
> each request_id is tied directly to a buildername, which has the repo name
> as part of it.
>
> what factors are used to generate the job guid?
request_id and buildername:
https://github.com/mozilla/treeherder/blob/b040bb4/treeherder/etl/buildapi.py#L329
https://github.com/mozilla/treeherder/blob/b040bb4/treeherder/etl/common.py#L142
Summary: NoteViewSet create/list API exceptions "models:DoesNotExist: Job matching query does not exist." → Treeherder generated two jobs with the same guid in the inbound and autoland repositories
Seeing another set of these jobs that claim to have no logs available and the 500 errors on the note api.
Comment 8•8 years ago
|
||
Catlee, could you take another look at this? (see comment 6 for the answer to your earlier question)
Flags: needinfo?(catlee)
Comment 9•8 years ago
|
||
What are the two revisions / repositories that are having the guid collision?
Flags: needinfo?(catlee)
Comment 10•8 years ago
|
||
On further investigation I think I got this wrong (I mixed up job guids in comment 2). The problem seems to be (in the example above) that there is no corresponding entry in the central 'treeherder' job database corresponding to the one in 'mozilla-inbound'). That is to say, there is no job in treeherder with the project specific id '32107692' or the guid '62714aa01cf0e7bebc20f7e42cf52ceacf306414'.
I'm not sure how this happened. Wes pointed out some other instances of similar problems today but I didn't see any exceptions in new relic that would indicate we failed to ingest this data along with jobs. For example:
https://treeherder.mozilla.org/#/jobs?repo=fx-team&fromchange=a878df4314faa5a620aca4b04fbae422f68392d3&group_state=expanded&bugfiler&filter-searchStr=7a4fda4564c85d103178c81fc01c876af8cab0fd&selectedJob=10804736
Summary: Treeherder generated two jobs with the same guid in the inbound and autoland repositories → Treeherder sometimes doesn't generate an entry in the central jobs database
Comment 11•8 years ago
|
||
The code in question is here:
https://github.com/mozilla/treeherder/blob/82918ec/treeherder/model/derived/jobs.py#L1584
I still don't see how we would ingest one type of data and not the other. Storing the same data in two places (and assuming that the representations are in sync) is probably always going to be inherently fragile, bug 1178641 can't come soon enough.
Comment 12•8 years ago
|
||
This should be a non-issue once we only have one representation to worry about.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•