Closed Bug 1287930 Opened 8 years ago Closed 7 years ago

Treeherder sometimes doesn't generate an entry in the central jobs database

Categories

(Tree Management :: Treeherder: API, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1318474

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

For example:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/c24652da-4df0-11e6-9e2c-c81f66b8ceca_21484_26282
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/f6da525a-4df0-11e6-9e2c-c81f66b8ceca_5495_10103

Traceback (most recent call last):
...
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/rest_framework/views.py", line 463, in dispatch
File "/data/www/treeherder.mozilla.org/treeherder-service/treeherder/webapp/api/note.py", line 57, in create
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 334, in get

Example job that does this:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3996c602754f1653b9affb07485ad641189cc771&filter-searchStr=2dcad453160d2179bb5d56d1e355e608a35f3401&selectedJob=32107692

Wes said it seemed to only occur on jobs that claimed they didn't have a parsed log.

Will, I'm presuming this is due to bug 1266062? I don't suppose you could take a look?
Flags: needinfo?(wlachance)
When clicking that linked job, I get this response to GETting https://treeherder.mozilla.org/api/project/mozilla-inbound/note/?job_id=32107692
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder1.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:40 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY


If I try to classify that job, I get this response to POSTing to https://treeherder.mozilla.org/api/project/mozilla-inbound/note/
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder3.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:51 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY

And also this response to GETting https://treeherder.mozilla.org/api/project/mozilla-inbound/note/?job_id=32107692
HTTP/1.1 500 INTERNAL SERVER ERROR
Server: gunicorn/19.6.0
X-Backend-Server: treeherder1.webapp.scl3.mozilla.com
Vary: Cookie, Accept-Encoding
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Strict-Transport-Security: max-age=31536000
Date: Tue, 19 Jul 2016 21:04:51 GMT
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Connection: close
X-FRAME-OPTIONS: DENY
So I think the problem here is that we have the same guid for the job, one in 'mozilla-inbound', one in 'autoland'. But the intermediate jobs table (which the job note api depends on) can only have one entry per guid.

mysql> select job_guid from job where id=32107692;
+------------------------------------------+
| job_guid                                 |
+------------------------------------------+
| 62714aa01cf0e7bebc20f7e42cf52ceacf306414 |
+------------------------------------------+
1 row in set (0.09 sec)

mysql> use treeherder;
Database changed
mysql> select * from job where guid='d6eb3d8e22cdcab5cce5b45e4eb73b6d259ba0c3';
+----------+------------------------------------------+---------------------+---------------+
| id       | guid                                     | project_specific_id | repository_id |
+----------+------------------------------------------+---------------------+---------------+
| 32248479 | d6eb3d8e22cdcab5cce5b45e4eb73b6d259ba0c3 |              736080 |            77 |
+----------+------------------------------------------+---------------------+---------------+
1 row in set (0.09 sec)

You can see an earlier exception in new relic when we tried to insert the entry into the intermediate job's table for inbound (when the guid for inbound already existed):

https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/b517e7bb-4dfb-11e6-9e2c-c81f66b8ceca_0_8994/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=s81pkv

Looking at the code in etl/buildapi.py, it appears that we're only using request_id and buildername to generate the guid, which could result in this error. I think we're gonna need to incorporate the repository name when generating the job guid; though I'm not sure how to deploy that change without generating a bunch of duplicated information. Maybe we need to do some kind of quick check in the etl layer to see if an "old style" guid exists first for the job?
Flags: needinfo?(wlachance)
Ah good spot.

As part of bug 1093743 / bug 1141603 I believe I was told that the request_id was unique across all repositories. (ignoring result==5 retries)

Chris, can you confirm whether this is indeed the case?

That said, "buildbot request_id and buildername" should typically be unique, since the buildername tends to contain the repository name too. Looking at:
https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/?job_guid=62714aa01cf0e7bebc20f7e42cf52ceacf306414

I thought I could find the buildapi artifact (and thus the buildername) using:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=32107692
...however that gets zero results?
Flags: needinfo?(catlee)
The job in question:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3996c602754f1653b9affb07485ad641189cc771&selectedJob=32107692

Buildapi for that push:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/3996c602754f1653b9affb07485ad641189cc771

Job details from buildapi (given the buildapi artifact is missing in our DB):
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/118494035

The buildnername:
"Windows 7 32-bit mozilla-inbound opt test mochitest-browser-chrome-1"

And request_id:
117318918

This request doesn't exist for the autoland repo:
https://secure.pub.build.mozilla.org/buildapi/self-serve/autoland/request/117318918

So I'm struggling to see how the collisions are happening, unless something in the scheduling/automation for the autoland repo is submitting jobs with the wrong metadata?
each request_id is tied directly to a buildername, which has the repo name as part of it.

what factors are used to generate the job guid?
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #5)
> each request_id is tied directly to a buildername, which has the repo name
> as part of it.
> 
> what factors are used to generate the job guid?

request_id and buildername:

https://github.com/mozilla/treeherder/blob/b040bb4/treeherder/etl/buildapi.py#L329
https://github.com/mozilla/treeherder/blob/b040bb4/treeherder/etl/common.py#L142
Summary: NoteViewSet create/list API exceptions "models:DoesNotExist: Job matching query does not exist." → Treeherder generated two jobs with the same guid in the inbound and autoland repositories
Seeing another set of these jobs that claim to have no logs available and the 500 errors on the note api.
Catlee, could you take another look at this? (see comment 6 for the answer to your earlier question)
Flags: needinfo?(catlee)
What are the two revisions / repositories that are having the guid collision?
Flags: needinfo?(catlee)
On further investigation I think I got this wrong (I mixed up job guids in comment 2). The problem seems to be (in the example above) that there is no corresponding entry in the central 'treeherder' job database corresponding to the one in 'mozilla-inbound'). That is to say, there is no job in treeherder with the project specific id '32107692' or the guid '62714aa01cf0e7bebc20f7e42cf52ceacf306414'.

I'm not sure how this happened. Wes pointed out some other instances of similar problems today but I didn't see any exceptions in new relic that would indicate we failed to ingest this data along with jobs. For example:

https://treeherder.mozilla.org/#/jobs?repo=fx-team&fromchange=a878df4314faa5a620aca4b04fbae422f68392d3&group_state=expanded&bugfiler&filter-searchStr=7a4fda4564c85d103178c81fc01c876af8cab0fd&selectedJob=10804736
Summary: Treeherder generated two jobs with the same guid in the inbound and autoland repositories → Treeherder sometimes doesn't generate an entry in the central jobs database
The code in question is here:

https://github.com/mozilla/treeherder/blob/82918ec/treeherder/model/derived/jobs.py#L1584

I still don't see how we would ingest one type of data and not the other. Storing the same data in two places (and assuming that the representations are in sync) is probably always going to be inherently fragile, bug 1178641 can't come soon enough.
This should be a non-issue once we only have one representation to worry about.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.