Closed Bug 1035129 Opened 10 years ago Closed 10 years ago

Some job signatures are missing from the reference data signature table

Categories

(Tree Management :: Treeherder, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdoglio, Assigned: mdoglio)

References

Details

Attachments

(2 files)

I found the attached trace in the gunicorn logs on production. That is caused by a job signature in the job table missing in the reference_data_signature table.
Blocks: 1033107
mdoglio observed this in the cedar_jobs_1 database. At this time, it looks like there are 8417 jobs in the cedar_jobs_1.job table, all of the entries have job signatures but there are 366 signatures in the job table that are not found in treeherder.reference_data_signatures. The missing reference_data_signatures were identified with the following sql:

select signature, reason, result, state from cedar_jobs_1.job where signature not in ( select signature from treeherder.reference_data_signatures );

I repeated this test on mozilla_inbound_jobs_1 and found out of 460764 jobs with signatures 20429 of them were missing from treeherder.reference_data_signatures table.

I was unable to find any correlation between the jobs that are missing treeherder.reference_data_signatures and the ones that are not so at this point it appears to be random.

Given that the signatures are stored in treeherder.reference_data_signatures before they are written to the *_jobs_1.job table, there must be some unaccounted flow of execution issue or bug of some sort involving:

https://github.com/mozilla/treeherder-service/blob/master/treeherder/model/derived/jobs.py#L1572
https://github.com/mozilla/treeherder-service/blob/master/treeherder/model/derived/refdata.py#L211
Just a progress on this: I tried to run data ingestion locally and I found 2 jobs on cedar in pending state with the same problem. I haven't noticed any failure in the worker or gunicorn. I'm writing some automatic tests to check the correctness of the ingestion of those 2 jobs. Hopefully it will reveal what is causing this issue
This is not a
No longer blocks: 1033107
This is not a blocker of #1033107 anymore as I changed the way the build system type is detected.
Priority: -- → P1
Assignee: nobody → mdoglio
Status: NEW → ASSIGNED
I found out why this is happening: if treeherder ingests a job with a known buildername but different reference data signature (i.e. different properties like platform, job_type, etc) associated to it, the new signature is not inserted in the reference data signature table.
This can happen for example when we update the regular expressions for buildernames.
We should probably update the unique constraint on that table to include the signature and make sure that our insertion query is changed accordingly.
To apply this on production we need to drop and recreate the UNIQUE index on treeherder.reference_data_signatures table
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: