Closed Bug 1368989 Opened 7 years ago Closed 7 years ago

Job ingestion "IntegrityError: (1062, "Duplicate entry '155bcddb-df77-4ee6-bec2-2532a4786b27/0' for key 'guid'")"

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

This occurs during {store-pulse-jobs, builds-pending, builds-running} and is the 2nd most common exception seen on New Relic (it has been occurring for some time):

https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors?tw%5Bend%5D=1496228939&tw%5Bstart%5D=1495624139#/table?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&filters=%5B%7B%22key%22%3A%22error.class%22%2C%22value%22%3A%22django.db.utils%3AIntegrityError%22%2C%22like%22%3Afalse%7D%5D&_k=rm8qkp

Cameron, I don't suppose you could take a look?
Flags: needinfo?(cdawson)
Yes, I'll definitely take a look.  Trying to wrap up a couple things now, but will look into it tomorrow for sure.  :)
So this indicates that we already have a job with the same guid, of course.  We check if there's an existing job with the same ``guid`` just prior to doing this ``create``.  Nevertheless, there is a dup by the time the create function is called on ``jobs``.

And in every case I checked (maybe 30) they were ALL either pending or running.  Not a single complete job errored out this way, during my looking.  When I look in the DB, those jobs show as completed.

Still investigating this and trying a couple approaches.  But it kind of SEEMS like we're ending up in the right place.  Even so, we shouldn't have all these exceptions.  So I'm investigating the right way to handle this gracefully.
I've been experimenting with this a bit, and I just can't reproduce the problem locally.  I wonder if what is happening is this:

We have 5 dynos processing ``store_pulse_jobs``.  Could more than one be getting the same message and ingesting the same job because the first worker hasn't had a chance to "ack" the message before the second worker has gotten it as well.  Then we have two workers with the same message(job).  And sometimes the timing is just right that the call to ``.exists()`` shows no job, but by the next line of code where it does the ``.create``, then the job has been created?

If that's the case, then the solution here would be to either get trickier with our workers/dynos so they can't get the same job, or do a failsafe like use ``get_or_create`` or catch the ``IntegrityError`` and return gracefully without logging to new relic.

Ed-- Curious if you have any thoughts about this.  :)
Flags: needinfo?(cdawson) → needinfo?(emorley)
I'm pretty sure celery guarantees that tasks aren't duplicated.

Another explanation that has just occurred to me, might be that the pending and running pulse messages are coming in such quick succession that two different dynos are executing them at the same time - and that our ingestion code just has a race condition between the steps.
Flags: needinfo?(emorley)
Assignee: nobody → cdawson
Comment on attachment 8885474 [details] [review]
[treeherder] mozilla:integrity-error > mozilla:master

Back over to you, Ed.  Made the changes we talked about.  I hope the comments are clear enough.  Thanks!  :)
Attachment #8885474 - Flags: review?(emorley)
Comment on attachment 8885474 [details] [review]
[treeherder] mozilla:integrity-error > mozilla:master

Awesome - thank you :-)
Attachment #8885474 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/cf715762738d343d6864056f756b6efcf159bb56
Bug 1368989 - Fix integrity error during pulse job ingestion

Instead of hitting an exception and logging that when we get two
jobs with the same guid during ingestion, we will now use a
get_or_create to try to create it, but not throw an exception if
we can't because it's JUST been created right after the check.

This case looks to only happen when a pending and running job
with the same guid are being processed by different workers/dynos
simultaneously.  So the worst that would happen would be that a
job would look pending longer and then become complete immediately.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: