Closed Bug 1368989 Opened 7 years ago Closed 7 years ago

Job ingestion "IntegrityError: (1062, "Duplicate entry '155bcddb-df77-4ee6-bec2-2532a4786b27/0' for key 'guid'")"

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

[treeherder] mozilla:integrity-error > mozilla:master 7 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review

Ed Morley [:emorley]

Reporter

Description

•

7 years ago

This occurs during {store-pulse-jobs, builds-pending, builds-running} and is the 2nd most common exception seen on New Relic (it has been occurring for some time):

https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors?tw%5Bend%5D=1496228939&tw%5Bstart%5D=1495624139#/table?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&filters=%5B%7B%22key%22%3A%22error.class%22%2C%22value%22%3A%22django.db.utils%3AIntegrityError%22%2C%22like%22%3Afalse%7D%5D&_k=rm8qkp

Cameron, I don't suppose you could take a look?

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Blocks: treeherder-nr-exceptions

Cameron Dawson [:camd]

Assignee

Comment 1

•

7 years ago

Yes, I'll definitely take a look.  Trying to wrap up a couple things now, but will look into it tomorrow for sure.  :)

Cameron Dawson [:camd]

Assignee

Comment 2

•

7 years ago

So this indicates that we already have a job with the same guid, of course.  We check if there's an existing job with the same ``guid`` just prior to doing this ``create``.  Nevertheless, there is a dup by the time the create function is called on ``jobs``.

And in every case I checked (maybe 30) they were ALL either pending or running.  Not a single complete job errored out this way, during my looking.  When I look in the DB, those jobs show as completed.

Still investigating this and trying a couple approaches.  But it kind of SEEMS like we're ending up in the right place.  Even so, we shouldn't have all these exceptions.  So I'm investigating the right way to handle this gracefully.

GitHub Autolander Bot

Comment 3

•

7 years ago

Attached file [treeherder] mozilla:integrity-error > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Comment 4

•

7 years ago

I've been experimenting with this a bit, and I just can't reproduce the problem locally.  I wonder if what is happening is this:

We have 5 dynos processing ``store_pulse_jobs``.  Could more than one be getting the same message and ingesting the same job because the first worker hasn't had a chance to "ack" the message before the second worker has gotten it as well.  Then we have two workers with the same message(job).  And sometimes the timing is just right that the call to ``.exists()`` shows no job, but by the next line of code where it does the ``.create``, then the job has been created?

If that's the case, then the solution here would be to either get trickier with our workers/dynos so they can't get the same job, or do a failsafe like use ``get_or_create`` or catch the ``IntegrityError`` and return gracefully without logging to new relic.

Ed-- Curious if you have any thoughts about this.  :)

Cameron Dawson [:camd]

Assignee

Updated

•

7 years ago

Flags: needinfo?(cdawson) → needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 5

•

7 years ago

I'm pretty sure celery guarantees that tasks aren't duplicated.

Another explanation that has just occurred to me, might be that the pending and running pulse messages are coming in such quick succession that two different dynos are executing them at the same time - and that our ingestion code just has a race condition between the steps.

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Assignee: nobody → cdawson

Cameron Dawson [:camd]

Assignee

Comment 6

•

7 years ago

Comment on attachment 8885474 [details] [review]
[treeherder] mozilla:integrity-error > mozilla:master

Back over to you, Ed.  Made the changes we talked about.  I hope the comments are clear enough.  Thanks!  :)

Attachment #8885474 - Flags: review?(emorley)

Ed Morley [:emorley]

Reporter

Comment 7

•

7 years ago

Comment on attachment 8885474 [details] [review]
[treeherder] mozilla:integrity-error > mozilla:master

Awesome - thank you :-)

Attachment #8885474 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 8

•

7 years ago

Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/cf715762738d343d6864056f756b6efcf159bb56
Bug 1368989 - Fix integrity error during pulse job ingestion

Instead of hitting an exception and logging that when we get two
jobs with the same guid during ingestion, we will now use a
get_or_create to try to create it, but not throw an exception if
we can't because it's JUST been created right after the check.

This case looks to only happen when a pending and running job
with the same guid are being processed by different workers/dynos
simultaneously.  So the worst that would happen would be that a
job would look pending longer and then become complete immediately.

Cameron Dawson [:camd]

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Job ingestion "IntegrityError: (1062, "Duplicate entry '155bcddb-df77-4ee6-bec2-2532a4786b27/0' for key 'guid'")"

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: camd)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Attachment

General

Description

File Name

Content Type