Closed Bug 1306578 Opened 8 years ago Closed 8 years ago

In the case of MissingResultsetException .retry() gets called twice per task

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

Details

Attachments

(1 file)

(In reply to Ed Morley [:emorley] from bug 1286578 comment #19) > I wonder as a follow-up, whether we should add a new queue for these > "waiting for their push" jobs. It's just I got an alert on Heroku stage > about queue backlog for the `store_pulse_jobs` queue after this landed. > > With a separate queue we could set different alert thresholds, plus more > easily monitor whether any backlog was due to missing lots of pushes, or > just because standard ingestion was getting behind. (In reply to Cameron Dawson [:camd] from bug 1286578 comment #20) > This has been deployed to both stage and production now. The rabbitmq > queues for SCL3 and Heroku stage are hovering at zero, so they're emptying > the whole way, atm. Should we see if those seem to grow or we get more > alerts? Not sure if that was a one-time thing. The alerts are still occurring: Name treeherder-stage Queue store_pulse_jobs Current # messages 1460 Alarm queue regexp .* Alarm threshold 500 Name treeherder-stage Queue store_pulse_jobs Current # messages 4743 Alarm queue regexp .* Alarm threshold 500 Name treeherder-stage Queue store_pulse_jobs Current # messages 26008 Alarm queue regexp .* Alarm threshold 500 Also - it seems like there may be a problem somewhere - 26000 jobs are waiting in the queue?
Flags: needinfo?(cdawson)
Those 26000 jobs are in the un-acked, not 'ready' state. I'm not sure whether that is expected (since perhaps they are in the 3 minute waiting period), or whether there's some other bug. (Some preliminary reading seemed to imply un-acked actually meant a bug in the rabbitmq client behaviour, but the comments were about plain rabbitmq, not celery).
Ah I see the issue. Calling .retry() is the equivalent of: ``` <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` See: http://docs.celeryproject.org/en/latest/reference/celery.app.task.html#celery.app.task.Task.retry Hence: https://github.com/mozilla/treeherder/commit/9faf9d97ae0b050c1bdc3fda3c4bb7a4e6a15c34 So here: ``` try: JobLoader().process_job_list(job_list) except MissingResultsetException as exc: store_pulse_jobs.retry(exc=exc) ``` We're really doing the equivalent of: ``` # ... except MissingResultsetException as exc: <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` And that exception then propagates to the retryable_task wrapper: ``` def inner(*args, **kwargs): try: return f(*args, **kwargs) except self.NON_RETRYABLE_EXCEPTIONS: raise except Exception as e: # ... newrelic.agent.record_exception(params=params) # ... raise task_func.retry(exc=e, countdown=timeout) ``` ...which does the equivalent of: ``` # ... except Exception as e: # ... newrelic.agent.record_exception(<celery.exceptions.Retry>) # ... <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` ie we double call the `<schedule task to run again and mark original task as retry>` part. If we wanted to nest retry logic like this, we'd have to add `celery.exceptions.Retry` to `NON_RETRYABLE_EXCEPTIONS`. However I think it's much cleaner to just not nest things, and use retryable_task for all retry handling :-)
Summary: Add a separate queue for retried pulse jobs that are waiting for their push to be ingested → In the case of MissingResultsetException .retry() gets called twice per task
Flags: needinfo?(cdawson)
Attachment #8797293 - Flags: review?(emorley)
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Attachment #8797293 - Flags: review?(emorley) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/61d5fe0889290b346b3bd80e0ed8c092e36e6cb9 Bug 1306578 - Better handling for unknown project repo Raises a DatasetNotfoundError if the repo isn't found https://github.com/mozilla/treeherder/commit/f3261ad9790ce6afc6cf9377de6676cd4fdd0c35 Bug 1306578 - Use retryable_task for pulse job ingestion Was calling retry manually, but that was causing repeated retries that were blowing up the celery queue. This just uses our normal retryable_task for retries and adds some extra logging in new relic to help identify the product of a pulse job that failed.
Depends on: 1307289
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: