Closed Bug 1306578 Opened 8 years ago Closed 8 years ago

In the case of MissingResultsetException .retry() gets called twice per task

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

Details

Attachments

(1 file)

[treeherder] mozilla:pulse-job-retry-improvements > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review

Ed Morley [:emorley]

Reporter

Description

•

8 years ago

(In reply to Ed Morley [:emorley] from bug 1286578 comment #19) > I wonder as a follow-up, whether we should add a new queue for these > "waiting for their push" jobs. It's just I got an alert on Heroku stage > about queue backlog for the `store_pulse_jobs` queue after this landed. > > With a separate queue we could set different alert thresholds, plus more > easily monitor whether any backlog was due to missing lots of pushes, or > just because standard ingestion was getting behind. (In reply to Cameron Dawson [:camd] from bug 1286578 comment #20) > This has been deployed to both stage and production now. The rabbitmq > queues for SCL3 and Heroku stage are hovering at zero, so they're emptying > the whole way, atm. Should we see if those seem to grow or we get more > alerts? Not sure if that was a one-time thing. The alerts are still occurring: Name treeherder-stage Queue store_pulse_jobs Current # messages 1460 Alarm queue regexp .* Alarm threshold 500 Name treeherder-stage Queue store_pulse_jobs Current # messages 4743 Alarm queue regexp .* Alarm threshold 500 Name treeherder-stage Queue store_pulse_jobs Current # messages 26008 Alarm queue regexp .* Alarm threshold 500 Also - it seems like there may be a problem somewhere - 26000 jobs are waiting in the queue?

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Reporter

Comment 1

•

8 years ago

Those 26000 jobs are in the un-acked, not 'ready' state. I'm not sure whether that is expected (since perhaps they are in the 3 minute waiting period), or whether there's some other bug. (Some preliminary reading seemed to imply un-acked actually meant a bug in the rabbitmq client behaviour, but the comments were about plain rabbitmq, not celery).

Ed Morley [:emorley]

Reporter

Comment 2

•

8 years ago

Ah I see the issue. Calling .retry() is the equivalent of: ``` <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` See: http://docs.celeryproject.org/en/latest/reference/celery.app.task.html#celery.app.task.Task.retry Hence: https://github.com/mozilla/treeherder/commit/9faf9d97ae0b050c1bdc3fda3c4bb7a4e6a15c34 So here: ``` try: JobLoader().process_job_list(job_list) except MissingResultsetException as exc: store_pulse_jobs.retry(exc=exc) ``` We're really doing the equivalent of: ``` # ... except MissingResultsetException as exc: <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` And that exception then propagates to the retryable_task wrapper: ``` def inner(*args, **kwargs): try: return f(*args, **kwargs) except self.NON_RETRYABLE_EXCEPTIONS: raise except Exception as e: # ... newrelic.agent.record_exception(params=params) # ... raise task_func.retry(exc=e, countdown=timeout) ``` ...which does the equivalent of: ``` # ... except Exception as e: # ... newrelic.agent.record_exception(<celery.exceptions.Retry>) # ... <schedule task to run again and mark original task as retry> raise celery.exceptions.Retry ``` ie we double call the `<schedule task to run again and mark original task as retry>` part. If we wanted to nest retry logic like this, we'd have to add `celery.exceptions.Retry` to `NON_RETRYABLE_EXCEPTIONS`. However I think it's much cleaner to just not nest things, and use retryable_task for all retry handling :-)

Summary: Add a separate queue for retried pulse jobs that are waiting for their push to be ingested → In the case of MissingResultsetException .retry() gets called twice per task

GitHub Autolander Bot

Comment 3

•

8 years ago

Attached file [treeherder] mozilla:pulse-job-retry-improvements > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Updated

•

8 years ago

Flags: needinfo?(cdawson)

Attachment #8797293 - Flags: review?(emorley)

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Assignee: nobody → cdawson

Status: NEW → ASSIGNED

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Attachment #8797293 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 4

•

8 years ago

Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/61d5fe0889290b346b3bd80e0ed8c092e36e6cb9 Bug 1306578 - Better handling for unknown project repo Raises a DatasetNotfoundError if the repo isn't found https://github.com/mozilla/treeherder/commit/f3261ad9790ce6afc6cf9377de6676cd4fdd0c35 Bug 1306578 - Use retryable_task for pulse job ingestion Was calling retry manually, but that was causing repeated retries that were blowing up the celery queue. This just uses our normal retryable_task for retries and adds some extra logging in new relic to help identify the product of a pulse job that failed.

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Depends on: 1307289

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

In the case of MissingResultsetException .retry() gets called twice per task

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: camd)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Comment 4

Updated

Updated

Attachment

General

Description

File Name

Content Type