Closed
Bug 1306578
Opened 8 years ago
Closed 8 years ago
In the case of MissingResultsetException .retry() gets called twice per task
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: camd)
References
Details
Attachments
(1 file)
(In reply to Ed Morley [:emorley] from bug 1286578 comment #19)
> I wonder as a follow-up, whether we should add a new queue for these
> "waiting for their push" jobs. It's just I got an alert on Heroku stage
> about queue backlog for the `store_pulse_jobs` queue after this landed.
>
> With a separate queue we could set different alert thresholds, plus more
> easily monitor whether any backlog was due to missing lots of pushes, or
> just because standard ingestion was getting behind.
(In reply to Cameron Dawson [:camd] from bug 1286578 comment #20)
> This has been deployed to both stage and production now. The rabbitmq
> queues for SCL3 and Heroku stage are hovering at zero, so they're emptying
> the whole way, atm. Should we see if those seem to grow or we get more
> alerts? Not sure if that was a one-time thing.
The alerts are still occurring:
Name treeherder-stage
Queue store_pulse_jobs
Current # messages 1460
Alarm queue regexp .*
Alarm threshold 500
Name treeherder-stage
Queue store_pulse_jobs
Current # messages 4743
Alarm queue regexp .*
Alarm threshold 500
Name treeherder-stage
Queue store_pulse_jobs
Current # messages 26008
Alarm queue regexp .*
Alarm threshold 500
Also - it seems like there may be a problem somewhere - 26000 jobs are waiting in the queue?
Flags: needinfo?(cdawson)
Reporter | ||
Comment 1•8 years ago
|
||
Those 26000 jobs are in the un-acked, not 'ready' state.
I'm not sure whether that is expected (since perhaps they are in the 3 minute waiting period), or whether there's some other bug. (Some preliminary reading seemed to imply un-acked actually meant a bug in the rabbitmq client behaviour, but the comments were about plain rabbitmq, not celery).
Reporter | ||
Comment 2•8 years ago
|
||
Ah I see the issue.
Calling .retry() is the equivalent of:
```
<schedule task to run again and mark original task as retry>
raise celery.exceptions.Retry
```
See:
http://docs.celeryproject.org/en/latest/reference/celery.app.task.html#celery.app.task.Task.retry
Hence:
https://github.com/mozilla/treeherder/commit/9faf9d97ae0b050c1bdc3fda3c4bb7a4e6a15c34
So here:
```
try:
JobLoader().process_job_list(job_list)
except MissingResultsetException as exc:
store_pulse_jobs.retry(exc=exc)
```
We're really doing the equivalent of:
```
# ...
except MissingResultsetException as exc:
<schedule task to run again and mark original task as retry>
raise celery.exceptions.Retry
```
And that exception then propagates to the retryable_task wrapper:
```
def inner(*args, **kwargs):
try:
return f(*args, **kwargs)
except self.NON_RETRYABLE_EXCEPTIONS:
raise
except Exception as e:
# ...
newrelic.agent.record_exception(params=params)
# ...
raise task_func.retry(exc=e, countdown=timeout)
```
...which does the equivalent of:
```
# ...
except Exception as e:
# ...
newrelic.agent.record_exception(<celery.exceptions.Retry>)
# ...
<schedule task to run again and mark original task as retry>
raise celery.exceptions.Retry
```
ie we double call the `<schedule task to run again and mark original task as retry>` part.
If we wanted to nest retry logic like this, we'd have to add `celery.exceptions.Retry` to `NON_RETRYABLE_EXCEPTIONS`.
However I think it's much cleaner to just not nest things, and use retryable_task for all retry handling :-)
Summary: Add a separate queue for retried pulse jobs that are waiting for their push to be ingested → In the case of MissingResultsetException .retry() gets called twice per task
Comment 3•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(cdawson)
Attachment #8797293 -
Flags: review?(emorley)
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Reporter | ||
Updated•8 years ago
|
Attachment #8797293 -
Flags: review?(emorley) → review+
Comment 4•8 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/61d5fe0889290b346b3bd80e0ed8c092e36e6cb9
Bug 1306578 - Better handling for unknown project repo
Raises a DatasetNotfoundError if the repo isn't found
https://github.com/mozilla/treeherder/commit/f3261ad9790ce6afc6cf9377de6676cd4fdd0c35
Bug 1306578 - Use retryable_task for pulse job ingestion
Was calling retry manually, but that was causing repeated
retries that were blowing up the celery queue. This just
uses our normal retryable_task for retries and adds some
extra logging in new relic to help identify the product
of a pulse job that failed.
Reporter | ||
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•