Closed Bug 1330354 Opened 7 years ago Closed 7 years ago

SETA - failures fixed by commit endpoint times out

Categories

(Tree Management Graveyard :: Treeherder: SETA, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: rwood)

References

Details

Attachments

(2 files)

emorley, how can we see what is going wrong?
How can we simulate an environment like this?
The number of JobNotes on production is probably immensely big compared to stage.

Jan 11 11:35:32 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/seta/failures-fixed-by-commit/" host=treeherder.mozilla.org request_id=9c28532c-84a8-4579-9b68-0f60c6701a2a fwd="66.207.193.22" dyno=web.2 connect=0ms service=30001ms status=503 bytes=0 

Code:
* https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/seta.py#L42-L51
* https://github.com/mozilla/treeherder/blob/master/treeherder/seta/analyze_failures.py#L36-L108
I'd look at the New Relic profiles - check for reported slow transactions.
I loaded the API (so it falls within 30 minutes) and I found it under 'transactions' [1]
That shows that it took 49.9 seconds. Most of it goes into MySql operations.

Is that too long?



[1] https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.

Yeah, we have a limit of 30 seconds for web transactions.

Options:

1. Speed up the SQL query
2. Pre-generate the data somehow and store it in memcache
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.

You can view older transactions by adjusting the time window on the menu at the top.

The trace in question is:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d

The total transaction duration was 30s -- the 49.9 from what I can tell was actually "49.9%" for profile percentage breakdown of the Python parts.

I see that there are 2800 SELECTs made against 4 different tables (ie in total over 10000 SELECTS).

I'd start by trying in Vagrant, and loading the API in the browser to see what the debug toolbar SQL analyzer says. Even though the query time will be quicker locally, you should still be able to identify cases where we should be using eg select_relateed().
Ah yes there's no select_related() being used:

   for job_note in models.JobNote.objects.filter(failure_classification=2):

So Django lazy-loads the additional fields for each usage, rather than fetching all in one shot using a JOIN.

Something like this should work:

    job_notes = models.JobNote.objects.filter(failure_classification=2).select_related(
        'job', 'job__signature', 'job__job_type', 'job__option_collection')

(I can never decide the best indentation strategy for these)
Component: Treeherder → Treeherder: SETA
Blocks: 1330728
Blocks: 1326102
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Attachment #8836800 - Flags: review?(emorley)
Comment on attachment 8836888 [details] [review]
[treeherder] rwood-moz:bug1330354-optimize > mozilla:master

Another pull request for a small optimization suggested by :wlach
Attachment #8836888 - Flags: review?(emorley)
Attachment #8836888 - Flags: review?(emorley) → review+
Comment on attachment 8836800 [details] [review]
[treeherder] rwood-moz:bug1330354 > mozilla:master

:-)
Attachment #8836800 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/42ee3c1ce8d185b1830d033152bacb9c5337a2e3
Bug 1330354 - Use select_related in get_failures_fixed_by_commit() (#2162)

To prevent thousands of MySQL requests that cause both the API endpoint
and seta-analyse-failures task to time out.
Leaving this bug open until both of these patches are deployed to production, and we can confirm the API/analyze_failures no longer times out there.
After the patches were deployed, I ran analyze_failures on production and it appears to have finished successfully. I didn't see any exceptions on treeherder-prod in new relic. I also didn't see any seta APIs in the list of most time consuming transactions on new relic. I was trying to see the total run time of analyze_failures but couldn't find it listed. I ran the script via heroku run .. "./manage.py analyze_failures).

:emorley, is that sufficient to verify that analyze_faiures runs in < 30s now? Is there any way to get an actual number? I dug around in new relic but couldn't find it
Flags: needinfo?(emorley)
(In reply to Robert Wood [:rwood] from comment #15)
> I ran the script via heroku run .. "./manage.py analyze_failures).
> 
> :emorley, is that sufficient to verify that analyze_faiures runs in < 30s
> now? Is there any way to get an actual number? I dug around in new relic but
> couldn't find it

In order to show in New Relic, the agent has to wrap the command being run (like the commands listed in `Procfile`), eg:

newrelic-admin run-program ./manage.py analyze_failures
Flags: needinfo?(emorley)
> In order to show in New Relic, the agent has to wrap the command being run
> (like the commands listed in `Procfile`), eg:
> 
> newrelic-admin run-program ./manage.py analyze_failures

Thanks :emroley. So I would run that command in a heroku dyno? Actually will it show up in new relic when it is run as the regularly scheduled celerey task automatically?
Flags: needinfo?(emorley)
Yes, eg:

$ thp run newrelic-admin run-program ./manage.py analyze_failures

(Using the helper aliases like: `alias thp="HEROKU_APP=treeherder-prod heroku"`)

It will show in New Relic when run automatically, since the worker that handles the job also uses the wrapper (every command in the Procfile uses the wrapper for this reason).
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #18)
> Yes, eg:
> 
> $ thp run newrelic-admin run-program ./manage.py analyze_failures

Hmm I tried that and it didn't appear in New Relic. I think because it's being run outside of the celery task framework and/or a Django web request, the New Relic agent doesn't know how to categorise it.

A better way of ensuring the task appears is to schedule an async task (like happens for the celerybeat scheduled periodic tasks), using:

$ thp run ./manage.py shell
...
from treeherder.seta.tasks import seta_analyze_failures
seta_analyze_failures.apply_async(time_limit=15*60, soft_time_limit=10*60)


(The `./manage.py shell` command doesn't need to be run with the New relic wrapper script itself, since it's only adding the task to the queue, not running the task. The task will shortly after run on the worker_default dyno, not the one-off dyno.)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: