SETA - failures fixed by commit endpoint times out

RESOLVED FIXED

Status

Tree Management
Treeherder: SETA
RESOLVED FIXED
11 months ago
9 months ago

People

(Reporter: armenzg, Assigned: rwood)

Tracking

(Blocks: 1 bug)

Details

Attachments

(2 attachments)

(Reporter)

Description

11 months ago
emorley, how can we see what is going wrong?
How can we simulate an environment like this?
The number of JobNotes on production is probably immensely big compared to stage.

Jan 11 11:35:32 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/seta/failures-fixed-by-commit/" host=treeherder.mozilla.org request_id=9c28532c-84a8-4579-9b68-0f60c6701a2a fwd="66.207.193.22" dyno=web.2 connect=0ms service=30001ms status=503 bytes=0 

Code:
* https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/seta.py#L42-L51
* https://github.com/mozilla/treeherder/blob/master/treeherder/seta/analyze_failures.py#L36-L108

Comment 1

11 months ago
I'd look at the New Relic profiles - check for reported slow transactions.
(Reporter)

Comment 2

10 months ago
I loaded the API (so it falls within 30 minutes) and I found it under 'transactions' [1]
That shows that it took 49.9 seconds. Most of it goes into MySql operations.

Is that too long?



[1] https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.

Yeah, we have a limit of 30 seconds for web transactions.

Options:

1. Speed up the SQL query
2. Pre-generate the data somehow and store it in memcache

Comment 4

10 months ago
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.

You can view older transactions by adjusting the time window on the menu at the top.

The trace in question is:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d

The total transaction duration was 30s -- the 49.9 from what I can tell was actually "49.9%" for profile percentage breakdown of the Python parts.

I see that there are 2800 SELECTs made against 4 different tables (ie in total over 10000 SELECTS).

I'd start by trying in Vagrant, and loading the API in the browser to see what the debug toolbar SQL analyzer says. Even though the query time will be quicker locally, you should still be able to identify cases where we should be using eg select_relateed().

Comment 5

10 months ago
Ah yes there's no select_related() being used:

   for job_note in models.JobNote.objects.filter(failure_classification=2):

So Django lazy-loads the additional fields for each usage, rather than fetching all in one shot using a JOIN.

Something like this should work:

    job_notes = models.JobNote.objects.filter(failure_classification=2).select_related(
        'job', 'job__signature', 'job__job_type', 'job__option_collection')

(I can never decide the best indentation strategy for these)

Comment 6

10 months ago
Meant to include these links :-)

https://docs.djangoproject.com/en/1.10/ref/models/querysets/#select-related
http://www.yilmazhuseyin.com/blog/dev/django-orm-performance-tips-part-2/

Updated

10 months ago
Component: Treeherder → Treeherder: SETA

Updated

10 months ago
Blocks: 1330728
(Assignee)

Updated

10 months ago
Blocks: 1326102
(Assignee)

Updated

9 months ago
Duplicate of this bug: 1338570
(Assignee)

Updated

9 months ago
Assignee: nobody → rwood
Status: NEW → ASSIGNED

Comment 8

9 months ago
Created attachment 8836800 [details] [review]
[treeherder] rwood-moz:bug1330354 > mozilla:master
(Assignee)

Updated

9 months ago
Attachment #8836800 - Flags: review?(emorley)

Comment 9

9 months ago
Created attachment 8836888 [details] [review]
[treeherder] rwood-moz:bug1330354-optimize > mozilla:master
(Assignee)

Comment 10

9 months ago
Comment on attachment 8836888 [details] [review]
[treeherder] rwood-moz:bug1330354-optimize > mozilla:master

Another pull request for a small optimization suggested by :wlach
Attachment #8836888 - Flags: review?(emorley)

Updated

9 months ago
Attachment #8836888 - Flags: review?(emorley) → review+

Comment 11

9 months ago
Comment on attachment 8836800 [details] [review]
[treeherder] rwood-moz:bug1330354 > mozilla:master

:-)
Attachment #8836800 - Flags: review?(emorley) → review+

Comment 12

9 months ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/42ee3c1ce8d185b1830d033152bacb9c5337a2e3
Bug 1330354 - Use select_related in get_failures_fixed_by_commit() (#2162)

To prevent thousands of MySQL requests that cause both the API endpoint
and seta-analyse-failures task to time out.

Comment 13

9 months ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/2af1cefb827c022c56172adb9952f403b696582e
Bug 1330354 - Small optimization for Job.get_platform_option (#2164)
(Assignee)

Comment 14

9 months ago
Leaving this bug open until both of these patches are deployed to production, and we can confirm the API/analyze_failures no longer times out there.
(Assignee)

Comment 15

9 months ago
After the patches were deployed, I ran analyze_failures on production and it appears to have finished successfully. I didn't see any exceptions on treeherder-prod in new relic. I also didn't see any seta APIs in the list of most time consuming transactions on new relic. I was trying to see the total run time of analyze_failures but couldn't find it listed. I ran the script via heroku run .. "./manage.py analyze_failures).

:emorley, is that sufficient to verify that analyze_faiures runs in < 30s now? Is there any way to get an actual number? I dug around in new relic but couldn't find it
Flags: needinfo?(emorley)

Comment 16

9 months ago
(In reply to Robert Wood [:rwood] from comment #15)
> I ran the script via heroku run .. "./manage.py analyze_failures).
> 
> :emorley, is that sufficient to verify that analyze_faiures runs in < 30s
> now? Is there any way to get an actual number? I dug around in new relic but
> couldn't find it

In order to show in New Relic, the agent has to wrap the command being run (like the commands listed in `Procfile`), eg:

newrelic-admin run-program ./manage.py analyze_failures
Flags: needinfo?(emorley)
(Assignee)

Comment 17

9 months ago
> In order to show in New Relic, the agent has to wrap the command being run
> (like the commands listed in `Procfile`), eg:
> 
> newrelic-admin run-program ./manage.py analyze_failures

Thanks :emroley. So I would run that command in a heroku dyno? Actually will it show up in new relic when it is run as the regularly scheduled celerey task automatically?
Flags: needinfo?(emorley)

Comment 18

9 months ago
Yes, eg:

$ thp run newrelic-admin run-program ./manage.py analyze_failures

(Using the helper aliases like: `alias thp="HEROKU_APP=treeherder-prod heroku"`)

It will show in New Relic when run automatically, since the worker that handles the job also uses the wrapper (every command in the Procfile uses the wrapper for this reason).
Flags: needinfo?(emorley)

Comment 19

9 months ago
(In reply to Ed Morley [:emorley] from comment #18)
> Yes, eg:
> 
> $ thp run newrelic-admin run-program ./manage.py analyze_failures

Hmm I tried that and it didn't appear in New Relic. I think because it's being run outside of the celery task framework and/or a Django web request, the New Relic agent doesn't know how to categorise it.

A better way of ensuring the task appears is to schedule an async task (like happens for the celerybeat scheduled periodic tasks), using:

$ thp run ./manage.py shell
...
from treeherder.seta.tasks import seta_analyze_failures
seta_analyze_failures.apply_async(time_limit=15*60, soft_time_limit=10*60)


(The `./manage.py shell` command doesn't need to be run with the New relic wrapper script itself, since it's only adding the task to the queue, not running the task. The task will shortly after run on the worker_default dyno, not the one-off dyno.)

Updated

9 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.