1330354 - SETA - failures fixed by commit endpoint times out

Reporter

Description

•

8 years ago

emorley, how can we see what is going wrong? How can we simulate an environment like this? The number of JobNotes on production is probably immensely big compared to stage. Jan 11 11:35:32 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/seta/failures-fixed-by-commit/" host=treeherder.mozilla.org request_id=9c28532c-84a8-4579-9b68-0f60c6701a2a fwd="66.207.193.22" dyno=web.2 connect=0ms service=30001ms status=503 bytes=0 Code: * https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/seta.py#L42-L51 * https://github.com/mozilla/treeherder/blob/master/treeherder/seta/analyze_failures.py#L36-L108

Ed Morley [:emorley]

Comment 1

•

8 years ago

I'd look at the New Relic profiles - check for reported slow transactions.

Armen [:armenzg]

Reporter

Comment 2

•

8 years ago

I loaded the API (so it falls within 30 minutes) and I found it under 'transactions' [1] That shows that it took 49.9 seconds. Most of it goes into MySql operations. Is that too long? [1] https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d

William Lachance (:wlach)

Comment 3

•

8 years ago

(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2) > I loaded the API (so it falls within 30 minutes) and I found it under > 'transactions' [1] > That shows that it took 49.9 seconds. Most of it goes into MySql operations. Yeah, we have a limit of 30 seconds for web transactions. Options: 1. Speed up the SQL query 2. Pre-generate the data somehow and store it in memcache

Ed Morley [:emorley]

Comment 4

•

8 years ago

(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2) > I loaded the API (so it falls within 30 minutes) and I found it under > 'transactions' [1] > That shows that it took 49.9 seconds. Most of it goes into MySql operations. You can view older transactions by adjusting the time window on the menu at the top. The trace in question is: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d The total transaction duration was 30s -- the 49.9 from what I can tell was actually "49.9%" for profile percentage breakdown of the Python parts. I see that there are 2800 SELECTs made against 4 different tables (ie in total over 10000 SELECTS). I'd start by trying in Vagrant, and loading the API in the browser to see what the debug toolbar SQL analyzer says. Even though the query time will be quicker locally, you should still be able to identify cases where we should be using eg select_relateed().

Ed Morley [:emorley]

Comment 5

•

8 years ago

Ah yes there's no select_related() being used: for job_note in models.JobNote.objects.filter(failure_classification=2): So Django lazy-loads the additional fields for each usage, rather than fetching all in one shot using a JOIN. Something like this should work: job_notes = models.JobNote.objects.filter(failure_classification=2).select_related( 'job', 'job__signature', 'job__job_type', 'job__option_collection') (I can never decide the best indentation strategy for these)

Ed Morley [:emorley]

Comment 6

•

8 years ago

Meant to include these links :-) https://docs.djangoproject.com/en/1.10/ref/models/querysets/#select-related http://www.yilmazhuseyin.com/blog/dev/django-orm-performance-tips-part-2/

Ed Morley [:emorley]

Updated

•

8 years ago

Component: Treeherder → Treeherder: SETA

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: 1330728

Robert Wood [:rwood]

Assignee

Updated

•

8 years ago

Blocks: 1326102

Robert Wood [:rwood]

Assignee

Updated

•

8 years ago

Assignee: nobody → rwood

Status: NEW → ASSIGNED

GitHub Autolander Bot

Comment 8

•

8 years ago

Attached file [treeherder] rwood-moz:bug1330354 > mozilla:master — Details

Robert Wood [:rwood]

Assignee

Updated

•

8 years ago

Attachment #8836800 - Flags: review?(emorley)

GitHub Autolander Bot

Comment 9

•

8 years ago

Attached file [treeherder] rwood-moz:bug1330354-optimize > mozilla:master — Details

Robert Wood [:rwood]

Assignee

Comment 10

•

8 years ago

Comment on attachment 8836888 [details] [review] [treeherder] rwood-moz:bug1330354-optimize > mozilla:master Another pull request for a small optimization suggested by :wlach

Attachment #8836888 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8836888 - Flags: review?(emorley) → review+

Ed Morley [:emorley]

Comment 11

•

8 years ago

Comment on attachment 8836800 [details] [review] [treeherder] rwood-moz:bug1330354 > mozilla:master :-)

Attachment #8836800 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 12

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/42ee3c1ce8d185b1830d033152bacb9c5337a2e3 Bug 1330354 - Use select_related in get_failures_fixed_by_commit() (#2162) To prevent thousands of MySQL requests that cause both the API endpoint and seta-analyse-failures task to time out.

Treeherder GitHub Bugbot

Comment 13

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/2af1cefb827c022c56172adb9952f403b696582e Bug 1330354 - Small optimization for Job.get_platform_option (#2164)

Robert Wood [:rwood]

Assignee

Comment 14

•

8 years ago

Leaving this bug open until both of these patches are deployed to production, and we can confirm the API/analyze_failures no longer times out there.

Robert Wood [:rwood]

Assignee

Comment 15

•

8 years ago

After the patches were deployed, I ran analyze_failures on production and it appears to have finished successfully. I didn't see any exceptions on treeherder-prod in new relic. I also didn't see any seta APIs in the list of most time consuming transactions on new relic. I was trying to see the total run time of analyze_failures but couldn't find it listed. I ran the script via heroku run .. "./manage.py analyze_failures). :emorley, is that sufficient to verify that analyze_faiures runs in < 30s now? Is there any way to get an actual number? I dug around in new relic but couldn't find it

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Comment 16

•

8 years ago

(In reply to Robert Wood [:rwood] from comment #15) > I ran the script via heroku run .. "./manage.py analyze_failures). > > :emorley, is that sufficient to verify that analyze_faiures runs in < 30s > now? Is there any way to get an actual number? I dug around in new relic but > couldn't find it In order to show in New Relic, the agent has to wrap the command being run (like the commands listed in `Procfile`), eg: newrelic-admin run-program ./manage.py analyze_failures

Flags: needinfo?(emorley)

Robert Wood [:rwood]

Assignee

Comment 17

•

8 years ago

> In order to show in New Relic, the agent has to wrap the command being run > (like the commands listed in `Procfile`), eg: > > newrelic-admin run-program ./manage.py analyze_failures Thanks :emroley. So I would run that command in a heroku dyno? Actually will it show up in new relic when it is run as the regularly scheduled celerey task automatically?

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Comment 18

•

8 years ago

Yes, eg: $ thp run newrelic-admin run-program ./manage.py analyze_failures (Using the helper aliases like: `alias thp="HEROKU_APP=treeherder-prod heroku"`) It will show in New Relic when run automatically, since the worker that handles the job also uses the wrapper (every command in the Procfile uses the wrapper for this reason).

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Comment 19

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #18) > Yes, eg: > > $ thp run newrelic-admin run-program ./manage.py analyze_failures Hmm I tried that and it didn't appear in New Relic. I think because it's being run outside of the celery task framework and/or a Django web request, the New Relic agent doesn't know how to categorise it. A better way of ensuring the task appears is to schedule an async task (like happens for the celerybeat scheduled periodic tasks), using: $ thp run ./manage.py shell ... from treeherder.seta.tasks import seta_analyze_failures seta_analyze_failures.apply_async(time_limit=15*60, soft_time_limit=10*60) (The `./manage.py shell` command doesn't need to be run with the New relic wrapper script itself, since it's only adding the task to the queue, not running the task. The task will shortly after run on the worker_default dyno, not the one-off dyno.)

Ed Morley [:emorley]

Updated

•

8 years ago

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Tree Management → Tree Management Graveyard

[treeherder] rwood-moz:bug1330354 > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
[treeherder] rwood-moz:bug1330354-optimize > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review