1368982 - seta-analyze-failures task is timing out (SoftTimeLimitExceeded)

The link in comment 0 isn't an error, but profiles (the error is just "this thing took too long so timed out"), so not easily pasted. I've sent an invite to the New Relic account :-)

Flags: needinfo?(emorley)

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

8 years ago

all of SETA is in treeherder now (although windows buildbot still uses ouija, although ouija is not accessible anymore!)

Robert Wood [:rwood]

Comment 6

•

8 years ago

Thanks :ahal, fyi: https://github.com/rwood-moz/treeherder/blob/master/treeherder/seta/analyze_failures.py

Robert Wood [:rwood]

Comment 7

•

8 years ago

er not my repo, hah, https://github.com/mozilla/treeherder/blob/master/treeherder/seta/analyze_failures.py

Andrew Halberstadt [:ahal]

Comment 8

•

8 years ago

Thanks. Might be a few days until I have a chance to dig into this, but I'll see if I can figure it out.

Assignee: nobody → ahalberstadt

Status: NEW → ASSIGNED

Andrew Halberstadt [:ahal]

Comment 9

•

8 years ago

Ok, I'm probably stating the obvious, but just want to make sure I vaguely understand what needs to happen here.. It looks like celery has a time limit of 300 seconds per task, and the seta-analyze-failures task is averaging 276 seconds (which is presumably close enough to the limit that it intermittently times out, or is it perma failing?). The bulk of which time is spent in a single query to the job_notes table. So basically that query needs to be more efficient somehow, or we need to bump up the timeout for this task. Preferably the former.

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

8 years ago

that seems like the right way to look at this problem.

Robert Wood [:rwood]

Comment 11

•

8 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #9) FYI I made a similar optimization in the past: https://github.com/mozilla/treeherder/commit/9c4a9c76384f68e1f2611eb13c9d0a50b582e393

Andrew Halberstadt [:ahal]

Comment 12

•

8 years ago

Thanks! Rob, do you have any tips for how you tested this? So far I've ingested pushes from inbound, manually tagged failures as fixed-by-commit and ran analyze_failures. Stuff happens, but I haven't been able to make heads or tails of it yet. Though, it doesn't take anywhere near ~5min to run, I guess due to less data? Please bear with me while I attempt to grok everything..

Andrew Halberstadt [:ahal]

Comment 13

•

8 years ago

The good news here is that it is intermittent: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d So at least the seta data won't get *that* far out of date, it just won't update as frequently as we'd like.

Robert Wood [:rwood]

Comment 14

•

8 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #12) > Thanks! Rob, do you have any tips for how you tested this? > > So far I've ingested pushes from inbound, manually tagged failures as > fixed-by-commit and ran analyze_failures. Stuff happens, but I haven't been > able to make heads or tails of it yet. Though, it doesn't take anywhere near > ~5min to run, I guess due to less data? Yes that's all you can really do, ingest a few pushes and hit the API locally with your changes, just to verify the API works. Also if you hit the API in your browser on localhost you can also look at the django toolbar (right side of browser) and click on SQL and you'll see the actual SQL query that is being generated, if you want to verify at that level. Otherwise all you can really do for real test data is deploy your PR/branch to treeherder-prototype or treeherder-stage, and you can run the analyze failures script on there via a heroku worker i.e. "heroku run --app treeherder-stage "./manage.py analyze_failures".

Ed Morley [:emorley]

Reporter

Comment 15

•

8 years ago

Once you have the SELECT SQL being run (I'd check out the traces in New Relic) you can run against stage/prod to see what can be done to optimise them.

Andrew Halberstadt [:ahal]

Comment 16

•

8 years ago

Thanks, great idea! I figured out how to get valid SQL out of the django ORM, so this should be a relatively easy way to test out performance with live data. Though now I have the opposite problem.. Running the query on redash takes a lot longer than ~5min.. Would that be expected?

Ed Morley [:emorley]

Reporter

Comment 17

•

8 years ago

The RDS instance used by redash is an m4.xlarge compared to the read-write RDS instance which is an m4.2xlarge, so half the RAM/vCPUs, though lower load. So long as you can get the query to complete / can experiment with adjusting it to make it more efficient (might be worth starting with an EXPLAIN), then should be fine.

Ed Morley [:emorley]

Reporter

Comment 18

•

8 years ago

I can also give out credentials to access MySQL on the RDS instances directly, if you don't already have them.

Andrew Halberstadt [:ahal]

Comment 19

•

8 years ago

Should probably unassign myself. The last I had done here was trying (and failing) to improve the perf of the query. Keep in mind I've hardly ever written a SQL query in my life, so I'm by no means saying that it can't be improved.. Just that I didn't see anything obviously wrong. We noticed that it was only timing out part of the time, so the SETA data was still being updated every other day or so. Because it's not a huge deal to have data ~1 day stale, this bug isn't the biggest priority from the SETA point of view.. though of course if performance got worse, we could definitely run into trouble. I'd be tempted to just bump up the timeout for this task (if possible), it runs overnight PST so it doesn't seem terribly bad to give it a bit of extra time. Especially if there's a re-write of SETA being planned for sometime in Q4/Q1.

Assignee: ahalberstadt → nobody

Status: ASSIGNED → NEW

Ed Morley [:emorley]

Reporter

Comment 20

•

8 years ago

The query that is taking 3-4 minutes to run is the one that searches job_note: https://emorley.pastebin.mozilla.org/9029280

Ed Morley [:emorley]

Reporter

Comment 21

•

7 years ago

The pastebin in comment 20 has expired, here's a new go... Looking at: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1505229772&tw%5Bstart%5D=1505143372&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d&app_trace_id=b45a2d51-974e-11e7-86ce-0242ac110011_39982_42495 Traceback: ... File "/app/treeherder/seta/tasks.py", line 8, in seta_analyze_failures File "/app/treeherder/seta/analyze_failures.py", line 27, in run File "/app/treeherder/seta/analyze_failures.py", line 82, in get_failures_fixed_by_commit The DB query that was taking too long is (I fetched an un-obfuscated query from the AWS slow query log): # Time: 2017-09-12T00:09:52.674933Z # Query_time: 90.000682 Lock_time: 0.000299 Rows_sent: 30331 Rows_examined: 18114891 SET timestamp=1505174992; SELECT `job_note`.`id`, `job_note`.`job_id`, `job_note`.`failure_classification_id`, `job_note`.`user_id`, `job_note`.`text`, `job_note`.`created`, `job`.`id`, `job`.`repository_id`, `job`.`guid`, `job`.`project_specific_id`, `job`.`autoclassify_status`, `job`.`coalesced_to_guid`, `job`.`signature_id`, `job`.`build_platform_id`, `job`.`machine_platform_id`, `job`.`machine_id`, `job`.`option_collection_hash`, `job`.`job_type_id`, `job`.`job_group_id`, `job`.`product_id`, `job`.`failure_classification_id`, `job`.`who`, `job`.`reason`, `job`.`result`, `job`.`state`, `job`.`submit_time`, `job`.`start_time`, `job`.`end_time`, `job`.`last_modified`, `job`.`running_eta`, `job`.`tier`, `job`.`push_id`, `reference_data_signatures`.`id`, `reference_data_signatures`.`name`, `reference_data_signatures`.`signature`, `reference_data_signatures`.`build_os_name`, `reference_data_signatures`.`build_platform`, `reference_data_signatures`.`build_architecture`, `reference_data_signatures`.`machine_os_name`, `reference_data_signatures`.`machine_platform`, `reference_data_signatures`.`machine_architecture`, `reference_data_signatures`.`job_group_name`, `reference_data_signatures`.`job_group_symbol`, `reference_data_signatures`.`job_type_name`, `reference_data_signatures`.`job_type_symbol`, `reference_data_signatures`.`option_collection_hash`, `reference_data_signatures`.`build_system_type`, `reference_data_signatures`.`repository`, `reference_data_signatures`.`first_submission_timestamp`, `job_type`.`id`, `job_type`.`symbol`, `job_type`.`name`, `job_type`.`description` FROM `job_note` INNER JOIN `job` ON (`job_note`.`job_id` = `job`.`id`) INNER JOIN `repository` ON (`job`.`repository_id` = `repository`.`id`) INNER JOIN `reference_data_signatures` ON (`job`.`signature_id` = `reference_data_signatures`.`id`) INNER JOIN `job_type` ON (`job`.`job_type_id` = `job_type`.`id`) WHERE (`job_note`.`created` > '2017-06-14 00:08:13.816394' AND `job_note`.`failure_classification_id` = 2 AND `job_note`.`text` IS NOT NULL AND `repository`.`name` IN ('mozilla-central', 'autoland', 'mozilla-inbound') AND NOT (`reference_data_signatures`.`build_platform` IN ('osx-10-9', 'windows8-32', 'windowsxp', 'linux64-stylo', 'osx-10-7', 'other', 'android-5-0-armv8-api15', 'android-6-0-armv8-api15', 'android-4-2-armv7-api15', 'Win 6.3.9600 x86_64', 'windows7-64', 'osx-10-11', 'android-4-4-armv7-api15', 'taskcluster-images', 'android-5-1-armv7-api15')) AND NOT (`job_note`.`text` = '')); An explain of that query: https://emorley.pastebin.mozilla.org/9032227

Ed Morley [:emorley]

Reporter

Comment 22

•

7 years ago

The query examined 18 million rows before it got cut off by the 90s query timeout, which seems pretty broken.

Armen [:armenzg]

Updated

•

7 years ago

Priority: P1 → P3

Armen [:armenzg]

Comment 23

•

5 years ago

We're moving away from SETA wontfixing.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

4 years ago

Product: Tree Management → Tree Management Graveyard