Closed Bug 1368982 Opened 7 years ago Closed 4 years ago

seta-analyze-failures task is timing out (SoftTimeLimitExceeded)

Categories

(Tree Management Graveyard :: Treeherder: SETA, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

Thanks Ed, I believe :ahal may be looking into this
Flags: needinfo?(rwood) → needinfo?(ahalberstadt)
Heh, I can try.. Rob, is this the right repo?
https://github.com/mozilla/ouija

Ed, I can't access those.. Could you paste the errors to the bug?
Flags: needinfo?(ahalberstadt) → needinfo?(emorley)
The link in comment 0 isn't an error, but profiles (the error is just "this thing took too long so timed out"), so not easily pasted. I've sent an invite to the New Relic account :-)
Flags: needinfo?(emorley)
all of SETA is in treeherder now (although windows buildbot still uses ouija, although ouija is not accessible anymore!)
Thanks. Might be a few days until I have a chance to dig into this, but I'll see if I can figure it out.
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Ok, I'm probably stating the obvious, but just want to make sure I vaguely understand what needs to happen here..

It looks like celery has a time limit of 300 seconds per task, and the seta-analyze-failures task is averaging 276 seconds (which is presumably close enough to the limit that it intermittently times out, or is it perma failing?). The bulk of which time is spent in a single query to the job_notes table. So basically that query needs to be more efficient somehow, or we need to bump up the timeout for this task. Preferably the former.
that seems like the right way to look at this problem.
(In reply to Andrew Halberstadt [:ahal] from comment #9)

FYI I made a similar optimization in the past:

https://github.com/mozilla/treeherder/commit/9c4a9c76384f68e1f2611eb13c9d0a50b582e393
Thanks! Rob, do you have any tips for how you tested this?

So far I've ingested pushes from inbound, manually tagged failures as fixed-by-commit and ran analyze_failures. Stuff happens, but I haven't been able to make heads or tails of it yet. Though, it doesn't take anywhere near ~5min to run, I guess due to less data?

Please bear with me while I attempt to grok everything..
The good news here is that it is intermittent:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d

So at least the seta data won't get *that* far out of date, it just won't update as frequently as we'd like.
(In reply to Andrew Halberstadt [:ahal] from comment #12)
> Thanks! Rob, do you have any tips for how you tested this?
> 
> So far I've ingested pushes from inbound, manually tagged failures as
> fixed-by-commit and ran analyze_failures. Stuff happens, but I haven't been
> able to make heads or tails of it yet. Though, it doesn't take anywhere near
> ~5min to run, I guess due to less data?

Yes that's all you can really do, ingest a few pushes and hit the API locally with your changes, just to verify the API works. Also if you hit the API in your browser on localhost you can also look at the django toolbar (right side of browser) and click on SQL and you'll see the actual SQL query that is being generated, if you want to verify at that level.

Otherwise all you can really do for real test data is deploy your PR/branch to treeherder-prototype or treeherder-stage, and you can run the analyze failures script on there via a heroku worker i.e. "heroku run --app treeherder-stage "./manage.py analyze_failures".
Once you have the SELECT SQL being run (I'd check out the traces in New Relic) you can run against stage/prod to see what can be done to optimise them.
Thanks, great idea! I figured out how to get valid SQL out of the django ORM, so this should be a relatively easy way to test out performance with live data.

Though now I have the opposite problem.. Running the query on redash takes a lot longer than ~5min.. Would that be expected?
The RDS instance used by redash is an m4.xlarge compared to the read-write RDS instance which is an m4.2xlarge, so half the RAM/vCPUs, though lower load.

So long as you can get the query to complete / can experiment with adjusting it to make it more efficient (might be worth starting with an EXPLAIN), then should be fine.
I can also give out credentials to access MySQL on the RDS instances directly, if you don't already have them.
Should probably unassign myself. The last I had done here was trying (and failing) to improve the perf of the query. Keep in mind I've hardly ever written a SQL query in my life, so I'm by no means saying that it can't be improved.. Just that I didn't see anything obviously wrong.

We noticed that it was only timing out part of the time, so the SETA data was still being updated every other day or so. Because it's not a huge deal to have data ~1 day stale, this bug isn't the biggest priority from the SETA point of view.. though of course if performance got worse, we could definitely run into trouble.

I'd be tempted to just bump up the timeout for this task (if possible), it runs overnight PST so it doesn't seem terribly bad to give it a bit of extra time. Especially if there's a re-write of SETA being planned for sometime in Q4/Q1.
Assignee: ahalberstadt → nobody
Status: ASSIGNED → NEW
The query that is taking 3-4 minutes to run is the one that searches job_note:
https://emorley.pastebin.mozilla.org/9029280
The pastebin in comment 20 has expired, here's a new go...

Looking at:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1505229772&tw%5Bstart%5D=1505143372&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d&app_trace_id=b45a2d51-974e-11e7-86ce-0242ac110011_39982_42495

Traceback:
...
File "/app/treeherder/seta/tasks.py", line 8, in seta_analyze_failures
File "/app/treeherder/seta/analyze_failures.py", line 27, in run
File "/app/treeherder/seta/analyze_failures.py", line 82, in get_failures_fixed_by_commit

The DB query that was taking too long is (I fetched an un-obfuscated query from the AWS slow query log):

# Time: 2017-09-12T00:09:52.674933Z
# Query_time: 90.000682  Lock_time: 0.000299 Rows_sent: 30331  Rows_examined: 18114891
SET timestamp=1505174992;
SELECT `job_note`.`id`, `job_note`.`job_id`, `job_note`.`failure_classification_id`, `job_note`.`user_id`, `job_note`.`text`, `job_note`.`created`, `job`.`id`, `job`.`repository_id`, `job`.`guid`, `job`.`project_specific_id`, `job`.`autoclassify_status`, `job`.`coalesced_to_guid`, `job`.`signature_id`, `job`.`build_platform_id`, `job`.`machine_platform_id`, `job`.`machine_id`, `job`.`option_collection_hash`, `job`.`job_type_id`, `job`.`job_group_id`, `job`.`product_id`, `job`.`failure_classification_id`, `job`.`who`, `job`.`reason`, `job`.`result`, `job`.`state`, `job`.`submit_time`, `job`.`start_time`, `job`.`end_time`, `job`.`last_modified`, `job`.`running_eta`, `job`.`tier`, `job`.`push_id`, `reference_data_signatures`.`id`, `reference_data_signatures`.`name`, `reference_data_signatures`.`signature`, `reference_data_signatures`.`build_os_name`, `reference_data_signatures`.`build_platform`, `reference_data_signatures`.`build_architecture`, `reference_data_signatures`.`machine_os_name`, `reference_data_signatures`.`machine_platform`, `reference_data_signatures`.`machine_architecture`, `reference_data_signatures`.`job_group_name`, `reference_data_signatures`.`job_group_symbol`, `reference_data_signatures`.`job_type_name`, `reference_data_signatures`.`job_type_symbol`, `reference_data_signatures`.`option_collection_hash`, `reference_data_signatures`.`build_system_type`, `reference_data_signatures`.`repository`, `reference_data_signatures`.`first_submission_timestamp`, `job_type`.`id`, `job_type`.`symbol`, `job_type`.`name`, `job_type`.`description` FROM `job_note` INNER JOIN `job` ON (`job_note`.`job_id` = `job`.`id`) INNER JOIN `repository` ON (`job`.`repository_id` = `repository`.`id`) INNER JOIN `reference_data_signatures` ON (`job`.`signature_id` = `reference_data_signatures`.`id`) INNER JOIN `job_type` ON (`job`.`job_type_id` = `job_type`.`id`) WHERE (`job_note`.`created` > '2017-06-14 00:08:13.816394' AND `job_note`.`failure_classification_id` = 2 AND `job_note`.`text` IS NOT NULL AND `repository`.`name` IN ('mozilla-central', 'autoland', 'mozilla-inbound') AND NOT (`reference_data_signatures`.`build_platform` IN ('osx-10-9', 'windows8-32', 'windowsxp', 'linux64-stylo', 'osx-10-7', 'other', 'android-5-0-armv8-api15', 'android-6-0-armv8-api15', 'android-4-2-armv7-api15', 'Win 6.3.9600 x86_64', 'windows7-64', 'osx-10-11', 'android-4-4-armv7-api15', 'taskcluster-images', 'android-5-1-armv7-api15')) AND NOT (`job_note`.`text` = ''));

An explain of that query:
https://emorley.pastebin.mozilla.org/9032227
The query examined 18 million rows before it got cut off by the 90s query timeout, which seems pretty broken.
Priority: P1 → P3

We're moving away from SETA wontfixing.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.