If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

cycle_data is failing on stage/production

REOPENED
Assigned to

Status

Tree Management
Treeherder: Infrastructure
P1
normal
REOPENED
7 months ago
9 days ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

(Depends on: 2 bugs)

Details

Attachments

(2 attachments)

(Assignee)

Description

7 months ago
Breaking out from bug 1284432 comment 35.

`cycle_non_job_data()` is failing during:

used_machine_ids = set(Job.objects.values_list(
            'machine_id', flat=True).distinct())
Machine.objects.exclude(id__in=used_machine_ids).delete()

The SQL being used for this is roughly:

used_machine_ids -> "SELECT DISTINCT `job`.`machine_id` FROM `job`"

Then 

"SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (1, 2, 3, 4, 5, ...))"

This doesn't work when there are 2.6 million distinct  machine ids used in the jobs table (ie it tries to pass a 2.6 million item tuple).

Instead of flattening to a list, we should pass the queryset directly, eg something like:

Machine.objects.exclude(id__in=Job.objects.values('machine_id')).delete()
(Assignee)

Comment 1

7 months ago
$ ./manage.py shell
...
>>> from treeherder.model.models import Job, Machine
>>> print Machine.objects.exclude(id__in=Job.objects.values('machine_id')).query
SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))

> EXPLAIN SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))

******************** 1. row *********************
           id: 1
  select_type: PRIMARY
        table: machine
         type: index
possible_keys: 
          key: machine_name_4a8b45973a00cd64_uniq
      key_len: 302
          ref: 
         rows: 5340061
        Extra: Using where; Using index
******************** 2. row *********************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: U0
         type: index_subquery
possible_keys: job_a9374927
          key: job_a9374927
      key_len: 4
          ref: func
         rows: 3
        Extra: Using index
2 rows in set


-> Looks good to me.

Comment 2

7 months ago
Created attachment 8846450 [details] [review]
[treeherder] mozilla:cycle-non-job-data-timeouts > mozilla:master
(Assignee)

Updated

7 months ago
Attachment #8846450 - Flags: review?(wlachance)
(Assignee)

Comment 3

7 months ago
I don't think we need to chunk this query. Even in the worst case scenario (hasn't run for a month, is running on prototype's slower m4.xlarge, bug 1346565 not yet fixed etc), it still was manageable on prototype:

19:02:03    DELETE FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))
  1687275 row(s) affected 69.625 sec
Comment on attachment 8846450 [details] [review]
[treeherder] mozilla:cycle-non-job-data-timeouts > mozilla:master

This is great, thank you!
Attachment #8846450 - Flags: review?(wlachance) → review+

Comment 5

6 months ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/b12defb0cba0c945ed65c86342474f5b6c684cc4
Bug 1346567 - Rewrite inefficient queries in cycle_non_job_data()

Since previously the select was evaluated independently of the delete,
causing timeouts when attempting to returning 2.6 million machine ids
from the jobs table.

Now the select queryset isn't evaluated on it's own, and instead is only
used to generate the subquery in eg:
  SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT
        (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0));
(Assignee)

Updated

6 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
(Assignee)

Comment 6

6 months ago
Still getting timeouts:
https://papertrailapp.com/systems/treeherder-prod/events?centered_on_id=778402766182834248&q=program%3A%2Frun.8132
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 7

2 months ago
cycle_data is now failing due to running out of RAM (Error R15), eg:
https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Aworker_default.2%20-sample&focus=829935121451872309

Aug 04 01:15:34 treeherder-prod app/worker_default.2: [2017-08-04 00:15:34,572: INFO/Worker-6] Reporting to: https://rpm.newrelic.com/accounts/677903/applications/14179757 
Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Process running mem=1155M(112.9%) 
Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) 
Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Process running mem=1918M(187.3%) 
Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Process running mem=2270M(221.7%) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Error R15 (Memory quota vastly exceeded) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Stopping process with SIGKILL 
Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: State changed from up to crashed 
Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: Process exited with status 137 

In addition to fixing the RAM issue, we should also:
* improve the logging (since there's nothing in the log that actually says this task started, I'm having to read between the lines, after noticing cycle_data isn't listed in New Relic non-web transactions)
* consider moving these long-running but only run a few times a day tasks to the Heroku scheduler instead (bug 1176492)
Summary: Refactor cycle_non_job_data() to prevent 'MySQL server has gone away' → cycle_data is failing on stage/production
(Assignee)

Updated

2 months ago
Depends on: 1176492
(Assignee)

Updated

2 months ago
Depends on: 1387543
(Assignee)

Comment 8

2 months ago
I'm running a one-off cycle_data to try and clear some of the backlog (/see how bad the RAM usage really is):
`thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0  --chunk-size 2000`

Logs:
https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.6241
(Assignee)

Comment 9

2 months ago
Created attachment 8893997 [details]
cycle-data log
(Assignee)

Updated

2 months ago
Attachment #8893997 - Attachment description: cycle-data.log → cycle-data log
Attachment #8893997 - Attachment mime type: text/x-log → text/plain
(Assignee)

Updated

9 days ago
Component: Treeherder → Treeherder: Infrastructure
You need to log in before you can comment on or make changes to this bug.