Closed Bug 1346567 Opened 7 years ago Closed 5 years ago

cycle_data is failing on stage/production

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: igoldan)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

Breaking out from bug 1284432 comment 35.

`cycle_non_job_data()` is failing during:

used_machine_ids = set(Job.objects.values_list(
            'machine_id', flat=True).distinct())
Machine.objects.exclude(id__in=used_machine_ids).delete()

The SQL being used for this is roughly:

used_machine_ids -> "SELECT DISTINCT `job`.`machine_id` FROM `job`"

Then 

"SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (1, 2, 3, 4, 5, ...))"

This doesn't work when there are 2.6 million distinct  machine ids used in the jobs table (ie it tries to pass a 2.6 million item tuple).

Instead of flattening to a list, we should pass the queryset directly, eg something like:

Machine.objects.exclude(id__in=Job.objects.values('machine_id')).delete()
$ ./manage.py shell
...
>>> from treeherder.model.models import Job, Machine
>>> print Machine.objects.exclude(id__in=Job.objects.values('machine_id')).query
SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))

> EXPLAIN SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))

******************** 1. row *********************
           id: 1
  select_type: PRIMARY
        table: machine
         type: index
possible_keys: 
          key: machine_name_4a8b45973a00cd64_uniq
      key_len: 302
          ref: 
         rows: 5340061
        Extra: Using where; Using index
******************** 2. row *********************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: U0
         type: index_subquery
possible_keys: job_a9374927
          key: job_a9374927
      key_len: 4
          ref: func
         rows: 3
        Extra: Using index
2 rows in set


-> Looks good to me.
Attachment #8846450 - Flags: review?(wlachance)
I don't think we need to chunk this query. Even in the worst case scenario (hasn't run for a month, is running on prototype's slower m4.xlarge, bug 1346565 not yet fixed etc), it still was manageable on prototype:

19:02:03    DELETE FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0))
  1687275 row(s) affected 69.625 sec
Comment on attachment 8846450 [details] [review]
[treeherder] mozilla:cycle-non-job-data-timeouts > mozilla:master

This is great, thank you!
Attachment #8846450 - Flags: review?(wlachance) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/b12defb0cba0c945ed65c86342474f5b6c684cc4
Bug 1346567 - Rewrite inefficient queries in cycle_non_job_data()

Since previously the select was evaluated independently of the delete,
causing timeouts when attempting to returning 2.6 million machine ids
from the jobs table.

Now the select queryset isn't evaluated on it's own, and instead is only
used to generate the subquery in eg:
  SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT
        (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0));
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Still getting timeouts:
https://papertrailapp.com/systems/treeherder-prod/events?centered_on_id=778402766182834248&q=program%3A%2Frun.8132
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
cycle_data is now failing due to running out of RAM (Error R15), eg:
https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Aworker_default.2%20-sample&focus=829935121451872309

Aug 04 01:15:34 treeherder-prod app/worker_default.2: [2017-08-04 00:15:34,572: INFO/Worker-6] Reporting to: https://rpm.newrelic.com/accounts/677903/applications/14179757 
Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Process running mem=1155M(112.9%) 
Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) 
Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Process running mem=1918M(187.3%) 
Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Process running mem=2270M(221.7%) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Error R15 (Memory quota vastly exceeded) 
Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Stopping process with SIGKILL 
Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: State changed from up to crashed 
Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: Process exited with status 137 

In addition to fixing the RAM issue, we should also:
* improve the logging (since there's nothing in the log that actually says this task started, I'm having to read between the lines, after noticing cycle_data isn't listed in New Relic non-web transactions)
* consider moving these long-running but only run a few times a day tasks to the Heroku scheduler instead (bug 1176492)
Summary: Refactor cycle_non_job_data() to prevent 'MySQL server has gone away' → cycle_data is failing on stage/production
Depends on: 1176492
Depends on: 1387543
I'm running a one-off cycle_data to try and clear some of the backlog (/see how bad the RAM usage really is):
`thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0  --chunk-size 2000`

Logs:
https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.6241
Attached file cycle-data log
Attachment #8893997 - Attachment description: cycle-data.log → cycle-data log
Attachment #8893997 - Attachment mime type: text/x-log → text/plain
Component: Treeherder → Treeherder: Infrastructure
Assignee: emorley → nobody
Blocks: 1078392
See Also: → 1419483
So one of the reasons cycle_data is timing out is due to the performance_datum cycling, which currently results in queries like:

SELECT `performance_datum`.`id` FROM `performance_datum` WHERE (`performance_datum`.`repository_id` = 4 AND `performance_datum`.`push_timestamp` < '2017-07-25 21:27:32.388466') LIMIT 500

There are a few issues with this:
1) There should be no need for a select before the delete (it's due to the casting to a list to figure out if there are any rows left to delete, but .delete() returns the number to delete, which should be used instead)
2) It's filtering on `repository`, which isn't really necessary
3) `push_timestamp` isn't indexed, albeit without the repository clause I suspect table scan order would mean the rows to be expired would be seen first, so not an issue
Assignee: nobody → emorley
(In reply to Ed Morley [:emorley] from comment #10)
> 2) It's filtering on `repository`, which isn't really necessary

Oh, except some repositories are marked as being excluded from data expiration. Grumble.
So I guess we'll have to add yet another index to the performance_datum table (whose indexes are already 2.5 times larger than the data), since the partial data expiration (ie excluding some repositories) means that there are thousands of rows at the start of the table that get table scanned over, every time the next chunk is deleted. If those rows did not exist, the number of rows scanned would be virtually identical to the number of rows deleted.
Attachment #8934946 - Flags: review?(wlachance)
(In reply to Ed Morley [:emorley] from comment #11)
> (In reply to Ed Morley [:emorley] from comment #10)
> > 2) It's filtering on `repository`, which isn't really necessary
> 
> Oh, except some repositories are marked as being excluded from data
> expiration. Grumble.

It is possible that I was being overly cautious about adding this option (not to expire performance data on some repositories), and we can just take it out. Old performance test results haven't proved to be very useful in my experience (since we typically lack the context to explain them). I would talk to :jmaher.
Attachment #8934946 - Flags: review?(wlachance) → review+
Depends on: 1423607
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/82f2b2fe9dab64831f40e7888de3b553dc315f3f
Bug 1346567 - Temporarily disable perfherder data cycling (#3022)

Since it's timing out due to missing indexes, preventing the rest of
the jobs data cycling from taking place (resulting in disk space
usage alerts).
Assignee: emorley → nobody
In bug 1484642 it was found that the cycle_data task was interfering with bugs fetching, since it exceeded memory limits and caused the dyno to be restarted. Bug 1176492 has moved cycle_data to the Heroku scheduler addon (ie: it now runs in a periodically spun up separate dyno rather than alongside other tasks in the generic "default" worker) - which improves isolation between tasks, gives greater flexibility over dyno sizes used for each task, and saves money since no need for a larger dyno type 24/7 just so the "default" worker can handle cycle_data once a day.

The cycle_data task in the scheduler addon is currently using a Performance-M dyno (2.5GB RAM). Once the memory issues are resolved, it can be scaled back to save dyno credits.
In Papertrail I'm seeing:

Oct 30 05:16:04 treeherder-prod app/scheduler.2100:   File "/app/treeherder/model/management/commands/cycle_data.py", line 84, in cycle_non_job_data 
Oct 30 05:16:04 treeherder-prod app/scheduler.2100:     Machine.objects.exclude(id__in=used_machine_ids).delete()
...
Oct 30 05:16:04 treeherder-prod app/scheduler.2100: django.db.utils.OperationalError: (2006, 'MySQL server has gone away') 
(https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Ascheduler)

I've run that delete manually for now:

`DELETE FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0));`

9,233,915 rows were deleted in 296s, leaving 4,270,102 rows in the table (68% were stale).

And then defragged the table:

`OPTIMIZE TABLE treeherder.machine;`
(took 90s)

I believe the reason there were so many stale rows was that the later stages of cycle_data had been failing for so long until the memory issues were worked around in bug 1176492, that a backlog of old machines had built up - making it even harder for the query to finish in time.

It's possible the Django ORM delete will complete successfully now that the table is back to a more typical size. If not, we'll have to rework the query to make it more efficient.
Assignee: nobody → emorley

This is better than it was, but still needs some more TLC.

NB: Perfherder data cycling is still disabled due to it being slow - something which should be addressed in the coming quarters, before DB disk space becomes an issue.

Assignee: emorley → nobody

Some querying approaches for the performance_datum are impossible for me, as the table now has more than 688 million rows.
I tried various queries and noticed that this table increases by more than 600 rows/minute, whereas the machine table by around 100 rows/minute. It may increase even faster than this.

Current implementation of the cycle_data.py management needs a different cleaning algorithm. One that's capable of going through a table that's 500 million rows big.

It also needs to run separately. ATM, cycling performance data happens in effect to cycling Treeherder data.
This is not good, as they have different specs & requirements:

  • machine table grows slower than performance_datum
  • Treeherder-data expires after 4 months, while Perfherder-data should expire after 1 year
Assignee: nobody → igoldan

Kyle, I could use your help on this. The perf cycling algorithm resides here.
It won't meet our needs, even after we clean data that's too old.

Let's assume that if we're able to delete N rows every day (as this will be a daily run script), we'll keep the database in decent size.

I currently see 2 different approaches to this:

  1. Have a SQL script that randomly picks up N+something rows. On these N rows we'll filter only what we need to delete and do the delete.
    The random picking allows us to eschew the current timing out query.
    Drawbacks are we could waste some days on doing nothing, as we could randomly pick N rows of fresh data (not deletable). Also, there's a big change we'll delete less than N rows every day.

  2. Query by MIN(performance_datum.id). It's blazing fast and this id is autoincrementing, meaning lower ids tend to stick to old push_timestamps, by which we decide whether to delete or not.
    Basically, we would need to find the smallest id that meets our needs. That is a row that is too old to keep.
    Then filter all rows from that id up to id+N.
    This approach is more deterministic and seems the better one.

What do you think?

Flags: needinfo?(klahnakoski)

I would question how long we keep the data in activedata? If we keep it there for a full year, could we not trim down our perfherder instance to 6 months?

I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.

A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)

I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.

That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.

what about AWFY, Autophone, platform_microbenchmarks, etc. (see bug 1516630)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)

A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.

The current algorithm would expire it in a year. 6 weeks sounds good to me also :)

Depends on: 1516630

Ionut: Tables in a database have a "clustered index", which is the scan-order of a table; I will call it "the natural order" or just "the order" of a table. With large tables you must use this order to get good query response times. It is important that the natural order is reasonably aligned with your expiration strategy. Lucky for us, the default clustered index happens to be the indexed ID column, which is he same as the insert order, which aligns well with the age of the records.

So your technique of using MIN(performance_datum.id) is leveraging the clustered index (the id).

Assuming the id is your clustered index, which appears to be the case, since you see high performance when using ids. You can go further by using ranges of id to speed your queries:

  1. Using the time range you want to delete, find the range of ids you plan to scan (SELECT MAX(id), MIN(id) FROM performance_datum WHERE timestamp<SOMETHING). This is the hard part, and it may induce a table scan, and you do not want to do that. This id range need not be perfect it can include way more than you plan to delete. As long as this range is much less than your table size, your query will run much faster. Rough estimates for id range are fine; you may consider collecting id ranges grouped by month just one time (and remembered elsewhere, because that scan is expensive). You may consider starting at MIN(id), and working on 10K chunks of records up from there; stopping when each chunk does not have many deleted records. You may use a binary search for an id that has a timestamp that is close enough.

  2. With a known range of ids, you must include them in the query (eg id BETWEEN minID AND maxID); this will ensure the table scan is limited to that range of contiguous rows.

start = MIN("id")
while true:
    existing = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
    "DELETE FROM performance_datum WHERE id BETWEEN {start} AND {start+10000} AND <OTHER CONSTRAINTS>"
    remaining = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
    if existing - remaining < 100:
        break;
    start+=100000;

(Be sure each line, or the body of the loop is run under one transaction, not the whole loop.)

Flags: needinfo?(klahnakoski)

jmaher,

The performance dataum from Treeherder is not in ActiveData; the jobs are and the raw PERFHERDER records are, but not performance_datum. It could be included since it is only a billion records.

By copying the performance data, we can offload some of the query load. With less queries, TH can decrease the table indexes; which means smaller data sizes and better performance. ActiveData may help with the cycle_data problem: ActiveData stores TH records id, so we can quickly find id ranges required to make the delete queries run faster: https://activedata.allizom.org/tools/query.html#query_id=Qdm77nBW

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #22)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)

I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.

That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.

A nice simplification here might just be to expire performance data on the same schedule as other treeherder data (which, last I checked, was something like 4-6 months, irrespective of repository). In my experience, the more you can get rid of domain-specific complexity in a system like this, the better off you are. This way, you could also add an index on the push_timestamp field that would make the query return very quickly (without worrying about the relationship between id and date).

I like the idea of using activedata as an historical record of performance test results, if people really need them-- I think its existing storage of PERFHERDER records should be fine -- that's basically just a superset of what is stored in performance_datum.

I also liked Ionut's suggestion of putting this into a seperate task. If you simplified things as above, did that, and deleted in chunks, I am pretty sure that would resolve this issue in a satisfactory way.

Depends on: 1539479
Priority: P1 → P2
Depends on: 1541387
See Also: → 1343328
Depends on: 1567874
Priority: P2 → P1
Priority: P1 → P2

I've merged the latest PR to master.
:camd, could you setup a new Heroku Scheduler on treeherder-staging, similar to that from treeherder-prototype? It should run daily.

Its Job should be precisely newrelic-admin run-program ./manage.py cycle_data --days 365 --chunk-size 1000 from:perfherder. Thanks!

Flags: needinfo?(cdawson)

OK, I created that on stage. :)

Flags: needinfo?(cdawson)

I've modified the scheduler for production.

Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → FIXED

The cycle_data has actually been failing for Job data since December 3, 2018. We are hitting a utf-8 exception on some of the FailureLine records:

builtins:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 18: invalid continuation byte

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Re-closing this in favor of my work happening in: Bug 1581227

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
See Also: → 1581227
No longer depends on: 1516630
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: