cycle_data is failing on stage/production
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P2)
Tracking
(Not tracked)
People
(Reporter: emorley, Assigned: igoldan)
References
(Blocks 1 open bug)
Details
Attachments
(7 files)
47 bytes,
text/x-github-pull-request
|
wlach
:
review+
|
Details | Review |
76.23 KB,
text/plain
|
Details | |
47 bytes,
text/x-github-pull-request
|
wlach
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review |
Breaking out from bug 1284432 comment 35. `cycle_non_job_data()` is failing during: used_machine_ids = set(Job.objects.values_list( 'machine_id', flat=True).distinct()) Machine.objects.exclude(id__in=used_machine_ids).delete() The SQL being used for this is roughly: used_machine_ids -> "SELECT DISTINCT `job`.`machine_id` FROM `job`" Then "SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (1, 2, 3, 4, 5, ...))" This doesn't work when there are 2.6 million distinct machine ids used in the jobs table (ie it tries to pass a 2.6 million item tuple). Instead of flattening to a list, we should pass the queryset directly, eg something like: Machine.objects.exclude(id__in=Job.objects.values('machine_id')).delete()
Reporter | ||
Comment 1•7 years ago
|
||
$ ./manage.py shell ... >>> from treeherder.model.models import Job, Machine >>> print Machine.objects.exclude(id__in=Job.objects.values('machine_id')).query SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0)) > EXPLAIN SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0)) ******************** 1. row ********************* id: 1 select_type: PRIMARY table: machine type: index possible_keys: key: machine_name_4a8b45973a00cd64_uniq key_len: 302 ref: rows: 5340061 Extra: Using where; Using index ******************** 2. row ********************* id: 2 select_type: DEPENDENT SUBQUERY table: U0 type: index_subquery possible_keys: job_a9374927 key: job_a9374927 key_len: 4 ref: func rows: 3 Extra: Using index 2 rows in set -> Looks good to me.
Comment 2•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 3•7 years ago
|
||
I don't think we need to chunk this query. Even in the worst case scenario (hasn't run for a month, is running on prototype's slower m4.xlarge, bug 1346565 not yet fixed etc), it still was manageable on prototype: 19:02:03 DELETE FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0)) 1687275 row(s) affected 69.625 sec
Comment 4•7 years ago
|
||
Comment on attachment 8846450 [details] [review] [treeherder] mozilla:cycle-non-job-data-timeouts > mozilla:master This is great, thank you!
Comment 5•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/b12defb0cba0c945ed65c86342474f5b6c684cc4 Bug 1346567 - Rewrite inefficient queries in cycle_non_job_data() Since previously the select was evaluated independently of the delete, causing timeouts when attempting to returning 2.6 million machine ids from the jobs table. Now the select queryset isn't evaluated on it's own, and instead is only used to generate the subquery in eg: SELECT `machine`.`id`, `machine`.`name` FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0));
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 6•7 years ago
|
||
Still getting timeouts: https://papertrailapp.com/systems/treeherder-prod/events?centered_on_id=778402766182834248&q=program%3A%2Frun.8132
Reporter | ||
Comment 7•7 years ago
|
||
cycle_data is now failing due to running out of RAM (Error R15), eg: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Aworker_default.2%20-sample&focus=829935121451872309 Aug 04 01:15:34 treeherder-prod app/worker_default.2: [2017-08-04 00:15:34,572: INFO/Worker-6] Reporting to: https://rpm.newrelic.com/accounts/677903/applications/14179757 Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Process running mem=1155M(112.9%) Aug 04 02:11:59 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Process running mem=1918M(187.3%) Aug 04 02:12:22 treeherder-prod heroku/worker_default.2: Error R14 (Memory quota exceeded) Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Process running mem=2270M(221.7%) Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Error R15 (Memory quota vastly exceeded) Aug 04 02:12:43 treeherder-prod heroku/worker_default.2: Stopping process with SIGKILL Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: State changed from up to crashed Aug 04 02:12:44 treeherder-prod heroku/worker_default.2: Process exited with status 137 In addition to fixing the RAM issue, we should also: * improve the logging (since there's nothing in the log that actually says this task started, I'm having to read between the lines, after noticing cycle_data isn't listed in New Relic non-web transactions) * consider moving these long-running but only run a few times a day tasks to the Heroku scheduler instead (bug 1176492)
Reporter | ||
Comment 8•7 years ago
|
||
I'm running a one-off cycle_data to try and clear some of the backlog (/see how bad the RAM usage really is): `thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 2000` Logs: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.6241
Reporter | ||
Comment 9•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 10•7 years ago
|
||
So one of the reasons cycle_data is timing out is due to the performance_datum cycling, which currently results in queries like: SELECT `performance_datum`.`id` FROM `performance_datum` WHERE (`performance_datum`.`repository_id` = 4 AND `performance_datum`.`push_timestamp` < '2017-07-25 21:27:32.388466') LIMIT 500 There are a few issues with this: 1) There should be no need for a select before the delete (it's due to the casting to a list to figure out if there are any rows left to delete, but .delete() returns the number to delete, which should be used instead) 2) It's filtering on `repository`, which isn't really necessary 3) `push_timestamp` isn't indexed, albeit without the repository clause I suspect table scan order would mean the rows to be expired would be seen first, so not an issue
Reporter | ||
Comment 11•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #10) > 2) It's filtering on `repository`, which isn't really necessary Oh, except some repositories are marked as being excluded from data expiration. Grumble.
Reporter | ||
Comment 12•7 years ago
|
||
So I guess we'll have to add yet another index to the performance_datum table (whose indexes are already 2.5 times larger than the data), since the partial data expiration (ie excluding some repositories) means that there are thousands of rows at the start of the table that get table scanned over, every time the next chunk is deleted. If those rows did not exist, the number of rows scanned would be virtually identical to the number of rows deleted.
Comment 13•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Comment 14•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #11) > (In reply to Ed Morley [:emorley] from comment #10) > > 2) It's filtering on `repository`, which isn't really necessary > > Oh, except some repositories are marked as being excluded from data > expiration. Grumble. It is possible that I was being overly cautious about adding this option (not to expire performance data on some repositories), and we can just take it out. Old performance test results haven't proved to be very useful in my experience (since we typically lack the context to explain them). I would talk to :jmaher.
Updated•7 years ago
|
Comment 15•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/82f2b2fe9dab64831f40e7888de3b553dc315f3f Bug 1346567 - Temporarily disable perfherder data cycling (#3022) Since it's timing out due to missing indexes, preventing the rest of the jobs data cycling from taking place (resulting in disk space usage alerts).
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 16•6 years ago
|
||
In bug 1484642 it was found that the cycle_data task was interfering with bugs fetching, since it exceeded memory limits and caused the dyno to be restarted. Bug 1176492 has moved cycle_data to the Heroku scheduler addon (ie: it now runs in a periodically spun up separate dyno rather than alongside other tasks in the generic "default" worker) - which improves isolation between tasks, gives greater flexibility over dyno sizes used for each task, and saves money since no need for a larger dyno type 24/7 just so the "default" worker can handle cycle_data once a day. The cycle_data task in the scheduler addon is currently using a Performance-M dyno (2.5GB RAM). Once the memory issues are resolved, it can be scaled back to save dyno credits.
Reporter | ||
Comment 17•6 years ago
|
||
In Papertrail I'm seeing: Oct 30 05:16:04 treeherder-prod app/scheduler.2100: File "/app/treeherder/model/management/commands/cycle_data.py", line 84, in cycle_non_job_data Oct 30 05:16:04 treeherder-prod app/scheduler.2100: Machine.objects.exclude(id__in=used_machine_ids).delete() ... Oct 30 05:16:04 treeherder-prod app/scheduler.2100: django.db.utils.OperationalError: (2006, 'MySQL server has gone away') (https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Ascheduler) I've run that delete manually for now: `DELETE FROM `machine` WHERE NOT (`machine`.`id` IN (SELECT U0.`machine_id` FROM `job` U0));` 9,233,915 rows were deleted in 296s, leaving 4,270,102 rows in the table (68% were stale). And then defragged the table: `OPTIMIZE TABLE treeherder.machine;` (took 90s) I believe the reason there were so many stale rows was that the later stages of cycle_data had been failing for so long until the memory issues were worked around in bug 1176492, that a backlog of old machines had built up - making it even harder for the query to finish in time. It's possible the Django ORM delete will complete successfully now that the table is back to a more typical size. If not, we'll have to rework the query to make it more efficient.
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 18•5 years ago
|
||
This is better than it was, but still needs some more TLC.
NB: Perfherder data cycling is still disabled due to it being slow - something which should be addressed in the coming quarters, before DB disk space becomes an issue.
Assignee | ||
Comment 19•5 years ago
•
|
||
Some querying approaches for the performance_datum
are impossible for me, as the table now has more than 688 million rows.
I tried various queries and noticed that this table increases by more than 600 rows/minute, whereas the machine
table by around 100 rows/minute. It may increase even faster than this.
Current implementation of the cycle_data.py management needs a different cleaning algorithm. One that's capable of going through a table that's 500 million rows big.
It also needs to run separately. ATM, cycling performance data happens in effect to cycling Treeherder data.
This is not good, as they have different specs & requirements:
machine
table grows slower thanperformance_datum
- Treeherder-data expires after 4 months, while Perfherder-data should expire after 1 year
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 20•5 years ago
•
|
||
Kyle, I could use your help on this. The perf cycling algorithm resides here.
It won't meet our needs, even after we clean data that's too old.
Let's assume that if we're able to delete N rows every day (as this will be a daily run script), we'll keep the database in decent size.
I currently see 2 different approaches to this:
-
Have a SQL script that randomly picks up N+something rows. On these N rows we'll filter only what we need to delete and do the delete.
The random picking allows us to eschew the current timing out query.
Drawbacks are we could waste some days on doing nothing, as we could randomly pick N rows of fresh data (not deletable). Also, there's a big change we'll delete less than N rows every day. -
Query by MIN(
performance_datum
.id
). It's blazing fast and thisid
is autoincrementing, meaning lowerid
s tend to stick to oldpush_timestamp
s, by which we decide whether to delete or not.
Basically, we would need to find the smallestid
that meets our needs. That is a row that is too old to keep.
Then filter all rows from thatid
up toid
+N.
This approach is more deterministic and seems the better one.
What do you think?
Comment 21•5 years ago
|
||
I would question how long we keep the data in activedata? If we keep it there for a full year, could we not trim down our perfherder instance to 6 months?
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.
Assignee | ||
Comment 22•5 years ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.
Comment 23•5 years ago
|
||
what about AWFY, Autophone, platform_microbenchmarks, etc. (see bug 1516630)
Assignee | ||
Comment 24•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.
The current algorithm would expire it in a year. 6 weeks sounds good to me also :)
Comment 25•5 years ago
|
||
Ionut: Tables in a database have a "clustered index", which is the scan-order of a table; I will call it "the natural order" or just "the order" of a table. With large tables you must use this order to get good query response times. It is important that the natural order is reasonably aligned with your expiration strategy. Lucky for us, the default clustered index happens to be the indexed ID column, which is he same as the insert order, which aligns well with the age of the records.
So your technique of using MIN(performance_datum
.id
) is leveraging the clustered index (the id
).
Assuming the id
is your clustered index, which appears to be the case, since you see high performance when using ids. You can go further by using ranges of id
to speed your queries:
-
Using the time range you want to delete, find the range of ids you plan to scan (SELECT MAX(id), MIN(id) FROM performance_datum WHERE timestamp<SOMETHING). This is the hard part, and it may induce a table scan, and you do not want to do that. This
id
range need not be perfect it can include way more than you plan to delete. As long as this range is much less than your table size, your query will run much faster. Rough estimates for id range are fine; you may consider collecting id ranges grouped by month just one time (and remembered elsewhere, because that scan is expensive). You may consider starting at MIN(id), and working on 10K chunks of records up from there; stopping when each chunk does not have many deleted records. You may use a binary search for anid
that has a timestamp that is close enough. -
With a known range of
id
s, you must include them in the query (egid BETWEEN minID AND maxID
); this will ensure the table scan is limited to that range of contiguous rows.
start = MIN("id")
while true:
existing = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
"DELETE FROM performance_datum WHERE id BETWEEN {start} AND {start+10000} AND <OTHER CONSTRAINTS>"
remaining = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
if existing - remaining < 100:
break;
start+=100000;
(Be sure each line, or the body of the loop is run under one transaction, not the whole loop.)
Comment 26•5 years ago
|
||
jmaher,
The performance dataum from Treeherder is not in ActiveData; the jobs are and the raw PERFHERDER
records are, but not performance_datum
. It could be included since it is only a billion records.
By copying the performance data, we can offload some of the query load. With less queries, TH can decrease the table indexes; which means smaller data sizes and better performance. ActiveData may help with the cycle_data
problem: ActiveData stores TH records id, so we can quickly find id
ranges required to make the delete queries run faster: https://activedata.allizom.org/tools/query.html#query_id=Qdm77nBW
Comment 27•5 years ago
•
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #22)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.
A nice simplification here might just be to expire performance data on the same schedule as other treeherder data (which, last I checked, was something like 4-6 months, irrespective of repository). In my experience, the more you can get rid of domain-specific complexity in a system like this, the better off you are. This way, you could also add an index on the push_timestamp field that would make the query return very quickly (without worrying about the relationship between id and date).
I like the idea of using activedata as an historical record of performance test results, if people really need them-- I think its existing storage of PERFHERDER
records should be fine -- that's basically just a superset of what is stored in performance_datum.
I also liked Ionut's suggestion of putting this into a seperate task. If you simplified things as above, did that, and deleted in chunks, I am pretty sure that would resolve this issue in a satisfactory way.
Comment 28•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 29•5 years ago
|
||
Comment 30•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 31•5 years ago
•
|
||
I've merged the latest PR to master.
:camd, could you setup a new Heroku Scheduler on treeherder-staging
, similar to that from treeherder-prototype
? It should run daily.
Its Job
should be precisely newrelic-admin run-program ./manage.py cycle_data --days 365 --chunk-size 1000 from:perfherder
. Thanks!
Comment 33•5 years ago
|
||
I've modified the scheduler for production.
Assignee | ||
Updated•5 years ago
|
Comment 34•5 years ago
|
||
The cycle_data has actually been failing for Job data since December 3, 2018. We are hitting a utf-8 exception on some of the FailureLine records:
builtins:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 18: invalid continuation byte
Comment 35•5 years ago
|
||
Comment 36•5 years ago
|
||
Re-closing this in favor of my work happening in: Bug 1581227
Description
•