cycle_data is failing on stage/production
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P2)
Tracking
(Not tracked)
People
(Reporter: emorley, Assigned: igoldan)
References
(Blocks 1 open bug)
Details
Attachments
(7 files)
|
47 bytes,
text/x-github-pull-request
|
wlach
:
review+
|
Details | Review |
|
76.23 KB,
text/plain
|
Details | |
|
47 bytes,
text/x-github-pull-request
|
wlach
:
review+
|
Details | Review |
|
47 bytes,
text/x-github-pull-request
|
Details | Review | |
|
47 bytes,
text/x-github-pull-request
|
Details | Review | |
|
47 bytes,
text/x-github-pull-request
|
Details | Review | |
|
47 bytes,
text/x-github-pull-request
|
Details | Review |
| Reporter | ||
Comment 1•8 years ago
|
||
Comment 2•8 years ago
|
||
| Reporter | ||
Updated•8 years ago
|
| Reporter | ||
Comment 3•8 years ago
|
||
Comment 4•8 years ago
|
||
Comment 5•8 years ago
|
||
| Reporter | ||
Updated•8 years ago
|
| Reporter | ||
Comment 6•8 years ago
|
||
| Reporter | ||
Comment 7•8 years ago
|
||
| Reporter | ||
Comment 8•8 years ago
|
||
| Reporter | ||
Comment 9•8 years ago
|
||
| Reporter | ||
Updated•8 years ago
|
| Reporter | ||
Updated•8 years ago
|
| Reporter | ||
Updated•8 years ago
|
| Reporter | ||
Comment 10•7 years ago
|
||
| Reporter | ||
Comment 11•7 years ago
|
||
| Reporter | ||
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
| Reporter | ||
Updated•7 years ago
|
Comment 14•7 years ago
|
||
Updated•7 years ago
|
Comment 15•7 years ago
|
||
| Reporter | ||
Updated•7 years ago
|
| Reporter | ||
Comment 16•7 years ago
|
||
| Reporter | ||
Comment 17•7 years ago
|
||
| Reporter | ||
Updated•7 years ago
|
| Reporter | ||
Comment 18•6 years ago
|
||
This is better than it was, but still needs some more TLC.
NB: Perfherder data cycling is still disabled due to it being slow - something which should be addressed in the coming quarters, before DB disk space becomes an issue.
| Assignee | ||
Comment 19•6 years ago
•
|
||
Some querying approaches for the performance_datum are impossible for me, as the table now has more than 688 million rows.
I tried various queries and noticed that this table increases by more than 600 rows/minute, whereas the machine table by around 100 rows/minute. It may increase even faster than this.
Current implementation of the cycle_data.py management needs a different cleaning algorithm. One that's capable of going through a table that's 500 million rows big.
It also needs to run separately. ATM, cycling performance data happens in effect to cycling Treeherder data.
This is not good, as they have different specs & requirements:
machinetable grows slower thanperformance_datum- Treeherder-data expires after 4 months, while Perfherder-data should expire after 1 year
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 20•6 years ago
•
|
||
Kyle, I could use your help on this. The perf cycling algorithm resides here.
It won't meet our needs, even after we clean data that's too old.
Let's assume that if we're able to delete N rows every day (as this will be a daily run script), we'll keep the database in decent size.
I currently see 2 different approaches to this:
-
Have a SQL script that randomly picks up N+something rows. On these N rows we'll filter only what we need to delete and do the delete.
The random picking allows us to eschew the current timing out query.
Drawbacks are we could waste some days on doing nothing, as we could randomly pick N rows of fresh data (not deletable). Also, there's a big change we'll delete less than N rows every day. -
Query by MIN(
performance_datum.id). It's blazing fast and thisidis autoincrementing, meaning lowerids tend to stick to oldpush_timestamps, by which we decide whether to delete or not.
Basically, we would need to find the smallestidthat meets our needs. That is a row that is too old to keep.
Then filter all rows from thatidup toid+N.
This approach is more deterministic and seems the better one.
What do you think?
I would question how long we keep the data in activedata? If we keep it there for a full year, could we not trim down our perfherder instance to 6 months?
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.
| Assignee | ||
Comment 22•6 years ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.
what about AWFY, Autophone, platform_microbenchmarks, etc. (see bug 1516630)
| Assignee | ||
Comment 24•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
A final question- how long do we keep perf data from try runs? I would like to propose something <1 year, maybe even 6 weeks.
The current algorithm would expire it in a year. 6 weeks sounds good to me also :)
Comment 25•6 years ago
|
||
Ionut: Tables in a database have a "clustered index", which is the scan-order of a table; I will call it "the natural order" or just "the order" of a table. With large tables you must use this order to get good query response times. It is important that the natural order is reasonably aligned with your expiration strategy. Lucky for us, the default clustered index happens to be the indexed ID column, which is he same as the insert order, which aligns well with the age of the records.
So your technique of using MIN(performance_datum.id) is leveraging the clustered index (the id).
Assuming the id is your clustered index, which appears to be the case, since you see high performance when using ids. You can go further by using ranges of id to speed your queries:
-
Using the time range you want to delete, find the range of ids you plan to scan (SELECT MAX(id), MIN(id) FROM performance_datum WHERE timestamp<SOMETHING). This is the hard part, and it may induce a table scan, and you do not want to do that. This
idrange need not be perfect it can include way more than you plan to delete. As long as this range is much less than your table size, your query will run much faster. Rough estimates for id range are fine; you may consider collecting id ranges grouped by month just one time (and remembered elsewhere, because that scan is expensive). You may consider starting at MIN(id), and working on 10K chunks of records up from there; stopping when each chunk does not have many deleted records. You may use a binary search for anidthat has a timestamp that is close enough. -
With a known range of
ids, you must include them in the query (egid BETWEEN minID AND maxID); this will ensure the table scan is limited to that range of contiguous rows.
start = MIN("id")
while true:
existing = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
"DELETE FROM performance_datum WHERE id BETWEEN {start} AND {start+10000} AND <OTHER CONSTRAINTS>"
remaining = "SELECT count(1) FROM performance_datum WHERE id BETWEEN {start} AND {start+10000}"
if existing - remaining < 100:
break;
start+=100000;
(Be sure each line, or the body of the loop is run under one transaction, not the whole loop.)
Comment 26•6 years ago
|
||
jmaher,
The performance dataum from Treeherder is not in ActiveData; the jobs are and the raw PERFHERDER records are, but not performance_datum. It could be included since it is only a billion records.
By copying the performance data, we can offload some of the query load. With less queries, TH can decrease the table indexes; which means smaller data sizes and better performance. ActiveData may help with the cycle_data problem: ActiveData stores TH records id, so we can quickly find id ranges required to make the delete queries run faster: https://activedata.allizom.org/tools/query.html#query_id=Qdm77nBW
Comment 27•6 years ago
•
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #22)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
I would also question- do we need all the data from old frameworks? That should significantly reduce the datasets- anything with no fresh data in 4 months should be removed.
That should be identified and expired ASAP. I'm also thinking we could keep perf data from autoland, mozilla-inbound, mozilla-beta & mozilla-central for 6 months or 1 year. Perf data from other repos could be expired faster.
A nice simplification here might just be to expire performance data on the same schedule as other treeherder data (which, last I checked, was something like 4-6 months, irrespective of repository). In my experience, the more you can get rid of domain-specific complexity in a system like this, the better off you are. This way, you could also add an index on the push_timestamp field that would make the query return very quickly (without worrying about the relationship between id and date).
I like the idea of using activedata as an historical record of performance test results, if people really need them-- I think its existing storage of PERFHERDER records should be fine -- that's basically just a superset of what is stored in performance_datum.
I also liked Ionut's suggestion of putting this into a seperate task. If you simplified things as above, did that, and deleted in chunks, I am pretty sure that would resolve this issue in a satisfactory way.
Comment 28•6 years ago
|
||
| Assignee | ||
Updated•6 years ago
|
Comment 29•6 years ago
|
||
Comment 30•6 years ago
|
||
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 31•6 years ago
•
|
||
I've merged the latest PR to master.
:camd, could you setup a new Heroku Scheduler on treeherder-staging, similar to that from treeherder-prototype? It should run daily.
Its Job should be precisely newrelic-admin run-program ./manage.py cycle_data --days 365 --chunk-size 1000 from:perfherder. Thanks!
Comment 33•6 years ago
|
||
I've modified the scheduler for production.
| Assignee | ||
Updated•6 years ago
|
Comment 34•6 years ago
|
||
The cycle_data has actually been failing for Job data since December 3, 2018. We are hitting a utf-8 exception on some of the FailureLine records:
builtins:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 18: invalid continuation byte
Comment 35•6 years ago
|
||
Comment 36•6 years ago
|
||
Re-closing this in favor of my work happening in: Bug 1581227
Description
•