Closed Bug 1126943 Opened 10 years ago Closed 9 years ago

Cycle data from the objectstore table more aggressively than the others

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

The current data expiration works roughly like this: - Find resultsets older than 4 months - Prune all jobs entries for those resultsets - Prune all objectstore entries corresponding to that ingested data I know we said in the past it would be good to keep the objectstore data around, so we could replay the ingestion if there were any problems - but realistically we're not going to do that for jobs older than say a week. The objectstore tables across all DBs is currently 25 GB, reducing the lifecycle to 1 week would reduce that to 1.5 GB. This should also help with the performance issue seen in bug 1125410 (not that the table was insanely large anyway, but can't make things any worse).
(In reply to Ed Morley [:edmorley] from comment #0) > The current data expiration works roughly like this: Whereas to have a different lifecycle we could simplify this and just prune anything with a objectstore.loaded_timestamp older than 1 week ago.
Given bug 1125410 has just re-surfaced IMO this is a P1. Even if it doesn't directly improve perf/avoid the issue, it will at the least reduce the time taken to run an OPTIMIZE (which fixes the issue), hopefully to the point where we can run it in realtime on the master.
Maybe we should just delete rows in the objectstore when we ingest them, rather than setting to "loaded". Would save having to expire them after the fact...
(In reply to Ed Morley [:edmorley] from comment #1) > (In reply to Ed Morley [:edmorley] from comment #0) > > The current data expiration works roughly like this: > > Whereas to have a different lifecycle we could simplify this and just prune > anything with a objectstore.loaded_timestamp older than 1 week ago. Also, this simplification would mean we actually expire all old objectstore entries, even the ones stuck in the "loading" state (bug 1125476), whereas at the moment we never clean them up.
Blocks: 1130355
Priority: P2 → P3
Summary: Reduce the objectstore table lifecycle from 4 months to N weeks → Delete jobs from the objectstore table once they are ingested
Assignee: nobody → emorley
Status: NEW → ASSIGNED
In a followup bug I'll handle the existing completed records in the objectstore and tweak the data cycle task, but for now this will at least stop us keeping any more completed jobs (eg the mozilla-inbound objectstore currently contains 2.9 million records for the last 4 months).
Attachment #8601796 - Flags: review?(mdoglio)
I forgot we used uniqueness in the objectstore rather than presence in the jobs table to prevent re-ingestion of jobs in builds-4hr. In which case, the simplest solution is just to go back to the cycle-data-more-aggressively plan :-)
Summary: Delete jobs from the objectstore table once they are ingested → Cycle data from the objectstore table more agressively than the others
Attachment #8601796 - Flags: review?(mdoglio) → review-
Attachment #8601796 - Attachment is obsolete: true
Attachment #8602264 - Flags: review?(mdoglio)
Attachment #8602264 - Flags: review?(mdoglio) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d462c2322f2408706fbd409702a56dd05c51cccc Bug 1126943 - Factor out the calculation of the cycle timestamp Since we'll be using it with differing cycle_interval values shortly. https://github.com/mozilla/treeherder/commit/b02368414ef0235dc0160df76550fa00f2b23020 Bug 1126943 - Expire data from the objectstore independently of jobs DB Items in the objectstore are currently expired by finding the list of result sets matching the date range, then looking up the jobs for those result sets, and finally searching for matching job guids in the datastore table. This is not only bad for performance of objectstore deletes (since we end up with lists of thousands of guids), but also means we cannot set a different cycle interval for the objectstore. The new approach is much simpler: we only query the objectstore, and use loaded_timestamp to determine which rows to cycle. The objectstore does not have any foreign keys, so this isn't a problem. The only constraint is that we must keep the complete jobs long enough for the job to stop appearing in builds-4hr, to prevent us from continually re-adding it to the objectstore. For now, we also only cycle jobs with a processed_state of 'complete', so entries with errors (or that are stuck in the 'loading' state due to bug 1125476) are not lost (this matches the prior behaviour, since the list of job_guids would only include successfully ingested jobs). For now the objectstore cycle interval has been set to the same default interval as the jobs tables, but this will be reduced once manual cycle data runs are run on stage/prod. https://github.com/mozilla/treeherder/commit/2e0eda9a9e0c3aea90fc442cecab543559d82d78 Bug 1126943 - Display count of deleted objectstore rows
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/e63c4650001bb370ddac000ac0cebea03381d09e Bug 1126943 - Correct displayed count of deleted objectstore rows The break was before the addition of the number of rows deleted in that chunk, so it was always slightly less than the real number of rows deleted.
I've run on stage and got down to 1 day for the objectstore. The deletes were pretty quick in the end, particularly once the table size was reduced - we can probably raise the default chunk size for the objectstore to 10,000 or similar.
I ran against prod last night using an objectstore cycle interval of 1 day: https://emorley.pastebin.mozilla.org/8833054
Summary: Cycle data from the objectstore table more agressively than the others → Cycle data from the objectstore table more aggressively than the others
Attachment #8604049 - Flags: review?(mdoglio) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/950339a92f34988408a98b125c0e7ca53fdd82a2 Bug 1126943 - Lower the default objectstore cycle interval to 1 day Now that stage+prod have had their objectstores reduced in size by manual |manage.py cycle_data| runs, we can safely reduce the default interval used by the once a day automated data cycle. https://github.com/mozilla/treeherder/commit/ac14f791fb27a3e687a4d65d6502a40c89f9ae22 Bug 1126943 - Increase the default objectstore data cycle chunk size Now that the objectstores on stage/prod only contain 1 day's worth of jobs, the deletes are much faster, so we can increase the chunk size. On production, deleting either 5000 or 10000 rows from the inbound objectstore both took about 0.4s, so the latter seems safe enough.
Blocks: 1163588
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1130303
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: