1140349 - Remove the objectstore

Reporter

Description

•

10 years ago

tl;dr: There are a number of issues with the current objectstore implementation - how should we redesign it, or should we get rid of it entirely? It's my understanding that: * all jobs found in builds-{pending,running,4hr} are mangled into an appropriate form for the Treeherder API by the etl layer, then placed into the objectstore, so they can be processed and inserted via the API asynchronously by the process-objects task. * this was done to allow builds-* ingestion even if the API was down or under heavy load or ...? However, if the API were down, the etl process would still fail, since it calls the API at various points. * jobs submitted directly to the API (eg from Taskcluster) skip the objectstore, but are rate-limited by the API However we have the following problems with the objectstore: a) For unknown reasons (other than perhaps high levels of churn/activity on those tables) we've experienced several cases of corruption/locking causing extreme slowness in the process-objects task (bug 1125410). b) Jobs can get stuck in the objectstore in the "loading" state and never get ingested (bug 1125476). c) The data-cycle deletes for the objectstore table take 10x as long as any of the other tables (eg https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/All shows 7 minute query times). d) We keep all jobs in the objectstore (for the same amount of time as all the other tables; 4 months), even after they've been successfully submitted to the API. This consumes about 25GB on prod (bug 1078523 comment 20) and means the queries times are much larger than they need be. I imagine many of the corruption/locking/slowness issues from (a)+(d) would be avoided if the table was 0.1% of its current size. For example the mozilla-inbound objectstore contains 2.5 million rows, even though at any one time, only a few hundred of those rows are actually yet to be ingested. e) Whilst the objectstore tables have an "error" and "error_msg" field, there are zero rows in four months of mozilla-inbound jobs that have a value other than "N". I struggle to believe we've not hit an error in that time. We're either not using this field (and should remove it) or something is broken. f) "loaded_timestamp" is an unfortunately named field, it actually means "the time we inserted the row into the objectstore table", bit "the time we submitted the job to the API". g) The "revision_hash" field only seems to be populated after we have submitted the job, so we could presumably remove this field. Some options: 1) Remove the objectstore entirely and have the etl layer submit to the API directly. 2) Keep the objectstore, but delete rows as soon as they are submitted to the API. 3) Keep the objectstore, and delete all ingested rows as part of the daily data-cycle. 4) Keep the objectstore, and delete ingested rows older than a certain age (eg 7 days) as part of the daily data-cycle. Thanks for reading! :-) -- Some additional context... Example objectstore records: id: 5474407 job_guid: 5b21287514aea99bccbeb3e873f6ad6011b2322f_40493 revision_hash: loaded_timestamp: 1425640702 processed_state: ready error: N error_msg: N json_blob: ... worker_id: id: 1131 job_guid: b0767165317388efe54592a594a861a7363aeb8b revision_hash: loaded_timestamp: 1402952309 processed_state: loading error: N error_msg: N json_blob: ... worker_id: 636665 id: 5474407 job_guid: 5b21287514aea99bccbeb3e873f6ad6011b2322f_40493 revision_hash: 8ff8af56181ccad5759dd86d25e7899f88b1d3db loaded_timestamp: 1425640702 processed_state: complete error: N error_msg: N json_blob: ... worker_id: 78459293 Related bugs: * Bug 1130355 - The cycle-data task should not block process-objects for 25 mins * Bug 1125410 - Extremely slow job ingestion due to process-objects tasks being blocked on MySQL system locks * Bug 1135112 - The process-objects tasks is run against all repositories, even those marked as onhold * Bug 1126943 - Reduce the objectstore table lifecycle from 4 months to N week * Bug 1125476 - Jobs in the objectstore can get stuck in the 'loading' state and never get ingested

PR Phase 1 Bypass and Drain Objectstore 9 years ago Cameron Dawson [:camd] 46 bytes, text/x-github-pull-request	emorley : review+ mdoglio : feedback+	Details \| Review
PR Phase 2 Remove Objectstore code 9 years ago Cameron Dawson [:camd] 46 bytes, text/x-github-pull-request	emorley : review+ emorley : feedback+	Details \| Review