Closed Bug 1140349 Opened 10 years ago Closed 9 years ago

Remove the objectstore

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

tl;dr: There are a number of issues with the current objectstore implementation - how should we redesign it, or should we get rid of it entirely? It's my understanding that: * all jobs found in builds-{pending,running,4hr} are mangled into an appropriate form for the Treeherder API by the etl layer, then placed into the objectstore, so they can be processed and inserted via the API asynchronously by the process-objects task. * this was done to allow builds-* ingestion even if the API was down or under heavy load or ...? However, if the API were down, the etl process would still fail, since it calls the API at various points. * jobs submitted directly to the API (eg from Taskcluster) skip the objectstore, but are rate-limited by the API However we have the following problems with the objectstore: a) For unknown reasons (other than perhaps high levels of churn/activity on those tables) we've experienced several cases of corruption/locking causing extreme slowness in the process-objects task (bug 1125410). b) Jobs can get stuck in the objectstore in the "loading" state and never get ingested (bug 1125476). c) The data-cycle deletes for the objectstore table take 10x as long as any of the other tables (eg https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/All shows 7 minute query times). d) We keep all jobs in the objectstore (for the same amount of time as all the other tables; 4 months), even after they've been successfully submitted to the API. This consumes about 25GB on prod (bug 1078523 comment 20) and means the queries times are much larger than they need be. I imagine many of the corruption/locking/slowness issues from (a)+(d) would be avoided if the table was 0.1% of its current size. For example the mozilla-inbound objectstore contains 2.5 million rows, even though at any one time, only a few hundred of those rows are actually yet to be ingested. e) Whilst the objectstore tables have an "error" and "error_msg" field, there are zero rows in four months of mozilla-inbound jobs that have a value other than "N". I struggle to believe we've not hit an error in that time. We're either not using this field (and should remove it) or something is broken. f) "loaded_timestamp" is an unfortunately named field, it actually means "the time we inserted the row into the objectstore table", bit "the time we submitted the job to the API". g) The "revision_hash" field only seems to be populated after we have submitted the job, so we could presumably remove this field. Some options: 1) Remove the objectstore entirely and have the etl layer submit to the API directly. 2) Keep the objectstore, but delete rows as soon as they are submitted to the API. 3) Keep the objectstore, and delete all ingested rows as part of the daily data-cycle. 4) Keep the objectstore, and delete ingested rows older than a certain age (eg 7 days) as part of the daily data-cycle. Thanks for reading! :-) -- Some additional context... Example objectstore records: id: 5474407 job_guid: 5b21287514aea99bccbeb3e873f6ad6011b2322f_40493 revision_hash: loaded_timestamp: 1425640702 processed_state: ready error: N error_msg: N json_blob: ... worker_id: id: 1131 job_guid: b0767165317388efe54592a594a861a7363aeb8b revision_hash: loaded_timestamp: 1402952309 processed_state: loading error: N error_msg: N json_blob: ... worker_id: 636665 id: 5474407 job_guid: 5b21287514aea99bccbeb3e873f6ad6011b2322f_40493 revision_hash: 8ff8af56181ccad5759dd86d25e7899f88b1d3db loaded_timestamp: 1425640702 processed_state: complete error: N error_msg: N json_blob: ... worker_id: 78459293 Related bugs: * Bug 1130355 - The cycle-data task should not block process-objects for 25 mins * Bug 1125410 - Extremely slow job ingestion due to process-objects tasks being blocked on MySQL system locks * Bug 1135112 - The process-objects tasks is run against all repositories, even those marked as onhold * Bug 1126943 - Reduce the objectstore table lifecycle from 4 months to N week * Bug 1125476 - Jobs in the objectstore can get stuck in the 'loading' state and never get ingested
(In reply to Ed Morley [:edmorley] from comment #0) > Some options: > 1) Remove the objectstore entirely and have the etl layer submit to the API > directly. > 2) Keep the objectstore, but delete rows as soon as they are submitted to > the API. > 3) Keep the objectstore, and delete all ingested rows as part of the daily > data-cycle. > 4) Keep the objectstore, and delete ingested rows older than a certain age > (eg 7 days) as part of the daily data-cycle. Out of 2-4, I'm leaning towards #2 - however I'm interested to know if people think removing the objectstore entirely is something we should be considering?
Flags: needinfo?(mdoglio)
Flags: needinfo?(cdawson)
I'm in favor of 1). The objectstore data model was inherited from Datazilla, which has no concept of queues and asynchronous tasks. Since we are using rabbitmq and celery, I can't think of any benefit from using the objectstore.
Flags: needinfo?(mdoglio)
Works for me :-)
Summary: Rethinking the objectstore → Remove the objectstore
Yep yep. +1. I hope we do this in Q2.
Flags: needinfo?(cdawson)
Blocks: 1125476
Assignee: nobody → cdawson
Here is the raw sql query we need to copy the OAUTH keys in the ``datasource`` table from the ``objectstore`` records to the ``jobs`` records. This can be done prior to merging any code. But definitely after the first merge: UPDATE treeherder.datasource ds INNER JOIN ( SELECT project, contenttype, oauth_consumer_key, oauth_consumer_secret FROM treeherder.datasource WHERE contenttype='objectstore' ) newdata ON newdata.project=ds.project SET ds.oauth_consumer_key=newdata.oauth_consumer_key, ds.oauth_consumer_secret=newdata.oauth_consumer_secret WHERE ds.project = newdata.project AND ds.contenttype = 'jobs'
Unfortunately this will break existing submitters of data to stage/prod. Can we instead copy from the objectstore to jobs? (Since people aren't submitting to the objectstore themselves, right?)
Attachment #8627754 - Flags: review?(emorley)
Attachment #8627754 - Flags: feedback?(mdoglio)
Attachment #8627756 - Flags: review?(emorley)
Attachment #8627756 - Flags: feedback?(mdoglio)
Blocks: 1178227
(In reply to Ed Morley [:emorley] from comment #6) > Unfortunately this will break existing submitters of data to stage/prod. Can > we instead copy from the objectstore to jobs? (Since people aren't > submitting to the objectstore themselves, right?) I wonder if the initial PR could do a redirect from objectstore to jobs. The data is identical. That way, it would still work, regardless of the client version they're using. I think that would be a good call. I'll update the PR. Alternatively, I could 40X the old endpoint and tell them to update their client...
Attachment #8627754 - Flags: feedback?(mdoglio) → feedback+
I think TaskCluster is not using the python-client to post jobs to treeherder. We should probably send out an email before we switch the objectstore off on stage/prod. Maybe we can: 1 - change the python client (adding a few tests for completed jobs passing through the job endpoint) and send out an email to tell our api consumers to update their client. 2 - after a week or so 301 all the requests on the objectstore endpoint to the job endpoint 3 - if we still have clients attempting requests to the objectstore, send another email 4 - remove the objectstore code. I think in general this process would be easier if we were versioning the api (bug 1145720).
I only got partway through the review of this today; I'll finish it off tomorrow :-)
Comment on attachment 8627754 [details] [review] PR Phase 1 Bypass and Drain Objectstore Actually, re-reading it sounds like you're going to change the approach slightly? Will hold off until the PR has been updated :-)
Attachment #8627754 - Flags: review?(emorley)
Attachment #8627756 - Flags: review?(emorley)
I took another look at this, and the easiest thing was to modify the implementation of the ``ObjectstoreViewSet.create`` function to have the same code as the ``JobsViewSet.create`` function. This is to say that both endpoints getting a POST will skip the objectstore and post directly to the jobs database. This copied code will go away in the phase 2 commit where we just remove the ``ObjectstoreViewSet`` completely. This was my thinking: https://github.com/mozilla/treeherder/blob/drain-objectstore/treeherder/webapp/api/objectstore.py#L34-L36 Mdoglio: is there a better way to do this? POST doesn't support redirects, according to the spec.
Flags: needinfo?(mdoglio)
It really depends on how the client is implemented. The python requests library for example follows redirects by default, but you can disable that. If you want to do a redirect, I would just return a HttpResponseRedirect [1] object without any content in the body. [1]https://docs.djangoproject.com/en/1.8/ref/request-response/#django.http.HttpResponseRedirect
Flags: needinfo?(mdoglio)
It's not just a case of following the requests, it's ensuring the library you use doesn't convert the POST to GET and drop the body, per the spec: http://tools.ietf.org/html/rfc2616.html#section-10.3.2 urllib2 does drop the POST body on the floor for example: https://docs.python.org/2/library/urllib2.html#urllib2.HTTPRedirectHandler.redirect_request The wording of the requests comment here is unclear to me - it may still not preserve the body: https://github.com/kennethreitz/requests/blob/8b5e457b756b2ab4c02473f7a42c2e0201ecc7e9/requests/api.py#L33
Attachment #8627756 - Flags: feedback?(mdoglio)
Attachment #8627754 - Flags: review?(emorley)
Ran the sql in comment 5 on stage. All is well. :)
Comment on attachment 8627754 [details] [review] PR Phase 1 Bypass and Drain Objectstore r=me with test tweaks
Attachment #8627754 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/ad4f87b131c3e4f9ae70c4e38204cfc7019bd9b2 Bug 1140349 - Skip and drain the objectstore Since we use Celery for queueing job ingestion, the objectstore is now irrelevant. This code is the first step. This will bypass the Objectstore and ingest jobs directly to our ``jobs`` database. Phase 2 is to remove all the Objectstore code (in a later commit) Phase 3 is to delete the Objectstore databases and related fields in other tables.
Comment on attachment 8627756 [details] [review] PR Phase 2 Remove Objectstore code This one's a bit of a whopper! Aren't you lucky.... :)
Attachment #8627756 - Flags: review?(emorley)
Comment on attachment 8627756 [details] [review] PR Phase 2 Remove Objectstore code Yey lots of deadcode removal! Almost ready, have commented with a few other things that can be deleted :-)
Attachment #8627756 - Flags: review?(emorley) → feedback+
Status: NEW → ASSIGNED
The other nice thing about this removal is that since we only run process-objects once a minute, and it takes an average of 10s (but up to 40s) to run, we're reducing the time taken for completed jobs to appear in the UI by an average of ~40s - or in the peak load + unfortunate schedule timing case, by as much as 1m40s. The builds-4hr task itself runs every 3 minutes and takes an average of 1 minute (and this will only get lower with bug 1178240 - we may even be able to run builds-4hr ingestion more often), so the end-to-end processing time saved by removing the objectstore is actually fairly significant.
Blocks: 1096863
Blocks: 1179858
Comment on attachment 8627756 [details] [review] PR Phase 2 Remove Objectstore code I happened to spot this in the IRC scrollback: <camd> emorley: OK, I rebased the remove-objectstore branch now and removed the grunt build commit. Thanks for looking over that. At your leisure, sir. :) Please can you re-request review when a PR has been updated, to make sure it goes in my queue (see also https://github.com/mozilla/treeherder/pull/660#issuecomment-122937977)? I might not always see IRC messages after hours, and I find it helpful to have as few sources of tasks to keep track of as possible, rather than Bugzilla+email+todoist+IRC+... :-) I'll take a look at this tomorrow - some nice cleanup!
Attachment #8627756 - Flags: review?(emorley)
Comment on attachment 8627756 [details] [review] PR Phase 2 Remove Objectstore code r=me with the s/jobs_execute/execute/ change in rewrite_perf_data.py It's great that we can remove all this code now! :-)
Attachment #8627756 - Flags: review?(emorley) → review+
Blocks: 1185030
Before this lands, the oauth credentials need copying over from the objectstore rows to the jobs rows on Heroku, using the comment 5 query (stage/prod are fine).
(In reply to Ed Morley [:emorley] from comment #22) > Comment on attachment 8627756 [details] [review] > PR Phase 2 Remove Objectstore code > > I happened to spot this in the IRC scrollback: > <camd> emorley: OK, I rebased the remove-objectstore branch now and removed > the grunt build commit. Thanks for looking over that. At your leisure, > sir. :) > > Please can you re-request review when a PR has been updated, to make sure it > goes in my queue (see also > https://github.com/mozilla/treeherder/pull/660#issuecomment-122937977)? I > might not always see IRC messages after hours, and I find it helpful to have > as few sources of tasks to keep track of as possible, rather than > Bugzilla+email+todoist+IRC+... :-) > > I'll take a look at this tomorrow - some nice cleanup! Yep, makes sense. I feel the same way. My BZ queue is one of my main to do lists. :) In the future I'll be sure to do that.
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/00cfe6643db96d5339887493423b962ec0a189d1 Bug 1140349 - Remove the objectstore code After the previous commit, the Objectstore is effectively "dead code". So this commit removes all the dead code after anything left over in the Objectstore has been drained and added to the DB.
Blocks: 1186357
Depends on: 1186366
Think everything is done here; remaining work is being covered by bug 1185030 and bug 1186357.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1190343
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: