Add a manage command to just ingest data for a single push

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: wlach, Assigned: wlach)

Tracking

Details

Attachments

(2 attachments)

It would be useful for testing purposes to just be able to ingest data for a single push. I'm going to implement something that allows us to do this.

For reference, here's mdoglio's suggestions on doing this (irc log):

<wlach> how hard would it be to just get treeherder to ingest data for a single push? it feels like that would make testing things like this a lot easier (as opposed to the "start ingesting data on your local testing instance and hope you find something")
<mdoglio> wlach: to ingest a single push is not complicated, we could write a django command to do that
<wlach> mdoglio: I think I'd like to add that, can you give me a few pointers on where to look?
<mdoglio> sure
<mdoglio> it's actually pretty easy
<mdoglio> I'll grab a couple of links
<mdoglio> so this is an example of a django command, you can just copy&paste to a new file inside the same dir
<mdoglio> https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/management/commands/export_project_credentials.py
<mdoglio> the name of the file (with .py stripped) will be the name of the django command
<wlach> right
<mdoglio> in the handle function you should instantiate a HgPushlogProcess object
<mdoglio> very much like it's done ehre https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/tasks/buildapi_tasks.py#L68
<mdoglio> *here
<mdoglio> the 2 parameters there are the pushlog url and the repository name. if you add a &changeset= parameter to the url it should fetch just one push
<RyanVM|sheriffduty> poor logparser just got inundated
<wlach> mdoglio: ah ok, that works!
<mdoglio> wlach: after that in the same function you can run one or many of the Buildapi
<mdoglio> https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/tasks/buildapi_tasks.py#L17-L41
<mdoglio> processes
<wlach> mdoglio: can I just run all of them in succession?
<mdoglio> yes
<mdoglio> after that you need to call the process_objects task synchronously
<mdoglio> or execute its content
<wlach> which one is the process_objects task?
<wlach> or do you mean just the three you mentioned generally
<mdoglio> I just found out that there's already a command for that 
<mdoglio> :)
<mdoglio> https://github.com/mozilla/treeherder-service/blob/master/treeherder/model/management/commands/process_objects.py
<mdoglio> the process_objects task is usually executed in a periodic task
<wlach> so hmm
<mdoglio> but again, you can execute it as a normal function
<mdoglio> or call the already existing command using django.core.management.call_command
<mdoglio> something like call_command('process_objects', limit=n)
<wlach> so HgPushogParser gets the raw object data we need to process
<wlach> (for that push)
<wlach> and the process_objects task processes that data?
<mdoglio> HgPushlogProcess fetches the pushlog from hg.m.o
<mdoglio> than the buildapi processes fetch the jobs data from buildapi
<mdoglio> at his point you have pending and running in the jobs database and completed jobs in the objectstore
<mdoglio> process_objects processes the completed jobs and store them in the jobs database. and after that calls the log parser
<wlach> mdoglio: the buildapi processes are what you linked to in https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/tasks/buildapi_tasks.py#L17-L41 ?
<mdoglio> wlach: yes
<wlach> ok cool! that sounds pretty straightforward
<mdoglio> it is indeed.
<mdoglio> wlach: the only thing you need to run is the web server
<wlach> mdoglio: right
<mdoglio> with either gunicorn or the runserver command
<mdoglio> oh, one last thing
<mdoglio> before running anything, set CELERY_ALWAYS_EAGER = True in the settings moduel
<wlach> what does that do?
<mdoglio> form django.conf import settings; settings.CELERY_ALWAYS_EAGER = True
<mdoglio> it runs every celery task you call sycnhronously
<wlach> and I assume I put that in the manage command?
<mdoglio> yes from the django command you write
When you say 'single push' do you mean "all jobs associated with any recent single push, doesn't matter which, so we can test" or "all jobs associated with push with revision foo"

If the latter, the harder part for this bug, is figuring out which builds-4hr daily archive the jobs for a push belong in. ie: We can't just take "time of push", because retriggers may have happened since. So I guess we'd have to query buildapi for the finish times for all jobs, then search the daily archive for each date that was referenced.

Depending on the above (and whether this is just for testing, or also to help recovery in production - something that would be useful), parts of this may overlap with bug 1069467.
Component: Treeherder → Treeherder: Data Ingestion
(In reply to Ed Morley (moved to Treeherder) [:edmorley] from comment #1)
> When you say 'single push' do you mean "all jobs associated with any recent
> single push, doesn't matter which, so we can test" or "all jobs associated
> with push with revision foo"
> 
> If the latter, the harder part for this bug, is figuring out which
> builds-4hr daily archive the jobs for a push belong in. ie: We can't just
> take "time of push", because retriggers may have happened since. So I guess
> we'd have to query buildapi for the finish times for all jobs, then search
> the daily archive for each date that was referenced.
> 
> Depending on the above (and whether this is just for testing, or also to
> help recovery in production - something that would be useful), parts of this
> may overlap with bug 1069467.

Yeah, my use case here is being able to just import all the data associated with a specific revision in a single shot, for testing purposes. I think bug 1069467 is strictly a separate thing, though I think my approach here might be useful as a basis for creating a recovery command as proposed there...
(In reply to William Lachance (:wlach) from comment #2)
> though I think my approach here might
> be useful as a basis for creating a recovery command as proposed there...

Yeah this was more the reason I commented, but along the lines of our conversation on IRC, I think lets not worry too much for this bug, to save scope creeping it.
Created attachment 8535892 [details] [review]
Add a manage command to just ingest data for a single push

This seems to work ok here, not sure if there's a more elegant way of handling the filtering?
Attachment #8535892 - Flags: review?(mdoglio)
Comment on attachment 8535892 [details] [review]
Add a manage command to just ingest data for a single push

Oh grump, this doesn't actually work. Cancelling review for now.
Attachment #8535892 - Flags: review?(mdoglio)
Created attachment 8538862 [details] [review]
Working patch!

Ok this one actually works! And it even has a profiling option so you can see what takes up the most time (hint: log parsing in most cases...)
Attachment #8538862 - Flags: review?(mdoglio)
Attachment #8538862 - Flags: review?(mdoglio) → review+

Comment 7

4 years ago
Commits pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/393a62067fae922fef2ce2c96d4a44f69bf78b70
Bug 1110463 - Add a manage command to just ingest data for a single push

https://github.com/mozilla/treeherder-service/commit/3f762b296f86f014cfa34c33a916ae9c10f96e18
Merge pull request #305 from wlach/1110463

Bug 1110463 - Add a manage command to just ingest data for a single push
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.