Closed
Bug 1216578
Opened 10 years ago
Closed 10 years ago
Autophone - queue Treeherder jobs and get Treeherder submission off of main thread
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(firefox44 affected)
RESOLVED
FIXED
| Tracking | Status | |
|---|---|---|
| firefox44 | --- | affected |
People
(Reporter: bc, Assigned: gbrown)
References
Details
Attachments
(1 file, 2 obsolete files)
|
17.38 KB,
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
Autophone's use of Treeherder is problematic:
Treeherder submission occurs in the main Autophone autophone.py process both on the main thread (during startup and trigger jobs processing) and on the threads which process pulse messages. It also occurs in each worker process (which are single threaded).
In order to prevent hangs which occurred due to some contention that was related to simultaneous Treeherder submission, we have to use a shared lock in order to serialize submission across the main autophone process (main thread and pulse threads) and the worker processes.
When a Treeherder submission fails due to Treeherder server errors (which happens fairly often), we currently retry up to a default of 3 times with a default wait of 5 minutes in between the attempts. Since Treeherder submission is guarded by the shared lock, this can cause a temporary but long dead lock with one process/thread holding the shared lock and retrying a Treeherder submission while the other processes wait to obtain the lock.
We should create a means of queuing the Treeherder submissions so that any process or thread can quickly add a new submission. This queue should be persistent so that it survives Autophone and Host restarts. I'm not sure how we should handle repeated failures. I hate to just drop them, but there may be cases where we have a malformed submission that needs to be cleared. We should of course log the errors and use email notification when we can't recover after repeated attempts.
A separate process or thread can periodically check for pending submissions and then submit them, clearing out the submissions as they are successfully submitted.
The queue could be implemented as an sqlite table in the jobs.sql database or perhaps in a different sql database, e.g. treeherder.sql.
The current Treeherder submit_.* methods, would be changed to call a new method "queue_request" instead of "post_request". "queue_request" would still use the shared lock to serialize insertion of submissions to the queue.
As for actually submitting to Treeherder, we can either use a separate process or a new Thread in the Autophone process to monitor the Treeherder submissions table and make submissions. It may seem counter intuitive, but I think I would prefer a thread in the main Autophone process over a distinct process implemented using multiprocessing. This would still use the shared lock to serialize access to the Treeherder submissions table but would only hold the lock for a short period and not hold the lock while attempting to actually post the submission to Treeherder.
Comment 1•10 years ago
|
||
:gbrown, we discussed this briefly last week and if you have a couple of days to hack this would be a big help. We can regroup in 6 weeks and see who has time to tackle this if it isn't inprogress/completed by then.
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → gbrown
| Assignee | ||
Comment 2•10 years ago
|
||
:bc -- Thanks for the detailed explanation; it has been very helpful.
Have you considered how to serialize the TreeherderJobCollection to the database? I am pondering keeping the existing TJC construction code in submit_pending/running/complete, then in queue_request, storing the TJC as a single json string; I think I can retrieve the json string from the DB, build a dict from that, then re-construct a TJC from that...but I'm open to other approaches.
Also, I notice several timestamps in the TJC. Should those reflect the time queued, or the time submitted?
Flags: needinfo?(bob)
| Reporter | ||
Comment 3•10 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #2)
> :bc -- Thanks for the detailed explanation; it has been very helpful.
>
:-)
> Have you considered how to serialize the TreeherderJobCollection to the
> database? I am pondering keeping the existing TJC construction code in
> submit_pending/running/complete, then in queue_request, storing the TJC as a
> single json string; I think I can retrieve the json string from the DB,
> build a dict from that, then re-construct a TJC from that...but I'm open to
> other approaches.
>
That sounds very reasonable. We might want some additional data such as the initial time it was queued, the last time submission was attempted, the last error message and the number of attempts. This would be for our benefit and perhaps be of use in cleaning out stale/bad jobs.
> Also, I notice several timestamps in the TJC. Should those reflect the time
> queued, or the time submitted?
There were problems initially with the treeherder client sending empty values for the dates, so for the pending submission they were all set to the same value.
The running submission would keep the original requested value but change the started/ended values to match the match the time the job began running.
Finally, the completed submission would keep the requested and started values from the running submission and change the ended value to match when the job completed.
In our case, I think the important pieces of information to be displayed in treeherder are the actual times the tjc was queued instead of when it was actually sent to treeherder. We are more interested in how the job ran rather than how well treeherder is at receiving the data.
The newest treeherder client may allow us to not send the started/ended values for the pending job; or the ended value for the running job. If you wouldn't mind checking on that and updating the tjc accordingly if possible that would be great.
Flags: needinfo?(bob)
| Assignee | ||
Comment 4•10 years ago
|
||
| Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #3)
> The newest treeherder client may allow us to not send the started/ended
> values for the pending job; or the ended value for the running job. If you
> wouldn't mind checking on that and updating the tjc accordingly if possible
> that would be great.
The started/ended values are still required: The server reports an error if they are not included.
However, I found that these can be specified as 0, resulting in no corresponding tag display in the treeherder UI -- I'll use that.
| Assignee | ||
Comment 6•10 years ago
|
||
I think I've implemented this very much as described in comment 0. All previous calls to "post_request" now call "queue_request" which uses jobs to save the treeherder request data into a new "treeherder" table. The new "treeherder_thread" runs forever, retrieving rows from the treeherder table, submitting them to treeherder, and deleting them on successful submission. If there are no treeherder jobs queued, treeherder_thread sleeps for the treeherder-retry-wait. If a job cannot be submitted to treeherder, treeherder_thread sleeps for the treeherder-retry-wait.
Extra stuff:
- the treeherder-retries option is removed; we retry infinitely now
- we submit 0 for pending/running timestamps instead of the current time, so that the started/ended times are not displayed on treeherder
- in autophonetreeherder, logurl/logname are initialized earlier to avoid a reference-before-defined exception when the s3 bucket is not configured
This works fine for me, but I'm testing just s1s2 against just one device.
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=f07e71078bc8&exclusion_profile=false&filter-searchStr=autophone
Concerns: Error handling is simplistic. If treeherder is down or a bad job is stuck in our queue, autophone will try to submit it and report the failure by mail every 30 seconds (treeherder-retry-wait).
Attachment #8696118 -
Attachment is obsolete: true
Attachment #8698466 -
Flags: review?(bob)
| Reporter | ||
Comment 7•10 years ago
|
||
Comment on attachment 8698466 [details] [diff] [review]
save treeherder jobs in db, retrieve and submit in a new thread
Review of attachment 8698466 [details] [diff] [review]:
-----------------------------------------------------------------
Note to self: Need to remove old jobs.sqlite in order for code to create the new table.
This looks pretty good and tests well. Would you mind addressing some of the comments and attaching a new patch? I'll test it and do a quick review asap.
Thanks!
::: autophone.py
@@ +134,5 @@
> self.pulse_monitor = None
> self.restart_workers = {}
> self.treeherder = AutophoneTreeherder(None,
> self.options,
> + self.jobs,
we need mailer=self.mailer, here.
@@ +225,5 @@
> name='CmdTCPThread')
> self.server_thread.daemon = True
> self.server_thread.start()
> +
> + if self.treeherder:
self.treeherder is never None, so this will always be true.
I had previously hidden whether or not results were submitted to Treeherder inside of AutophoneTreeherder's submit_* methods which checked for the Treeherder url and/or the build's revision_hash.
It is kind of ugly, but I think the easiest approach here is to check for self.treeherder.url instead.
@@ +406,4 @@
> self.server.shutdown()
> if self.server_thread:
> self.server_thread.join()
> + if self.treeherder:
ditto self.treeherder.url ...
::: autophonetreeherder.py
@@ +35,4 @@
>
> class AutophoneTreeherder(object):
>
> + def __init__(self, worker_subprocess, options, jobs, s3_bucket=None, mailer=None,
break the line after s3_bucket and put mailer on the same line as shared_lock.
@@ +81,1 @@
> logger.debug('AutophoneTreeherder.post_request: %s' % job_collection.__dict__)
Should we log the attempts and last_attempt?
@@ +87,5 @@
> + try:
> + client.post_collection(project, job_collection)
> + return True
> + except Exception, e:
> + logger.exception('Error submitting request to Treeherder')
log attempts, last_attempt?
@@ +113,4 @@
> logger.debug('AutophoneTreeherder shared_lock.acquire')
> self.shared_lock.acquire()
> try:
> + self.jobs.new_treeherder_job(machine, project, job_collection.to_json())
See the comments in the jobs.py new_treeherder_job where I recommend we pass the job_collection as is to the new_treeherder_job method and have it call to_json() inside the method.
@@ +161,5 @@
> tj.add_state(TestState.PENDING)
> tj.add_submit_timestamp(t.submit_timestamp)
> # XXX need to send these until Bug 1066346 fixed.
> + tj.add_start_timestamp(0)
> + tj.add_end_timestamp(0)
Sweet! Works like a charm.
@@ +222,4 @@
> tj.add_submit_timestamp(t.submit_timestamp)
> tj.add_start_timestamp(t.start_timestamp)
> # XXX need to send these until Bug 1066346 fixed.
> + tj.add_end_timestamp(0)
ditto!
@@ +315,5 @@
>
> # Attach logs, ANRs, tombstones, etc.
>
> + logurl = None
> + logname = None
Ok, I see that these might be unitialized when we are creating the text_log_summary if self.s3_bucket evaluates to False. But since we are initializing them here, we can remove the initializers inside the if self.s3_bucket just above the # UnitTest Log.
@@ +556,5 @@
> tjc.to_json())
>
> + self.queue_request(machine, project, tjc)
> +
> + def serve_forever(self):
I'm a bit concerned that Autophone will wait for the full time.sleep(self.retry_wait) before shutting down. Since the default is 300 seconds, we could be waiting for 5 minutes or longer before we completed shutting down. If we re-wrote this to have only one call to time.sleep then we could do something like (not tested):
def serve_forever(self):
while not self.shutdown_requested:
wait = True
job = self.jobs.get_next_treeherder_job()
if job:
tjc = TreeherderJobCollection()
for data in job['job_collection']:
tj = TreeherderJob(data)
tjc.add(tj)
if self.post_request(job['machine'], job['project'], tjc, job['attempts'], job['last_attempt']):
self.jobs.treeherder_job_completed(job['id'])
wait = False
if wait:
for i in range(self.retry_wait):
if self.shutdown_requested:
break
time.sleep(1)
which would only delay shutdown by a second regardless of the value of self.retry_wait.
::: jobs.py
@@ +370,4 @@
> self._commit_connection(conn)
> self._close_connection(conn)
>
> + def new_treeherder_job(self, machine, project, job_collection):
It is a little confusing that we are using the name job_collection here even though we are passing the json representation of the job collection.
In fact, I think it would be better if we passed the real job_collection to new_treeherder_job and did the to_json() call here instead of relying on the caller to do it. In addition this is more consistent with the way get_next_treeherder_job hides the fact that the job_collection is stored as a string representation of the json.
Attachment #8698466 -
Flags: review?(bob)
| Assignee | ||
Comment 8•10 years ago
|
||
Thanks for the careful review. I think this addresses all the issues you raised.
For me, this works just as before + shuts down quickly.
Attachment #8698466 -
Attachment is obsolete: true
Attachment #8700082 -
Flags: review?(bob)
| Reporter | ||
Comment 9•10 years ago
|
||
Comment on attachment 8700082 [details] [diff] [review]
save treeherder jobs in db, retrieve and submit in a new thread, v2
Review of attachment 8700082 [details] [diff] [review]:
-----------------------------------------------------------------
::: autophonetreeherder.py
@@ +35,4 @@
>
> class AutophoneTreeherder(object):
>
> + def __init__(self, worker_subprocess, options, jobs, s3_bucket=None,
trailing whitespace.
@@ +77,5 @@
> d[attr] = getattr(self, attr)
> return '%s' % d
>
> + def post_request(self, machine, project, job_collection, attempts, last_attempt):
> + logger.debug('AutophoneTreeherder.post_request: %s, attempt=%d, last=%s' %
trailing whitespace.
Attachment #8700082 -
Flags: review?(bob) → review+
| Assignee | ||
Comment 10•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 11•10 years ago
|
||
I had to update the commit message:
https://github.com/mozilla/autophone/commit/e535d843a1c13d974f65c2f084b93dd09df31b5a
Updated•4 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•