All users were logged out of Bugzilla on October 13th, 2018

Robustly handle iterations in taskcluster services

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
8 months ago

People

(Reporter: pmoore, Unassigned)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
Copied from https://bugzilla.mozilla.org/show_bug.cgi?id=1274929#c19:

author: jhford



So it seems like we have a common failure pattern in taskcluster services.  We have lots of timed iteration loops which all have roughly the same requirements.  If any of them get stuck, we get into really bad situations like this.  Right now, each tool has it's own iteration logic.  The actual iteration logic used everywhere really ought to be centralized so that we can share the fail-safes and make everything rugged.

What about making a node class which took the following options:

  maxIterations: 0, undef for infinite, positive integer for limited
  maxFailures: undef for 7, a positive integer that says how many back to back failing iterations before crashing
  maxIterationTime: positive integer of second stating the maximum amount of time an iteration can take before crashing
  minIterationTime: 0, undef for no min, positive integer for minimum amount of time an iteration can take, crash if less
  watchDog: positive integer of the number of seconds the watchdog should be set to
  waitTime: positive integer of seconds for how long to wait between iteration attempts
  waitTimeAfterFail: 0 to use waitTime, positive integer to use 
  handler: an async function that resolves when an iteration is good and rejects when bad
  dmsConfig: undef or object with a deadman's snitch to hit after each iteration completes

The class would have the methods:

  start() -- start the loop
  stop() -- stop the loop after the next iteration

The handler function would be called as:
    
  let result = await this.handler(watchDog);

Passing in the watchDog as a parameter instead of binding the function to an object is so we can pass methods from an object that can use the correct `this`.  This watch dog is an instance of https://github.com/taskcluster/aws-provisioner/blob/master/src/watchdog.js, which will abort the iteration if it hasn't been touched in the required number of seconds.  The reason we would pass it into the function instead of just doing it automatically is because some long lasting iterations will have repeating actions which can be used as a sign of non-freezing.  Example: provisioner loops can take up to 20ish minutes, but submit something every half second, it's only frozen if those half second things are happening, we should wait 20 minutes.  If they aren't happening at least once a minute, we should probably crash.

This watchdog and max iteration time system ensures that inside the process we try really really hard for things to not fail, but we'll fail as safely as possible.

The dmsConfig is used to configure an instance of a Deadman's Snitch (https://deadmanssnitch.com/) which is hit after each iteration.  This is an external service, so we'll be notified if the queue process is stuck without relying on the queue service itself to notify us.

I'd also like to provide a promise timeout function, which rejects a promise that takes too long to complete as a helper for writing robust services.

Doing these things for the provisioner has been extremely helpful to make it a lot more resiliant.  Does anyone disagree with using something like this?
(Reporter)

Updated

2 years ago
Flags: needinfo?(jopsen)
hmm... Might be useful... I think there might be a bit too many features here.
The loops in the queue where actually polling an azure queue, and each process had 4 such loops,
with little sleeping between iterations (sleep depended on whether the queue was empty).

So not sure it's easy to make one thing that runs them all.
If so we should probably focus on fewer features..
Flags: needinfo?(jopsen)
Agreed -- this might be a few composable tools.
Created attachment 8772914 [details]
repository

First review.  I haven't written a README.md file yet, but src/iterate.js has an explanation of the library.  Maybe I should move that there.

Thanks for looking at this.
Attachment #8772914 - Flags: review?(bstack)
Thoughts on the implementation of the library (I'm staying out of the debate over how much functionality this should have as I don't understand all of the ramifications):

1. https://github.com/jhford/taskcluster-lib-iterate/blob/master/src/watchdog.js#L7 is bringing back all of my childhood memories of http://static.tvtropes.org/pmwiki/pub/images/bomberman.png
2. I'm a big fan of the documentation in here. Definitely get a good bit of it in the README.md when that happens.
3. I assume the example.js file won't stay there once the real package is created?
4. Maybe a bit more clarification on what "exclusive" means here https://github.com/jhford/taskcluster-lib-iterate/blob/master/src/iterate.js#L44

5. More to come. Didn't finish looking this entirely over today.
(In reply to Brian Stack [:bstack] from comment #5)
> Thoughts on the implementation of the library (I'm staying out of the debate
> over how much functionality this should have as I don't understand all of
> the ramifications):

I'm interested in your overall opinions about whether there's too many things going on here as well.  A lot of the seemingly paranoid things in here are the direct result of my experience with the provisioner freeze-ups.  If there's something in particular that you think isn't needed, I'd love to know!

> 1.
> https://github.com/jhford/taskcluster-lib-iterate/blob/master/src/watchdog.
> js#L7 is bringing back all of my childhood memories of
> http://static.tvtropes.org/pmwiki/pub/images/bomberman.png

hehe :)

> 2. I'm a big fan of the documentation in here. Definitely get a good bit of
> it in the README.md when that happens.

Great.  I was worried that I was over-documenting, but then I wonder if that's even a possibility...

> 3. I assume the example.js file won't stay there once the real package is
> created?

I'm a fan of example code.  I guess I could put it in the readme, but it's also nice to be able to execute it directly.  Since these are such simple examples, I suspect they're OK to copy and paste.

> 4. Maybe a bit more clarification on what "exclusive" means here
> https://github.com/jhford/taskcluster-lib-iterate/blob/master/src/iterate.
> js#L44

I meant it in the sense that the wait time is excluded from the value that should be provided.  I can reword this to make it more clear.

> 5. More to come. Didn't finish looking this entirely over today.

great!
Brian, did you have any other concerns?  If not, I'm going to land this very soon.
Flags: needinfo?(bstack)
Nope, I think we're good to land it!
Flags: needinfo?(bstack)

Updated

2 years ago
Attachment #8772914 - Flags: review?(bstack) → review+

Updated

8 months ago
Status: NEW → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.