Open Bug 1563307 Opened 6 years ago Updated 5 years ago

generic-worker: config setting numberOfTasksToRun should not include superseded tasks

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: bc, Unassigned, Mentored)

References

Details

Currently when generic-worker is configured to run one task then exit, it will exit after consuming a superseded task. This is a particular problem with bitbar tasks since we have to do a lot of work to get a device and start a container before we can start generic-worker.

Since there has been no change in local state, there should be no problem with generic-worker continuing to consume tasks until it finds one that has not been superseded.

This would alleviate a major issue with andorid-hw/bitbar when we have large backlogs.

Summary: generic-worker should consume superseded tasks until it find one that is not → generic-worker: config setting numberOfTasksToRun should not include superseded tasks

This could be an option, or we could possibly create a different config setting for this particular use case.

Although for bitbar workers this could be an appropriate behaviour, it might not be universally desirable (some workers might wish to run some custom code between each and every resolved task). In this case, keeping existing behaviour for the current config setting may be better.

Another option would be that the worker outputs structured data (e.g. a json file) with a summary of the task runs it resolved, and the resolution statuses. This would allow the wrapper script that starts the worker to decide whether it wants to start it up again to claim another task.

Also note, other resolution statuses such as malformed-payload also indicate that no task commands were executed, so they may be worth also considering.

See Also: → 1563377
See Also: → 1562988

I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.

(In reply to Tom Prince [:tomprince] from comment #2)

I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.

OK, that is a fair point, I think I am happy to adapt the semantics.

If at some point there is a use case to exit the worker after resolving X tasks rather than running X tasks, we could introduce a config setting numberOfTasksToResolve for this purpose. It works in our favour that the current config setting is named numberOfTasksToRun.

I'll keep tasks-resolved-count.txt as it is (number of resolved tasks) but additionally create tasks-run-count.txt (number of tasks actually executed on the local worker).

For some future time - we could potentially create a structured log of task activity in a local file task-activity.log, where each line is a json object, containing a run information object for each task run claim and each task run resolution. That might be best left for some other bug than this one, and it isn't clear there would be a demand for this information, so the additional development/maintenance cost of such a feature might not be worth it.

Ah, CI tests currently rely heavily on the semantics of numberOfTasksToRun. Typically, tests submit a task, run the worker, then check the results. Since tests also test superseding functionality, they need to be able to resolve a task as superseded and then return.

So it looks like I'll need to introduce numberOfTasksToResolve at the same time, at least for the test framework.

The dangers of changing semantics! :-)

I'm sure if it is currently the case but does the generic worker exit after processing a cancelled task? It would be good to handle cancelled tasks in a similar fashion to the superseded tasks if we don't already.

Cancelled tasks can have run for some time, so have polluted the environment, so they will need to exit.

I think it depends on whether the task was cancelled before or after the worker claimed it. If it was cancelled before the worker claimed it, I think we can count it as not having run.

Sorry, this doesn't make sense. A worker wouldn't claim a cancelled task! Ignore me!

Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.

(In reply to Bob Clary [:bc:] from comment #8)

Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.

Unfortunately we have many competing priorities at the moment, many of them impacting important Mozilla projects.

The quickest options to expedite the situation would be:

  1. To run the worker in a loop, and grep for a log line in the worker log indicating the task had been superseded. If it finds one, run another task, rather than shutting down. This seems relatively simple and should be possible in a few lines of e.g. a bash script, without needing a new worker release, or assistance from the taskcluster team.

  2. We do welcome PRs if you have a resource available to look into it.

If those aren't doable, we can ask :coop to make a priority call.

Apologies, I should have mentioned that I wasn't planning at the moment to work on the implementation, as from my earlier comments, it could have looked like I was about to start!

Mentor: pmoore
Keywords: good-first-bug

Update: We implemented Option #1 above for android-hw/bitbar workers.

After careful consideration, this might not be a good first bug after all.

Keywords: good-first-bug
QA Whiteboard: [lang=go]
You need to log in before you can comment on or make changes to this bug.