generic-worker: config setting numberOfTasksToRun should not include superseded tasks
Categories
(Taskcluster :: Workers, enhancement)
Tracking
(Not tracked)
People
(Reporter: bc, Unassigned, Mentored)
References
Details
Currently when generic-worker is configured to run one task then exit, it will exit after consuming a superseded task. This is a particular problem with bitbar tasks since we have to do a lot of work to get a device and start a container before we can start generic-worker.
Since there has been no change in local state, there should be no problem with generic-worker continuing to consume tasks until it finds one that has not been superseded.
This would alleviate a major issue with andorid-hw/bitbar when we have large backlogs.
Updated•6 years ago
|
Comment 1•6 years ago
|
||
This could be an option, or we could possibly create a different config setting for this particular use case.
Although for bitbar workers this could be an appropriate behaviour, it might not be universally desirable (some workers might wish to run some custom code between each and every resolved task). In this case, keeping existing behaviour for the current config setting may be better.
Another option would be that the worker outputs structured data (e.g. a json file) with a summary of the task runs it resolved, and the resolution statuses. This would allow the wrapper script that starts the worker to decide whether it wants to start it up again to claim another task.
Also note, other resolution statuses such as malformed-payload also indicate that no task commands were executed, so they may be worth also considering.
Comment 2•6 years ago
|
||
I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.
Comment 3•6 years ago
|
||
(In reply to Tom Prince [:tomprince] from comment #2)
I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.
OK, that is a fair point, I think I am happy to adapt the semantics.
If at some point there is a use case to exit the worker after resolving X tasks rather than running X tasks, we could introduce a config setting numberOfTasksToResolve for this purpose. It works in our favour that the current config setting is named numberOfTasksToRun.
I'll keep tasks-resolved-count.txt as it is (number of resolved tasks) but additionally create tasks-run-count.txt (number of tasks actually executed on the local worker).
For some future time - we could potentially create a structured log of task activity in a local file task-activity.log, where each line is a json object, containing a run information object for each task run claim and each task run resolution. That might be best left for some other bug than this one, and it isn't clear there would be a demand for this information, so the additional development/maintenance cost of such a feature might not be worth it.
Comment 4•6 years ago
|
||
Ah, CI tests currently rely heavily on the semantics of numberOfTasksToRun. Typically, tests submit a task, run the worker, then check the results. Since tests also test superseding functionality, they need to be able to resolve a task as superseded and then return.
So it looks like I'll need to introduce numberOfTasksToResolve at the same time, at least for the test framework.
The dangers of changing semantics! :-)
| Reporter | ||
Comment 5•6 years ago
|
||
I'm sure if it is currently the case but does the generic worker exit after processing a cancelled task? It would be good to handle cancelled tasks in a similar fashion to the superseded tasks if we don't already.
Comment 6•6 years ago
|
||
Cancelled tasks can have run for some time, so have polluted the environment, so they will need to exit.
Comment 7•6 years ago
•
|
||
I think it depends on whether the task was cancelled before or after the worker claimed it. If it was cancelled before the worker claimed it, I think we can count it as not having run.
Sorry, this doesn't make sense. A worker wouldn't claim a cancelled task! Ignore me!
| Reporter | ||
Comment 8•6 years ago
|
||
Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.
Comment 9•6 years ago
•
|
||
(In reply to Bob Clary [:bc:] from comment #8)
Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.
Unfortunately we have many competing priorities at the moment, many of them impacting important Mozilla projects.
The quickest options to expedite the situation would be:
-
To run the worker in a loop, and grep for a log line in the worker log indicating the task had been superseded. If it finds one, run another task, rather than shutting down. This seems relatively simple and should be possible in a few lines of e.g. a bash script, without needing a new worker release, or assistance from the taskcluster team.
-
We do welcome PRs if you have a resource available to look into it.
If those aren't doable, we can ask :coop to make a priority call.
Apologies, I should have mentioned that I wasn't planning at the moment to work on the implementation, as from my earlier comments, it could have looked like I was about to start!
Updated•6 years ago
|
Comment 10•6 years ago
|
||
Update: We implemented Option #1 above for android-hw/bitbar workers.
Comment 11•6 years ago
|
||
After careful consideration, this might not be a good first bug after all.
Updated•5 years ago
|
Description
•