1563307 - generic-worker: config setting numberOfTasksToRun should not include superseded tasks

Reporter

Description

•

6 years ago

Currently when generic-worker is configured to run one task then exit, it will exit after consuming a superseded task. This is a particular problem with bitbar tasks since we have to do a lot of work to get a device and start a container before we can start generic-worker.

Since there has been no change in local state, there should be no problem with generic-worker continuing to consume tasks until it finds one that has not been superseded.

This would alleviate a major issue with andorid-hw/bitbar when we have large backlogs.

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Summary: generic-worker should consume superseded tasks until it find one that is not → generic-worker: config setting numberOfTasksToRun should not include superseded tasks

Pete Moore [:pmoore][:pete]

Comment 1

•

6 years ago

This could be an option, or we could possibly create a different config setting for this particular use case.

Although for bitbar workers this could be an appropriate behaviour, it might not be universally desirable (some workers might wish to run some custom code between each and every resolved task). In this case, keeping existing behaviour for the current config setting may be better.

Another option would be that the worker outputs structured data (e.g. a json file) with a summary of the task runs it resolved, and the resolution statuses. This would allow the wrapper script that starts the worker to decide whether it wants to start it up again to claim another task.

Also note, other resolution statuses such as malformed-payload also indicate that no task commands were executed, so they may be worth also considering.

Bob Clary [:bc] (inactive)

Reporter

Updated

•

6 years ago

Updated

•

6 years ago

Comment 2

•

6 years ago

I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.

Pete Moore [:pmoore][:pete]

Comment 3

•

6 years ago

(In reply to Tom Prince [:tomprince] from comment #2)

I think the behavior suggested here should be the default. I can't think of a use case where we would want different behavior (all of the existing deployments would benefit from the current behavior), so I think we should wait for a use-case before implementing an option to behave otherwise.

OK, that is a fair point, I think I am happy to adapt the semantics.

If at some point there is a use case to exit the worker after resolving X tasks rather than running X tasks, we could introduce a config setting numberOfTasksToResolve for this purpose. It works in our favour that the current config setting is named numberOfTasksToRun.

I'll keep tasks-resolved-count.txt as it is (number of resolved tasks) but additionally create tasks-run-count.txt (number of tasks actually executed on the local worker).

For some future time - we could potentially create a structured log of task activity in a local file task-activity.log, where each line is a json object, containing a run information object for each task run claim and each task run resolution. That might be best left for some other bug than this one, and it isn't clear there would be a demand for this information, so the additional development/maintenance cost of such a feature might not be worth it.

Pete Moore [:pmoore][:pete]

Comment 4

•

6 years ago

Ah, CI tests currently rely heavily on the semantics of numberOfTasksToRun. Typically, tests submit a task, run the worker, then check the results. Since tests also test superseding functionality, they need to be able to resolve a task as superseded and then return.

So it looks like I'll need to introduce numberOfTasksToResolve at the same time, at least for the test framework.

The dangers of changing semantics! :-)

Bob Clary [:bc] (inactive)

Reporter

Comment 5

•

6 years ago

I'm sure if it is currently the case but does the generic worker exit after processing a cancelled task? It would be good to handle cancelled tasks in a similar fashion to the superseded tasks if we don't already.

Tom Prince [:tomprince]

Comment 6

•

6 years ago

Cancelled tasks can have run for some time, so have polluted the environment, so they will need to exit.

Pete Moore [:pmoore][:pete]

Comment 7

•

6 years ago

•

Edited

~~I think it depends on whether the task was cancelled before or after the worker claimed it. If it was cancelled before the worker claimed it, I think we can count it as not having run.~~

Sorry, this doesn't make sense. A worker wouldn't claim a cancelled task! Ignore me!

Bob Clary [:bc] (inactive)

Reporter

Comment 8

•

6 years ago

Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.

Pete Moore [:pmoore][:pete]

Comment 9

•

6 years ago

•

Edited

(In reply to Bob Clary [:bc:] from comment #8)

Is there any hope of expediting this? We've deployed our threaded test run manager and have increased our utilization of devices as a result, but our current backlog is still over 6200 tasks and we aren't making any headway reducing it. I believe this new worker would go a long way to helping us catch up.

Unfortunately we have many competing priorities at the moment, many of them impacting important Mozilla projects.

The quickest options to expedite the situation would be:

To run the worker in a loop, and grep for a log line in the worker log indicating the task had been superseded. If it finds one, run another task, rather than shutting down. This seems relatively simple and should be possible in a few lines of e.g. a bash script, without needing a new worker release, or assistance from the taskcluster team.
We do welcome PRs if you have a resource available to look into it.

If those aren't doable, we can ask :coop to make a priority call.

Apologies, I should have mentioned that I wasn't planning at the moment to work on the implementation, as from my earlier comments, it could have looked like I was about to start!

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Mentor: pmoore

Keywords: good-first-bug

Andrew Erickson [:aerickson]

Comment 10

•

6 years ago

Update: We implemented Option #1 above for android-hw/bitbar workers.

Pete Moore [:pmoore][:pete]

Comment 11

•

6 years ago

After careful consideration, this might not be a good first bug after all.

Keywords: good-first-bug

Pete Moore [:pmoore][:pete]

Updated

•

5 years ago

QA Whiteboard: [lang=go]

Bugzilla

generic-worker: config setting numberOfTasksToRun should not include superseded tasks

Categories

(Taskcluster :: Workers, enhancement)

Tracking

(Not tracked)

People

(Reporter: bc, Unassigned, Mentored)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Updated