Open Bug 1680271 Opened 5 years ago Updated 3 years ago

frequent backlog for gecko-t-win10-64-1803-hw Windows worker pool

Categories

(Infrastructure & Operations :: RelOps: Windows OS, task)

Tracking

(Not tracked)

People

(Reporter: aryx, Unassigned)

References

Details

Yesterday and today, we observe backlog for Linux and Windows test machines owned by Mozilla (also used for performance tasks). The Linux backlog is gone but the Windows backlog persists and doesn't decrease much.

Is it possible to get a breakdown of the pending and processed tasks (e.g. counts per push and tree) and compare it 2 weeks ago?

The treeherder database doesn't store the worker type and the bigquery taskclusteretl.derived_task_summary has only completed tasks and gets updated only once per day.

At the moment, Try pushes by developers done in the morning won't have the tasks using those workers completed by EOD.

Flags: needinfo?(mcornmesser)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #0)

Yesterday and today, we observe backlog for Linux and Windows test machines owned by Mozilla (also used for performance tasks). The Linux backlog is gone but the Windows backlog persists and doesn't decrease much.

Incidentally I was planning on audit the Windows hardware workers this week. It is likely a significant number of nodes have dropped off. I will start that audit today and see were we stand.

Is it possible to get a breakdown of the pending and processed tasks (e.g. counts per push and tree) and compare it 2 weeks ago?
I am not aware of a way. We may want to talk to Taskclsuter about if this is possible, or if it could be possible in the future.

Flags: needinfo?(mcornmesser)

A significant number had dropped out of the pool. I am working on getting them back into the pool as well as testing some theories on why so many dropped off. I am going to keep a close eye on the pool for the next couple days as well.

This pool should be in better shape now. There was an issue with the nodes being rebooted during deployments and restores. We have changed the time interval to hopefully prevent that happening in the future.

See Also: → 1694954

Backlogged since 3am UTC today and the queue is rather growing than decreasing - Mark, can you check the state of the pool, please?

Flags: needinfo?(mcornmesser)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #4)

Backlogged since 3am UTC today and the queue is rather growing than decreasing - Mark, can you check the state of the pool, please?

It doesn't appear to be a significant number of workers missing, and that most of the workers are either currently running a task or had just finished a task. I will run some audit scripts tonight to pick up any down workers, but i don't expect it will return much more capacity.

Flags: needinfo?(mcornmesser)

Mark, the gecko-t-win10-64-1803-hw queue has steadily been increasing for the last 10h, could you please take a look?

Flags: needinfo?(mcornmesser)

Looking into it now.

Flags: needinfo?(mcornmesser)

I found the issue. related to Bug 1722015 . When the package was updated the casing of the file name was different. This caused the puppet run to perma-fail. The filename has been corrected and workers should start coming back on line shortly.

It looks like the queue is steadily dropping and the workers are continuing to pick up task. I will check back in on it in (pdt) morning.

See Also: → 1792745
You need to log in before you can comment on or make changes to this bug.