frequent backlog for gecko-t-win10-64-1803-hw Windows worker pool
Categories
(Infrastructure & Operations :: RelOps: Windows OS, task)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
References
Details
Yesterday and today, we observe backlog for Linux and Windows test machines owned by Mozilla (also used for performance tasks). The Linux backlog is gone but the Windows backlog persists and doesn't decrease much.
Is it possible to get a breakdown of the pending and processed tasks (e.g. counts per push and tree) and compare it 2 weeks ago?
The treeherder database doesn't store the worker type and the bigquery taskclusteretl.derived_task_summary has only completed tasks and gets updated only once per day.
At the moment, Try pushes by developers done in the morning won't have the tasks using those workers completed by EOD.
Comment 1•5 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #0)
Yesterday and today, we observe backlog for Linux and Windows test machines owned by Mozilla (also used for performance tasks). The Linux backlog is gone but the Windows backlog persists and doesn't decrease much.
Incidentally I was planning on audit the Windows hardware workers this week. It is likely a significant number of nodes have dropped off. I will start that audit today and see were we stand.
Is it possible to get a breakdown of the pending and processed tasks (e.g. counts per push and tree) and compare it 2 weeks ago?
I am not aware of a way. We may want to talk to Taskclsuter about if this is possible, or if it could be possible in the future.
Comment 2•5 years ago
|
||
A significant number had dropped out of the pool. I am working on getting them back into the pool as well as testing some theories on why so many dropped off. I am going to keep a close eye on the pool for the next couple days as well.
Comment 3•5 years ago
|
||
This pool should be in better shape now. There was an issue with the nodes being rebooted during deployments and restores. We have changed the time interval to hopefully prevent that happening in the future.
| Reporter | ||
Comment 4•4 years ago
|
||
Backlogged since 3am UTC today and the queue is rather growing than decreasing - Mark, can you check the state of the pool, please?
Comment 5•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #4)
Backlogged since 3am UTC today and the queue is rather growing than decreasing - Mark, can you check the state of the pool, please?
It doesn't appear to be a significant number of workers missing, and that most of the workers are either currently running a task or had just finished a task. I will run some audit scripts tonight to pick up any down workers, but i don't expect it will return much more capacity.
Comment 6•4 years ago
|
||
Mark, the gecko-t-win10-64-1803-hw queue has steadily been increasing for the last 10h, could you please take a look?
Comment 8•4 years ago
|
||
I found the issue. related to Bug 1722015 . When the package was updated the casing of the file name was different. This caused the puppet run to perma-fail. The filename has been corrected and workers should start coming back on line shortly.
Comment 9•4 years ago
|
||
It looks like the queue is steadily dropping and the workers are continuing to pick up task. I will check back in on it in (pdt) morning.
Description
•