Closed Bug 1752461 Opened 4 years ago Closed 11 months ago

Number of active comm-t/win11-64-2009 workers keeps dropping to zero or near-zero

Categories

(Release Engineering :: Firefox-CI Administration, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: darktrojan, Assigned: jmoss)

References

Details

This could be a duplicate of bug 1735411, so feel free to dupe it. In this case it's only really been happening in the past few weeks so it may be a different problem.

At least once a day we're seeing a high number of pending Windows tests jobs, and the list of workers showing only two workers actually doing anything. Sometimes there are zero workers doing anything and no pending tasks are getting cleared. After a number of hours we seem to get lots more workers and the tasks run.

It could be a matter of infrequent demand for workers. There are often periods of many hours where no pushes happen.

When did the Azure machines became default for the jobs in comm? Any idea how much time difference there is between switching it over to Azure vs. hitting this?

Flags: needinfo?(geoff)
Flags: needinfo?(geoff) → needinfo?(rob)

The timing definitely lined up with the switching to Azure. It landed on Jan 15. Previously we only had occasional issues with decision tasks.

Flags: needinfo?(rob)
Severity: -- → S3
Depends on: 1741946
See Also: → 1735411

Marking this as S3, this is on the roadmap to be fixed properly, if it starts getting out of hand NI me and I'll see if I can prioritize this higher with the team.

We're currently seeing 3+ hour gaps between the build finishing and the tests starting. Not every push, but frequently. It's not the end of the world as we still have the other platforms, but it is annoying.

Also, comm-beta uses the same workers so the push to release time there is longer.

(Actually I've just been contradicted on the last point, we can do the release steps without waiting for test results. But that feels a bit wrong.)

I know this is an old bug and things may have changed somewhat, but this still happens. I've had a task waiting on try-comm-central for two hours while no workers were awake to take it. One eventually awoke (probably triggered by a build on comm-central or by another try run I started), but now there are 14 tasks waiting and one worker.

(In reply to Geoff Lankow (:darktrojan) from comment #7)

I know this is an old bug and things may have changed somewhat, but this still happens. I've had a task waiting on try-comm-central for two hours while no workers were awake to take it. One eventually awoke (probably triggered by a build on comm-central or by another try run I started), but now there are 14 tasks waiting and one worker.

Which worker pool are you referring to? comm-t/win10-64-2004 or comm-t/win11-64-2009? We're getting rid of comm-t/win10-64-2004.

Flags: needinfo?(geoff)

Ah yes, I should've updated the bug title.

Flags: needinfo?(geoff)
Summary: Number of active comm-t win10-64-2004 workers keeps dropping to zero or near-zero → Number of active comm-t/win11-64-2009 workers keeps dropping to zero or near-zero
QA Contact: michelle
Assignee: nobody → jmoss
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.