Closed Bug 1462110 Opened 7 years ago Closed 7 years ago

Windows tests are not running and are creating a backlog

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aciure, Unassigned)

Details

No description provided.
Severity: normal → blocker
Summary: Windows 7 Builds are not running and are creating a backlog → Windows Builds are not running and are creating a backlog
grenade is monitoring pending. It looks like we are having some trouble getting instances from EC2, although there are no recent errors (most recent are 21 hours ago). Most workerTypes -- not just windows -- have big orange bars showing that there are lots of outstanding requests for machines. But we seem to be getting machines. And backlogs are not high and shrinking. So, I think we can re-open.
Assignee: nobody → rthijssen
trees are re-opened
on windows 7 & 10 we have a somewhat higher than normal (anecdotal evidence only) count of instances shutting down with this error: > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 *********** PANIC occurred! *********** > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.RedirectArtifact{BaseArtifact:(*main.BaseArtifact)(0x13d8e8a0), URL:"https://queue.taskcluster.net/v1/task/AK7QdzOOSx-Wvwt-XRlGDQ/runs/2/artifacts/public/logs/live_backing.log"} with CreateArtifact payload {"contentType":"text/plain; charset=utf-8","expires":"2018-05-30T11:03:41.873Z","storageType":"reference","url":"https://queue.taskcluster.net/v1/task/AK7QdzOOSx-Wvwt-XRlGDQ/runs/2/artifacts/public/logs/live_backing.log"} > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 Exiting worker with exit code 69 > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 Immediate shutdown being issued... > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 generic-worker internal error > May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com USER32: The process C:\Windows\System32\shutdown.exe (I-08EB9FED693CB) has initiated the shutdown of computer I-08EB9FED693CB on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: shutdown Comment: generic-worker internal error see all instances in this state at: https://papertrailapp.com/searches/39634711 could be a clock sync issue or some other problem talking to the queue. gw is shutting down the instance when this happens which is the expected and intended behaviour. i have set an alert to monitor the count of instances in this state.
i've checked the logs and see no errors around the ntp utc time sync, so the problem must be elsewhere. pete: can you investigate why the gw/queue 401 exception is occurring?
Assignee: rthijssen → nobody
Flags: needinfo?(pmoore)
issue with workers shutting down is only evident on testers (gecko-t-win7-32[-gpu], gecko-t-win10-64[-gpu]), not builders. pending counts for test worker types and gecko-3-b-win2012 were around the 100 mark which for testers does not seem high enough (to me) to warrant a tree closure. for builders that is somewhat high but would need to persist for more than 30 minutes before i would be concerned. it takes ~15 minutes to spin up windows builders so if the number of build task requests spiked by about 100 jobs and then 30 minutes later, the pending counts were starting to drop, i would call that normal behaviour and again no reason for a tree closure. if pending counts continued to rise or stay high for more than 30 minutes, then there might be an issue.
Summary: Windows Builds are not running and are creating a backlog → Windows tests are not running and are creating a backlog
I suspect this was due to bug 1458873 which was rolled out to production AWS Windows testers in: https://bugzilla.mozilla.org/show_bug.cgi?id=1461901#c5
Flags: needinfo?(pmoore)
Sounds like this was solved and just not updated.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.