Closed
Bug 1462110
Opened 7 years ago
Closed 7 years ago
Windows tests are not running and are creating a backlog
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aciure, Unassigned)
Details
No description provided.
Comment 1•7 years ago
|
||
Trees are closed for this.
E.g. https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=5d2a29dd0e399877f528acee8282db6b5606f399&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=pending&filter-resultStatus=running just got its first running Windows builds and still has some pending.
Shows backlogs for more workers (e.g. Linux testers).
Severity: normal → blocker
Summary: Windows 7 Builds are not running and are creating a backlog → Windows Builds are not running and are creating a backlog
Comment 2•7 years ago
|
||
grenade is monitoring pending.
It looks like we are having some trouble getting instances from EC2, although there are no recent errors (most recent are 21 hours ago). Most workerTypes -- not just windows -- have big orange bars showing that there are lots of outstanding requests for machines.
But we seem to be getting machines. And backlogs are not high and shrinking.
So, I think we can re-open.
Assignee: nobody → rthijssen
Comment 3•7 years ago
|
||
trees are re-opened
Comment 4•7 years ago
|
||
on windows 7 & 10 we have a somewhat higher than normal (anecdotal evidence only) count of instances shutting down with this error:
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 *********** PANIC occurred! ***********
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.RedirectArtifact{BaseArtifact:(*main.BaseArtifact)(0x13d8e8a0), URL:"https://queue.taskcluster.net/v1/task/AK7QdzOOSx-Wvwt-XRlGDQ/runs/2/artifacts/public/logs/live_backing.log"} with CreateArtifact payload {"contentType":"text/plain; charset=utf-8","expires":"2018-05-30T11:03:41.873Z","storageType":"reference","url":"https://queue.taskcluster.net/v1/task/AK7QdzOOSx-Wvwt-XRlGDQ/runs/2/artifacts/public/logs/live_backing.log"}
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 Exiting worker with exit code 69
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 Immediate shutdown being issued...
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com generic-worker: 2018/05/16 19:21:04 generic-worker internal error
> May 16 22:21:05 i-08eb9fed693cb2a53.gecko-t-win7-32.usw2.mozilla.com USER32: The process C:\Windows\System32\shutdown.exe (I-08EB9FED693CB) has initiated the shutdown of computer I-08EB9FED693CB on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: shutdown Comment: generic-worker internal error
see all instances in this state at: https://papertrailapp.com/searches/39634711
could be a clock sync issue or some other problem talking to the queue. gw is shutting down the instance when this happens which is the expected and intended behaviour. i have set an alert to monitor the count of instances in this state.
Comment 5•7 years ago
|
||
i've checked the logs and see no errors around the ntp utc time sync, so the problem must be elsewhere.
pete: can you investigate why the gw/queue 401 exception is occurring?
Assignee: rthijssen → nobody
Flags: needinfo?(pmoore)
Comment 6•7 years ago
|
||
issue with workers shutting down is only evident on testers (gecko-t-win7-32[-gpu], gecko-t-win10-64[-gpu]), not builders.
pending counts for test worker types and gecko-3-b-win2012 were around the 100 mark which for testers does not seem high enough (to me) to warrant a tree closure. for builders that is somewhat high but would need to persist for more than 30 minutes before i would be concerned. it takes ~15 minutes to spin up windows builders so if the number of build task requests spiked by about 100 jobs and then 30 minutes later, the pending counts were starting to drop, i would call that normal behaviour and again no reason for a tree closure. if pending counts continued to rise or stay high for more than 30 minutes, then there might be an issue.
Summary: Windows Builds are not running and are creating a backlog → Windows tests are not running and are creating a backlog
Comment 7•7 years ago
|
||
I suspect this was due to bug 1458873 which was rolled out to production AWS Windows testers in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1461901#c5
Flags: needinfo?(pmoore)
Comment 8•7 years ago
|
||
Sounds like this was solved and just not updated.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•