Closed Bug 1407409 Opened 7 years ago Closed 7 years ago

Growing pending backlog for gecko-t-win7-32-gpu

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Unassigned)

References

Details

There has been a growing backlog and a number of machines running has been fluctuating (often decreasing).

In the papertrail logs [1] , it seems that there are many machines being shutdown with the reason:
The process C:\windows\system32\shutdown.exe (I-0B0CCEBABFE1E) has initiated the shutdown of computer I-0B0CCEBABFE1E on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: shutdown   Comment: HaltOnIdle :: instance failed validity checks

[1] https://papertrailapp.com/groups/853883/events?q=gecko-t-win7-32-gpu%20%22shutdown%20type%22%20-restart
There are almost 500 instances now.  Abou the time this bug was filed, I raised the price for the instance from $0.90 to $1.10/hr.

It's possible that this was a spot-kill?
As far as I'm aware this is the machine shutting itself down, and I don't think OCC nor generic-worker respond to spot term events.  I might be mistaken.

It seems that generic-worker is exiting with an exit code other than 0 or 67, which causes the format and reboot script to silently exit.  Then a minute or so later haltonidle runs and sees that generic-worker isn't running and shuts the machine down because it thinks things are invalid at that point.

Pete, Rob, any ideas?
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
HaltOnIdle's "instance failed validity checks" message and the subsequent termination is logged when the following conditions are true:

1: the instance is not a loaner
2: generic-worker is not running
3: occ is not running
4: there is no active rdp session
5: no drive formatting is in progress

HaltOnIdle runs every 2 minutes to check that one of these conditions is satisfied and terminates the machine if none are. if the machine doesn't meet one of these conditions, it's basically just spending money being alive and doing nothing. a quick look at the logs preceding the message usually gives an indication of what's gone wrong in the form of a crash notification from generic worker but in the logs i checked today, there's no mention of a crash. there are some routine messages from generic-worker followed by two minutes of silence and then the "instance failed validity checks" message from HaltOnIdle.

to me this indicates that generic worker has stopped or crashed without enough time to log an exception or stack trace. HaltOnIdle is correct to terminate the instance at this point since if it didn't, we would have lots of idle instances doing nothing.

we'll need to investigate further and hopefully find an instance that has managed to log something useful in-between the routine messages from g-w and the termination.
Flags: needinfo?(rthijssen)
It is worth noting that of the 10 or so worker types we have running generic-worker, this only appears to happen on gecko-t-win7-32-gpu instances, which leads me to suspect the problem may not lie in generic-worker (e.g. win7 non-gpu runs exactly the same version of the worker, and win 10 gpu and win10 non-gpu an almost identical version).

We could probably upgrade win7 and win7 gpu to generic-worker 8.3.0 without problems (they currently run 8.2.0) - but I suspect it won't help solve this problem, but would at least align the testers on the same worker version until bug 1399401 lands.

This only behavioral change between 8.2.0 and 8.3.0 is bug 1347956 (gzipping more artifacts)[1].

But like I say, I don't think this will fix the underlying problem.

One anomaly I see with win7 gpu is that we aren't installing fakemon[2] (like we do on win10 gpu[3]). I'll create a separate bug to get that done. It may be not related at all - I don't know enough about fakemon.

Q, can you confirm we should be installing it also on win7 gpu workers? 

----

[1] https://github.com/taskcluster/generic-worker/compare/v8.2.0...v8.3.0
[2] https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Fix_2nd_monitor
[3] https://github.com/mozilla-releng/OpenCloudConfig/commit/fbae1cb0ca13fef35dc724e0e539582ae6f5dcba
Flags: needinfo?(pmoore) → needinfo?(q)
It was later discovered that this also affected non-gpu instances.  Here is an updated papertrail search:
https://papertrailapp.com/groups/853883/events?q=gecko-t-win7%20%22shutdown%20type%22%20-restart%20-beta%20-gpu%20validity&focus=855085508085338135
looking at one system at random (https://papertrailapp.com/systems/1223479972/events)

it looks as though gw has not started at all on this instance. 

in the logs we see that occ does it's check to make sure gw is installed, then it sets the readystate flag and waits for gw to start. when it doesn't start, it reboots the machine to see if an autologin will force gw to start. when that fails it shuts down.

since gw does start at least some of the time (evidenced by the instances which are taking jobs), i'm guessing that there must be some condition that sometimes causes gw to fail to start. maybe some call to the queue that fails or something similar???
since deploying new amis for an unrelated issue (Y: drive (cache) not mounted) several hours ago, the issue seems to have been resolved. i've not seen new validity check failures in the logs since that ami went live.

to be honest, i don't think the y: drive issue was the problem, since that problem has existed for several weeks and this validity check failure seems newer. but it seems that the new amis dont suffer from this problem or aren't exhibiting the log messages we saw earlier. at least not yet. will monitor again tomorrow morning to make sure.
Thanks Rob! seems problems with win7 stopped around 3.5 hours ago.  I am still seeing it for win10 gpu instances.  Perhaps those are different reasons?
See Also: → 1398748
appears resolved.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Flags: needinfo?(q)
It's Saturday, and there are 877 pending jobs (instances are at maximum capacity).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Closing, this time it's gecko-t-win10-64-gpu.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
max capacity is set to 256 and there were 256 running instances.  247 of those instances were failing status checks over the last 24 hours.  The 9 newer machines from today seem to be behaving ok.  Killed those rogue machines and new ones should spawn.  No diagnostic logging that I can find other than AWS saying the instances became unreachable.
This backlog quickly recovered itself once these failed machines were out of the pool.  Trees were reopened.
Ok, reopening this one because we also talked about the win10 instances here.

Directly after Greg killed the broken machines, Windows 10 gl jobs failed with an exception, see e.g. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=d71e8e0053d8043bc9deb98b35ca5220a0c9adea&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception That got pushed after the instances got killed, so shouldn't be an affect of the termination.

And now the issue is back, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=60d3df91ae2f1e724072d234e83b8b82edfc91cb&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=pending&filter-resultStatus=running still has the Windows 10 gl & gpu jobs pending 7 hours after push. There are 259 instances running and 243 pending.
Status: RESOLVED → REOPENED
Flags: needinfo?(garndt)
Resolution: FIXED → ---
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #14)

> And now the issue is back,
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=60d3df91ae2f1e724072d234e83b8b82edfc91cb&filter-
> resultStatus=retry&filter-resultStatus=usercancel&filter-
> resultStatus=testfailed&filter-resultStatus=busted&filter-
> resultStatus=exception&filter-resultStatus=pending&filter-
> resultStatus=running still has the Windows 10 gl & gpu jobs pending 7 hours
> after push. There are 259 instances running and 243 pending.

The pending counts for that worker type have gone down since comment 14.  However, there are 253 win10 gpu instances in us-east-1 that are failing all status checks (mostly machines started around 2 hours ago).  No other instance types are failing status checks (not even the win7 gpu instances), so I wonder if there is something different with the win10 gpu instances causing them to crash.

I have killed all the machines, and have a list of IDs that I can provide for ones I killed.

Looking at one of them [1] HaltOnIdle was reporting the machine productive, but then all output stopped.  Rob, I am at a lost of what might be going wrong with these Windows machines.  


[1] https://papertrailapp.com/systems/1231448412/events
Flags: needinfo?(garndt) → needinfo?(rthijssen)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #14)
> Ok, reopening this one because we also talked about the win10 instances here.
> 
> Directly after Greg killed the broken machines, Windows 10 gl jobs failed
> with an exception, see e.g.
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=d71e8e0053d8043bc9deb98b35ca5220a0c9adea&filter-
> resultStatus=retry&filter-resultStatus=usercancel&filter-
> resultStatus=testfailed&filter-resultStatus=busted&filter-
> resultStatus=exception That got pushed after the instances got killed, so
> shouldn't be an affect of the termination.
> 

As far as random windows 10 gpu jobs being reported as an exception, this could be a couple of reasons:
1. If close enough to the time I killed the bad gpu instances, I might have killed good ones which caused tasks to fail
2. I believe there is a known intermittent issue with windows gpu instances where the graphics driver causes them to crash.  That might be the case here and bad timing.
windows 10 instances in impaired state is tracked in bug 1372172 (exercising the gpu often causes the instance to hang). this bug was just about backlog on win 7 and was fixed with the ami update to that worker type.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.