1457868 - halt unproductive windows instances where explorer is in a crash loop

Assignee

Description

•

7 years ago

jonasfj pointed out a windows 10 instance today which had been in stuck a crashed state. the papertrail logs show 3 days of the error below repeated every second or so: Apr 27 14:57:03 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com Application_Error: Faulting application name: explorer.exe, version: 10.0.15063.0, time stamp: 0xa25b6083 Faulting module name: twinui.dll, version: 10.0.15063.0, time stamp: 0xd3bee097 Exception code: 0xc0000409 Fault offset: 0x0000000000170abc Faulting process id: 0x1fa0 Faulting application start time: 0x01d3de1edaefc3a0 Faulting application path: C:\Windows\explorer.exe Faulting module path: C:\Windows\system32\twinui.dll Report Id: 0e1abc0f-df56-4d5c-a04e-77de83fb8a68 Faulting package full name: Faulting package-relative application ID: the messages immediately preceeeding the first explorer crash message were: Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: 2018/04/27 11:57:01 Result: 1 0 The operation completed successfully. Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>echo Wait for registry setting to exist before changing it... Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Wait for registry setting to exist before changing it... Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>reg query HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3 /ve Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3 Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: (Default) REG_SZ (value not set) Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>if 0 EQU 0 goto HideTaskBar Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>echo Hiding taskbar... Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Hiding taskbar... Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>powershell -command "&{$p='HKCU:SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3';$v=(Get-ItemProperty -Path $p).Settings;$v[8]=3;&Set-ItemProperty -Path $p -Name Settings -Value $v;&Stop-Process -ProcessName explorer}" Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running. Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com HaltOnIdle: instance appears to be productive. Apr 27 14:57:02 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com HaltOnIdle: process priority for generic worker altered from Normal to AboveNormal. Apr 27 14:57:03 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>echo Setting visual effects Apr 27 14:57:03 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Setting visual effects Apr 27 14:57:03 i-01c09dbd16ceb19cb.gecko-t-win10-64.use1.mozilla.com generic-worker: Z:\task_1524830212>REG ADD "HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\VisualEffects" /v "VisualFXSetting" /t REG_DWORD /d 1 logs above were taken from: https://papertrailapp.com/systems/1816519341/events?focus=926492398777982992&selected=926492398777982992

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 1

•

7 years ago

patched HaltOnIdle to halt instances where there have been 30 (or more) explorer crash event logs. https://github.com/mozilla-releng/OpenCloudConfig/commit/cf97e51a5851ec32dc9801d4ea75c07ba1c7e581

Assignee: relops → rthijssen

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Comment 2

•

7 years ago

This is causing some fallout in bug 1444168. If there are explorer crashes, we should find the root cause, rather than terminate workers mid-task. When workers are terminated, running tasks will get bad exit codes, and resolve as task failures, which won't get retried. This makes it difficult to diagnose what the problem was, and leaves a trail of broken tasks, and can cause tree closures too. Let's review this particular explorer crashing issue, and see if we can find the root cause, rather than just terminating machines that we believe exhibit the issue. Terminating workers should really be a last resort, due to the trails of broken tasks they leave behind.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Intermittent Failures Robot)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 4

•

7 years ago

The instances being terminated by haltonidle are not productive. They do nothing other than send explorer crash messages indefinitely. When enough instances get into this state, the trees close because there are no productive workers. Just useless workers which burn resources and prevent other jobs from running. From an infra perspective, I'd rather not burn the whole farm just to make it easier to debug a problem.

Kendall Libby [:fubar] (he/him)

Comment 5

•

7 years ago

So, what's the solution to debugging explorer.exe crashes that is not 1) nuking every instance which causes jobs to not be retried, and 2) letting instances loop and do nothing until we run out of capacity? You seem to have each end of the spectrum, what's the middle ground?

Pete Moore [:pmoore][:pete]

Comment 6

•

7 years ago

I'm fine with terminating instances that are wasting resources, but we should do it in a way that doesn't leave tasks as failed. The key points for me are: 1) Let's try to avoid solving problems by killing workers. This can be a stop-gap, but not a long term solution. 2) Let's try to check health of workers before it starts claiming tasks, not while tasks are running, where possible. Most tasks have a maximum run time of an hour, and will be aborted in a controlled manner if they do not complete within this time. 3) If we are sure that workers are in a bad state, and need to check this while tasks are running, let's agree an interface such that tasks can be resolved as exception/internal-error rather than failed/failed since: a) failed tasks don't get retried b) no information is currently provided in task logs for user to see that the worker was forcefully terminated, and why - the user just thinks their process crashed for no good reason Instead with an agreed interface (which we can discuss and review together) the reason for the termination could be logged in the task log so the user that submitted the task has better context on the reason for the failure, rather than the worker just being shut down mid-task. 4) Let's try to avoid landing patches in production that haven't undergone any review, especially those that kill workers. Killing workers should be a last resort, and the conditions under which workers get terminated should be closely monitored and reviewed. In the case of https://bugzilla.mozilla.org/show_bug.cgi?id=1444168#c39 we should review logs to be sure that explorer really was crashing and the worker really was in a bad state. From a cursory look at the logs, this didn't appear to be the case, but a deeper investigation is warranted.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 7

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #6) > I'm fine with terminating instances that are wasting resources, but we > should do it in a way that doesn't leave tasks as failed. when haltonidle is terminating an instance that is in an unresponsive state, how would you suggest that it "doesn't leave tasks as failed"? > The key points for me are: > > 1) Let's try to avoid solving problems by killing workers. This can be a > stop-gap, but not a long term solution. what should we do with an unresponsive worker if not kill it? > 2) Let's try to check health of workers before it starts claiming tasks, not > while tasks are running, where possible. Most tasks have a maximum run time > of an hour, and will be aborted in a controlled manner if they do not > complete within this time. the condition addressed by this bug is one where the only communication from the instance for more than 24 hours was a once per second message to say that explorer had crashed and once every two minutes a notification from haltonidle that the generic worker process was running. clearly generic worker was not in a state that allowed it to gracefully fail tasks. > 3) If we are sure that workers are in a bad state, and need to check this > while tasks are running, let's agree an interface such that tasks can be > resolved as exception/internal-error rather than failed/failed since: > a) failed tasks don't get retried > b) no information is currently provided in task logs for user to see that > the worker was forcefully terminated, and why - the user just thinks their > process crashed for no good reason > Instead with an agreed interface (which we can discuss and review together) > the reason for the termination could be logged in the task log so the user > that submitted the task has better context on the reason for the failure, > rather than the worker just being shut down mid-task. if we assume that generic worker was not in a responsive state since the gw process was running for more than 24 hours but it was not outputting log messages, i don't really see how an interface into gw would have helped on this occasion. gw was not capable of logging. never mind responding on a status interface. > 4) Let's try to avoid landing patches in production that haven't undergone > any review, especially those that kill workers. Killing workers should be a > last resort, and the conditions under which workers get terminated should be > closely monitored and reviewed. agreed. although i maintain that killing an unresponsive worker is an appropriate course of action. we would otherwise spend money on keeping unresponsive workers online for no good reason. > In the case of https://bugzilla.mozilla.org/show_bug.cgi?id=1444168#c39 we > should review logs to be sure that explorer really was crashing and the > worker really was in a bad state. From a cursory look at the logs, this > didn't appear to be the case, but a deeper investigation is warranted. i reviewed the logs. furthermore the patch that was landed to close this bug was specifically to check the event log for log messages that contain: "Faulting application name: explorer.exe" which is part of the text from the full failure logs which read: "Application_Error: Faulting application name: explorer.exe, version: 10.0.15063.0, time stamp: 0xa25b6083 Faulting module name: twinui.dll, version: 10.0.15063.0, time stamp: 0xd3bee097 Exception code: 0xc0000409 Fault offset: 0x0000000000170abc Faulting process id: 0x1fa0 Faulting application start time: 0x01d3de1edaefc3a0 Faulting application path: C:\Windows\explorer.exe Faulting module path: C:\Windows\system32\twinui.dll Report Id: 0e1abc0f-df56-4d5c-a04e-77de83fb8a68 Faulting package full name: Faulting package-relative application ID:" the patch checks for 30 recurrences of this exception before terminating the worker. from my more than cursory look at the logs, i would say that it would indeed be the case that explorer had crashed. i agree that it would be nice for generic worker or the build process to put some nicer messages into the build log when tasks are failing but i would suggest creating a new bug for that. this bug is specifically about removing unresponsive instances from the resource pool in order to keep capacity available for responsive instances. i'm confused as to why this bug was reopened at all and i have no idea how to close it now that the goal posts seem to be moving to encompass all kinds of things that aren't within the scope of occ infra and resource management.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 8

•

7 years ago

also i believe the orange factor link to this bug is completely unrelated. those orange factor builds contain the following log messages: [taskcluster:error] Uploading error artifact public/test_info from file build/blobber_upload_dir with message "Could not read directory 'Z:\\task_1525311161\\build\\blobber_upload_dir'", reason "file-missing-on-worker" and expiry 2019-05-02T23:58:43.351Z [taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1525311161\build\blobber_upload_dir' [taskcluster 2018-05-03T01:40:41.339Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/M3Ldb-FxSoOrWD2gtl9Jow/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-02T23:58:43.351Z [taskcluster:error] Task not successful due to following exception(s): [taskcluster:error] Exception 1) [taskcluster:error] exit status 1073807364 [taskcluster:error] Exception 2) [taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1525311161\build\blobber_upload_dir' [taskcluster:error] if the instance where that task had run had been terminated because of explorer crashes we would not be getting messages about generic worker being unable to find an artifact. generic worker itself would not be saying anything since the instance it was on would be dead. i don't know why there's any association between whatever problems gw is having with finding artifacts and this bug at all. two different problems. so i'm closing this again. the problem this bug describes has been fixed. the problems described by the orange factor links cannot relate to that fix since a terminated worker would not be able to send messages about missing artifacts. please create another bug that is actuially related to those errors.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 9

•

7 years ago

having read through the comments on bug 1444168 more carefully, i see that i am wrong. it is in fact the killing of the instance that causes the failure to find artefacts.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 10

•

7 years ago

other recent commits to occ check for an idle generic worker so we'll have to rely on those to determine wether or not to terminate an instance. i have removed the check for multiple explorer crashes and these will no longer cause a termination. https://github.com/mozilla-releng/OpenCloudConfig/commit/7b01ea86a55fe380e2c675623bf40bdd459c118f

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

Bugzilla

halt unproductive windows instances where explorer is in a crash loop

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: grenade, Assigned: grenade)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11