Closed Bug 1444168 Opened 6 years ago Closed 6 years ago

Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)

Categories

(Taskcluster :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nataliaCs, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

Attachments

(1 file)

The issues is present on the following push https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1d01aa7f63ade6f7afc41370cd28cfbc35c14f13

Example log: https://treeherder.mozilla.org/logviewer.html#?job_id=166806532&repo=mozilla-inbound&lineNumber=36687

19:29:03     INFO -  Beginning work for file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03     INFO -  Processing file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03     INFO -  z:\build\build\src\obj-firefox\dist\host\bin\dump_syms.exe z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03     INFO -  Finished processing z:\build\build\src\obj-firefox\toolkit\library\xul.dll in 85.66s
19:29:03     INFO -  Running: z:/build/build/src/obj-firefox/_virtualenv/Scripts/python.exe z:/build/build/src\toolkit\crashreporter\tools\symbolstore.py -c --vcs-info -i --install-manifest=z:/build/build/src/obj-firefox\_build_manifests\install\dist_include,z:/build/build/src/obj-firefox\dist\include -s z:/build/build/src z:/build/build/src/obj-firefox\dist\host\bin\dump_syms.exe z:/build/build/src/obj-firefox\dist\crashreporter-symbols z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03     INFO -  mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/toolkit/library'
[taskcluster:error]    Exit Code: 1073807364
[taskcluster:error] Success Code: 0x0
[taskcluster:error]    User Time: 0s
[taskcluster:error]  Kernel Time: 31.25ms
[taskcluster:error]    Wall Time: 36m44.6030212s
[taskcluster:error]  Peak Memory: 1703936
[taskcluster:error]       Result: FAILED
[taskcluster 2018-03-08T19:29:14.145Z] === Task Finished ===
[taskcluster 2018-03-08T19:29:14.145Z] Task Duration: 43m16.7670142s
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1520534506\\public\\build'", reason "file-missing-on-worker" and expiry 2019-03-08T18:45:33.009Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build'
[taskcluster 2018-03-08T19:29:15.225Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster 2018-03-08T19:29:16.299Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster 2018-03-08T19:29:16.885Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/JGSjjqYkRMm3kcCEtWQ2ug/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster:error] Task not successful due to following exception(s):
[taskcluster:error] Exception 1)
[taskcluster:error] []
[taskcluster:error] 
[taskcluster:error] Exit code: 1073807364
[taskcluster:error] Exception 2)
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build'
[taskcluster:error] 

RESULTED IN TREE CLOSURE
tl;dr: this was caused by spot terminations (we had > 1000 spot terminations in the last day) and that bug 1180187 is not yet implemented so tasks were just failing, rather than resolving as exception/worker-shutdown[1]. I will look into implementing bug 1180187 as soon as possible to mitigate this in future, so that tasks are resolved as exception/worker-shutdown (blue in treeherder) rather than failed/failed (red in treeherder) in future, and automatically retried.


More detailed explanation
=========================

When implementing bug 1333957, the task logging was updated so that errors were logged with a different syntax, so that treeherder could parse them. This involved removing the timestamp from the error line, as required by treeherder. This means that now we are able to find errors automatically, that we could not before. Maybe this helped classify this issue more uniformly than we were able to do before - this part I am not sure.

However, the rate of spot terminations was high during this time (perhaps higher demand in AWS for this instance type in the regions where we had provisioned nodes) - but whatever the reason was, we were getting spot terminations.

Spot request termination detection has been built into OCC[2], which was able to log the spot termination in the worker log[3] at 19:28:13 UTC, for a planned termination 58 seconds later at 19:29:11. Assuming the shutdown was issued on time, the machine would have begun shutting down at 19:29:11, and two seconds later at 19:29:13 we see the exit code 0x40010004[4] = 1073807364 = DBG_TERMINATE_PROCESS = "Debugger terminated the process"[5] as reported in the task log[6].

--

[1] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#get-task-status
[2] https://github.com/mozilla-releng/OpenCloudConfig/blob/cb7a8948b9068302a062a9badec6d88ca3b4bc6b/userdata/HaltOnIdle.ps1#L63-L77
[3] https://papertrailapp.com/systems/1668716761/events?focus=908486543633731587&selected=908486543633731587 :

> Mar 08 19:28:13 i-0f019b81cdfaf7468.gecko-3-b-win2012.usw1.mozilla.com HaltOnIdle: Is-Terminating :: spot termination notice received: 2018-03-08T19:29:11Z. 

[4] https://stackoverflow.com/questions/37078953/in-which-cases-does-program-exit-with-0x40010004-code
[5] https://msdn.microsoft.com/en-us/library/cc704588.aspx?f=255&MSPPError=-2147217396
[6] https://treeherder.mozilla.org/logviewer.html#?job_id=166808009&repo=mozilla-inbound&lineNumber=55501 :

> 19:29:13     INFO -  mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/gfx/angle/targets/libANGLE'
> [taskcluster:error]    Exit Code: 1073807364
See Also: → 1180187
Note, trees are no longer closed.
Depends on: 1180187
See Also: 1180187
Attached file spot-kills.txt —
This is a list of all instances which terminated because they were spot killed by the EC2 service in the last 24 hours (give or take a few minutes to do this attachment).  The query run was:

ec2-manager::CRIMSON=> select region, "workerType", "instanceType", id, terminated from terminations where code = 'Server.SpotInstanceTermination' and terminated > now() - interval '1 day';

This was the result of a ping on IRC, so please forgive me if this isn't the correct place for this attachment.
Severity: blocker → major
Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build
(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> This means that now we are able to find errors automatically,
> that we could not before. Maybe this helped classify this issue more
> uniformly than we were able to do before - this part I am not sure.

To test this theory, I searched the new ec2-manager postgres database, to find spot termination notices that occurred before upgrading generic-worker to provide the new error messages, and found e.g.

https://treeherder.mozilla.org/logviewer.html#?job_id=165524756&repo=nss-try&lineNumber=178

> [taskcluster 2018-03-02T14:10:26.295Z]    Exit Code: 3221225786

Note, this is a similar exit code to signify process termination (STATUS_CONTROL_C_EXIT)[1].

which corresponds to:

> ec2-manager::CRIMSON=> select id, "workerType", region, reason, terminated from terminations where id = 'i-0f20dd96e96d8f45d';
>          id          |  workerType   |  region   |          reason           |       terminated       
> ---------------------+---------------+-----------+---------------------------+------------------------
>  i-0f20dd96e96d8f45d | nss-win2012r2 | us-east-1 | Spot instance termination | 2018-03-02 14:11:21+00
> (1 row)

> ec2-manager::CRIMSON=> select generated from cloudwatchlog where id = 'i-0f20dd96e96d8f45d' and state='shutting-down';
>        generated        
> ------------------------
>  2018-03-02 14:10:25+00
> (1 row)

So in this case from a week ago, before any worker upgrade, we see that a spot termination did cause a task to fail (in this case an NSS "Windows 2012 opt Windows 2012 32 (opt) (B)" job) but that failure could was not classified by treeherder, whereas now, treeherder is able to classify parse the task log error message, and report it.


Summary
=======

It looks like we've *always* had this problem in generic-worker, we just weren't classifying it before, because the worker logs weren't suitable for treeherder to report on this issue.

With improved logging as a consequence of bug 1333957, we've now discovered an issue we've had for the last 3 years, namely that we've not been handling task resolutions due to spot terminations properly on Windows, and the path to resolution is to finally implement bug 1180187, that unfortunately slipped through the net until now.



--
[1] https://stackoverflow.com/questions/25444765/why-does-a-non-interactive-batch-script-think-ive-pressed-control-c
Summary: Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364/3221225786
Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 → Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0
Whiteboard: [stockwell infra]
Update:
There have been 81 failures in the last week.
Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build
Affected platforms: Windows 2012 x64, windows2012-32, windows2012-32-noopt

Depends on: 1180187
Flags: needinfo?(jmaher)
:coop, can you ensure that all the taskcluster::* bugzilla components have a triage owner?  When we have intermittent failures in infrastructure there needs to be a person who can act as a point of contact.  For this bug, I will follow up in bug 1180187 which has more actionable activity.
Flags: needinfo?(jmaher) → needinfo?(coop)
We are in the process of finding triage owners for all components, but most of the team is also away this week, which is why many bugs are sitting idle.
Flags: needinfo?(coop)
:coop, can you assign this to someone to be a point of contact- I know this might be difficult to fix, but it is good to have someone who understands the problem even if there are dependencies that need to be fixed.
Flags: needinfo?(coop)
I'm assigning this to Pete and he will tell me why that was the wrong thing to do.

Honestly, I think getting the newest version of the generic-worker deployed to the builders will make this much easier to fix, and that's what Pete has been focusing on for the past "little" while.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Flags: needinfo?(coop)
This should be actually fixed now with the deployment of bug 1399401.

We can leave it open a while, to be sure. I expect OrangeFactor to comment at least once more, since we've only just deployed, but after that, I'd expect this issue to have gone.

In short, the exit codes are caused by worker spot terminations. Previously we weren't listening for them, so suddenly tasks would get killed. Now we are ahead of the game, and listen for notifications that give us one or two minutes to abort the task, report a worker shutdown, and perform cleanup (upload artifacts, etc) so the task should have resolved by the time the worker gets the hard kill.
As we see from comment 26, April 3 was last day this occurred, indeed it seems to be fixed. \o/
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0
the trees are close because of this and backfills fail for regular jobs :(
Severity: major → blocker
We think this may have been caused by the "fix" for bug 1457979.
I understand as well that the fix has been fixed. 

10:51:25 <grenade> yes, my earlier patch was too aggressive. fixed now (HaltOnIdle will kill gw if it does not make a log entry within 5 hours, instead of 20 minutes).

We are retriggering some builds now to see if this is actually fixed.
Trees are reopened. afaict, the fix that grenade applied has worked! Please reopen if this happens again.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Note, *for future reference* if this error begins occurring again, it means something killed the process, it stopped running, because something killed it; ***there was an assassin***.

What the assassin is, can vary (a buggy script, a shutdown command, an undetected spot termination, an interactive user RDP'd on the machine, an unintended Windows Update that forcefully kills processes.... - a myriad of possibilities to consider).
Summary: Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)
Assignee: pmoore → nobody
Hi, there are still some failures with this exit code:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2a7aef0304b22b0dd3efbfb105ede8ea515d4fb0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&selectedJob=176574693

https://treeherder.mozilla.org/logviewer.html#?job_id=176574693&repo=mozilla-inbound&lineNumber=1424

[taskcluster:error]    Exit Code: 1073807364
[taskcluster:error]    User Time: 15.625ms
[taskcluster:error]  Kernel Time: 15.625ms
[taskcluster:error]    Wall Time: 1m49.3949647s
[taskcluster:error]       Result: FAILED
[taskcluster 2018-05-02T15:21:34.500Z] === Task Finished ===
[taskcluster 2018-05-02T15:21:34.500Z] Task Duration: 1m49.3959979s
[taskcluster 2018-05-02T15:21:35.103Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:35.555Z] Uploading artifact public/logs/log_critical.log from file logs\log_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:35.937Z] Uploading artifact public/logs/log_error.log from file logs\log_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:36.600Z] Uploading artifact public/logs/log_fatal.log from file logs\log_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:37.722Z] Uploading artifact public/logs/log_info.log from file logs\log_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:38.093Z] Uploading artifact public/logs/log_raw.log from file logs\log_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:38.501Z] Uploading artifact public/logs/log_warning.log from file logs\log_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] Uploading error artifact public/test_info from file build/blobber_upload_dir with message "Could not read directory 'Z:\\task_1525273982\\build\\blobber_upload_dir'", reason "file-missing-on-worker" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir'
[taskcluster 2018-05-02T15:21:39.514Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/PSc1s4OcRFSXdninXsZiDw/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] Task not successful due to following exception(s):
[taskcluster:error] Exception 1)
[taskcluster:error] exit status 1073807364
[taskcluster:error] Exception 2)
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir'
[taskcluster:error] 

Can you please take a look?
Flags: needinfo?(pmoore)
(In reply to Cristian Brindusan [:cristian_brindusan] from comment #37)

> Can you please take a look?

The logs show that the worker was shutdown by HaltOnIdle mid-task as follows:

May 02 17:21:03 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running. 
May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: Explorer is crashing repeatedy. 
May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: instance failed reliability check and will be halted. uptime:  
May 02 17:21:05 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-0014C5B2ADE47) has initiated the shutdown of computer I-0014C5B2ADE47 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: shutdown   Comment: HaltOnIdle :: instance failed reliability check 


This seems to be related to bug 1457868.
Flags: needinfo?(pmoore)
See Also: → 1457868
Please disregard the above OrangeFactor comment, that should be for bug 1457868.
Thank you.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: