1444168 - Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)

Reporter

Description

•

7 years ago

The issues is present on the following push https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1d01aa7f63ade6f7afc41370cd28cfbc35c14f13 Example log: https://treeherder.mozilla.org/logviewer.html#?job_id=166806532&repo=mozilla-inbound&lineNumber=36687 19:29:03 INFO - Beginning work for file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll 19:29:03 INFO - Processing file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll 19:29:03 INFO - z:\build\build\src\obj-firefox\dist\host\bin\dump_syms.exe z:\build\build\src\obj-firefox\toolkit\library\xul.dll 19:29:03 INFO - Finished processing z:\build\build\src\obj-firefox\toolkit\library\xul.dll in 85.66s 19:29:03 INFO - Running: z:/build/build/src/obj-firefox/_virtualenv/Scripts/python.exe z:/build/build/src\toolkit\crashreporter\tools\symbolstore.py -c --vcs-info -i --install-manifest=z:/build/build/src/obj-firefox\_build_manifests\install\dist_include,z:/build/build/src/obj-firefox\dist\include -s z:/build/build/src z:/build/build/src/obj-firefox\dist\host\bin\dump_syms.exe z:/build/build/src/obj-firefox\dist\crashreporter-symbols z:\build\build\src\obj-firefox\toolkit\library\xul.dll 19:29:03 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/toolkit/library' [taskcluster:error] Exit Code: 1073807364 [taskcluster:error] Success Code: 0x0 [taskcluster:error] User Time: 0s [taskcluster:error] Kernel Time: 31.25ms [taskcluster:error] Wall Time: 36m44.6030212s [taskcluster:error] Peak Memory: 1703936 [taskcluster:error] Result: FAILED [taskcluster 2018-03-08T19:29:14.145Z] === Task Finished === [taskcluster 2018-03-08T19:29:14.145Z] Task Duration: 43m16.7670142s [taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1520534506\\public\\build'", reason "file-missing-on-worker" and expiry 2019-03-08T18:45:33.009Z [taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build' [taskcluster 2018-03-08T19:29:15.225Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z [taskcluster 2018-03-08T19:29:16.299Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z [taskcluster 2018-03-08T19:29:16.885Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/JGSjjqYkRMm3kcCEtWQ2ug/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z [taskcluster:error] Task not successful due to following exception(s): [taskcluster:error] Exception 1) [taskcluster:error] [] [taskcluster:error] [taskcluster:error] Exit code: 1073807364 [taskcluster:error] Exception 2) [taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build' [taskcluster:error] RESULTED IN TREE CLOSURE

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Severity: normal → blocker

Comment hidden (Intermittent Failures Robot)

Pete Moore [:pmoore][:pete]

Comment 2

•

7 years ago

tl;dr: this was caused by spot terminations (we had > 1000 spot terminations in the last day) and that bug 1180187 is not yet implemented so tasks were just failing, rather than resolving as exception/worker-shutdown[1]. I will look into implementing bug 1180187 as soon as possible to mitigate this in future, so that tasks are resolved as exception/worker-shutdown (blue in treeherder) rather than failed/failed (red in treeherder) in future, and automatically retried. More detailed explanation ========================= When implementing bug 1333957, the task logging was updated so that errors were logged with a different syntax, so that treeherder could parse them. This involved removing the timestamp from the error line, as required by treeherder. This means that now we are able to find errors automatically, that we could not before. Maybe this helped classify this issue more uniformly than we were able to do before - this part I am not sure. However, the rate of spot terminations was high during this time (perhaps higher demand in AWS for this instance type in the regions where we had provisioned nodes) - but whatever the reason was, we were getting spot terminations. Spot request termination detection has been built into OCC[2], which was able to log the spot termination in the worker log[3] at 19:28:13 UTC, for a planned termination 58 seconds later at 19:29:11. Assuming the shutdown was issued on time, the machine would have begun shutting down at 19:29:11, and two seconds later at 19:29:13 we see the exit code 0x40010004[4] = 1073807364 = DBG_TERMINATE_PROCESS = "Debugger terminated the process"[5] as reported in the task log[6]. -- [1] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#get-task-status [2] https://github.com/mozilla-releng/OpenCloudConfig/blob/cb7a8948b9068302a062a9badec6d88ca3b4bc6b/userdata/HaltOnIdle.ps1#L63-L77 [3] https://papertrailapp.com/systems/1668716761/events?focus=908486543633731587&selected=908486543633731587 : > Mar 08 19:28:13 i-0f019b81cdfaf7468.gecko-3-b-win2012.usw1.mozilla.com HaltOnIdle: Is-Terminating :: spot termination notice received: 2018-03-08T19:29:11Z. [4] https://stackoverflow.com/questions/37078953/in-which-cases-does-program-exit-with-0x40010004-code [5] https://msdn.microsoft.com/en-us/library/cc704588.aspx?f=255&MSPPError=-2147217396 [6] https://treeherder.mozilla.org/logviewer.html#?job_id=166808009&repo=mozilla-inbound&lineNumber=55501 : > 19:29:13 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/gfx/angle/targets/libANGLE' > [taskcluster:error] Exit Code: 1073807364

Comment 3

•

7 years ago

Note, trees are no longer closed.

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Depends on: 1180187

Comment 4

•

7 years ago

Attached file spot-kills.txt — Details

This is a list of all instances which terminated because they were spot killed by the EC2 service in the last 24 hours (give or take a few minutes to do this attachment). The query run was: ec2-manager::CRIMSON=> select region, "workerType", "instanceType", id, terminated from terminations where code = 'Server.SpotInstanceTermination' and terminated > now() - interval '1 day'; This was the result of a ping on IRC, so please forgive me if this isn't the correct place for this attachment.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Severity: blocker → major

Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build

Pete Moore [:pmoore][:pete]

Comment 5

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #2) > This means that now we are able to find errors automatically, > that we could not before. Maybe this helped classify this issue more > uniformly than we were able to do before - this part I am not sure. To test this theory, I searched the new ec2-manager postgres database, to find spot termination notices that occurred before upgrading generic-worker to provide the new error messages, and found e.g. https://treeherder.mozilla.org/logviewer.html#?job_id=165524756&repo=nss-try&lineNumber=178 > [taskcluster 2018-03-02T14:10:26.295Z] Exit Code: 3221225786 Note, this is a similar exit code to signify process termination (STATUS_CONTROL_C_EXIT)[1]. which corresponds to: > ec2-manager::CRIMSON=> select id, "workerType", region, reason, terminated from terminations where id = 'i-0f20dd96e96d8f45d'; > id | workerType | region | reason | terminated > ---------------------+---------------+-----------+---------------------------+------------------------ > i-0f20dd96e96d8f45d | nss-win2012r2 | us-east-1 | Spot instance termination | 2018-03-02 14:11:21+00 > (1 row) > ec2-manager::CRIMSON=> select generated from cloudwatchlog where id = 'i-0f20dd96e96d8f45d' and state='shutting-down'; > generated > ------------------------ > 2018-03-02 14:10:25+00 > (1 row) So in this case from a week ago, before any worker upgrade, we see that a spot termination did cause a task to fail (in this case an NSS "Windows 2012 opt Windows 2012 32 (opt) (B)" job) but that failure could was not classified by treeherder, whereas now, treeherder is able to classify parse the task log error message, and report it. Summary ======= It looks like we've *always* had this problem in generic-worker, we just weren't classifying it before, because the worker logs weren't suitable for treeherder to report on this issue. With improved logging as a consequence of bug 1333957, we've now discovered an issue we've had for the last 3 years, namely that we've not been handling task resolutions due to spot terminations properly on Windows, and the path to resolution is to finally implement bug 1180187, that unfortunately slipped through the net until now. -- [1] https://stackoverflow.com/questions/25444765/why-does-a-non-interactive-batch-script-think-ive-pressed-control-c

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Summary: Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364/3221225786

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 → Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell infra]

Comment hidden (Intermittent Failures Robot)

Tiberius Oros[:tiberius_oros]

Comment 16

•

7 years ago

Update: There have been 81 failures in the last week. Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build Affected platforms: Windows 2012 x64, windows2012-32, windows2012-32-noopt Depends on: 1180187

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

7 years ago

:coop, can you ensure that all the taskcluster::* bugzilla components have a triage owner? When we have intermittent failures in infrastructure there needs to be a person who can act as a point of contact. For this bug, I will follow up in bug 1180187 which has more actionable activity.

Flags: needinfo?(jmaher) → needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 18

•

7 years ago

We are in the process of finding triage owners for all components, but most of the team is also away this week, which is why many bugs are sitting idle.

Flags: needinfo?(coop)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

7 years ago

:coop, can you assign this to someone to be a point of contact- I know this might be difficult to fix, but it is good to have someone who understands the problem even if there are dependencies that need to be fixed.

Flags: needinfo?(coop)

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Comment 24

•

7 years ago

I'm assigning this to Pete and he will tell me why that was the wrong thing to do. Honestly, I think getting the newest version of the generic-worker deployed to the builders will make this much easier to fix, and that's what Pete has been focusing on for the past "little" while.

Assignee: nobody → pmoore

Status: NEW → ASSIGNED

Flags: needinfo?(coop)

Pete Moore [:pmoore][:pete]

Comment 25

•

7 years ago

This should be actually fixed now with the deployment of bug 1399401. We can leave it open a while, to be sure. I expect OrangeFactor to comment at least once more, since we've only just deployed, but after that, I'd expect this issue to have gone. In short, the exit codes are caused by worker spot terminations. Previously we weren't listening for them, so suddenly tasks would get killed. Now we are ahead of the game, and listen for notifications that give us one or two minutes to abort the task, report a worker shutdown, and perform cleanup (upload artifacts, etc) so the task should have resolved by the time the worker gets the hard kill.

Comment hidden (Intermittent Failures Robot)

Pete Moore [:pmoore][:pete]

Comment 27

•

7 years ago

As we see from comment 26, April 3 was last day this occurred, indeed it seems to be fixed. \o/

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

7 years ago

Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0

Joel Maher ( :jmaher ) (UTC -8)

Comment 31

•

7 years ago

the trees are close because of this and backfills fail for regular jobs :(

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Severity: major → blocker

Chris Cooper [:coop] (he/him)

Comment 32

•

7 years ago

We think this may have been caused by the "fix" for bug 1457979.

Brian Stack [:bstack]

Comment 33

•

7 years ago

I understand as well that the fix has been fixed. 10:51:25 <grenade> yes, my earlier patch was too aggressive. fixed now (HaltOnIdle will kill gw if it does not make a log entry within 5 hours, instead of 20 minutes). We are retriggering some builds now to see if this is actually fixed.

Brian Stack [:bstack]

Comment 34

•

7 years ago

Trees are reopened. afaict, the fix that grenade applied has worked! Please reopen if this happens again.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

Pete Moore [:pmoore][:pete]

Comment 36

•

7 years ago

Note, *for future reference* if this error begins occurring again, it means something killed the process, it stopped running, because something killed it; ***there was an assassin***. What the assassin is, can vary (a buggy script, a shutdown command, an undetected spot termination, an interactive user RDP'd on the machine, an unintended Windows Update that forcefully kills processes.... - a myriad of possibilities to consider).

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Summary: Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Assignee: pmoore → nobody

Cristian Brindusan [:cbrindusan]

Comment 37

•

7 years ago

Hi, there are still some failures with this exit code: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2a7aef0304b22b0dd3efbfb105ede8ea515d4fb0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&selectedJob=176574693 https://treeherder.mozilla.org/logviewer.html#?job_id=176574693&repo=mozilla-inbound&lineNumber=1424 [taskcluster:error] Exit Code: 1073807364 [taskcluster:error] User Time: 15.625ms [taskcluster:error] Kernel Time: 15.625ms [taskcluster:error] Wall Time: 1m49.3949647s [taskcluster:error] Result: FAILED [taskcluster 2018-05-02T15:21:34.500Z] === Task Finished === [taskcluster 2018-05-02T15:21:34.500Z] Task Duration: 1m49.3959979s [taskcluster 2018-05-02T15:21:35.103Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:35.555Z] Uploading artifact public/logs/log_critical.log from file logs\log_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:35.937Z] Uploading artifact public/logs/log_error.log from file logs\log_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:36.600Z] Uploading artifact public/logs/log_fatal.log from file logs\log_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:37.722Z] Uploading artifact public/logs/log_info.log from file logs\log_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:38.093Z] Uploading artifact public/logs/log_raw.log from file logs\log_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster 2018-05-02T15:21:38.501Z] Uploading artifact public/logs/log_warning.log from file logs\log_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z [taskcluster:error] Uploading error artifact public/test_info from file build/blobber_upload_dir with message "Could not read directory 'Z:\\task_1525273982\\build\\blobber_upload_dir'", reason "file-missing-on-worker" and expiry 2019-05-02T14:04:15.986Z [taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir' [taskcluster 2018-05-02T15:21:39.514Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/PSc1s4OcRFSXdninXsZiDw/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-02T14:04:15.986Z [taskcluster:error] Task not successful due to following exception(s): [taskcluster:error] Exception 1) [taskcluster:error] exit status 1073807364 [taskcluster:error] Exception 2) [taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir' [taskcluster:error] Can you please take a look?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 38

•

7 years ago

(In reply to Cristian Brindusan [:cristian_brindusan] from comment #37) > Can you please take a look? The logs show that the worker was shutdown by HaltOnIdle mid-task as follows: May 02 17:21:03 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running. May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: Explorer is crashing repeatedy. May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: instance failed reliability check and will be halted. uptime: May 02 17:21:05 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-0014C5B2ADE47) has initiated the shutdown of computer I-0014C5B2ADE47 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned) Reason Code: 0x80040001 Shutdown Type: shutdown Comment: HaltOnIdle :: instance failed reliability check This seems to be related to bug 1457868.

Flags: needinfo?(pmoore)

Comment 39

•

7 years ago

https://papertrailapp.com/groups/853883/events?focus=928355680749023244&selected=928355680749023244

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Comment 41

•

7 years ago

Please disregard the above OrangeFactor comment, that should be for bug 1457868. Thank you.

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago