Closed
Bug 1444168
Opened 7 years ago
Closed 7 years ago
Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nataliaCs, Unassigned)
References
Details
(Whiteboard: [stockwell infra])
Attachments
(1 file)
124.00 KB,
text/plain
|
Details |
The issues is present on the following push https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1d01aa7f63ade6f7afc41370cd28cfbc35c14f13
Example log: https://treeherder.mozilla.org/logviewer.html#?job_id=166806532&repo=mozilla-inbound&lineNumber=36687
19:29:03 INFO - Beginning work for file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03 INFO - Processing file: z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03 INFO - z:\build\build\src\obj-firefox\dist\host\bin\dump_syms.exe z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03 INFO - Finished processing z:\build\build\src\obj-firefox\toolkit\library\xul.dll in 85.66s
19:29:03 INFO - Running: z:/build/build/src/obj-firefox/_virtualenv/Scripts/python.exe z:/build/build/src\toolkit\crashreporter\tools\symbolstore.py -c --vcs-info -i --install-manifest=z:/build/build/src/obj-firefox\_build_manifests\install\dist_include,z:/build/build/src/obj-firefox\dist\include -s z:/build/build/src z:/build/build/src/obj-firefox\dist\host\bin\dump_syms.exe z:/build/build/src/obj-firefox\dist\crashreporter-symbols z:\build\build\src\obj-firefox\toolkit\library\xul.dll
19:29:03 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/toolkit/library'
[taskcluster:error] Exit Code: 1073807364
[taskcluster:error] Success Code: 0x0
[taskcluster:error] User Time: 0s
[taskcluster:error] Kernel Time: 31.25ms
[taskcluster:error] Wall Time: 36m44.6030212s
[taskcluster:error] Peak Memory: 1703936
[taskcluster:error] Result: FAILED
[taskcluster 2018-03-08T19:29:14.145Z] === Task Finished ===
[taskcluster 2018-03-08T19:29:14.145Z] Task Duration: 43m16.7670142s
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1520534506\\public\\build'", reason "file-missing-on-worker" and expiry 2019-03-08T18:45:33.009Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build'
[taskcluster 2018-03-08T19:29:15.225Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster 2018-03-08T19:29:16.299Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster 2018-03-08T19:29:16.885Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/JGSjjqYkRMm3kcCEtWQ2ug/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-08T18:45:33.009Z
[taskcluster:error] Task not successful due to following exception(s):
[taskcluster:error] Exception 1)
[taskcluster:error] []
[taskcluster:error]
[taskcluster:error] Exit code: 1073807364
[taskcluster:error] Exception 2)
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1520534506\public\build'
[taskcluster:error]
RESULTED IN TREE CLOSURE
Updated•7 years ago
|
Severity: normal → blocker
Comment hidden (Intermittent Failures Robot) |
Comment 2•7 years ago
|
||
tl;dr: this was caused by spot terminations (we had > 1000 spot terminations in the last day) and that bug 1180187 is not yet implemented so tasks were just failing, rather than resolving as exception/worker-shutdown[1]. I will look into implementing bug 1180187 as soon as possible to mitigate this in future, so that tasks are resolved as exception/worker-shutdown (blue in treeherder) rather than failed/failed (red in treeherder) in future, and automatically retried.
More detailed explanation
=========================
When implementing bug 1333957, the task logging was updated so that errors were logged with a different syntax, so that treeherder could parse them. This involved removing the timestamp from the error line, as required by treeherder. This means that now we are able to find errors automatically, that we could not before. Maybe this helped classify this issue more uniformly than we were able to do before - this part I am not sure.
However, the rate of spot terminations was high during this time (perhaps higher demand in AWS for this instance type in the regions where we had provisioned nodes) - but whatever the reason was, we were getting spot terminations.
Spot request termination detection has been built into OCC[2], which was able to log the spot termination in the worker log[3] at 19:28:13 UTC, for a planned termination 58 seconds later at 19:29:11. Assuming the shutdown was issued on time, the machine would have begun shutting down at 19:29:11, and two seconds later at 19:29:13 we see the exit code 0x40010004[4] = 1073807364 = DBG_TERMINATE_PROCESS = "Debugger terminated the process"[5] as reported in the task log[6].
--
[1] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#get-task-status
[2] https://github.com/mozilla-releng/OpenCloudConfig/blob/cb7a8948b9068302a062a9badec6d88ca3b4bc6b/userdata/HaltOnIdle.ps1#L63-L77
[3] https://papertrailapp.com/systems/1668716761/events?focus=908486543633731587&selected=908486543633731587 :
> Mar 08 19:28:13 i-0f019b81cdfaf7468.gecko-3-b-win2012.usw1.mozilla.com HaltOnIdle: Is-Terminating :: spot termination notice received: 2018-03-08T19:29:11Z.
[4] https://stackoverflow.com/questions/37078953/in-which-cases-does-program-exit-with-0x40010004-code
[5] https://msdn.microsoft.com/en-us/library/cc704588.aspx?f=255&MSPPError=-2147217396
[6] https://treeherder.mozilla.org/logviewer.html#?job_id=166808009&repo=mozilla-inbound&lineNumber=55501 :
> 19:29:13 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/gfx/angle/targets/libANGLE'
> [taskcluster:error] Exit Code: 1073807364
See Also: → 1180187
Comment 3•7 years ago
|
||
Note, trees are no longer closed.
Updated•7 years ago
|
Comment 4•7 years ago
|
||
This is a list of all instances which terminated because they were spot killed by the EC2 service in the last 24 hours (give or take a few minutes to do this attachment). The query run was:
ec2-manager::CRIMSON=> select region, "workerType", "instanceType", id, terminated from terminations where code = 'Server.SpotInstanceTermination' and terminated > now() - interval '1 day';
This was the result of a ping on IRC, so please forgive me if this isn't the correct place for this attachment.
Updated•7 years ago
|
Severity: blocker → major
Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build
Comment 5•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> This means that now we are able to find errors automatically,
> that we could not before. Maybe this helped classify this issue more
> uniformly than we were able to do before - this part I am not sure.
To test this theory, I searched the new ec2-manager postgres database, to find spot termination notices that occurred before upgrading generic-worker to provide the new error messages, and found e.g.
https://treeherder.mozilla.org/logviewer.html#?job_id=165524756&repo=nss-try&lineNumber=178
> [taskcluster 2018-03-02T14:10:26.295Z] Exit Code: 3221225786
Note, this is a similar exit code to signify process termination (STATUS_CONTROL_C_EXIT)[1].
which corresponds to:
> ec2-manager::CRIMSON=> select id, "workerType", region, reason, terminated from terminations where id = 'i-0f20dd96e96d8f45d';
> id | workerType | region | reason | terminated
> ---------------------+---------------+-----------+---------------------------+------------------------
> i-0f20dd96e96d8f45d | nss-win2012r2 | us-east-1 | Spot instance termination | 2018-03-02 14:11:21+00
> (1 row)
> ec2-manager::CRIMSON=> select generated from cloudwatchlog where id = 'i-0f20dd96e96d8f45d' and state='shutting-down';
> generated
> ------------------------
> 2018-03-02 14:10:25+00
> (1 row)
So in this case from a week ago, before any worker upgrade, we see that a spot termination did cause a task to fail (in this case an NSS "Windows 2012 opt Windows 2012 32 (opt) (B)" job) but that failure could was not classified by treeherder, whereas now, treeherder is able to classify parse the task log error message, and report it.
Summary
=======
It looks like we've *always* had this problem in generic-worker, we just weren't classifying it before, because the worker logs weren't suitable for treeherder to report on this issue.
With improved logging as a consequence of bug 1333957, we've now discovered an issue we've had for the last 3 years, namely that we've not been handling task resolutions due to spot terminations properly on Windows, and the path to resolution is to finally implement bug 1180187, that unfortunately slipped through the net until now.
--
[1] https://stackoverflow.com/questions/25444765/why-does-a-non-interactive-batch-script-think-ive-pressed-control-c
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Summary: Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build → Windows Build Bustage with Exit Code: 1073807364/3221225786
Updated•7 years ago
|
Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 → Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell infra]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 16•7 years ago
|
||
Update:
There have been 81 failures in the last week.
Summary: TREE CLOSURE Windows Build Bustage with Exit Code: 1073807364, with Uploading error artifact public/build from file public/build
Affected platforms: Windows 2012 x64, windows2012-32, windows2012-32-noopt
Depends on: 1180187
Flags: needinfo?(jmaher)
Comment 17•7 years ago
|
||
:coop, can you ensure that all the taskcluster::* bugzilla components have a triage owner? When we have intermittent failures in infrastructure there needs to be a person who can act as a point of contact. For this bug, I will follow up in bug 1180187 which has more actionable activity.
Flags: needinfo?(jmaher) → needinfo?(coop)
Comment 18•7 years ago
|
||
We are in the process of finding triage owners for all components, but most of the team is also away this week, which is why many bugs are sitting idle.
Flags: needinfo?(coop)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 22•7 years ago
|
||
:coop, can you assign this to someone to be a point of contact- I know this might be difficult to fix, but it is good to have someone who understands the problem even if there are dependencies that need to be fixed.
Flags: needinfo?(coop)
Comment hidden (Intermittent Failures Robot) |
Comment 24•7 years ago
|
||
I'm assigning this to Pete and he will tell me why that was the wrong thing to do.
Honestly, I think getting the newest version of the generic-worker deployed to the builders will make this much easier to fix, and that's what Pete has been focusing on for the past "little" while.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Flags: needinfo?(coop)
Comment 25•7 years ago
|
||
This should be actually fixed now with the deployment of bug 1399401.
We can leave it open a while, to be sure. I expect OrangeFactor to comment at least once more, since we've only just deployed, but after that, I'd expect this issue to have gone.
In short, the exit codes are caused by worker spot terminations. Previously we weren't listening for them, so suddenly tasks would get killed. Now we are ahead of the game, and listen for notifications that give us one or two minutes to abort the task, report a worker shutdown, and perform cleanup (upload artifacts, etc) so the task should have resolved by the time the worker gets the hard kill.
Comment hidden (Intermittent Failures Robot) |
Comment 27•7 years ago
|
||
As we see from comment 26, April 3 was last day this occurred, indeed it seems to be fixed. \o/
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Summary: Windows Build Bustage with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0
Comment 31•7 years ago
|
||
the trees are close because of this and backfills fail for regular jobs :(
Updated•7 years ago
|
Severity: major → blocker
Comment 32•7 years ago
|
||
We think this may have been caused by the "fix" for bug 1457979.
Comment 33•7 years ago
|
||
I understand as well that the fix has been fixed.
10:51:25 <grenade> yes, my earlier patch was too aggressive. fixed now (HaltOnIdle will kill gw if it does not make a log entry within 5 hours, instead of 20 minutes).
We are retriggering some builds now to see if this is actually fixed.
Comment 34•7 years ago
|
||
Trees are reopened. afaict, the fix that grenade applied has worked! Please reopen if this happens again.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Comment 36•7 years ago
|
||
Note, *for future reference* if this error begins occurring again, it means something killed the process, it stopped running, because something killed it; ***there was an assassin***.
What the assassin is, can vary (a buggy script, a shutdown command, an undetected spot termination, an interactive user RDP'd on the machine, an unintended Windows Update that forcefully kills processes.... - a myriad of possibilities to consider).
Updated•7 years ago
|
Summary: Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 → Intermittent failures with Exit Code: 1073807364/3221225786 / Success Code: 0x0 (task process killed by *something*)
Updated•7 years ago
|
Assignee: pmoore → nobody
Comment 37•7 years ago
|
||
Hi, there are still some failures with this exit code:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2a7aef0304b22b0dd3efbfb105ede8ea515d4fb0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&selectedJob=176574693
https://treeherder.mozilla.org/logviewer.html#?job_id=176574693&repo=mozilla-inbound&lineNumber=1424
[taskcluster:error] Exit Code: 1073807364
[taskcluster:error] User Time: 15.625ms
[taskcluster:error] Kernel Time: 15.625ms
[taskcluster:error] Wall Time: 1m49.3949647s
[taskcluster:error] Result: FAILED
[taskcluster 2018-05-02T15:21:34.500Z] === Task Finished ===
[taskcluster 2018-05-02T15:21:34.500Z] Task Duration: 1m49.3959979s
[taskcluster 2018-05-02T15:21:35.103Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:35.555Z] Uploading artifact public/logs/log_critical.log from file logs\log_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:35.937Z] Uploading artifact public/logs/log_error.log from file logs\log_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:36.600Z] Uploading artifact public/logs/log_fatal.log from file logs\log_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:37.722Z] Uploading artifact public/logs/log_info.log from file logs\log_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:38.093Z] Uploading artifact public/logs/log_raw.log from file logs\log_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster 2018-05-02T15:21:38.501Z] Uploading artifact public/logs/log_warning.log from file logs\log_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] Uploading error artifact public/test_info from file build/blobber_upload_dir with message "Could not read directory 'Z:\\task_1525273982\\build\\blobber_upload_dir'", reason "file-missing-on-worker" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir'
[taskcluster 2018-05-02T15:21:39.514Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/PSc1s4OcRFSXdninXsZiDw/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-02T14:04:15.986Z
[taskcluster:error] Task not successful due to following exception(s):
[taskcluster:error] Exception 1)
[taskcluster:error] exit status 1073807364
[taskcluster:error] Exception 2)
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1525273982\build\blobber_upload_dir'
[taskcluster:error]
Can you please take a look?
Flags: needinfo?(pmoore)
Comment 38•7 years ago
|
||
(In reply to Cristian Brindusan [:cristian_brindusan] from comment #37)
> Can you please take a look?
The logs show that the worker was shutdown by HaltOnIdle mid-task as follows:
May 02 17:21:03 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running.
May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: Explorer is crashing repeatedy.
May 02 17:21:04 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com HaltOnIdle: instance failed reliability check and will be halted. uptime:
May 02 17:21:05 i-0014c5b2ade476936.gecko-t-win10-64.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-0014C5B2ADE47) has initiated the shutdown of computer I-0014C5B2ADE47 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned) Reason Code: 0x80040001 Shutdown Type: shutdown Comment: HaltOnIdle :: instance failed reliability check
This seems to be related to bug 1457868.
Flags: needinfo?(pmoore)
See Also: → 1457868
Comment 39•7 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 41•7 years ago
|
||
Please disregard the above OrangeFactor comment, that should be for bug 1457868.
Thank you.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•