Closed Bug 1544403 Opened 4 months ago Closed 4 months ago

worker should log an exception with queue on exit code 1073807364 in order to trigger a retry

Categories

(Taskcluster :: Workers, enhancement)

Unspecified
Windows
enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla68

People

(Reporter: grenade, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

on windows builders in gcp on tier 3, we see a significant number of builds failing with a final log message of:

[taskcluster:error] exit status 1073807364

the symptom is an old one and can be caused by different issues. typically it occurs when the instance is lost mid-task through a reboot or instance termination that is unexpected by the worker process.

i initially assumed we were getting terminations from gcloud but the papertrail instance logs don't really support that assumption. for example, here are the last few lines of a build log:

11:32:10     INFO -  mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:10     INFO -  mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:10     INFO -  media/webrtc/trunk/gtest/webrtc-gtest.exe
11:32:10     INFO -  mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16     INFO -  mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16     INFO -  z:/task_1555325675/build/src/clang/bin/lld-link.exe -NOLOGO -OUT:../../../../dist/bin/webrtc-gtest.exe -PDB:webrtc-gtest.pdb -IMPLIB:webrtc-gtest.lib -SUBSYSTEM:CONSOLE,6.01 -STACK:1572864  -guard:cf,nolongjmp -LARGEADDRESSAWARE -RELEASE -SAFESEH -DEBUG -OPT:REF,ICF clang_rt.profile-i386.lib   @z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest/webrtc-gtest_exe_pgo.list ./module.res   ..\..\..\..\mozglue\build\mozglue.lib kernel32.lib user32.lib gdi32.lib winmm.lib wsock32.lib advapi32.lib secur32.lib winmm.lib Strmiids.lib d3d11.lib dxgi.lib amstrmid.lib dmoguids.lib msdmo.lib shell32.lib shlwapi.lib strmiids.lib ws2_32.lib wmcodecdspuuid.lib
11:32:16     INFO -  lld-link: warning: ignoring unknown argument: -RELEASE
11:32:16     INFO -  mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16     INFO -  mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16     INFO -  touch -t `date +%Y%m%d%H%M.%S -d 'now+5seconds'` pgo.relink
11:32:16     INFO -  mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:17     INFO -  mozmake.EXE[5]: Nothing to be done for 'syms'.
[taskcluster 2019-04-15T11:39:30.853Z]    Exit Code: 1073807364
[taskcluster 2019-04-15T11:39:30.853Z]    User Time: 0s
[taskcluster 2019-04-15T11:39:30.853Z]  Kernel Time: 0s
[taskcluster 2019-04-15T11:39:30.853Z]    Wall Time: 32m10.7963246s
[taskcluster 2019-04-15T11:39:30.853Z]       Result: FAILED
[taskcluster 2019-04-15T11:39:30.853Z] === Task Finished ===
[taskcluster 2019-04-15T11:39:30.853Z] Task Duration: 42m44.7676934s
[taskcluster 2019-04-15T11:39:31.367Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:32.040Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:32.975Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:33.343Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:33.880Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/Va8tiYL9T5GvQIagmugNIg/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster:error] exit status 1073807364

and here's the last few entries from the instance papertrail logs at the same time:


Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: begin - 2019-04-15T11:38:01.1562091Z 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-Terminating :: begin - 2019-04-15T11:38:01.1869996Z 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-Terminating :: end - 2019-04-15T11:38:01.2065237Z 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive Y: is detected. 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive Z: is detected. 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running. 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: instance appears to be productive. 
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: end - 2019-04-15T11:38:01.2670906Z 
Apr 15 11:38:45 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.   
Apr 15 11:38:52 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 2 times 
Apr 15 11:39:02 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.   
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request 
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/localconfig.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331081&Signature=B%2BaFuLYamQ9LlVQ9qA52%2BjCaf8o%3D HTTP/1.1  Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 1331  Content-Encoding: gzip  Content-Type: application/octet-stream  Accept-Encoding: gzip   
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/localconfig.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331081&Signature=B%2BaFuLYamQ9LlVQ9qA52%2BjCaf8o%3D 
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response 
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK  Content-Length: 0  Date: Mon, 15 Apr 2019 11:39:32 GMT  Etag: "d9c56c6b785ca907d72ed166cb6309a1"  Server: AmazonS3  X-Amz-Id-2: 4T/eElP/LpdGoNpJjtRkhxSdnKoNDVQtSGZQkAh7cvbSeb4pJ8dmhJOUdvbGSFBmmBGLTUAldlI=  X-Amz-Request-Id: F34D3872B0E69178  X-Amz-Version-Id: fW9_rF9cY3bunOIP_55PcnZimv1LHarL   
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Chain of Trust... 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/certified.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331082&Signature=H%2FC9YPwtFQCXuhBdIxSZ1zXuN3Y%3D HTTP/1.1  Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 476139  Content-Encoding: gzip  Content-Type: text/plain; charset=utf-8  Accept-Encoding: gzip   
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/certified.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331082&Signature=H%2FC9YPwtFQCXuhBdIxSZ1zXuN3Y%3D 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK  Content-Length: 0  Date: Mon, 15 Apr 2019 11:39:33 GMT  Etag: "589cc5a2881ec948da215aab72535922"  Server: AmazonS3  X-Amz-Id-2: G5Qqf9Mrjt2nP8cu06cxvU56EQDSV8zYbPFSRtwiRvUIgGWXo0ThoZ69HvM1p+UJW4AtFNl/rXE=  X-Amz-Request-Id: B0FD95DFF30CE0D2  X-Amz-Version-Id: DWfn9mpCscE6tTyGav8M6QA.E5_f6ZnW   
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331083&Signature=GZKF%2F9TOcm3luu1Se9LyB9xcIAg%3D HTTP/1.1  Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 2287  Content-Encoding: gzip  Content-Type: text/plain; charset=utf-8  Accept-Encoding: gzip   
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331083&Signature=GZKF%2F9TOcm3luu1Se9LyB9xcIAg%3D 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response 
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK  Content-Length: 0  Date: Mon, 15 Apr 2019 11:39:34 GMT  Etag: "fa0b6e3aab78925230e370c09158ddc6"  Server: AmazonS3  X-Amz-Id-2: XyieHF5QNmTod3TkSP5ZTRJjDQ+89u+pEuquH8NGcdHutDPIVu3ZfWDsLiRsJXFgkmf1s/afIhw=  X-Amz-Request-Id: DA6B11EAFD867C89  X-Amz-Version-Id: YFzf6HgWN_.fOght8tcDZYngjL9Dmtc_   
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json.sig?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331083&Signature=ydoCiJ2HEddW%2B4Pkwii8GNutPqs%3D HTTP/1.1  Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 116  Content-Encoding: gzip  Content-Type: application/octet-stream  Accept-Encoding: gzip   
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json.sig?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331083&Signature=ydoCiJ2HEddW%2B4Pkwii8GNutPqs%3D 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK  Content-Length: 0  Date: Mon, 15 Apr 2019 11:39:34 GMT  Etag: "8c69926670ca9ad05dbc93b05859edf0"  Server: AmazonS3  X-Amz-Id-2: T2kJBEV1dPP6zvmLfZ04vQ/OCzlPSMVeKycnm3AreiWg6L443iLwy+lNIf66lrZLE4hMGkFL7iE=  X-Amz-Request-Id: 27E059F90EE9F495  X-Amz-Version-Id: M7O07vvCBzAwfbU2R6Z9Q45NBVFieFma   
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Supersede... 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Mounts/Caches... 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature OS Groups... 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Taskcluster Proxy... 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopped taskcluster proxy process (PID 3740) 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Live Log... 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Redirecting public/logs/live.log to public/logs/live_backing.log 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: WARNING: could not terminate livelog exposure: close tcp [::]:60023: use of closed network connection 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 3 times 
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.   
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/live_backing.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331084&Signature=8T%2FC%2F97nslh2gbxC71POu5fpVt8%3D HTTP/1.1  Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 476349  Content-Encoding: gzip  Content-Type: text/plain; charset=utf-8  Accept-Encoding: gzip   
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/live_backing.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331084&Signature=8T%2FC%2F97nslh2gbxC71POu5fpVt8%3D 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK  Content-Length: 0  Date: Mon, 15 Apr 2019 11:39:35 GMT  Etag: "8f47d39d054c22b12398614337344af8"  Server: AmazonS3  X-Amz-Id-2: bs1nKj6vth2jhB9gEltHaK3KGipHYX5PNwj/cRBRe2g9FOlB6rcZNX5WdoBhljvU5oUQoLeZ3H4=  X-Amz-Request-Id: FAA486598297BAB9  X-Amz-Version-Id: o7ZwmW92hKLOw0xpxgGfHQHSvhboQW75   
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Resolving task Va8tiYL9T5GvQIagmugNIg ... 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: ERROR(s) encountered: exit status 1073807364 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [0 C0422161FC] 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker:   Result: 0 0 The data area passed to a system call is too small. 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [C0423B45C0 C0422161FC] 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker:   Result: 1 0 The operation completed successfully. 
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Looking for existing task users to delete... 
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Resolved 1 tasks in total so far. 
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Saving file file-caches.json (absolute path: C:\generic-worker\file-caches.json) 
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Saving file directory-caches.json (absolute path: C:\generic-worker\directory-caches.json) 
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Exiting worker with exit code 67 
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.594969 main.go:407: GCEMetadataScripts: Starting shutdown scripts (version 4.2.1@1).  
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.634992 main.go:416: GCEMetadataScripts: No shutdown scripts to run.  
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.645749 main.go:421: GCEMetadataScripts: Finished running shutdown scripts.  
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (VM-SMX3T1SFAI5D) has initiated the restart of computer VM-SMX3T1SFAI5D on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: restart   Comment: rebooting; generic worker task run completed 
Apr 15 11:39:37 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 3 times 
Apr 15 11:39:38 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly.  It has done this 1 time(s). 
Apr 15 11:39:39 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.   
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING received a system shutdown request 
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING stopping nxlog service 
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING nxlog-ce received a termination request signal, exiting... 

it's difficult to say what's going on but from the perspective of what's in the logs, it looks like

  • a task finished normally
  • generic-worker exited with an exit code of 67
  • which was interpreted by the wrapper script as a request to reboot
  • the instance never restarted (this suggests something was wrong on the infra front, but i haven't figured out what that is yet)

in any case, retriggering tasks that fail in this way, usually results in successful builds. as such, it feels like we could improve the resilience of what happens or is shown in treeherder by triggering a retry instead of showing a build or test failure.

here's some links to recent failures:

Blocks: 1525096
See Also: → 1527134

We should be able to do this in the gecko task definitions with:

payload:
  ...
  onExitStatus:
    retry:
      - 1073807364

We could include this in all gecko windows tasks in the decision code machinery.

See onExitStatus docs.

(In reply to Pete Moore [:pmoore][:pete] from comment #1)

We should be able to do this in the gecko task definitions with:

payload:
  ...
  onExitStatus:
    retry:
      - 1073807364

We could include this in all gecko windows tasks in the decision code machinery.

See onExitStatus docs.

testing now:
https://hg.mozilla.org/try/rev/7f63803
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=7f63803

seems to work well.
i deliberately killed worker instance vm-mh2jgoe41ewuocvcr whilst it was running its task.
the livelog recorded these messages before the task run was recorded as an exception and re-triggered:

updating [==================>                            ] 112100/275564 5m46s
updating [==================>                            ] 112500/275564 5m46s
updating [==================>                            ] 112900/275564 5m47s[taskcluster 2019-04-30T20:51:36.164Z]    Exit Code: 3221225786
[taskcluster 2019-04-30T20:51:36.164Z]    User Time: 15.625ms
[taskcluster 2019-04-30T20:51:36.164Z]  Kernel Time: 0s
[taskcluster 2019-04-30T20:51:36.164Z]    Wall Time: 4m14.2193896s
[taskcluster 2019-04-30T20:51:36.164Z]       Result: FAILED
[taskcluster 2019-04-30T20:51:36.164Z] === Task Finished ===
[taskcluster 2019-04-30T20:51:36.164Z] Task Duration: 4m14.4235599s
[taskcluster:error] Uploading error artifact public/logs from file logs with message "Could not read directory 'Z:\\task_1556652867\\logs'", reason "file-missing-on-worker" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1556652867\logs'
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1556652867\\public\\build'", reason "file-missing-on-worker" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1556652867\public\build'
[taskcluster 2019-04-30T20:51:38.143Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:39.109Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:39.649Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:40.340Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/F5HBezBETdWtTP_u-PQV5A/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] Task appears to have failed intermittently - exit code 3221225786 found in task payload.onExitStatus list
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1556652867\logs'
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1556652867\public\build'

when an instance is terminated while it is still running a task, the generic worker process exits with an interrupt exit code. this change treats such exit codes as an exception which triggers a task retry

Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Attachment #9061761 - Flags: review?(pmoore)

Comment on attachment 9061761 [details]
trigger retry on instance termination mid-task

Looks good Rob, but I think just needs a comment to explain why these exit codes are special, and also needs to be filtered to only add them to tasks on windows workers (e.g. to avoid the mac tests and linux talos jobs we run on generic-worker).

Thanks!

Attachment #9061761 - Flags: review?(pmoore) → review-

Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.

Although maybe we can do that in a subsequent bug, in order not to block this one.

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.

Although maybe we can do that in a subsequent bug, in order not to block this one.

here's an attempt at what i think was suggested:

    task_def['payload'] = {
        'command': worker['command'],
        'maxRunTime': worker['max-run-time'],
        'onExitStatus': {},
    }

    if worker['os'] == 'windows':
        worker.setdefault('retry-exit-status', []).extend(
            # these codes (on windows) indicate a process interruption,
            # rather than a task run failure. see bug 1544403.
            1073807364, # process force-killed due to system shutdown
            3221225786, # sigint (any interrupt)
        )
    if 'retry-exit-status' in worker:
        task_def['payload']['onExitStatus']['retry'] = worker['retry-exit-status']

instead of what was implemented in the patch:

    task_def['payload'] = {
        'command': worker['command'],
        'maxRunTime': worker['max-run-time'],
        'onExitStatus': {
            'retry': [
                # these codes (on windows) indicate a process interruption,
                # rather than a task run failure. see bug 1544403.
                1073807364, # process force-killed due to system shutdown
                3221225786, # sigint (any interrupt)
            ],
        },
    }

i'm of the fairly strong opinion that the second of these two implementations is cleaner. i'm not sure what was meant by "won't easily allow other exit codes to be added to the list". please clarify that. possibly with a code example here as i must have failed to comprehend that. i think it's pretty much exactly the same amount of effort to add exit codes to either implementation.

(In reply to Rob Thijssen [:grenade (EET)] from comment #7)

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.

Although maybe we can do that in a subsequent bug, in order not to block this one.

here's an attempt at what i think was suggested:

    task_def['payload'] = {
        'command': worker['command'],
        'maxRunTime': worker['max-run-time'],
        'onExitStatus': {},
    }

    if worker['os'] == 'windows':
        worker.setdefault('retry-exit-status', []).extend(
            # these codes (on windows) indicate a process interruption,
            # rather than a task run failure. see bug 1544403.
            1073807364, # process force-killed due to system shutdown
            3221225786, # sigint (any interrupt)
        )
    if 'retry-exit-status' in worker:
        task_def['payload']['onExitStatus']['retry'] = worker['retry-exit-status']

instead of what was implemented in the patch:

    task_def['payload'] = {
        'command': worker['command'],
        'maxRunTime': worker['max-run-time'],
        'onExitStatus': {
            'retry': [
                # these codes (on windows) indicate a process interruption,
                # rather than a task run failure. see bug 1544403.
                1073807364, # process force-killed due to system shutdown
                3221225786, # sigint (any interrupt)
            ],
        },
    }

i'm of the fairly strong opinion that the second of these two implementations is cleaner. i'm not sure what was meant by "won't easily allow other exit codes to be added to the list". please clarify that. possibly with a code example here as i must have failed to comprehend that. i think it's pretty much exactly the same amount of effort to add exit codes to either implementation.

The payload builder of docker-worker includes an optional property retry-exit-status here. This allows any of the task-building machinery to mutate the list of exit codes that should cause a retry, by manipulating the retry-exit-status worker property. For example,

  • here inside a job transformation,
  • here during task transformation,
  • here inside a job default,

The retry exit codes, like the other parts of the task payload (mounts, commands to run, artifacts to include, ...), should be part of the generic-worker payload builder (like it is for the docker worker) so that any of the transformation steps can mutate the list. In the final transformation, the values from the list should be taken to generate the payload, like we do in docker-worker.

The reason I say it isn't possible to mutate the list currently, is because in your patch there currently are no means for the transforms to alter the list contents, since the final transformation is deciding what goes in there.

(In reply to Pete Moore [:pmoore][:pete] from comment #8)

The reason I say it isn't possible to mutate the list currently, is because in your patch there currently are no means for the transforms to alter the list contents, since the final transformation is deciding what goes in there.

thanks for explaining that. it sounds like a great feature to add. i'll leave that for another feature request as it sounds like it has wider implications than the scope of this bug, which is just about getting generic worker to stop reporting task exceptions as build failures.

i've added the platform check you requested to the phabricator patch and i look forward to getting that merged and this issue closed.

Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e1d2dc6db0fa
trigger retry on instance termination mid-task r=pmoore

Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.