Closed Bug 1497753 Opened 7 years ago Closed 6 years ago

docker-worker jobs intermittently fail to shut down the liveLog container (with no error shown in live.log)

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nataliaCs, Unassigned)

References

Details

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=204362372&repo=autoland Task details: https://tools.taskcluster.net/groups/JDvmPcfiShimWhYvTBIFDw/tasks/YamOThfOQ9e6PmHVKJ1Eag/runs/0 [task 2018-10-09T23:13:57.446Z] 23:13:57 INFO - Creating Python environment [task 2018-10-09T23:13:58.806Z] 23:13:58 INFO - New python executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python2.7 [task 2018-10-09T23:13:58.806Z] 23:13:58 INFO - Also creating executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python [task 2018-10-09T23:13:58.806Z] 23:13:58 INFO - Installing setuptools, pip, wheel...done. [task 2018-10-09T23:13:59.007Z] 23:13:59 INFO - WARNING: Python.h not found. Install Python development headers. task 2018-10-09T23:37:24.773Z] 23:37:24 INFO - 883 compiler warnings present. [task 2018-10-09T23:37:25.456Z] 23:37:25 INFO - Overall system resources - Wall time: 1408s; CPU: 0%; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: build/win32/crashinject.cpp:46:58 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: build/win32/crashinject.cpp:65:53 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: build/win32/crashinject.cpp:72:57 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: build/win32/crashinject.cpp:84:57 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: clang/i686-w64-mingw32/include/yvals.h:271:9 [-Wmacro-redefined] '_THROW_BAD_ALLOC' macro redefined [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:22622:15 [-Wimplicit-function-declaration] implicit declaration of function 'malloc_usable_size' is invalid in C99 [task 2018-10-09T23:37:25.572Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:22649:7 [-Wimplicit-function-declaration] implicit declaration of function 'malloc_usable_size' is invalid in C99 [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:29138:10 [-Wunreachable-code-return] 'return' will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:40645:35 [-Wunused-variable] unused variable 'sqlite3_os_type' [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:43312:5 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:43339:11 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:44828:18 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:45016:5 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:45822:5 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.573Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:104758:12 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:139224:39 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:139283:39 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: db/sqlite3/src/sqlite3.c:148624:9 [-Wunreachable-code] code will never be executed [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: dom/bindings/BindingUtils.cpp:156:26 [-Wclass-varargs] passing object of class type 'typename raw_type<char16_t, int>::type' (aka 'char16ptr_t') through variadic function [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: dom/bindings/BindingUtils.cpp:156:45 [-Wclass-varargs] passing object of class type 'typename raw_type<char16_t, int>::type' (aka 'char16ptr_t') through variadic function [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: dom/ipc/ContentParent.cpp:784:56 [-Wformat] format specifies type 'int' but the argument has type 'base::ProcessId' (aka 'unsigned long') [task 2018-10-09T23:37:25.574Z] 23:37:25 INFO - warning: dom/ipc/ContentParent.cpp:1462:57 [-Wformat] format specifies type 'int' but the argument has type 'base::ProcessId' (aka 'unsigned long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/fake-cdm/cdm-test-output-protection.h:47:87 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/fake-cdm/cdm-test-output-protection.h:69:76 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/gmp/GMPChild.cpp:570:65 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/gmp/GMPChild.cpp:626:61 [-Wformat] format specifies type 'int' but the argument has type 'DWORD' (aka 'unsigned long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:314:75 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:342:77 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.575Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:354:74 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.576Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:366:65 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.576Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:372:68 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.577Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:381:73 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.577Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:390:75 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.577Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:400:7 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.577Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:424:66 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.577Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:779:72 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:786:64 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:801:76 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:810:40 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:825:70 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:851:70 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:864:64 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.578Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:872:70 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/DXVA2Manager.cpp:880:68 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/WMFMediaDataDecoder.cpp:124:49 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/WMFVideoMFTManager.cpp:204:9 [-Wformat-security] format string is not a string literal (potentially insecure) [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/media/platforms/wmf/WMFVideoMFTManager.cpp:641:62 [-Wformat] format specifies type 'unsigned int' but the argument has type 'HRESULT' (aka 'long') [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/plugins/base/nsPluginNativeWindowWin.cpp:225:5 [-Wimplicit-fallthrough] unannotated fall-through between switch labels [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/plugins/base/nsPluginNativeWindowWin.cpp:240:5 [-Wimplicit-fallthrough] unannotated fall-through between switch labels [task 2018-10-09T23:37:25.579Z] 23:37:25 INFO - warning: dom/plugins/base/nsPluginsDir.h:47:14 [-Wunused-private-field] private field 'pLibrary' is not used ....
@Aryx can you please take a look and help with the component/correct failures? Thank you.
Flags: needinfo?(aryx.bugmail)
Is this an upload failure? The log mentions at the end > Copying /builds/worker/logs/log_fatal.log to /builds/worker/workspace/build/upload/logs/log_fatal.log but the log viewer doesn't list such an artifact: https://treeherder.mozilla.org/logviewer.html#?job_id=204362372&repo=autoland
Component: Untriaged → Operations
Flags: needinfo?(aryx.bugmail)
Product: Firefox → Taskcluster
Assuming mozharness logs what it does, that's just a filesystem copy operation. It's possible it doesn't copy it if the file is empty? If there were errors uploading such artifacts, those would appear in the live log. I'm also not sure how an upload failure would cause Python.h to be missing much earlier. Or if that's even the issue -- [task 2018-10-09T23:13:59.007Z] 23:13:59 INFO - WARNING: Python.h not found. Install Python development headers. looks like a harmless warning to me. And the task ultimately succeeded. Is there actually anything wrong here?
Component: Operations → Untriaged
Flags: needinfo?(ncsoregi)
Product: Taskcluster → Firefox
Dustin: Can you check comment 2, please? The issue is that Treeherder shows the log as failed (without summary). Either the log or the task metadate could trigger that. (Natalia will be back on Friday.)
Flags: needinfo?(ncsoregi)
Summary: Intermittent WARNING: Python.h not found with compiler warnings present. → Intermittent successful job orange on treeherder
Ah, indeed, run 0 failed but run 1 succeeded. Looking in the worker logs, I see Oct 09 23:40:59 docker-worker.aws-provisioner.us-west-2a.ami-6d401815.m5-4xlarge.i-06e4f3ce28a1582d9 docker-worker: {"type":"error killing states","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-06e4f3ce28a1582d9","workerGroup":"us-west-2","workerType":"gecko-3-b-linux","workerNodeType":"m5.4xlarge","error":"Could not kill states properly. Error: Error: Error calling 'killed' for localLiveLog : Error: timeout of 30000ms exceeded\n at Promise.all.then.results (/home/ubuntu/docker_worker/src/lib/states.js:58:15)\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)"} and later Oct 09 23:41:16 docker-worker.aws-provisioner.us-west-2a.ami-6d401815.m5-4xlarge.i-06e4f3ce28a1582d9 docker-worker: {"type":"task resolved","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-06e4f3ce28a1582d9","workerGroup":"us-west-2","workerType":"gecko-3-b-linux","workerNodeType":"m5.4xlarge","taskId":"YamOThfOQ9e6PmHVKJ1Eag","runId":0,"taskState":"failed"} So it looks like the shutdown of the liveLog container failed (similar to the intermittents around linking docker containers at startup). Then someone manually re-ran the task, and on the second run it succeeded. I'm not sure why treeherder didn't catch the rerun and mark the job as not "busted", though. Is that typically how reruns are handled in treeherder? Docker-worker is deprecated, and this seems to be an incredibly rare event (first I've seen it anyway) so I don't think it's worthwhile to try to add task-logging output for this particular case.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5) > Docker-worker is deprecated, and this seems to be an incredibly rare event > (first I've seen it anyway) Bug 1497740 may be the same issue, with 18 failures in the last week.
See Also: → 1497740
Hi guys, this issue seems to still be happening once a week, can someone help us find a component for this issue? Thank you
Flags: needinfo?(past)
I think the remaining issue here is that treeherder is not showing the rerun as successful?
Treeherder's job result display doesn't aggregate reruns, it will show job every run as a single task instance. If it doesn't get automatically run (and set as exception for run 0), sheriffs try to identify what caused the job to fail. A successful rerun gets ignored for classification of such intermittent issues.
As far as I'm aware Treeherder does not treat re-runs as an update to an existing job. They would need to be submitted as a new job. As such, this seems to be a taskcluster-treeherder issue instead?
Flags: needinfo?(dustin)
I have no idea *what* the issue is.
Flags: needinfo?(dustin)
Eugh comment 17 was accidentally submitted earlier somehow (the page weirdly refreshed at one point) before I'd finished writing it, but the draft left in the comment field suggesting it hadn't? Strange. Anyway the revised version is... It looks like there are multiple separate issues in this bug: 1) (What this bug was filed about) Sometimes the shutdown of the liveLog container intermittently fails (details in comment 15) 2) That for these intermittents, `live.log` (and the others) don't contain any output that indicates as to why the failure occurred (therefore Treeherder not showing anything in the error summary panel is expected, since there is nothing for it to show). This would need changes to docker-worker to improve, however comment 15 says it is deprecated. 3) That clicking the task inspector link in Treeherder (either in the job details panel, or in the log viewer) for such an intermittent job, links to a page that seemingly shows a different result/log. Regarding (3), the short version is that Treeherder's links to the Task inspector currently don't include the run ID, so all end up showing the page for the latest run. I've filed bug 1502365 for changing this on our side, but it needs support from Taskcluster tools first, for which I've filed bug 1502368. Moving this bug back to Taskcluster, since the original issue it is about (the intermittent) isn't a Treeherder issue. (From the sounds of comment 15 it may end up being wontfix?)
Component: Treeherder → Docker-Worker
Product: Tree Management → Taskcluster
See Also: → 1502368
Summary: Intermittent successful job orange on treeherder → docker-worker jobs intermittently fail to shut down the liveLog container (with no error shown in live.log)
Version: --- → unspecified
See Also: 15023681502365
(In reply to Ed Morley [:emorley] from comment #18) > 1) (What this bug was filed about) Sometimes the shutdown of the liveLog > container intermittently fails (details in comment 15) Sorry that should be comment 5.
Component: Docker-Worker → Workers
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.