1457694 - test harness and taskcluster output not in sync, showing too early 'taskcluster:error] Task timeout after 1800 seconds. Force killing container.'

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

7 years ago

Look at https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=168a838455004d0e71546cf7198836c69016ca94&filter-searchStr=Doc&selectedJob=175985384 Focus on the line '[taskcluster:error] Task timeout after 1800 seconds. Force killing container.' Log: https://treeherder.mozilla.org/logviewer.html#?job_id=175985384&repo=autoland&lineNumber=760 [task 2018-04-27T16:40:01.321Z] Traceback (most recent call last): [task 2018-04-27T16:40:01.321Z] File "/builds/worker/checkouts/gecko/obj-x86_64-pc-linux-gnu/_virtualenv/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 658, in import_object [taskcluster:error] Task timeout after 1800 seconds. Force killing container. [task 2018-04-27T16:40:01.321Z] __import__(self.modname) [task 2018-04-27T16:40:01.321Z] File "/builds/worker/checkouts/gecko/build/mach_bootstrap.py", line 363, in __call__ [task 2018-04-27T16:40:01.321Z] module = self._original_import(name, globals, locals, fromlist, level) [task 2018-04-27T16:40:01.321Z] File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/action/generate_searchjson.py", line 14, in <module> [task 2018-04-27T16:40:01.322Z] output = open(output_file, 'w') [task 2018-04-27T16:40:01.322Z] IOError: [Errno 21] Is a directory: 'docs-out' But the job actually got started and killed (UTC+2): Fri Apr 27, 18:34:39 Fri Apr 27, 19:05:17 That terminating message should be at the end of the log.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

7 years ago

[taskcluster 2018-04-27 17:05:14.263Z] Unsuccessful task run with exit code: -1 completed in 1834.169 seconds It looks like it took an additional 25 minutes to upload the artifacts for the task. I only see small artifacts, but perhaps there was a larger one that it never finished uploading? Or perhaps S3 just got slow -- that happens far more than AWS likes to say :) I think the log mis-ordering is a buffering issue -- I suspect that the order of events was: * task terminated * uploads initiated * timeout * write "Force killing container" to log * close docker container streams, flushing buffers * write last 14ms worth of logs I'll let coop and wcosta be the arbiter, but I'm inclined to not try to fix this. Every hour we spend on docker-worker is an hour less work on docker-engine, and in the scheme of things the log ordering issue is pretty minor. Without reproduction, there's unlikely anything we can do about the slow upload.

Component: General → Docker-Worker

Geoff Brown [:gbrown]

Comment 2

•

7 years ago

Bug 1411358 collects intermittent failures due to task timeouts. There are a variety of causes, of course, but in recent weeks there have been quite a few similar to this (long delay between last output from harness and end of task + "Task timeout after..." message in random location close to but not at the end of the log), in a variety of Android and Linux test tasks. https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=177828889&lineNumber=24944 [task 2018-05-10T10:01:46.396Z] 10:01:46 INFO - [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [task 2018-05-10T10:01:46.396Z] 10:01:46 INFO - [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [taskcluster:error] Task timeout after 3600 seconds. Force killing container. [task 2018-05-10T10:01:46.397Z] 10:01:46 INFO - [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [task 2018-05-10T10:01:46.397Z] 10:01:46 INFO - [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 ... [task 2018-05-10T10:01:46.407Z] 10:01:46 INFO - TEST-START | ImageBitmapColorUtils.YUV444PToRGBA32 [task 2018-05-10T10:01:46.407Z] 10:01:46 INFO - TEST-PASS | ImageBitmapColorUtils.YUV444PToRGBA32 | test completed (time: 0ms) [task 2018-05-10T10:01:46.407Z] 10:01:46 INFO - TEST-START | ImageBitmapColorUtils.YUV444PToBGRA32 [task 2018-05-10T10:01:46.407Z] 10:01:46 INFO - TEST-PASS | ImageBitmapColorUtils.YUV444PToBGRA32 | test completed (time: 0ms)[taskcluster 2018-05-10 10:52:03.052Z] === Task Finished === [taskcluster 2018-05-10 10:52:03.053Z] Unsuccessful task run with exit code: -1 completed in 3601.594 seconds

Blocks: 1411358

Geoff Brown [:gbrown]

Comment 3

•

7 years ago

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=177915320&lineNumber=24827 [task 2018-05-10T17:53:40.871Z] 17:53:40 INFO - [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [task 2018-05-10T17:53:40.872Z] 17:53:40 INFO - [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [taskcluster:error] Task timeout after 3600 seconds. Force killing container. [task 2018-05-10T17:53:40.872Z] 17:53:40 INFO - [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 [task 2018-05-10T17:53:40.873Z] 17:53:40 INFO - [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309 ... [task 2018-05-10T17:53:40.906Z] 17:53:40 INFO - TEST-START | ImageSurfaceSink.PalettedSurfaceSinkWriteUnsafeComputedRow [task 2018-05-10T17:53:40.907Z] 17:53:40 INFO - TEST-PASS | ImageSurfaceSink.PalettedSurfaceSinkWriteUnsafeComputedRow | test completed (time: 0ms) [task 2018-05-10T17:53:40.907Z] 17:53:40 INFO - TEST-START | GlobalWindowDialogOptions.TestDialogTokenize[taskcluster 2018-05-10 18:46:55.516Z] === Task Finished === [taskcluster 2018-05-10 18:46:55.516Z] Unsuccessful task run with exit code: -1 completed in 3602.469 seconds

Geoff Brown [:gbrown]

Comment 4

•

7 years ago

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=177850775&lineNumber=1396 [task 2018-05-10T12:59:25.622Z] 12:59:25 INFO - root 13 2 0 0 c007340c 00000000 S kswapd0 [task 2018-05-10T12:59:25.623Z] 12:59:25 INFO - root 14 2 0 0 c004ce30 00000000 S aio/0 [task 2018-05-10T12:59:25.623Z] 12:59:25 INFO - root 25 2 0 0 c016d0f8 00000000 S mtdblockd [taskcluster:error] Task timeout after 7200 seconds. Force killing container. [task 2018-05-10T12:59:25.623Z] 12:59:25 INFO - root 26 2 0 0 c004ce30 00000000 S kstriped [task 2018-05-10T12:59:25.623Z] 12:59:25 INFO - root 27 2 0 0 c004ce30 00000000 S hid_compat ... [task 2018-05-10T13:00:10.662Z] 13:00:10 INFO - process: [task 2018-05-10T13:00:10.662Z] 13:00:10 INFO - USER PID PPID VSIZE RSS WCHAN PC NAME [task 2018-05-10T13:00:10.663Z] 13:00:10 INFO - root 1 0 364 268 c0098770 00017d58 S /init[taskcluster 2018-05-10 14:56:38.590Z] === Task Finished === [taskcluster 2018-05-10 14:56:38.590Z] Unsuccessful task run with exit code: -1 completed in 7461.379 seconds

Geoff Brown [:gbrown]

Comment 5

•

7 years ago

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=177834671&lineNumber=174 [task 2018-05-10T10:32:02.128Z] + true [task 2018-05-10T10:32:02.128Z] + . /builds/worker/scripts/xvfb.sh [taskcluster:error] Task timeout after 3600 seconds. Force killing container. [task 2018-05-10T10:32:02.129Z] #! /bin/bash -x [task 2018-05-10T10:32:02.129Z] ... [task 2018-05-10T10:32:19.686Z] 10:32:19 INFO - 'emulator': {'device_id': 'emulator-5554', 'name': 'test-1'}, [task 2018-05-10T10:32:19.686Z] 10:32:19 INFO - 'emulator_extra_args': '-show-kernel -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket', [task 2018-05-10T10:32:19.687Z] 10:32:19 INFO - 'emulator_manifest': '\n [\n {\n "size": 38451805,\n "digest": "ac3000aa6846dd1b09a420a1ba84727e393036f49da1419d6c9e5ec2490fc6105c74ca18b616c465dbe693992e2939afd88bc2042d961a9[taskcluster 2018-05-10 11:32:00.847Z] === Task Finished === [taskcluster 2018-05-10 11:32:00.848Z] Unsuccessful task run with exit code: -1 completed in 3602.395 seconds

Geoff Brown [:gbrown]

Comment 6

•

7 years ago

I don't care much about the location of the timeout error message (but it seems odd); the timeouts themselves though appear random and are pretty frequent -- very inconvenient.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

7 years ago

How could you get more information on the timeouts? I looked at the syslog from the host when I commented a few weeks ago, and didn't see anything out of the ordinary. The worker seemed to be plugging along as normal. The links above are all from the 10th, more than the 3-day log retention of syslogs, so I can't re-check that.

Geoff Brown [:gbrown]

Comment 8

•

7 years ago

Here's one from today: https://treeherder.mozilla.org/logviewer.html#?job_id=178541986&repo=mozilla-inbound&lineNumber=45642

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

7 years ago

Attached file docker-worker.aws-provisioner.us-east-1d.ami-5c43f923.c3-xlarge.i-0baf3f1dd2fcb75a8.log — Details

Here's the syslog from that instance at the time (with accessTokens hidden). Basically, docker-worker continued doing its housekeeping, including renewing its lease on the task during that time. Nothing out of the ordinary. This is also consistent with my thoughts about buffering: the output between the "Force killing container" and "Unsuccessful task run" was likely stuck in buffers with no further data to push it along during those 42 minutes. So I'm guessing something was sent: [task 2018-05-15T13:09:36.883Z] 13:09:36 INFO - Sending--> and nothing came back. Perhaps adding a timeout there will help?

Flags: needinfo?(dustin)

Geoff Brown [:gbrown]

Comment 10

•

7 years ago

Thanks for checking the syslog. That "Sending-->" came from gtest code. I don't know if gtest already has a timeout, but many harnesses do. Remember, we see this sort of thing in a variety of test suites -- not just gtest. Certainly mochitests and reftests have mechanisms to time out if there is no output for a few minutes; oddly, these logs don't show those time outs -- part of the mystery. Consider comment 4, where an android job is in the middle of dumping a process list when the log goes silent for almost 2 hours. It seems unlikely that this code was hung: https://dxr.mozilla.org/mozilla-central/rev/cf3ee14023483cbbb57129479537c713e22c1980/layout/tools/reftest/remotereftest.py#342-351 A correlation I may be noticing (or may be imagining) is that these hangs often occur when there is a lot of output being logged at one time. Some gtests dump a lot to the log all at once; the Android jobs retrieve all that device info, then dump 50+ lines in a tight loop....

Geoff Brown [:gbrown]

Comment 11

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #10) > A correlation I may be noticing (or may be imagining) is that these hangs > often occur when there is a lot of output being logged at one time. Some > gtests dump a lot to the log all at once; the Android jobs retrieve all that > device info, then dump 50+ lines in a tight loop.... Here's an experiment then: Hack the mozharness script to just dump a lot of text, rather than running tests. Does the task intermittently hang and time-out? It does! https://treeherder.mozilla.org/#/jobs?repo=try&revision=610ed4b30f42755cb83cbd5e5427f02c809bac17 https://treeherder.mozilla.org/logviewer.html#?job_id=186514985&repo=try&lineNumber=126958 [task 2018-07-04T23:31:19.465Z] INFO - CKTpQNyGnqMMJlysbYKoxXSQwjxLqTZnphzPkOCuewgrhBR fIuPcHbLaNURJk dXvWvGtre YtAEmSVDAoWadlOHfzFsBZEmc F [task 2018-07-04T23:31:19.466Z] INFO - ODMKvoTQDgmkCzxHQGA lFoeukEuzdJaGrgfNXcwLOyFSthBPjeqxsVvRJ bUCXsmEnyMZWlSKW ZtfRiqiIcanVjdINThpBwYYP [taskcluster:error] Task timeout after 3600 seconds. Force killing container. [task 2018-07-04T23:31:19.466Z] INFO - KAaNW rLPSTiDIdhIOge ctvvwmFTBKngxDYCoflynGHNXAVXWSfp EZQUxhdpzaOJeRuLzYqZmMbGQMsyRqrslVjBiPEbkcouJ [task 2018-07-04T23:31:19.466Z] INFO - VnXTbKaFfmrjdhEiw vlqXAUJusJYiMAbtPMczDBIDgNOCUmPTrVtWNKgYxoHS GOLqcRZoeHf LzEuswQpBlnxkSZI WyRQavep ... [task 2018-07-04T23:31:19.480Z] INFO - lbURwuvxoLKWGpiVA sMmCHYkhCj SfIdXTcGEqrqOeHyeDdRrtzNIBVksKaTPJlv gcESjLQgFQNWBP JiAzhaXMwupZFyntbDo [task 2018-07-04T23:31:19.480Z] INFO - bSaexzqOslWPvdicmIwyeZyIJpaORFiVVTvTCxBtmYGdJDoNfjUYAGHUrgKuncbtqLDHr zslhjQMMXEQfLSKRukENk hCPw A X [task 2018-07-04T23:31:19.480Z] INFO - dNKERsbkJaFmqzhjUzMQiLHXYotcTSZJxRnUZcPMYCyofFICOdlu vOwpwgLIgtfNaGe EQ AexrDnXvWpPDVSumVHGqABTbiK h[taskcluster 2018-07-05 00:29:03.229Z] === Task Finished === [taskcluster 2018-07-05 00:29:03.229Z] Unsuccessful task run with exit code: -1 completed in 4053.648 seconds

Geoff Brown [:gbrown]

Comment 12

•

7 years ago

And here's another experiment: Hack test-linux.sh to just dump a lot of text -- intermittent time-outs continue. https://treeherder.mozilla.org/#/jobs?repo=try&revision=120f4c0b95e0a2c07cf317fa167e07d39bf2cdef https://treeherder.mozilla.org/logviewer.html#?job_id=186547391&repo=try&lineNumber=251136 [task 2018-07-05T04:41:44.054Z] DaE+r3xVq1Y1BS0IK9q7GOFxKt03zmbx4hkU2MTvqu84HXXcv3M/NutX35Efx21OS8MhBTSAH7MK [task 2018-07-05T04:41:44.054Z] GFerhqhbPFyGH1JaCmIZUCgKU4P/MGo3JNQN7aouQy9RajogR75P7QGYQTVL0nkq/ihPb7w5luKY [taskcluster:error] Task timeout after 3600 seconds. Force killing container. [task 2018-07-05T04:41:44.055Z] vcp+tvuSz9p7BEg4wtc17mbCUjntB5FBE6NojOb88QVnDIOz7eYtB0l11011ODbpmJpNqV/RAKtB ... [task 2018-07-05T04:41:44.058Z] 1TfJAJe053HMQDSI4JhF6dpxDk542M8XFE8X9Ha183GjrYt9RbKYeDaCTKT+gCCpcHhHG3NbVVpu [task 2018-07-05T04:41:44.058Z] +nkqP1i2KIY0Gq7wKypRoPNVS2e5+E0lGnMQBmVqO1eRXqVURDHVFNdbtaVUOKpLQIDhFmaM3lf1 [task 2018-07-05T04:41:44.058Z] /4NOhL/fQCHC9oJmQElhFoQ4exE+0x/Yp9IqkVZMs/85cA80vb6A1Ui0fzOyp/0y8NBlLIL5g6gF[taskcluster 2018-07-05 05:41:14.985Z] === Task Finished === [taskcluster 2018-07-05 05:41:14.986Z] Unsuccessful task run with exit code: -1 completed in 3846.688 seconds

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

7 years ago

So it looks like this much output causes docker-worker to "lock up" until maxRunTime expires. Once it does expire, it prints a message, flushes some output, and then kills the task. This sounds suspiciously like a streams issue..... Wander, any ideas?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 14

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #13) > So it looks like this much output causes docker-worker to "lock up" until > maxRunTime expires. Once it does expire, it prints a message, flushes some > output, and then kills the task. This sounds suspiciously like a streams > issue..... > > Wander, any ideas? Putting in my back log

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Flags: needinfo?(wcosta)

Geoff Brown [:gbrown]

Comment 15

•

7 years ago

Here's one more experiment, with mixed results....This time I modified run-task to generate random output itself, rather than calling test-linux.sh; in this case, there is still an intermittent task timeout, but the frequency of the failure seems much lower. https://treeherder.mozilla.org/#/jobs?repo=try&revision=391e230e9482441345576ec7988f9138611a4390 https://treeherder.mozilla.org/logviewer.html#?job_id=186668532&repo=try&lineNumber=121279-121284 cPMmfs LYIEVPLtEuGZkCxvjst hSrcQwMzTHjpXSuXbRiHNoOCVFoJB UfgkdUODZJnDFabnWdwlTveKpyygKmlIhzNYqWRiGQe xMOKjSJwhjfYofqzNtnQHHgBkUPNuaLYFWtwDJixQkiMyubDUFXESVsGCnd vpvcAZIImroemzsABGyrTdlT hclRbK qRCEVZL [taskcluster:error] Task timeout after 3600 seconds. Force killing container. TmXAOzN UMRNKH ydJZDGsLQzOjKihnXHpegMfrWaRmyIbwvUZlBQLJEqStAuVhFdg cGecxbvxjWVfTolkDBCYYIqunkFiPsCa nBkKIYNdcwMkXsZCflDtVOexSWLcAwhxdaj yaRIveYDzoM s qQG viZyhrPGouWuiUmjbpXNFAECFHJVRzTLtEKHbfglOTUBPn ... Gvo aMmTORQZJGXZejyWRULqofpipDhYPuA UEDAKqYCLJkcdjKHdav mVfIVOtlBFiwgbM etHywTCEBWXSuFQPIbrlxnzngNzh EneEhXroOTWQkcvoJDMKdNICB AfQK SSlsbrwwxWiYH FIeubqVvLgCpLdpzagNmmcT ytHyGlfjFkiJRBaDxPOjRuZUzGnVPAt[taskcluster 2018-07-05 18:23:09.260Z] === Task Finished === [taskcluster 2018-07-05 18:23:09.261Z] Unsuccessful task run with exit code: -1 completed in 3925.886 seconds That is, we seem to have fewer of these streaming hangs when we do not use this code: https://dxr.mozilla.org/mozilla-central/rev/6c0fa9a675c91390ca27664ffb626c56e8afea4d/taskcluster/scripts/run-task#116-137 :gps - Any thoughts?

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 16

•

7 years ago

I agree that run-task:116-137 seems suspect. And that code did change recently and could explain the uptick in failures. I didn't have time today, but I'd like to test running `run-task` locally having it execute a process that spits out random output and see if I can reproduce. Reproducing locally is important, as that would enable someone to put the processes under strace and see what's going on. It would also be possible to enter a Python debugger and see what's up. I'd also like to test exiting a child process without a newline being the final character. I *think* the behavior of all these Python streams is that when the pipe is closed, an EOFError is supposed to be raised and any in-progress read() or readline() operations should terminate immediately. What I'm not certain about is how the child process, pipes, and the various Python wrappers are interplaying. It is quite possible that we trigger a readline() in Python (which blocks), the child then exits (having sent no more output), and the parent hangs waiting on output that will never arrive. There's also the possibility that we're hitting a Python bug where Python isn't handling the pipe close properly, leading to deadlock. It might be worth scouring the Internets for recipes that stream line output from subprocesses in Python and see what we're doing differently/incorrectly. I concede our use of a TextIOWrapper for doing (universal) line reading is a bit wonky. That could very well be where the problem is. I wonder if we shouldn'd read() from the underlying pipe and do our own line resolving. It is a little more code. But if it avoids this issue... I'm not sure if I'll have time to debug this in the next ~week. I'm a bit busy with migrating hg.mozilla.org to a different data center right now and that project is under a hard deadline. I would prefer someone else debug this. But this is a problem in code I wrote and I do feel a responsibility to help resolve it. I'll gladly take a look at patches sent my way. But I can't dedicate an unbound time to debug this issue right now :/

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 17

•

7 years ago

Something else to watch out for when generating "random" output is that the machine may run out of entropy and fail to generate more random data. Instead of generating random output, consider writing some more deterministic. e.g. run `xrange(2**32)` and print lines with the running counter. And consider sleeping a little between output so you don't overwhelm CPUs.

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

7 years ago

Based on https://hg.mozilla.org/try/rev/bdf6ac60376d, I added if __name__ == '__main__': run_and_prefix_output(b'foo', [b'bash', b'/home/dustin/tmp/spew.sh']) and spew.sh is base64 /dev/random | head -c 50000000 I ran this redirected to /tmp/stuff. It ran for about 30 seconds, then stopped of its own accord: dustin@jemison ~ $ head /tmp/stuff [foo 2018-07-07T16:48:55.039Z] executing [b'bash', b'/home/dustin/tmp/spew.sh'] [zzz 2018-07-07T16:48:55.040Z] run-task: reading [foo 2018-07-07T16:48:55.068Z] q4/361VKfUfgU4H4n7TBYG05RiXG1Vv4xO0wLtG0mCT1hihlMDnOiUK51FszC+p56SxLWcy+qC4S [foo 2018-07-07T16:48:55.068Z] Yj2Ro+RW1f69tAdCt7o3JRxHug6nzX7swjTucwUGutHS4t2bLc325U5IAZHJX2wJPr17rcpKOHPV [foo 2018-07-07T16:48:55.068Z] QGV7cYF34UDyOwDZZCkeOyIEdGCHTgthkdnPvHfhrpFn8axqEpb0vvVn3qqC6VvIf/Kedc3TlZZI [foo 2018-07-07T16:48:55.068Z] 8gCeqnsThW9/gl9tpei96t25iyvdobwOJMMrM/3KRpjiqHm9ejNFWIPQJxUF0zhnzNDvhiy1RwQo [foo 2018-07-07T16:48:55.068Z] mwngah1Ymzy8Mf+bRZ8f16I39sVjwslqzAOOm0Ly8/pQUEunkvlx31VzLAT+xL1Ykb1liOoRb0ln [foo 2018-07-07T16:48:55.068Z] 0NJjyOoiJOz0wu5SC5iYeRMEPmrXZuSssOLHjMjYz4cGAW74OusBgXH3hYfMczKhDTwNhiO7DJEf [foo 2018-07-07T16:48:55.068Z] 4El7nBfWCYxkgW83SyOKTtvKo+XmtmkqYXohpv1qR8KaeIZHtAIcYrUXu9IwJ+mgeXT1PD23XyJS [foo 2018-07-07T16:48:55.068Z] TM24S4gxvp4fRmbmWrDuo1kj6MQ4wQ1iQRUY/J1Q6WLTe2XMidyXE+2fpUBVbi8QgDTeaasqE2ly dustin@jemison ~ $ tail /tmp/stuff [foo 2018-07-07T16:49:31.019Z] Icx5QrDj3KeQAGiyr33mXfHnwNIBRvC1wU+gkBFv3phPeL3m5esssuc6KyFQx8Jgc9mJ/I8p69jx [foo 2018-07-07T16:49:31.019Z] yAVRdxN7iCT/d8LAFn5QIFJvuBIsN6OXchLCF1yt2o3gb8i8GtrGUnd30aNmG1yIaS+xNhYRRtTD [foo 2018-07-07T16:49:31.019Z] HTfpP6WCia/wW0RVdHRHmI6rFGmh6tQK5L86ReukhEtNKH/codhz1ayuk8QsRmwWv30DVexZI0Dw [foo 2018-07-07T16:49:31.019Z] 31jnDbIW+KvtNEhjmnhJlHjFTO+SEEa5xS1OsYZY0cH2Y9/ENuiABIWu1/av8cq2iNZ/AR6dhuyU [foo 2018-07-07T16:49:31.019Z] ObH+tclJuE9DufIc0jpH2o3onzz195Dc8nMue3KyreVTrcNdckCExVEB48SWnKg2pWiBbyHuJBbu [foo 2018-07-07T16:49:31.019Z] QiOIDy7JJ/83fsQ+idpr+R7ywrN1xMUTFZoIBo3CKIFDdJQt0ryHGy7t1jI6Z2uezOPei41fmA2c [foo 2018-07-07T16:49:31.019Z] 05pR33kUV5x+ZxksQJoWl68sJ5dCofyA8sW0iwA+ySHZZJhQFwrb0cvkyH4l7zCxLibS4eVKwnZP [foo 2018-07-07T16:49:31.019Z] iv1aqv5/QQq7qGsbxybtdqCNMR5pbPd35KxXZkBzQLfQqj4Iknln0RkRpTEMKFoddGtER3PAI9CR [foo 2018-07-07T16:49:31.046Z] xOVHaUQaRZHOsKH6v9AGOCEXUJymXZn6trl8f9qxELOiu2NQEV[zzz 2018-07-07T16:49:31.046Z] run-task: waiting after 649351 lines read [zzz 2018-07-07T16:49:31.046Z] run-task: wait complete Running it again: dustin@jemison ~ $ head /tmp/stuff [foo 2018-07-07T17:03:39.061Z] executing [b'bash', b'/home/dustin/tmp/spew.sh'] [zzz 2018-07-07T17:03:39.062Z] run-task: reading [foo 2018-07-07T17:03:39.090Z] oIzTkltCUPf4lGvkPoXB1ibL0jt5/No55YyKPcuDEtOsk4NkXNtsghH6tNp5t/jsazlRfKlcnE4J [foo 2018-07-07T17:03:39.090Z] FkFuaZB0TlJ76JZI5hatigQG+Oe4DuKiXR++aqT1qZmdLM78JxM7fJ984I9bZWbt9CB+vYBsc/2N [foo 2018-07-07T17:03:39.090Z] vFZyIOFw/q+C0yqJ4XPLShxMu7Y3FgQcrikjfzrNNxhhuOH78zLfNxTepfNxf+z4C/lFbfId5zub [foo 2018-07-07T17:03:39.090Z] f9YuLp4P2iPgrzo73k/U4HbylrAL9sj3PzrSe3GGYAZyghpPGpCEwPNjHEMAvv+23P7EXhuU8dYO [foo 2018-07-07T17:03:39.090Z] VUcQ8yuxY9aDXlu+OVPVgE0qPx0uHOPpJcfIOY1+4OFA2IrRIcrdLh88h9LpHiGU0eDJfThbDDSF [foo 2018-07-07T17:03:39.090Z] SDVpxj/c2Y3jgg5hPwGyMwpuW9XNNlyTMoIbW8z3jiBZz2yUWwdhG3FWHIawbA9rBBYK9oL3Cd99 [foo 2018-07-07T17:03:39.090Z] zgLj2tb4OQslLvf+SXfSfTgkh9fNhRgPmg4aK4pVxvEVCTeq0SYAZCwzBr48iAi0QjAAjYTDuGq5 [foo 2018-07-07T17:03:39.090Z] q13+X8/q6F9O3YsJ2XoqbxIQllQio7VCkgTUJwQR7kunRtfGr7kZ1+6fUxmgAUg4eoLlKoZwdx6Z dustin@jemison ~ $ tail /tmp/stuff [foo 2018-07-07T17:04:15.924Z] 18VruuHjL2ot06Ctu+72R+Sh2JIDNOsJ8sVg7C1Cxyb7JIMWbY8le/UWY28V0MMCzYQzyZEp5+az [foo 2018-07-07T17:04:15.924Z] E60JX1NpzQMUEFk96PWzMSvIyUph1djTd0wIKYy0Lhdl9/0zxyOb8tPsUSuW4+z9x0KumQgtWciW [foo 2018-07-07T17:04:15.924Z] iB1me8InodPaTRh5F5P9ZONVLdK8TtbDFCdvaN3+whMpnSK6tr1F9r0zK2PoOoP8gtfcDn6b7rIv [foo 2018-07-07T17:04:15.924Z] x8uOmmj0braPZ9cwEGio16WusE4cF3G7gHT0/NoTMsC8nMW6LmtmOO4tj+J1otIut8JOtB7slXOP [foo 2018-07-07T17:04:15.924Z] Iargf3la8H/xHlq4Gsp+rhHJLJw5Qc94GknJF6mPnlI2Wo7qBYtD2VgUZK1nuwHuodcCGAUoNRn6 [foo 2018-07-07T17:04:15.924Z] lrSYHE5g5nqh9OM2zZYW16Nw8lks78l3ZZrJM4G9VkEYuQVQ/rLYQhHTaNtjaucjI1eyA/qnm7pn [foo 2018-07-07T17:04:15.924Z] 11w2lNQOKcxeYCNneO1omaPAAlT0zJhJTdna26/LmWS1RjB6oA8hBLX9Fyn99mYLimpEKgz3NggK [foo 2018-07-07T17:04:15.924Z] lmSxLwkC+gKr4Z8APXUZRP1weKoVCV8e1l67GEzyaF8TAsHGgl1KRfnDGHLAvbJ36vf61jiVx+Ys [foo 2018-07-07T17:04:15.951Z] eINZ7e7Ytv9B7en7lP73++L4hFbLuBBXd7+Ngzq94I+fvp9rUn[zzz 2018-07-07T17:04:15.951Z] run-task: waiting after 649351 lines read [zzz 2018-07-07T17:04:15.951Z] run-task: wait complete Note that the line count is the same! watching `top` while this is going on, I don't see the processes using a lot of RAM or anything that would cause spew to fail

Dustin J. Mitchell [:dustin] (he/him)

Comment 19

•

7 years ago

Dropping the TextIOWrapper doesn't change things (precisely the same result) #stdout = io.TextIOWrapper(p.stdout, encoding='latin1') print_line(b"zzz", b"run-task: reading\n") line_count = 0 while True: data = p.stdout.readline() #.encode('latin1')

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

7 years ago

Printing the result of `p.wait() shows 0.

Dustin J. Mitchell [:dustin] (he/him)

Comment 21

•

7 years ago

Oh, I just realized that's head *-c*, so this is running to completion (I was thinking it was -n and expecting 50000000 lines). So, yeah, I can't reproduce the hang..

Geoff Brown [:gbrown]

Comment 22

•

7 years ago

(In reply to Gregory Szorc [:gps] from comment #17) > Something else to watch out for when generating "random" output is that the > machine may run out of entropy and fail to generate more random data. > Instead of generating random output, consider writing some more > deterministic. e.g. run `xrange(2**32)` and print lines with the running > counter. Reproduced with line count + constant text (no random output): https://treeherder.mozilla.org/#/jobs?repo=try&revision=585d4e195903994840d299a29d0cdde0ff8622e1 This happened fairly frequently: Nearly 1 time in 10. This reduces run-task logging to a loop around a simple print statement, suggesting that the trouble is really in docker-worker. > And consider sleeping a little between output so you don't > overwhelm CPUs. That seems unrealistic: run-task runs all sorts of builds and tests that generate output without regard to overwhelming CPUs.

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

7 years ago

I'm running some really basic tasks in https://tools.taskcluster.net/groups/e4a1BdK3RyiQ9QLDcagT9w as a further replication.

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

7 years ago

..and 20 minutes later, 19 out of 60 were failures with exactly this symptom. Here's the script to generate that: const taskcluster = require('taskcluster-client'); const N = 60; const main = async () => { const queue = new taskcluster.Queue(); const taskGroupId = taskcluster.slugid(); console.log(`taskGroupId: ${taskGroupId}`); for (let n = 0; n < N; n++) { const task = { taskGroupId, provisionerId: 'aws-provisioner-v1', workerType: 'gecko-t-linux-large', created: taskcluster.fromNow('0 seconds'), deadline: taskcluster.fromNow('3 hours'), expires: taskcluster.fromNow('28 days'), priority: 'low', metadata: { source: 'https://bugzilla.mozilla.org/show_bug.cgi?id=1457694', owner: 'dustin@mozilla.com', description: 'testing docker-worker', name: `test ${n}`, }, payload: { image: 'ubuntu:16.04', command: ['/bin/bash', '-c', 'for ((i=1;i<=500000;i++)); do echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx $i; done'], maxRunTime: 1200, }, }; const taskId = taskcluster.slugid(); await queue.createTask(taskId, task); console.log(taskId); } }; main().catch(err => { console.error(err); });

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

7 years ago

Is this possibly related to https://github.com/containerd/console/pull/27 ?

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

7 years ago

Yes! See also https://github.com/web-platform-tests/wpt/issues/10842

Mike Pennisi [:jugglinmike]

Comment 27

•

7 years ago

Removing the TTY allocation would avert this problem, but in IRC, Dustin explained how that would violate many existing users' assumptions in subtle ways. I'm wondering if an explicit opt-out could be implemented as part of the task configuration. It seems like that could allow users who were experiencing the issue to circumvent it and make any other modifications to their tasks that the change necessitated. Then again, if the only motivation for that feature is to work around a bug which may have already been fixed, then it may not be worth the effort. Without knowing how much this is effecting other users or knowing when the upstream fix will be available, I'm in no position to say. Still, I thought I'd suggest this in case it makes sense for the project.

Dustin J. Mitchell [:dustin] (he/him)

Comment 28

•

7 years ago

It's a good argument, and would push me slightly in the "let's fix it" direction, except that we're trying to migrate from docker-worker to generic-worker, and any modification to docker-worker pulls time and attention from that project. Basically, we're only fixing major bugs in docker-worker. So I think we could take the time to upgrade docker when the fix is released, as that's just twiddling a version number and doing some smoke-testing; but implementing a tty option would be quite a bit more time and attention that probably isn't justified.

Mike Pennisi [:jugglinmike]

Comment 29

•

7 years ago

That makes sense to me. Thanks for the explanation!

Mike Pennisi [:jugglinmike]

Comment 30

•

7 years ago

Good news! From the containerd issue referenced in comment 25: > Docker 18.06 has been released, and ships with containerd 1.1.1, which has > that fix.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

7 years ago

Depends on: 1478394

Chris Cooper [:coop] (he/him)

Comment 33

•

6 years ago

We will live with this until we switch to generic-worker.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

Mike Pennisi [:jugglinmike]

Comment 34

•

6 years ago

:coop Since :dustin fixed this via bug 1478394, I think it's more accurate to close it as "FIXED".

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Docker-Worker → Workers