Closed Bug 1457694 Opened 6 years ago Closed 5 years ago

test harness and taskcluster output not in sync, showing too early 'taskcluster:error] Task timeout after 1800 seconds. Force killing container.'

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: aryx, Assigned: wcosta)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Look at https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=168a838455004d0e71546cf7198836c69016ca94&filter-searchStr=Doc&selectedJob=175985384

Focus on the line '[taskcluster:error] Task timeout after 1800 seconds. Force killing container.'

Log: https://treeherder.mozilla.org/logviewer.html#?job_id=175985384&repo=autoland&lineNumber=760

[task 2018-04-27T16:40:01.321Z] Traceback (most recent call last):
[task 2018-04-27T16:40:01.321Z]   File "/builds/worker/checkouts/gecko/obj-x86_64-pc-linux-gnu/_virtualenv/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 658, in import_object
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.

[task 2018-04-27T16:40:01.321Z]     __import__(self.modname)
[task 2018-04-27T16:40:01.321Z]   File "/builds/worker/checkouts/gecko/build/mach_bootstrap.py", line 363, in __call__
[task 2018-04-27T16:40:01.321Z]     module = self._original_import(name, globals, locals, fromlist, level)
[task 2018-04-27T16:40:01.321Z]   File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/action/generate_searchjson.py", line 14, in <module>
[task 2018-04-27T16:40:01.322Z]     output = open(output_file, 'w')
[task 2018-04-27T16:40:01.322Z] IOError: [Errno 21] Is a directory: 'docs-out'

But the job actually got started and killed (UTC+2):
Fri Apr 27, 18:34:39
Fri Apr 27, 19:05:17

That terminating message should be at the end of the log.
[taskcluster 2018-04-27 17:05:14.263Z] Unsuccessful task run with exit code: -1 completed in 1834.169 seconds

It looks like it took an additional 25 minutes to upload the artifacts for the task.  I only see small artifacts, but perhaps there was a larger one that it never finished uploading?  Or perhaps S3 just got slow -- that happens far more than AWS likes to say :)

I think the log mis-ordering is a buffering issue -- I suspect that the order of events was:

 * task terminated
 * uploads initiated
 * timeout
 * write "Force killing container" to log
 * close docker container streams, flushing buffers
 * write last 14ms worth of logs

I'll let coop and wcosta be the arbiter, but I'm inclined to not try to fix this.  Every hour we spend on docker-worker is an hour less work on docker-engine, and in the scheme of things the log ordering issue is pretty minor.  Without reproduction, there's unlikely anything we can do about the slow upload.
Component: General → Docker-Worker
Bug 1411358 collects intermittent failures due to task timeouts. There are a variety of causes, of course, but in recent weeks there have been quite a few similar to this (long delay between last output from harness and end of task + "Task timeout after..." message in random location close to but not at the end of the log), in a variety of Android and Linux test tasks.

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=177828889&lineNumber=24944

[task 2018-05-10T10:01:46.396Z] 10:01:46     INFO -  [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309
[task 2018-05-10T10:01:46.396Z] 10:01:46     INFO -  [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2018-05-10T10:01:46.397Z] 10:01:46     INFO -  [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309
[task 2018-05-10T10:01:46.397Z] 10:01:46     INFO -  [967, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309

...

[task 2018-05-10T10:01:46.407Z] 10:01:46     INFO -  TEST-START | ImageBitmapColorUtils.YUV444PToRGBA32
[task 2018-05-10T10:01:46.407Z] 10:01:46     INFO -  TEST-PASS | ImageBitmapColorUtils.YUV444PToRGBA32 | test completed (time: 0ms)
[task 2018-05-10T10:01:46.407Z] 10:01:46     INFO -  TEST-START | ImageBitmapColorUtils.YUV444PToBGRA32
[task 2018-05-10T10:01:46.407Z] 10:01:46     INFO -  TEST-PASS | ImageBitmapColorUtils.YUV444PToBGRA32 | test completed (time: 0ms)[taskcluster 2018-05-10 10:52:03.052Z] === Task Finished ===
[taskcluster 2018-05-10 10:52:03.053Z] Unsuccessful task run with exit code: -1 completed in 3601.594 seconds
Blocks: 1411358
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=177915320&lineNumber=24827

[task 2018-05-10T17:53:40.871Z] 17:53:40     INFO -  [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309
[task 2018-05-10T17:53:40.872Z] 17:53:40     INFO -  [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2018-05-10T17:53:40.872Z] 17:53:40     INFO -  [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309
[task 2018-05-10T17:53:40.873Z] 17:53:40     INFO -  [965, Main Thread] WARNING: Provided buffer length is out-of-bounds in WriteBuffer: file /builds/worker/workspace/build/src/image/SurfacePipe.h, line 309

...

[task 2018-05-10T17:53:40.906Z] 17:53:40     INFO -  TEST-START | ImageSurfaceSink.PalettedSurfaceSinkWriteUnsafeComputedRow
[task 2018-05-10T17:53:40.907Z] 17:53:40     INFO -  TEST-PASS | ImageSurfaceSink.PalettedSurfaceSinkWriteUnsafeComputedRow | test completed (time: 0ms)
[task 2018-05-10T17:53:40.907Z] 17:53:40     INFO -  TEST-START | GlobalWindowDialogOptions.TestDialogTokenize[taskcluster 2018-05-10 18:46:55.516Z] === Task Finished ===
[taskcluster 2018-05-10 18:46:55.516Z] Unsuccessful task run with exit code: -1 completed in 3602.469 seconds
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=177850775&lineNumber=1396

[task 2018-05-10T12:59:25.622Z] 12:59:25     INFO - root      13    2     0      0     c007340c 00000000 S kswapd0
[task 2018-05-10T12:59:25.623Z] 12:59:25     INFO - root      14    2     0      0     c004ce30 00000000 S aio/0
[task 2018-05-10T12:59:25.623Z] 12:59:25     INFO - root      25    2     0      0     c016d0f8 00000000 S mtdblockd

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[task 2018-05-10T12:59:25.623Z] 12:59:25     INFO - root      26    2     0      0     c004ce30 00000000 S kstriped
[task 2018-05-10T12:59:25.623Z] 12:59:25     INFO - root      27    2     0      0     c004ce30 00000000 S hid_compat

...

[task 2018-05-10T13:00:10.662Z] 13:00:10     INFO -    process:
[task 2018-05-10T13:00:10.662Z] 13:00:10     INFO -       USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
[task 2018-05-10T13:00:10.663Z] 13:00:10     INFO -       root      1     0     364    268   c0098770 00017d58 S /init[taskcluster 2018-05-10 14:56:38.590Z] === Task Finished ===
[taskcluster 2018-05-10 14:56:38.590Z] Unsuccessful task run with exit code: -1 completed in 7461.379 seconds
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=177834671&lineNumber=174

[task 2018-05-10T10:32:02.128Z] + true
[task 2018-05-10T10:32:02.128Z] + . /builds/worker/scripts/xvfb.sh

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2018-05-10T10:32:02.129Z] #! /bin/bash -x
[task 2018-05-10T10:32:02.129Z] 

...

[task 2018-05-10T10:32:19.686Z] 10:32:19     INFO -  'emulator': {'device_id': 'emulator-5554', 'name': 'test-1'},
[task 2018-05-10T10:32:19.686Z] 10:32:19     INFO -  'emulator_extra_args': '-show-kernel -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket',
[task 2018-05-10T10:32:19.687Z] 10:32:19     INFO -  'emulator_manifest': '\n        [\n        {\n        "size": 38451805,\n        "digest": "ac3000aa6846dd1b09a420a1ba84727e393036f49da1419d6c9e5ec2490fc6105c74ca18b616c465dbe693992e2939afd88bc2042d961a9[taskcluster 2018-05-10 11:32:00.847Z] === Task Finished ===
[taskcluster 2018-05-10 11:32:00.848Z] Unsuccessful task run with exit code: -1 completed in 3602.395 seconds
I don't care much about the location of the timeout error message (but it seems odd); the timeouts themselves though appear random and are pretty frequent -- very inconvenient.
How could you get more information on the timeouts?  I looked at the syslog from the host when I commented a few weeks ago, and didn't see anything out of the ordinary.  The worker seemed to be plugging along as normal.  The links above are all from the 10th, more than the 3-day log retention of syslogs, so I can't re-check that.
Here's the syslog from that instance at the time (with accessTokens hidden).

Basically, docker-worker continued doing its housekeeping, including renewing its lease on the task during that time.  Nothing out of the ordinary.

This is also consistent with my thoughts about buffering: the output between the "Force killing container" and "Unsuccessful task run" was likely stuck in buffers with no further data to push it along during those 42 minutes.  So I'm guessing something was sent:

[task 2018-05-15T13:09:36.883Z] 13:09:36     INFO -  Sending-->

and nothing came back.  Perhaps adding a timeout there will help?
Flags: needinfo?(dustin)
Thanks for checking the syslog.

That "Sending-->" came from gtest code. I don't know if gtest already has a timeout, but many harnesses do. Remember, we see this sort of thing in a variety of test suites -- not just gtest. Certainly mochitests and reftests have mechanisms to time out if there is no output for a few minutes; oddly, these logs don't show those time outs -- part of the mystery.

Consider comment 4, where an android job is in the middle of dumping a process list when the log goes silent for almost 2 hours. It seems unlikely that this code was hung:

https://dxr.mozilla.org/mozilla-central/rev/cf3ee14023483cbbb57129479537c713e22c1980/layout/tools/reftest/remotereftest.py#342-351

A correlation I may be noticing (or may be imagining) is that these hangs often occur when there is a lot of output being logged at one time. Some gtests dump a lot to the log all at once; the Android jobs retrieve all that device info, then dump 50+ lines in a tight loop....
(In reply to Geoff Brown [:gbrown] from comment #10)
> A correlation I may be noticing (or may be imagining) is that these hangs
> often occur when there is a lot of output being logged at one time. Some
> gtests dump a lot to the log all at once; the Android jobs retrieve all that
> device info, then dump 50+ lines in a tight loop....

Here's an experiment then: Hack the mozharness script to just dump a lot of text, rather than running tests. Does the task intermittently hang and time-out? It does!

https://treeherder.mozilla.org/#/jobs?repo=try&revision=610ed4b30f42755cb83cbd5e5427f02c809bac17

https://treeherder.mozilla.org/logviewer.html#?job_id=186514985&repo=try&lineNumber=126958

[task 2018-07-04T23:31:19.465Z]     INFO - CKTpQNyGnqMMJlysbYKoxXSQwjxLqTZnphzPkOCuewgrhBR fIuPcHbLaNURJk dXvWvGtre YtAEmSVDAoWadlOHfzFsBZEmc F
[task 2018-07-04T23:31:19.466Z]     INFO - ODMKvoTQDgmkCzxHQGA lFoeukEuzdJaGrgfNXcwLOyFSthBPjeqxsVvRJ bUCXsmEnyMZWlSKW ZtfRiqiIcanVjdINThpBwYYP

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2018-07-04T23:31:19.466Z]     INFO -  KAaNW rLPSTiDIdhIOge ctvvwmFTBKngxDYCoflynGHNXAVXWSfp EZQUxhdpzaOJeRuLzYqZmMbGQMsyRqrslVjBiPEbkcouJ
[task 2018-07-04T23:31:19.466Z]     INFO - VnXTbKaFfmrjdhEiw vlqXAUJusJYiMAbtPMczDBIDgNOCUmPTrVtWNKgYxoHS GOLqcRZoeHf LzEuswQpBlnxkSZI WyRQavep
...
[task 2018-07-04T23:31:19.480Z]     INFO - lbURwuvxoLKWGpiVA sMmCHYkhCj SfIdXTcGEqrqOeHyeDdRrtzNIBVksKaTPJlv gcESjLQgFQNWBP JiAzhaXMwupZFyntbDo
[task 2018-07-04T23:31:19.480Z]     INFO - bSaexzqOslWPvdicmIwyeZyIJpaORFiVVTvTCxBtmYGdJDoNfjUYAGHUrgKuncbtqLDHr zslhjQMMXEQfLSKRukENk hCPw A X
[task 2018-07-04T23:31:19.480Z]     INFO - dNKERsbkJaFmqzhjUzMQiLHXYotcTSZJxRnUZcPMYCyofFICOdlu vOwpwgLIgtfNaGe EQ AexrDnXvWpPDVSumVHGqABTbiK h[taskcluster 2018-07-05 00:29:03.229Z] === Task Finished ===
[taskcluster 2018-07-05 00:29:03.229Z] Unsuccessful task run with exit code: -1 completed in 4053.648 seconds
And here's another experiment: Hack test-linux.sh to just dump a lot of text -- intermittent time-outs continue.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=120f4c0b95e0a2c07cf317fa167e07d39bf2cdef

https://treeherder.mozilla.org/logviewer.html#?job_id=186547391&repo=try&lineNumber=251136

[task 2018-07-05T04:41:44.054Z] DaE+r3xVq1Y1BS0IK9q7GOFxKt03zmbx4hkU2MTvqu84HXXcv3M/NutX35Efx21OS8MhBTSAH7MK
[task 2018-07-05T04:41:44.054Z] GFerhqhbPFyGH1JaCmIZUCgKU4P/MGo3JNQN7aouQy9RajogR75P7QGYQTVL0nkq/ihPb7w5luKY

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2018-07-05T04:41:44.055Z] vcp+tvuSz9p7BEg4wtc17mbCUjntB5FBE6NojOb88QVnDIOz7eYtB0l11011ODbpmJpNqV/RAKtB
...
[task 2018-07-05T04:41:44.058Z] 1TfJAJe053HMQDSI4JhF6dpxDk542M8XFE8X9Ha183GjrYt9RbKYeDaCTKT+gCCpcHhHG3NbVVpu
[task 2018-07-05T04:41:44.058Z] +nkqP1i2KIY0Gq7wKypRoPNVS2e5+E0lGnMQBmVqO1eRXqVURDHVFNdbtaVUOKpLQIDhFmaM3lf1
[task 2018-07-05T04:41:44.058Z] /4NOhL/fQCHC9oJmQElhFoQ4exE+0x/Yp9IqkVZMs/85cA80vb6A1Ui0fzOyp/0y8NBlLIL5g6gF[taskcluster 2018-07-05 05:41:14.985Z] === Task Finished ===
[taskcluster 2018-07-05 05:41:14.986Z] Unsuccessful task run with exit code: -1 completed in 3846.688 seconds
So it looks like this much output causes docker-worker to "lock up" until maxRunTime expires.  Once it does expire, it prints a message, flushes some output, and then kills the task.  This sounds suspiciously like a streams issue.....

Wander, any ideas?
Flags: needinfo?(wcosta)
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #13)
> So it looks like this much output causes docker-worker to "lock up" until
> maxRunTime expires.  Once it does expire, it prints a message, flushes some
> output, and then kills the task.  This sounds suspiciously like a streams
> issue.....
> 
> Wander, any ideas?

Putting in my back log
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
Here's one more experiment, with mixed results....This time I modified run-task to generate random output itself, rather than calling test-linux.sh; in this case, there is still an intermittent task timeout, but the frequency of the failure seems much lower.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=391e230e9482441345576ec7988f9138611a4390

https://treeherder.mozilla.org/logviewer.html#?job_id=186668532&repo=try&lineNumber=121279-121284

cPMmfs LYIEVPLtEuGZkCxvjst hSrcQwMzTHjpXSuXbRiHNoOCVFoJB UfgkdUODZJnDFabnWdwlTveKpyygKmlIhzNYqWRiGQe
xMOKjSJwhjfYofqzNtnQHHgBkUPNuaLYFWtwDJixQkiMyubDUFXESVsGCnd vpvcAZIImroemzsABGyrTdlT hclRbK qRCEVZL 

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
TmXAOzN UMRNKH ydJZDGsLQzOjKihnXHpegMfrWaRmyIbwvUZlBQLJEqStAuVhFdg  cGecxbvxjWVfTolkDBCYYIqunkFiPsCa
nBkKIYNdcwMkXsZCflDtVOexSWLcAwhxdaj yaRIveYDzoM s qQG viZyhrPGouWuiUmjbpXNFAECFHJVRzTLtEKHbfglOTUBPn
...
Gvo aMmTORQZJGXZejyWRULqofpipDhYPuA UEDAKqYCLJkcdjKHdav mVfIVOtlBFiwgbM etHywTCEBWXSuFQPIbrlxnzngNzh
EneEhXroOTWQkcvoJDMKdNICB AfQK SSlsbrwwxWiYH FIeubqVvLgCpLdpzagNmmcT ytHyGlfjFkiJRBaDxPOjRuZUzGnVPAt[taskcluster 2018-07-05 18:23:09.260Z] === Task Finished ===
[taskcluster 2018-07-05 18:23:09.261Z] Unsuccessful task run with exit code: -1 completed in 3925.886 seconds


That is, we seem to have fewer of these streaming hangs when we do not use this code:

https://dxr.mozilla.org/mozilla-central/rev/6c0fa9a675c91390ca27664ffb626c56e8afea4d/taskcluster/scripts/run-task#116-137

:gps - Any thoughts?
Flags: needinfo?(gps)
I agree that run-task:116-137 seems suspect. And that code did change recently and could explain the uptick in failures.

I didn't have time today, but I'd like to test running `run-task` locally having it execute a process that spits out random output and see if I can reproduce. Reproducing locally is important, as that would enable someone to put the processes under strace and see what's going on. It would also be possible to enter a Python debugger and see what's up.

I'd also like to test exiting a child process without a newline being the final character.

I *think* the behavior of all these Python streams is that when the pipe is closed, an EOFError is supposed to be raised and any in-progress read() or readline() operations should terminate immediately.

What I'm not certain about is how the child process, pipes, and the various Python wrappers are interplaying. It is quite possible that we trigger a readline() in Python (which blocks), the child then exits (having sent no more output), and the parent hangs waiting on output that will never arrive.

There's also the possibility that we're hitting a Python bug where Python isn't handling the pipe close properly, leading to deadlock.

It might be worth scouring the Internets for recipes that stream line output from subprocesses in Python and see what we're doing differently/incorrectly. I concede our use of a TextIOWrapper for doing (universal) line reading is a bit wonky. That could very well be where the problem is. I wonder if we shouldn'd read() from the underlying pipe and do our own line resolving. It is a little more code. But if it avoids this issue...

I'm not sure if I'll have time to debug this in the next ~week. I'm a bit busy with migrating hg.mozilla.org to a different data center right now and that project is under a hard deadline. I would prefer someone else debug this. But this is a problem in code I wrote and I do feel a responsibility to help resolve it. I'll gladly take a look at patches sent my way. But I can't dedicate an unbound time to debug this issue right now :/
Flags: needinfo?(gps)
Something else to watch out for when generating "random" output is that the machine may run out of entropy and fail to generate more random data. Instead of generating random output, consider writing some more deterministic. e.g. run `xrange(2**32)` and print lines with the running counter. And consider sleeping a little between output so you don't overwhelm CPUs.
Based on https://hg.mozilla.org/try/rev/bdf6ac60376d, I added

if __name__ == '__main__':
    run_and_prefix_output(b'foo', [b'bash', b'/home/dustin/tmp/spew.sh'])  

and spew.sh is

base64 /dev/random | head -c 50000000

I ran this redirected to /tmp/stuff.  It ran for about 30 seconds, then stopped of its own accord:

dustin@jemison ~ $ head /tmp/stuff 
[foo 2018-07-07T16:48:55.039Z] executing [b'bash', b'/home/dustin/tmp/spew.sh']
[zzz 2018-07-07T16:48:55.040Z] run-task: reading
[foo 2018-07-07T16:48:55.068Z] q4/361VKfUfgU4H4n7TBYG05RiXG1Vv4xO0wLtG0mCT1hihlMDnOiUK51FszC+p56SxLWcy+qC4S
[foo 2018-07-07T16:48:55.068Z] Yj2Ro+RW1f69tAdCt7o3JRxHug6nzX7swjTucwUGutHS4t2bLc325U5IAZHJX2wJPr17rcpKOHPV
[foo 2018-07-07T16:48:55.068Z] QGV7cYF34UDyOwDZZCkeOyIEdGCHTgthkdnPvHfhrpFn8axqEpb0vvVn3qqC6VvIf/Kedc3TlZZI
[foo 2018-07-07T16:48:55.068Z] 8gCeqnsThW9/gl9tpei96t25iyvdobwOJMMrM/3KRpjiqHm9ejNFWIPQJxUF0zhnzNDvhiy1RwQo
[foo 2018-07-07T16:48:55.068Z] mwngah1Ymzy8Mf+bRZ8f16I39sVjwslqzAOOm0Ly8/pQUEunkvlx31VzLAT+xL1Ykb1liOoRb0ln
[foo 2018-07-07T16:48:55.068Z] 0NJjyOoiJOz0wu5SC5iYeRMEPmrXZuSssOLHjMjYz4cGAW74OusBgXH3hYfMczKhDTwNhiO7DJEf
[foo 2018-07-07T16:48:55.068Z] 4El7nBfWCYxkgW83SyOKTtvKo+XmtmkqYXohpv1qR8KaeIZHtAIcYrUXu9IwJ+mgeXT1PD23XyJS
[foo 2018-07-07T16:48:55.068Z] TM24S4gxvp4fRmbmWrDuo1kj6MQ4wQ1iQRUY/J1Q6WLTe2XMidyXE+2fpUBVbi8QgDTeaasqE2ly
dustin@jemison ~ $  tail /tmp/stuff 
[foo 2018-07-07T16:49:31.019Z] Icx5QrDj3KeQAGiyr33mXfHnwNIBRvC1wU+gkBFv3phPeL3m5esssuc6KyFQx8Jgc9mJ/I8p69jx
[foo 2018-07-07T16:49:31.019Z] yAVRdxN7iCT/d8LAFn5QIFJvuBIsN6OXchLCF1yt2o3gb8i8GtrGUnd30aNmG1yIaS+xNhYRRtTD
[foo 2018-07-07T16:49:31.019Z] HTfpP6WCia/wW0RVdHRHmI6rFGmh6tQK5L86ReukhEtNKH/codhz1ayuk8QsRmwWv30DVexZI0Dw
[foo 2018-07-07T16:49:31.019Z] 31jnDbIW+KvtNEhjmnhJlHjFTO+SEEa5xS1OsYZY0cH2Y9/ENuiABIWu1/av8cq2iNZ/AR6dhuyU
[foo 2018-07-07T16:49:31.019Z] ObH+tclJuE9DufIc0jpH2o3onzz195Dc8nMue3KyreVTrcNdckCExVEB48SWnKg2pWiBbyHuJBbu
[foo 2018-07-07T16:49:31.019Z] QiOIDy7JJ/83fsQ+idpr+R7ywrN1xMUTFZoIBo3CKIFDdJQt0ryHGy7t1jI6Z2uezOPei41fmA2c
[foo 2018-07-07T16:49:31.019Z] 05pR33kUV5x+ZxksQJoWl68sJ5dCofyA8sW0iwA+ySHZZJhQFwrb0cvkyH4l7zCxLibS4eVKwnZP
[foo 2018-07-07T16:49:31.019Z] iv1aqv5/QQq7qGsbxybtdqCNMR5pbPd35KxXZkBzQLfQqj4Iknln0RkRpTEMKFoddGtER3PAI9CR
[foo 2018-07-07T16:49:31.046Z] xOVHaUQaRZHOsKH6v9AGOCEXUJymXZn6trl8f9qxELOiu2NQEV[zzz 2018-07-07T16:49:31.046Z] run-task: waiting after 649351 lines read
[zzz 2018-07-07T16:49:31.046Z] run-task: wait complete

Running it again:

dustin@jemison ~ $ head /tmp/stuff 
[foo 2018-07-07T17:03:39.061Z] executing [b'bash', b'/home/dustin/tmp/spew.sh']
[zzz 2018-07-07T17:03:39.062Z] run-task: reading
[foo 2018-07-07T17:03:39.090Z] oIzTkltCUPf4lGvkPoXB1ibL0jt5/No55YyKPcuDEtOsk4NkXNtsghH6tNp5t/jsazlRfKlcnE4J
[foo 2018-07-07T17:03:39.090Z] FkFuaZB0TlJ76JZI5hatigQG+Oe4DuKiXR++aqT1qZmdLM78JxM7fJ984I9bZWbt9CB+vYBsc/2N
[foo 2018-07-07T17:03:39.090Z] vFZyIOFw/q+C0yqJ4XPLShxMu7Y3FgQcrikjfzrNNxhhuOH78zLfNxTepfNxf+z4C/lFbfId5zub
[foo 2018-07-07T17:03:39.090Z] f9YuLp4P2iPgrzo73k/U4HbylrAL9sj3PzrSe3GGYAZyghpPGpCEwPNjHEMAvv+23P7EXhuU8dYO
[foo 2018-07-07T17:03:39.090Z] VUcQ8yuxY9aDXlu+OVPVgE0qPx0uHOPpJcfIOY1+4OFA2IrRIcrdLh88h9LpHiGU0eDJfThbDDSF
[foo 2018-07-07T17:03:39.090Z] SDVpxj/c2Y3jgg5hPwGyMwpuW9XNNlyTMoIbW8z3jiBZz2yUWwdhG3FWHIawbA9rBBYK9oL3Cd99
[foo 2018-07-07T17:03:39.090Z] zgLj2tb4OQslLvf+SXfSfTgkh9fNhRgPmg4aK4pVxvEVCTeq0SYAZCwzBr48iAi0QjAAjYTDuGq5
[foo 2018-07-07T17:03:39.090Z] q13+X8/q6F9O3YsJ2XoqbxIQllQio7VCkgTUJwQR7kunRtfGr7kZ1+6fUxmgAUg4eoLlKoZwdx6Z
dustin@jemison ~ $  tail /tmp/stuff 
[foo 2018-07-07T17:04:15.924Z] 18VruuHjL2ot06Ctu+72R+Sh2JIDNOsJ8sVg7C1Cxyb7JIMWbY8le/UWY28V0MMCzYQzyZEp5+az
[foo 2018-07-07T17:04:15.924Z] E60JX1NpzQMUEFk96PWzMSvIyUph1djTd0wIKYy0Lhdl9/0zxyOb8tPsUSuW4+z9x0KumQgtWciW
[foo 2018-07-07T17:04:15.924Z] iB1me8InodPaTRh5F5P9ZONVLdK8TtbDFCdvaN3+whMpnSK6tr1F9r0zK2PoOoP8gtfcDn6b7rIv
[foo 2018-07-07T17:04:15.924Z] x8uOmmj0braPZ9cwEGio16WusE4cF3G7gHT0/NoTMsC8nMW6LmtmOO4tj+J1otIut8JOtB7slXOP
[foo 2018-07-07T17:04:15.924Z] Iargf3la8H/xHlq4Gsp+rhHJLJw5Qc94GknJF6mPnlI2Wo7qBYtD2VgUZK1nuwHuodcCGAUoNRn6
[foo 2018-07-07T17:04:15.924Z] lrSYHE5g5nqh9OM2zZYW16Nw8lks78l3ZZrJM4G9VkEYuQVQ/rLYQhHTaNtjaucjI1eyA/qnm7pn
[foo 2018-07-07T17:04:15.924Z] 11w2lNQOKcxeYCNneO1omaPAAlT0zJhJTdna26/LmWS1RjB6oA8hBLX9Fyn99mYLimpEKgz3NggK
[foo 2018-07-07T17:04:15.924Z] lmSxLwkC+gKr4Z8APXUZRP1weKoVCV8e1l67GEzyaF8TAsHGgl1KRfnDGHLAvbJ36vf61jiVx+Ys
[foo 2018-07-07T17:04:15.951Z] eINZ7e7Ytv9B7en7lP73++L4hFbLuBBXd7+Ngzq94I+fvp9rUn[zzz 2018-07-07T17:04:15.951Z] run-task: waiting after 649351 lines read
[zzz 2018-07-07T17:04:15.951Z] run-task: wait complete

Note that the line count is the same!

watching `top` while this is going on, I don't see the processes using a lot of RAM or anything that would cause spew to fail
Dropping the TextIOWrapper doesn't change things (precisely the same result)

    #stdout = io.TextIOWrapper(p.stdout, encoding='latin1')

    print_line(b"zzz", b"run-task: reading\n")
    line_count = 0
    while True:
        data = p.stdout.readline() #.encode('latin1')
Printing the result of `p.wait() shows 0.
Oh, I just realized that's head *-c*, so this is running to completion (I was thinking it was -n and expecting 50000000 lines).

So, yeah, I can't reproduce the hang..
(In reply to Gregory Szorc [:gps] from comment #17)
> Something else to watch out for when generating "random" output is that the
> machine may run out of entropy and fail to generate more random data.
> Instead of generating random output, consider writing some more
> deterministic. e.g. run `xrange(2**32)` and print lines with the running
> counter. 

Reproduced with line count + constant text (no random output):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=585d4e195903994840d299a29d0cdde0ff8622e1

This happened fairly frequently: Nearly 1 time in 10. This reduces run-task logging to a loop around a simple print statement, suggesting that the trouble is really in docker-worker.

> And consider sleeping a little between output so you don't
> overwhelm CPUs.

That seems unrealistic: run-task runs all sorts of builds and tests that generate output without regard to overwhelming CPUs.
I'm running some really basic tasks in https://tools.taskcluster.net/groups/e4a1BdK3RyiQ9QLDcagT9w as a further replication.
..and 20 minutes later, 19 out of 60 were failures with exactly this symptom.  Here's the script to generate that:

const taskcluster = require('taskcluster-client');

const N = 60; 

const main = async () => {
  const queue = new taskcluster.Queue();
  const taskGroupId = taskcluster.slugid();
  console.log(`taskGroupId: ${taskGroupId}`);

  for (let n = 0; n < N; n++) {
    const task = { 
      taskGroupId,
      provisionerId: 'aws-provisioner-v1',
      workerType: 'gecko-t-linux-large',
      created: taskcluster.fromNow('0 seconds'),
      deadline: taskcluster.fromNow('3 hours'),    
      expires: taskcluster.fromNow('28 days'),
      priority: 'low',
      metadata: {
        source: 'https://bugzilla.mozilla.org/show_bug.cgi?id=1457694',
        owner: 'dustin@mozilla.com',
        description: 'testing docker-worker',
        name: `test ${n}`,
      },  
      payload: {
        image: 'ubuntu:16.04',
        command: ['/bin/bash', '-c', 'for ((i=1;i<=500000;i++)); do echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx $i; done'],
        maxRunTime: 1200,
      },  
    };  

    const taskId = taskcluster.slugid();
    await queue.createTask(taskId, task);
    console.log(taskId);
  }
};

main().catch(err => {
  console.error(err);
});
Removing the TTY allocation would avert this problem, but in IRC, Dustin
explained how that would violate many existing users' assumptions in subtle
ways. I'm wondering if an explicit opt-out could be implemented as part of the
task configuration. It seems like that could allow users who were experiencing
the issue to circumvent it and make any other modifications to their tasks that
the change necessitated.

Then again, if the only motivation for that feature is to work around a bug
which may have already been fixed, then it may not be worth the effort. Without
knowing how much this is effecting other users or knowing when the upstream fix
will be available, I'm in no position to say. Still, I thought I'd suggest this
in case it makes sense for the project.
It's a good argument, and would push me slightly in the "let's fix it" direction, except that we're trying to migrate from docker-worker to generic-worker, and any modification to docker-worker pulls time and attention from that project.

Basically, we're only fixing major bugs in docker-worker.  So I think we could take the time to upgrade docker when the fix is released, as that's just twiddling a version number and doing some smoke-testing; but implementing a tty option would be quite a bit more time and attention that probably isn't justified.
That makes sense to me. Thanks for the explanation!
Good news! From the containerd issue referenced in comment 25:

> Docker 18.06 has been released, and ships with containerd 1.1.1, which has
> that fix.

We will live with this until we switch to generic-worker.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX

:coop Since :dustin fixed this via bug 1478394, I think it's more accurate to close it as "FIXED".

Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: