Closed Bug 1246176 Opened 8 years ago Closed 8 years ago

tc-W(3) and tc-W-e10s(3) has started retrying a lot after changing from 12 to 8 chunks

Categories

(Taskcluster :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: dustin)

Details

Attachments

(3 files)

We don't what is causing this but we should investigate it before it becomes more problematic (as of now it is taking CPU time).

It would be easier to know why if bug 1243020 was fixed.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc-W%283&group_state=expanded&fromchange=648e852030e3&exclusion_profile=false
More noticeable if we also include e10s jobs:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc-W%20%283&group_state=expanded&fromchange=648e852030e3&exclusion_profile=false
Summary: tc-W(3) has start retrying a lot starting last night → tc-W(3) and tc-W-e10s(3) has start retrying a lot starting last night
I know this was caused by my push [1] where I changed from 12 chunks to 8 and started trickling bug 1238435.

[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a7f2b93dd564
Attachment #8716324 - Flags: review?(garndt)
Armen just pointed us to a more recent failure,
  https://tools.taskcluster.net/task-inspector/#fIAKGG5qRWiQMgcSt7LXAQ/0
which still has (well, did a second ago) a live log.  It ends rather precipitously and unhelpfully:

16:25:51     INFO - TEST-OK | /html/dom/dynamic-markup-insertion/opening-the-input-stream/004.html | took 1622ms        
16:25:51     INFO - TEST-START | /html/dom/dynamic-markup-insertion/opening-the-input-stream/005.html                   
16:25:52     INFO - PROCESS | 1551 | ++DOCSHELL 0x7f3e211d6800 == 55 [pid = 1551] [id = 470]                            
16:25:52     INFO - PROCESS | 1551 | ++DOMWINDOW == 155 (0x7f3e0dd52000) [pid = 1551] [serial = 1459] [outer = (nil)]   
16:25:52     INFO - PROCESS | 1551 | ++DOMWINDOW == 156 (0x7f3e0e9ca400) [pid = 1551] [serial = 1460] [outer = 0x7f3e0dd
52000]                                                                                                                  
16:25:53     INFO - PROCESS | 1551 | ++DOMWINDOW == 157 (0x7f3e13240400) [pid = 1551] [serial = 1461] [outer = 0x7f3e0dd
52000]                                                                                                                  
16:25:53     INFO - PROCESS | 1551 | [1551] WARNING: Suboptimal indexes for the SQL statement 0x7f3e0881b530 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
16:25:53     INFO - PROCESS | 1551 | [1551] WARNING: Suboptimal indexes for the SQL statement 0x7f3e0881b400 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
16:25:56     INFO - PROCESS | 1551 | ++DOCSHELL 0x7f3e249b4800 == 56 [pid = 1551] [id = 471]                            
16:25:56     INFO - PROCESS | 1551 | ++DOMWINDOW == 158 (0x7f3e1339ac00) [pid = 1551] [serial = 1462] [outer = (nil)]   
16:25:56     INFO - PROCESS | 1551 | [1551] WARNING: Suboptimal indexes for the SQL statement 0x7f3e0881b530 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
16:25:56     INFO - PROCESS | 1551 | [1551] WARNING: Suboptimal indexes for the SQL statement 0x7f3e0881b400 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
16:25:58     INFO - PROCESS | 1551 | ++DOMWINDOW == 159 (0x7f3e13240800) [pid = 1551] [serial = 1463] [outer = 0x7f3e133
9ac00]                                                                                                                  
16:26:01     INFO - PROCESS | 1551 | ++DOMWINDOW == 160 (0x7f3e0ec40000) [pid = 1551] [serial = 1464] [outer = 0x7f3e133
9ac00]                                                                                                                  
16:26:06     INFO - TEST-PASS | /html/dom/dynamic-markup-insertion/opening-the-input-stream/005.html | Cancelling timeou
t after document.open                                                                                                   
16:26:06     INFO - TEST-OK | /html/dom/dynamic-markup-insertion/opening-the-input-stream/005.html | took 15151ms       
16:26:06     INFO - TEST-START | /html/dom/dynamic-markup-insertion/opening-the-input-stream/006.html    

The "live log" is fetched from the actual EC2 instance doing the work.  So we do have a datapoint here, in that the instance was still alive 12 minutes after the task was resolved claim-expired.  And we also have the instance ID and can look up some logging output from docker-worker itself via papertrail.

I see a reclaim of this task:

 Feb 05 07:53:20 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-1ec2d9ac docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-1ec2d9ac","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","primaryTaskId":"fIAKGG5qRWiQMgcSt7LXAQ","primaryRunId":0,"taskId":"fIAKGG5qRWiQMgcSt7LXAQ","runId":0,"takenUntil":"2016-02-05T15:57:57.288Z"}
Feb 05 07:53:21 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-1ec2d9ac docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-1ec2d9ac","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","primaryTaskId":"fIAKGG5qRWiQMgcSt7LXAQ","primaryRunId":0,"taskId":"fIAKGG5qRWiQMgcSt7LXAQ","runId":0,"takenUntil":"2016-02-05T16:13:21.065Z"}
Feb 05 07:53:21 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-1ec2d9ac docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-1ec2d9ac","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","time":923014.6153846154,"primaryTaskId":"fIAKGG5qRWiQMgcSt7LXAQ","primaryRunId":0,"taskId":"fIAKGG5qRWiQMgcSt7LXAQ","runId":0,"takenUntil":"2016-02-05T16:13:21.065Z"}
Feb 05 07:53:34 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-1ec2d9ac docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-1ec2d9ac","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5816,"interval":30000} 

but no further reclaims after that time.  The queue resolved the task with claim-expired at 8:44.  Which doesn't seem to match up with the timestamps in the mozharness logging.

There's some state inconsistency in this docker-worker, somehow.
Ah, the mozharness timestamps are UTC (!?), so 16:26 utc (mozharness) == 8:26 pacific (papertrail).  So basically the tests up and quit at 8:26 in such a way that docker-worker forgot about them.
https://tools.taskcluster.net/task-inspector/#BdSBzuaCQgSzrQGqPvATkA/0

17:16:13     INFO - TEST-PASS | /html/browsers/the-window-object/window-properties.html | Window replaceable attribute: 
outerWidth                                                                                                              
17:16:13     INFO - TEST-PASS | /html/browsers/the-window-object/window-properties.html | Window replaceable attribute: 
outerHeight                                                                                                             
17:16:13     INFO - TEST-PASS | /html/browsers/the-window-object/window-properties.html | Window replaceable attribute: 
devicePixelRatio                                                                                                        
17:16:13     INFO - TEST-PASS | /html/browsers/the-window-object/window-properties.html | constructor                   
17:16:13     INFO - TEST-OK | /html/browsers/the-window-object/window-properties.html | took 1926ms                     
17:16:13     INFO - TEST-START | /html/browsers/the-window-object/window-prototype-chain.html                           
17:16:14     INFO - PROCESS | 1567 | ++DOCSHELL 0x7f0e52f2b800 == 73 [pid = 1567] [id = 271]                            
17:16:14     INFO - PROCESS | 1567 | ++DOMWINDOW == 185 (0x7f0e52df3c00) [pid = 1567] [serial = 896] [outer = (nil)]    
17:16:14     INFO - PROCESS | 1567 | ++DOMWINDOW == 186 (0x7f0e53010800) [pid = 1567] [serial = 897] [outer = 0x7f0e52df
3c00]                                                                                                                   
17:16:14     INFO - PROCESS | 1567 | ++DOMWINDOW == 187 (0x7f0e54ceb000) [pid = 1567] [serial = 898] [outer = 0x7f0e52df
3c00]                                                                                                                   
17:16:14     INFO - PROCESS | 1567 | [1567] WARNING: Suboptimal indexes for the SQL statement 0x7f0e7959b2e0 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
17:16:14     INFO - PROCESS | 1567 | [1567] WARNING: Suboptimal indexes for the SQL statement 0x7f0e7959abc0 (http://mzl
.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114                     
17:16:19     INFO - TEST-PASS | /html/browsers/the-window-object/window-prototype-chain.html | window object            
17:16:19     INFO - TEST-PASS | /html/browsers/the-window-object/window-prototype-chain.html | Window.prototype         
17:16:19     INFO - TEST-PASS | /html/browsers/the-window-object/window-prototype-chain.html | Global scope polluter    
17:16:19     INFO - TEST-PASS | /html/browsers/the-window-object/window-prototype-chain.html | EventTarget.prototype    
17:16:19     INFO - TEST-PASS | /html/browsers/the-window-object/window-prototype-chain.html | Object.prototype         
17:16:19     INFO - TEST-OK | /html/browsers/the-window-object/window-prototype-chain.html | took 5363ms                
17:16:19     INFO - TEST-START | /html/browsers/windows/browsing-context-first-created.xhtml                            
17:16:22     INFO - PROCESS | 1567 | ++DOCSHELL 0x7f0e52d20000 == 74 [pid = 1567] [id = 272]                            
17:16:22     INFO - PROCESS | 1567 | ++DOMWINDOW == 188 (0x7f0e5259b800) [pid = 1567] [serial = 899] [outer = (nil)]  

claim expired at 9:21

     Feb 05 06:59:57 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":3759,"interval":30000}
    Feb 05 06:59:57 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium"}
    Feb 05 06:59:58 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium"}
    Feb 05 07:59:36 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":7338,"interval":30000}
    Feb 05 08:59:18 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":10921,"interval":30000}
    Feb 05 08:59:48 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":10951,"interval":30000}
    Feb 05 09:40:38 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":13400,"interval":30000}
    Feb 05 09:40:39 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: Fri, 05 Feb 2016 17:40:40 GMT taskcluster-docker-worker:stats stats: ["worker_memory",{"free":3262676992,"total":3930931200,"used":668254208,"workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","instanceType":"m1.medium","provisionerId":"aws-provisioner-v1","capacity":0}]
    Feb 05 09:40:39 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: Fri, 05 Feb 2016 17:40:40 GMT taskcluster-docker-worker:stats stats: ["worker_uptime",{"value":13402,"workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","instanceType":"m1.medium","provisionerId":"aws-provisioner-v1","capacity":0}]
    Feb 05 09:40:39 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: Fri, 05 Feb 2016 17:40:40 GMT taskcluster-docker-worker:stats stats: ["worker_cpu_load",{"value":48.10168041494221,"workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","instanceType":"m1.medium","provisionerId":"aws-provisioner-v1","capacity":0}]
    Feb 05 09:40:39 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: Fri, 05 Feb 2016 17:40:40 GMT taskcluster-docker-worker:stats stats: ["worker_hd",{"free":389032792064,"used":33552953344,"total":422585745408,"workerId":"i-a83ced6b","workerGroup":"us-west-1c","workerType":"desktop-test","instanceType":"m1.medium","provisionerId":"aws-provisioner-v1","capacity":0}]
And, greg is better at papertrail than I am!

 Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: child_process.js:1162
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     throw errnoException(err, 'spawn');
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:           ^
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker: Error: spawn ENOMEM
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at exports._errnoException (util.js:746:11)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at ChildProcess.spawn (child_process.js:1162:11)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at exports.spawn (child_process.js:995:9)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at Object.exports.execFile (child_process.js:689:15)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at Object.exports.exec (child_process.js:649:18)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at Object.check (/home/ubuntu/docker_worker/node_modules/diskspace/diskspace.js:46:17)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at module.exports (/home/ubuntu/docker_worker/lib/stats/host_metrics.js:45:13)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at [object Object].wrapper [as _onTimeout] (timers.js:265:14)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b docker-worker:     at Timer.listOnTimeout (timers.js:110:15)
Feb 05 09:16:22 docker-worker.aws-provisioner.us-west-1c.ami-c7f285a7.m1-medium.i-a83ced6b kernel: [11945.137480] init: docker-worker main process ended, respawning 

In other words, an ENOMEM from docker-worker at about the time of the last log entry from the subprocess.  ENOMEM from fork tends to indicate kernel limits, rather than actual RAM/swap limits.  I wonder if the test is forkbombing?  Perhaps e10s isn't cleaning up zombies?
That instance is dead now, but if I see this intermittent again I'll try to connect to the host and see what's up.

Worth noting that the process docker-worker is spawning is `df -k` via the `diskspace` package.
From a running (not yet hung) instance of this job:

ubuntu    5225  0.0  0.0  12340  1552 pts/0    S+   17:48   0:00      \_ /bin/bash -xe /home/worker/workspace/test-linux.sh --no-read-buildbot-config --installer-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.tar.bz2 --test-packages-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/test_packages.json --download-symbols=ondemand --test-type=testharness --total-chunk=8 --this-chunk=3
ubuntu    5248  0.8  3.3 215924 128312 pts/0   S+   17:48   0:17      |   \_ Xvfb :0 -nolisten tcp -screen 0 1600x1200x24
ubuntu    5257  0.0  0.2 397916 10428 pts/0    Sl+  17:49   0:00      |   \_ x-session-manager
ubuntu    5307  0.0  0.0  12568   316 ?        Ss   17:49   0:00      |   |   \_ /usr/bin/ssh-agent /usr/bin/dbus-launch --exit-with-session x-session-manager
ubuntu    5323  0.0  0.4 683292 18360 pts/0    Sl+  17:49   0:01      |   |   \_ /usr/lib/gnome-settings-daemon/gnome-settings-daemon
ubuntu    5344  0.2  0.3 363392 13808 pts/0    Sl+  17:49   0:04      |   |   \_ metacity
ubuntu    5355  0.0  0.7 577520 27792 pts/0    Sl+  17:49   0:01      |   |   \_ unity-2d-panel
ubuntu    5356  0.1  1.9 1160560 74868 pts/0   Sl+  17:49   0:03      |   |   \_ unity-2d-shell
ubuntu    5363  0.0  0.2 456076  9200 pts/0    Sl+  17:49   0:00      |   |   \_ /usr/lib/gnome-settings-daemon/gnome-fallback-mount-helper
ubuntu    5364  0.0  0.7 1024200 29152 pts/0   Sl+  17:49   0:01      |   |   \_ nautilus -n
ubuntu    5372  0.0  0.1 368424  4464 pts/0    Sl+  17:49   0:00      |   |   \_ /usr/lib/deja-dup/deja-dup/deja-dup-monitor
ubuntu    5374  0.0  0.3 509772 12868 pts/0    Sl+  17:49   0:00      |   |   \_ bluetooth-applet
ubuntu    5598  0.0  0.2 427476 11080 pts/0    Sl+  17:49   0:00      |   |   \_ telepathy-indicator
ubuntu    5717  0.0  0.2 313208  9564 pts/0    Sl+  17:49   0:00      |   |   \_ gnome-screensaver
ubuntu    6012  0.0  0.3 410132 13436 pts/0    Sl+  17:50   0:00      |   |   \_ update-notifier
ubuntu    5296  5.3  0.4 151076 16048 pts/0    Sl+  17:49   1:54      |   \_ python2.7 /home/worker/workspace/mozharness/scripts/web_platform_tests.py --config-file mozharness/configs/web_platform_tests/prod_config.py --config-file mozharness/configs/remove_executables.py --no-read-buildbot-config --installer-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.tar.bz2 --test-packages-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/test_packages.json --download-symbols=ondemand --test-type=testharness --total-chunk=8 --this-chunk=3
ubuntu    6151  0.1  0.3  72256 14612 pts/0    S+   17:50   0:02      |       \_ python2.7 /home/worker/workspace/mozharness/scripts/web_platform_tests.py --config-file mozharness/configs/web_platform_tests/prod_config.py --config-file mozharness/configs/remove_executables.py --no-read-buildbot-config --installer-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.tar.bz2 --test-packages-url=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/test_packages.json --download-symbols=ondemand --test-type=testharness --total-chunk=8 --this-chunk=3
ubuntu    6179  3.3  3.5 680136 136100 pts/0   Sl   17:51   1:06      |       \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6202  0.0  2.5 237076 98372 pts/0    S    17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6203  0.0  2.5 720416 99016 pts/0    S    17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6204  0.0  2.6 311340 99824 pts/0    Sl   17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6206  0.0  2.6 237628 100348 pts/0   Sl   17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6214  0.3  2.7 747840 105132 pts/0   Sl   17:51   0:06      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6218  0.0  2.6 543056 100960 pts/0   Sl   17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6220  0.0  2.6 543076 101820 pts/0   Sl   17:51   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil
ubuntu    6969 67.9  6.1 1047656 237204 pts/0  Rl   18:22   1:34      |           \_ /home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner
ubuntu    6975  0.3  3.4 674040 134196 pts/0   Rl   18:22   0:00      |           \_ /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/azneRrY5RGWrZz8XQsHgEQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=8 --this-chunk=3 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil

so, that does seem like a lot of copies of runtests.py
That's

https://tools.taskcluster.net/task-inspector/#MFAXcHy0T_W7K7BEyd4B9Q/0
https://papertrailapp.com/systems/176844673/events

18:24:24     INFO - TEST-START | /html/browsers/history/the-history-interface/001.html                                  
18:24:24     INFO - PROCESS | 1569 | ++DOCSHELL 0x7f30a2e04000 == 16 [pid = 1569] [id = 84]                             
18:24:24     INFO - PROCESS | 1569 | ++DOMWINDOW == 45 (0x7f309d4ca800) [pid = 1569] [serial = 256] [outer = (nil)]     
18:24:24     INFO - PROCESS | 1569 | ++DOMWINDOW == 46 (0x7f309d4f9400) [pid = 1569] [serial = 257] [outer = 0x7f309d4ca
800]      

is the last thing logged, and now it's 18:31.

Papertrail logs from around that time don't have anything especially interesting:

 Feb 05 10:23:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5825,"interval":30000}
Feb 05 10:23:53 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","primaryTaskId":"MFAXcHy0T_W7K7BEyd4B9Q","primaryRunId":0,"taskId":"MFAXcHy0T_W7K7BEyd4B9Q","runId":0,"takenUntil":"2016-02-05T18:28:30.225Z"}
Feb 05 10:23:54 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","primaryTaskId":"MFAXcHy0T_W7K7BEyd4B9Q","primaryRunId":0,"taskId":"MFAXcHy0T_W7K7BEyd4B9Q","runId":0,"takenUntil":"2016-02-05T18:43:53.879Z"}
Feb 05 10:23:54 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"next claim","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","time":922858.4615384615,"primaryTaskId":"MFAXcHy0T_W7K7BEyd4B9Q","primaryRunId":0,"taskId":"MFAXcHy0T_W7K7BEyd4B9Q","runId":0,"takenUntil":"2016-02-05T18:43:53.879Z"}
Feb 05 10:24:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5855,"interval":30000}
Feb 05 10:24:16 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:24:16 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:24:29 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e sshd: error: Could not load host key: /etc/ssh/ssh_host_ed25519_key
Feb 05 10:24:31 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e sshd: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Feb 05 10:24:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5885,"interval":30000}
Feb 05 10:25:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5915,"interval":30000}
Feb 05 10:25:16 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:25:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:25:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5945,"interval":30000}
Feb 05 10:26:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":5975,"interval":30000}
Feb 05 10:26:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:26:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:26:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6005,"interval":30000}
Feb 05 10:26:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"polling","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","message":"polling interval adjusted","oldInterval":5000,"newInterval":60000}
Feb 05 10:27:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6035,"interval":30000}
Feb 05 10:27:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:27:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:27:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6065,"interval":30000}
Feb 05 10:28:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6095,"interval":30000}
Feb 05 10:28:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:28:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:28:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6125,"interval":30000}
Feb 05 10:29:11 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6156,"interval":30000}
Feb 05 10:29:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:29:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:29:41 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6186,"interval":30000}
Feb 05 10:30:12 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6216,"interval":30000}
Feb 05 10:30:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:30:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:30:24 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e dhclient: DHCPREQUEST of 172.31.15.109 on eth0 to 172.31.0.1 port 67 (xid=0x2b2ecadf)
Feb 05 10:30:24 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e dhclient: DHCPACK of 172.31.15.109 from 172.31.0.1
Feb 05 10:30:24 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e dhclient: bound to 172.31.15.109 -- renewal in 1635 seconds.
Feb 05 10:31:12 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6276,"interval":30000}
Feb 05 10:31:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:31:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:31:42 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6306,"interval":30000}
Feb 05 10:32:12 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"aliveness check","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium","alive":true,"uptime":6336,"interval":30000}
Feb 05 10:32:17 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection started","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"}
Feb 05 10:32:18 docker-worker.aws-provisioner.us-west-1b.ami-c7f285a7.m1-medium.i-2ce3f89e docker-worker: {"type":"garbage collection finished","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-2ce3f89e","workerGroup":"us-west-1b","workerType":"desktop-test","workerNodeType":"m1.medium"} 

In the host system, top shows

 6214 ubuntu    20   0 1996384 1.349g   2592 S 57.5 36.8   0:35.74 python
 6969 ubuntu    20   0 1092932 296664  67080 R 31.6  7.7   4:19.39 firefox
 1659 root      20   0 1067696 200096   6108 S  3.0  5.2  16:05.43 node

which is an awful lot for a host with 3.75GB of RAM, although vmstat doesn't show it swapping.  Just to check, Armen, these aren't the tests that we run on an m3.xlarge in Buildbot, are they?

In fact, I see things changing - RAM allocated and deallocated, CPU usage of processes changing, ..this is not totally wedged.  And now,

6214 ubuntu    20   0 3098180 2.347g   1664 S  0.7 64.1   0:58.68 python

so that Python process (one of the runtest.py's) is now using 2.347GB on its own.  I strongly suspect that something is deeply wrong with the test-runner in this case.

Looking at lsof, it's writing a great deal of information to /home/worker/workspace/build/blobber_upload_dir/wpt_raw.log

Tailing that file:

root@taskcluster-worker:~# tail /home/worker/workspace/build/blobber_upload_dir/wpt_raw.log
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697676977, "action": "process_output", "data": "[1569] WARNING: Suboptimal indexes for the SQL statement 0x7f30a0e07120 (http://mzl.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114"}
{"status": "PASS", "thread": "Thread-TestrunnerManager-1", "subtest": "document.write plaintext", "pid": 934, "source": "web-platform-tests", "test": "/html/dom/dynamic-markup-insertion/document-write/050.html", "time": 1454697677748, "action": "test_status"}
{"status": "OK", "source": "web-platform-tests", "test": "/html/dom/dynamic-markup-insertion/document-write/050.html", "thread": "Thread-TestrunnerManager-1", "time": 1454697677748, "action": "test_end", "pid": 934}
{"source": "web-platform-tests", "test": "/html/dom/dynamic-markup-insertion/document-write/051.html", "thread": "Thread-TestrunnerManager-1", "time": 1454697677779, "action": "test_start", "pid": 934}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678097, "action": "process_output", "data": "++DOCSHELL 0x7f308dcd6800 == 44 [pid = 1569] [id = 423]"}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678097, "action": "process_output", "data": "++DOMWINDOW == 153 (0x7f308bf51800) [pid = 1569] [serial = 1329] [outer = (nil)]"}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678138, "action": "process_output", "data": "++DOMWINDOW == 154 (0x7f308d58a800) [pid = 1569] [serial = 1330] [outer = 0x7f308bf51800]"}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678307, "action": "process_output", "data": "++DOMWINDOW == 155 (0x7f308e01c400) [pid = 1569] [serial = 1331] [outer = 0x7f308bf51800]"}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678344, "action": "process_output", "data": "[1569] WARNING: Suboptimal indexes for the SQL statement 0x7f30a0e07250 (http://mzl.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114"}
{"thread": "ProcessReader", "process": "1569", "pid": 934, "source": "web-platform-tests", "command": "/home/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpYuQOtg.mozrunner", "time": 1454697678344, "action": "process_output", "data": "[1569] WARNING: Suboptimal indexes for the SQL statement 0x7f30a0e07120 (http://mzl.la/1FuID0j).: file /home/worker/workspace/build/src/storage/mozStoragePrivateHelpers.cpp, line 114"}

In /home/worker/workspace/logs/log_info.log I see logging far beyond that available from the livelog daemon, too.  And then the docker-worker process exited and killed the docker container (same logs as before, running 'df').

I suppose the thing to dig into is, why is the test runner using that much space?  Is it buffering output from some other process in memory?
Flags: needinfo?(armenzg)
My guess is that the logs are getting buffered unnecessarily. At least they are pretty large for this set of tests (some of the tests are generating multiple tens of thousands of results), and debug firefox produces a lot of junk output.
We can try a bigger instance and see if we behave better?
Flags: needinfo?(armenzg)
We spent a lot of time running these tests on this specific instance type, so that's probably not a good direction to pursue.  I wonder if there's some particular junk message that's being spewed repeatedly.

It looks like https://tools.taskcluster.net/task-inspector/#b2ZQqE7LTr6kMKY4mC58eg/0 is starting to hang the same way, so let me take a look there.
Just as a data point I will try a larger instance:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dde5300cbdbf

I will be back on Monday.
So, there are MANY more lines of output in log_info.log than in the livelog right now:

18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("forecolor") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("forecolor") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","true"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("forecolor") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>": execCommand("stylewithcss", false, "false") return value 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>": execCommand("forecolor", false, "#0000FF") return value 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" checks for modifications to non-editable content 
18:49:25     INFO - TEST-FAIL | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" compare innerHTML - assert_equals: Unexpected innerHTML (aft
er normalizing inline style) expected "<p><font color=\"#0000ff\">foo</font></p> <p><font color=\"#0000ff\">bar</font></p>" but got "<p><font color=\"#0000FF\">foo</font></p> <p><font color=\"#0000FF
\">bar</font></p>"
18:49:25     INFO - runConformanceTest/<@http://web-platform.test:8000/editing/include/tests.js:5640:9
---- livelog stops here ----
18:49:25     INFO - Test.prototype.step@http://web-platform.test:8000/resources/testharness.js:1382:20
18:49:25     INFO - test@http://web-platform.test:8000/resources/testharness.js:496:9
18:49:25     INFO - runConformanceTest@http://web-platform.test:8000/editing/include/tests.js:5637:5
18:49:25     INFO - @http://web-platform.test:8000/editing/run/forecolor.html:40:5
18:49:25     INFO - @http://web-platform.test:8000/editing/run/forecolor.html:21:2
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("stylewithcss") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("stylewithcss") after 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandState("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandValue("forecolor") before 
18:49:25     INFO - TEST-PASS | /editing/run/forecolor.html | [["stylewithcss","false"],["forecolor","#0000FF"]] "<p>[foo</p> <p>bar]</p>" queryCommandIndeterm("forecolor") after 

followed by about 140,000 lines of additional logging, and still going. It looks like the "normal" level of logging repetitiveness: lots of TEST-PASS, DOMWINDOW, etc.

James, what could we do to try to debug what runtests.py is up to?
Comment on attachment 8716324 [details] [diff] [review]
do_not_retry.diff

Review of attachment 8716324 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good but I think you would want "retries: 0" unless you want it to retry once.
Attachment #8716324 - Flags: review?(garndt) → review+
garndt: I will land it with retry 0.

dustin: running on a larger instance clears this issue.

Adding NI for jgraham based on comment 16.
Flags: needinfo?(james)
Summary: tc-W(3) and tc-W-e10s(3) has start retrying a lot starting last night → tc-W(3) and tc-W-e10s(3) has started retrying a lot after changing from 12 to 8 chunks
Well, I'm happy to see tests moved to desktop-test-xlarge, but perhaps not in this case.

One thing to note: Buildbot and docker-worker may be doing different things with tty's -- it's possible that runtests.py is talking to a pty on Buildbot and not on docker-worker, or vice versa.  If that might make a difference.
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1e6662456b20f839eb9ed0b14b700a7161a8d59
Bug 1246176 - TaskCluster web platform tests should not retry for infrastructures issues until bug 1246176 is fixed. DONTBUILD. r=garndt
So, in terms of the amount of output generated, that seems entirely reasonable. This chunk happens to contain some tests with many many subtests, so getting hundreds of thousands of lines of output is not unexpected. A recent W3-e10s on buildbot produced ~140,000 lines of json data which expanded to 210,000 lines when formatted in the TBPL style. Those files are 54 and 29Mb respectively, so not huge, but not tiny either.

In theory this data should not be kept in memory for the duration of the run; at least I don't think the mozharness layer, or test harness layer, do this, although I have not specifically checked (by more than code inspection). Presumably the live log view *does* do this.

To debug this further I guess using a tool like Heapy [1] or memory_profiler [2] will help.

[1] http://guppy-pe.sourceforge.net/
[2] https://pypi.python.org/pypi/memory_profiler
Flags: needinfo?(james)
The easier way for now would be to jump to the next bigger instance (I don't think we really need xlarge) only for web platforms tests.

We could leave this bug until we have time to look into it with more time.
fwiw, the live-log daemon is not caching in memory (and certainly not in runtests.py's address space, which is the one that seems to be growing to multiple GB's).

I see that wptrunner uses a bunch of threads, which is probably what we are seeing with the multiple instances in `ps` output.

Armen, let's put this on the m3.xlarge workerType for the time being.  I'll see if I can run this locally under memory_profiler.  Even if these tests work on an m3.xlarge, it's wasteful to be burning a 1.5G of RAM and will end up costing us a lot of money by preventing running tests in parallel on the same instances.
Assignee: nobody → dustin
Attachment #8717100 - Flags: review?(dustin)
Attachment #8717104 - Flags: review?(dustin)
Attachment #8717104 - Flags: review?(dustin) → review+
Attachment #8717100 - Flags: review?(dustin) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/408fc73f39a0ee9c3d2aba2d247e7bf17843cb33
Bug 1246176 - Backout d1e6662456b2. DONTBUILD. r=backout

https://hg.mozilla.org/integration/mozilla-inbound/rev/dc67e43019fc5107e0f84986eeed98004318382f
Bug 1246176 - TaskCluster tier-2: Run web-platform-tests on m3.xlarge instances. CLOSED TREE. DONTBUILD. r=dustin
https://hg.mozilla.org/mozilla-central/rev/dc67e43019fc
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: