Browsertime tasks severely impacting pending queues on Windows hardware
Categories
(Taskcluster :: Workers, defect)
Tracking
(Not tracked)
People
(Reporter: rmaries, Unassigned)
References
(Regression)
Details
There is a Win hw and talos queue: https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&refresh=1m.
Only two win hw workers are running now: https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw
![]() |
||
Comment 2•6 years ago
|
||
This also affects Linux.
There are more details in bug 1595279 comment 5 but let's use this bug for the generic worker issue and keep bug 1595279 focused on the Windows AArch machines.
Comment 3•6 years ago
•
|
||
we don't yet understand what is causing the gw panic.
we do know:
- the version of generic worker in question (16.5.1) works fine in ec2
- using older gw (14.1.2), which was working fine until recently, throws the same exception/panic
- the yogas show evidence that the problem exists whether the instance has had operating system kb updates or not.
- resetting generic worker (delete current-task-user.json, next-task-user.json, tasks-resolved-count.txt), at least on yogas, makes no difference.
i can't help but suspect that some microsoft update or change is responsible, but i can't find evidence of any change making an impact.
![]() |
||
Comment 4•6 years ago
|
||
There is also the Linux backlog (comment 2): https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/
Those machines are occupied by tier-3 browsertime tasks which run each for ~7h. The Windows machines processing their current task for a longer time also have a browsertime job.
Bug 1594438 added "cold page load tests for browsertime on integration branches" - new tasks run for every push for one day to collect data how often they fail (and thus should run).
The scheduling of all these new tasks shall stop in ~3h.
Comment 5•6 years ago
•
|
||
I'm troubleshooting with Rob in a zoom meeting (751-471-384) and I'm seeing some strange errors when workers are publishing artifacts, e.g.
Nov 12 05:08:44 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker PUT /GdJyuSGaT1OHbbJDLaHXtw/0/public/test_info/browsertime-results/raptor-tp6-youtube-firefox/windowrecording-1573527996199.87/frame0001.png?AWSAccessKeyId=************&Content-Type=image%2Fpng&Expires=1573533537&Signature=fNMCrNy5A7d12N4oJdWVEjXA1FI%3D HTTP/1.1 Host: firefoxcitc-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 0 Content-Encoding: identity Content-Type: image/png Accept-Encoding: gzip
Nov 12 05:08:44 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Error: (Intermittent) HTTP response code 501 HTTP/1.1 501 Not Implemented Connection: close Transfer-Encoding: chunked Content-Type: application/xml Date: Tue, 12 Nov 2019 04:08:43 GMT Server: AmazonS3 X-Amz-Id-2: PTJKhQoq23Lo+ptbGuHuIzrJnvnPOZhVMkOVk6gLkAw9DU6okHZv9yYJOFAM9EWjUSOvEqib9+0= X-Amz-Request-Id: C333C73A2324E6BC 14d <?xml version="1.0" encoding="UTF-8"?> <Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>C333C73A2324E6BC</RequestId><HostId>PTJKhQoq23Lo+ptbGuHuIzrJnvnPOZhVMkOVk6gLkAw9DU6okHZv9yYJOFAM9EWjUSOvEqib9+0=</HostId></Error> 0
Nov 12 05:09:47 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Request
I'm not sure if this is related at the moment, but it is certainly unusual.
Comment 6•6 years ago
|
||
Some interesting things have turned up.
Looking at worker T-W1064-MS-110.mdc1.mozilla.com
we are getting the repeated failure:
Interactive username task_1573531952 does not match task user task_1573533123 from next-task-user.json file
When generic-worker creates a task user, it uses the current time (unix timestamp in seconds) for the numerical part of the username, so we can see that these two users were created at the following times:
task_1573531952
: Tuesday, November 12, 2019 5:12:32 AM GMT+01:00task_1573533123
: Tuesday, November 12, 2019 5:32:03 AM GMT+01:00
The error indicates that generic-worker had created the task user task_1573533123
for running the next task, updated the windows autologon registry to automatically log in as that user, and then rebooted the computer. The generic-worker then started up in a Windows Service, running as the LocalSystem
user, and calls WTSGetActiveConsoleSessionId to get the session ID of the console session (the session associated with the interactive desktop user) and then WTSQueryUserToken to fetch the default auth token of the interactive session. From the auth token, a call is made to GetTokenInformation which returns a TOKEN_USER structure, from which ultimately the task username of the interactive user is derived. If this does not match the value that generic-worker put in the registry before rebooting, it will panic with the above error message.
Comment 7•6 years ago
|
||
Comment 8•6 years ago
|
||
We can see that the worker was executing tasks prior to the failures. The logs have some interesting context here:
At 05:32:05 GMT+1 we see that the worker created the task user task_1573533123
. Presumably it successfully also wrote it to the next-task-user.json
file too, since the error message in comment 6 shows us that it found this value in the next-task-user.json
file. So all fine so far.
Nov 12 05:32:05 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Running command: net user task_1573533123 pWd0_p2LWTZDThuCe2XwzKhEBLLWG /add /expires:never /passwordchg:no /y
21 seconds later, we see that the worker resolves the task it is currently running:
Nov 12 05:32:26 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Resolving task P5aY5oRpRDms0hhjdWcNAw ...
This is strange, because it suggests the task only ran for a few seconds, which is unusual.
Looking at the task logs, we see that the task command experienced an exit code of 1073807364
which typically happens when a the task process is killed for some reason. Bug 1544403 added it to the onExitStatus
list of exit codes that should cause a task, since typically if something outside kills the task, it suggests something at fault externally to the task, on the host, such as the machine unexpectedly being sent a shutdown signal, for example.
Indeed the task reports that it completed in 9.3 seconds.
Shortly afterwards we see the first panic:
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc0424aa7f8, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x921ac0, 0xc0424b3ea0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:351 +0x2d main.RunWorker.func1(0xc042461e30) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:370 +0x59 panic(0x921ac0, 0xc0424b3ea0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd635f8, 0x9d80d4, 0x10, 0xc0420784e0, 0x8) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd81bc8, 0x1f, 0xc042461ae8) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:154 +0x66 main.initialiseFeatures(0xc0420c8000, 0xc0424b44c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:87 +0x471 main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x36c main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:157 +0x7ba
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker *********** PANIC occurred! ***********
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker invalid character '\x00' looking for beginning of value
Comment 9•6 years ago
•
|
||
A formatted version of this panic is:
runtime/debug.Stack(0x0, 0xc0424aa7f8, 0x0)
/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x921ac0, 0xc0424b3ea0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:351 +0x2d main.RunWorker.func1(0xc042461e30)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:370 +0x59 panic(0x921ac0, 0xc0424b3ea0)
/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd635f8, 0x9d80d4, 0x10, 0xc0420784e0, 0x8)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd81bc8, 0x1f, 0xc042461ae8)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:154 +0x66 main.initialiseFeatures(0xc0420c8000, 0xc0424b44c0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:87 +0x471 main.RunWorker(0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x36c main.main()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:157 +0x7ba
We can see from another log line that we are running generic-worker 16.2.0 on this worker with git revision b321a877c8f1616d8912399f599c325171f15e95
:
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Config: { "accessToken": "*************", "authBaseURL": "", "availabilityZone": "", "cachesDir": "C:\\cache", "certificate": "", "checkForNewDeploymentEverySecs": 0, "cleanUpTaskDirs": true, "clientId": "project/releng/generic-worker/datacenter-gecko-t-win10-64-hw", "deploymentId": "", "disableReboots": true, "downloadsDir": "C:\\downloads", "ed25519SigningKeyLocation": "C:\\generic-worker\\ed25519-private.key", "idleTimeoutSecs": 7200, "instanceId": "", "instanceType": "", "livelogCertificate": "", "livelogExecutable": "C:\\generic-worker\\livelog.exe", "livelogGETPort": 60023, "livelogKey": "", "livelogPUTPort": 60022, "livelogSecret": "*************", "numberOfTasksToRun": 0, "privateIP": "", "provisionerBaseURL": "", "provisionerId": "releng-hardware", "publicIP": "10.49.40.66", "purgeCacheBaseURL": "", "queueBaseURL": "", "region": "", "requiredDiskSpaceMegabytes": 10240, "rootURL": "https://firefox-ci-tc.services.mozilla.com", "runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd", "runTasksAsCurrentUser": false, "secretsBaseURL": "", "sentryProject": "generic-worker", "shutdownMachineOnIdle": false, "shutdownMachineOnInternalError": false, "subdomain": "taskcluster-worker.net", "taskclusterProxyExecutable": "C:\\generic-worker\\taskcluster-proxy.exe", "taskclusterProxyPort": 80, "tasksDir": "C:\\Users", "workerGroup": "mdc1", "workerId": "T-W1064-MS-110", "workerManagerBaseURL": "", "workerType": "gecko-t-win10-64-hw", "workerTypeMetadata": { "config": { "deploymentId": "" }, "generic-worker": { "engine": "multiuser", "go-arch": "amd64", "go-os": "windows", "go-version": "go1.10.8", "release": "https://github.com/taskcluster/generic-worker/releases/tag/v16.2.0", "revision": "b321a877c8f1616d8912399f599c325171f15e95", "source": "https://github.com/taskcluster/generic-worker/commits/b321a877c8f1616d8912399f599c325171f15e95", "version": "16.2.0" } }, "wstAudience": "taskcluster-net", "wstServerURL": "https://websocktunnel.tasks.build" }
Translating the above line references to the release version of generic-worker we get:
runtime/debug.Stack(0x0, 0xc0424aa7f8, 0x0)
/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24
+0xae
main.HandleCrash(0x921ac0, 0xc0424b3ea0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:351
+0x2d
main.RunWorker.func1(0xc042461e30)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:370
+0x59
panic(0x921ac0, 0xc0424b3ea0)
/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502
+0x237
main.(*CacheMap).LoadFromFile(0xd635f8, 0x9d80d4, 0x10, 0xc0420784e0, 0x8)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146
+0x4a5
main.(*MountsFeature).Initialise(0xd81bc8, 0x1f, 0xc042461ae8)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:154
+0x66
main.initialiseFeatures(0xc0420c8000, 0xc0424b44c0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:87
+0x471
main.RunWorker(0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407
+0x36c
main.main()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:157
+0x7ba
So the problem seems to be that the attempt to read file-caches.json
in mounts.go:154 is failing, since the file (or a value inside the file) appears to start with a zero byte '\x00'
.
Comment 10•6 years ago
•
|
||
In summary, I think the issue is as follows:
- a worker is running successfully, claiming and resolving tasks, rebooting, claiming and resolving tasks, rebooting, etc
- at some point something happens on a worker that causes a running task to get killed (such as this one), and touches state files managed by generic-worker, such as
file-caches.json
,directory-caches.json
,current-task-user.json
,next-task-user.json
,tasks-resolved-count.txt
either deleting the content, or corrupting their content - the worker reboots and we get a worker panic due to the corrupted state file(s):
invalid character '\x00' looking for beginning of value
- on all future reboots we get a different panic, like
Interactive username task_1573531952 does not match task user task_1573533123 from next-task-user.json file
Comment 11•6 years ago
|
||
I am also see, ""message": "Client ID project/releng/generic-worker/datacenter-gecko-t-win10-64-hw does not have sufficient scopes and is missing the following scopes:\n\n```\nauth:websocktunnel-token:taskcluster-net/mdc1.T-W1064-MS-018.60099", through out the logs.
![]() |
||
Comment 12•6 years ago
|
||
Browsertime tasks on autoland got canceled and autoland reopened.
Comment 13•6 years ago
|
||
We (miles, markco, and I) been trying to make forward progress on figuring out the root issue of this but have not made much progress. Since trees are reopened and markco has some more mitigations in place, we're going to take a step back from actively debugging this and try to come up with ways to make debugging this easier.
Comment 14•6 years ago
|
||
I don't think the generic-worker panic was responsible for back log in this case. Going back through the papertrail logs it looks like ms-110 was the only node that was looping on the panic during this time frame, https://my.papertrailapp.com/groups/1141234/events?focus=1130798589950451736&q=ms%20AND%20panic&selected=1130798589950451736 .
I think the cause is more related to comment 4 and 12. During the browsertime tasks this error message seems to prevalent:
"generic-worker UTC Error: (Intermittent) HTTP response code 501 HTTP/1.1 501 Not Implemented Connection: close Transfer-Encoding: chunked Content-Type: application/xml Date: Tue, 12 Nov 2019 18:31:00 GMT Server: AmazonS3 X-Amz-Id-2: ugKEcuzAT0WhRadnDezZep57DNpm4hK0psg7LYgcfcvSf2ywKcIYwLuVGehdyAmuyxXul9Yx3RI= X-Amz-Request-Id: B0227CE2775EF361 14d <?xml version="1.0" encoding="UTF-8"?> <Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>B0227CE2775EF361</RequestId><HostId>ugKEcuzAT0WhRadnDezZep57DNpm4hK0psg7LYgcfcvSf2ywKcIYwLuVGehdyAmuyxXul9Yx3RI=</HostId></Error>
Which lead to some tasks finishing as an exception:
https://firefox-ci-tc.services.mozilla.com/tasks/JQ9rRMRFSf6bNFe5v66tdQ/runs/0
aryx: Re: comment 4. Is this one time thing, or should we expect to see this again?
![]() |
||
Comment 15•6 years ago
|
||
Mark: The perf team is working on zipping all files of that process instead of uploading them one by one. I don't know how much this will reduce the run time of these tasks when they reland but even then they shouldn't run on an every single push but only every hour or fifth push.
Updated•6 years ago
|
Comment 16•6 years ago
|
||
Hi, we're disabling the visual-metrics on browsertime until we fix the underlying issues here so it shouldn't be occurring anymore: https://bugzilla.mozilla.org/show_bug.cgi?id=1596102
For context, the issue started because of bug 1581240. It caused the task to produce corrupted png files (causing the failure you're seeing). Then, bug 1594438 exacerbated the issue by adding more tests. We are currently working on a fix for the png files, and we are also going to zip the artifacts instead of what's being done now to try to prevent the issue from occurring in the future.
Comment 17•6 years ago
|
||
Thanks Greg!
Updated•6 years ago
|
Comment 19•6 years ago
|
||
Sorry for playing needinfo-tennis. redirecting to the bug reporter to determine if this is resolved now.
Reporter | ||
Comment 20•6 years ago
|
||
Queue is looking good now. Closing.
Description
•