Closed Bug 1595715 Opened 6 years ago Closed 6 years ago

Browsertime tasks severely impacting pending queues on Windows hardware

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rmaries, Unassigned)

References

(Regression)

Details

Rob are there any updates on the talos queue?

Flags: needinfo?(rthijssen)

This also affects Linux.

There are more details in bug 1595279 comment 5 but let's use this bug for the generic worker issue and keep bug 1595279 focused on the Windows AArch machines.

Flags: needinfo?(pmoore)
See Also: → 1595279

we don't yet understand what is causing the gw panic.

we do know:

  • the version of generic worker in question (16.5.1) works fine in ec2
  • using older gw (14.1.2), which was working fine until recently, throws the same exception/panic
  • the yogas show evidence that the problem exists whether the instance has had operating system kb updates or not.
  • resetting generic worker (delete current-task-user.json, next-task-user.json, tasks-resolved-count.txt), at least on yogas, makes no difference.

i can't help but suspect that some microsoft update or change is responsible, but i can't find evidence of any change making an impact.

Flags: needinfo?(rthijssen)

There is also the Linux backlog (comment 2): https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/

Those machines are occupied by tier-3 browsertime tasks which run each for ~7h. The Windows machines processing their current task for a longer time also have a browsertime job.

Bug 1594438 added "cold page load tests for browsertime on integration branches" - new tasks run for every push for one day to collect data how often they fail (and thus should run).

The scheduling of all these new tasks shall stop in ~3h.

Regressed by: 1594438

I'm troubleshooting with Rob in a zoom meeting (751-471-384) and I'm seeing some strange errors when workers are publishing artifacts, e.g.

Nov 12 05:08:44 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker PUT /GdJyuSGaT1OHbbJDLaHXtw/0/public/test_info/browsertime-results/raptor-tp6-youtube-firefox/windowrecording-1573527996199.87/frame0001.png?AWSAccessKeyId=************&Content-Type=image%2Fpng&Expires=1573533537&Signature=fNMCrNy5A7d12N4oJdWVEjXA1FI%3D HTTP/1.1  Host: firefoxcitc-public-artifacts.s3.us-west-2.amazonaws.com  User-Agent: Go-http-client/1.1  Content-Length: 0  Content-Encoding: identity  Content-Type: image/png  Accept-Encoding: gzip
Nov 12 05:08:44 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Error: (Intermittent) HTTP response code 501 HTTP/1.1 501 Not Implemented  Connection: close  Transfer-Encoding: chunked  Content-Type: application/xml  Date: Tue, 12 Nov 2019 04:08:43 GMT  Server: AmazonS3  X-Amz-Id-2: PTJKhQoq23Lo+ptbGuHuIzrJnvnPOZhVMkOVk6gLkAw9DU6okHZv9yYJOFAM9EWjUSOvEqib9+0=  X-Amz-Request-Id: C333C73A2324E6BC    14d  <?xml version="1.0" encoding="UTF-8"?> <Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>C333C73A2324E6BC</RequestId><HostId>PTJKhQoq23Lo+ptbGuHuIzrJnvnPOZhVMkOVk6gLkAw9DU6okHZv9yYJOFAM9EWjUSOvEqib9+0=</HostId></Error>  0
Nov 12 05:09:47 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Request

I'm not sure if this is related at the moment, but it is certainly unusual.

Flags: needinfo?(pmoore)

Some interesting things have turned up.

Looking at worker T-W1064-MS-110.mdc1.mozilla.com we are getting the repeated failure:

Interactive username task_1573531952 does not match task user task_1573533123 from next-task-user.json file

When generic-worker creates a task user, it uses the current time (unix timestamp in seconds) for the numerical part of the username, so we can see that these two users were created at the following times:

  • task_1573531952: Tuesday, November 12, 2019 5:12:32 AM GMT+01:00
  • task_1573533123: Tuesday, November 12, 2019 5:32:03 AM GMT+01:00

The error indicates that generic-worker had created the task user task_1573533123 for running the next task, updated the windows autologon registry to automatically log in as that user, and then rebooted the computer. The generic-worker then started up in a Windows Service, running as the LocalSystem user, and calls WTSGetActiveConsoleSessionId to get the session ID of the console session (the session associated with the interactive desktop user) and then WTSQueryUserToken to fetch the default auth token of the interactive session. From the auth token, a call is made to GetTokenInformation which returns a TOKEN_USER structure, from which ultimately the task username of the interactive user is derived. If this does not match the value that generic-worker put in the registry before rebooting, it will panic with the above error message.

We can see that the worker was executing tasks prior to the failures. The logs have some interesting context here:

At 05:32:05 GMT+1 we see that the worker created the task user task_1573533123. Presumably it successfully also wrote it to the next-task-user.json file too, since the error message in comment 6 shows us that it found this value in the next-task-user.json file. So all fine so far.

Nov 12 05:32:05 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Running command: net user task_1573533123 pWd0_p2LWTZDThuCe2XwzKhEBLLWG /add /expires:never /passwordchg:no /y

21 seconds later, we see that the worker resolves the task it is currently running:

Nov 12 05:32:26 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Resolving task P5aY5oRpRDms0hhjdWcNAw ...

This is strange, because it suggests the task only ran for a few seconds, which is unusual.

Looking at the task logs, we see that the task command experienced an exit code of 1073807364 which typically happens when a the task process is killed for some reason. Bug 1544403 added it to the onExitStatus list of exit codes that should cause a task, since typically if something outside kills the task, it suggests something at fault externally to the task, on the host, such as the machine unexpectedly being sent a shutdown signal, for example.

Indeed the task reports that it completed in 9.3 seconds.

Shortly afterwards we see the first panic:

Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc0424aa7f8, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x921ac0, 0xc0424b3ea0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:351 +0x2d main.RunWorker.func1(0xc042461e30) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:370 +0x59 panic(0x921ac0, 0xc0424b3ea0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd635f8, 0x9d80d4, 0x10, 0xc0420784e0, 0x8) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd81bc8, 0x1f, 0xc042461ae8) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:154 +0x66 main.initialiseFeatures(0xc0420c8000, 0xc0424b44c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:87 +0x471 main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x36c main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:157 +0x7ba
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker  *********** PANIC occurred! *********** 
Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker invalid character '\x00' looking for beginning of value

A formatted version of this panic is:

runtime/debug.Stack(0x0, 0xc0424aa7f8, 0x0) 
	/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x921ac0, 0xc0424b3ea0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:351 +0x2d main.RunWorker.func1(0xc042461e30) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:370 +0x59 panic(0x921ac0, 0xc0424b3ea0) 
	/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd635f8, 0x9d80d4, 0x10, 0xc0420784e0, 0x8) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd81bc8, 0x1f, 0xc042461ae8) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:154 +0x66 main.initialiseFeatures(0xc0420c8000, 0xc0424b44c0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:87 +0x471 main.RunWorker(0x0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x36c main.main() 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:157 +0x7ba

We can see from another log line that we are running generic-worker 16.2.0 on this worker with git revision b321a877c8f1616d8912399f599c325171f15e95:

Nov 12 05:34:50 T-W1064-MS-110.mdc1.mozilla.com-1 generic-worker Config: {   "accessToken": "*************",   "authBaseURL": "",   "availabilityZone": "",   "cachesDir": "C:\\cache",   "certificate": "",   "checkForNewDeploymentEverySecs": 0,   "cleanUpTaskDirs": true,   "clientId": "project/releng/generic-worker/datacenter-gecko-t-win10-64-hw",   "deploymentId": "",   "disableReboots": true,   "downloadsDir": "C:\\downloads",   "ed25519SigningKeyLocation": "C:\\generic-worker\\ed25519-private.key",   "idleTimeoutSecs": 7200,   "instanceId": "",   "instanceType": "",   "livelogCertificate": "",   "livelogExecutable": "C:\\generic-worker\\livelog.exe",   "livelogGETPort": 60023,   "livelogKey": "",   "livelogPUTPort": 60022,   "livelogSecret": "*************",   "numberOfTasksToRun": 0,   "privateIP": "",   "provisionerBaseURL": "",   "provisionerId": "releng-hardware",   "publicIP": "10.49.40.66",   "purgeCacheBaseURL": "",   "queueBaseURL": "",   "region": "",   "requiredDiskSpaceMegabytes": 10240,   "rootURL": "https://firefox-ci-tc.services.mozilla.com",   "runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd",   "runTasksAsCurrentUser": false,   "secretsBaseURL": "",   "sentryProject": "generic-worker",   "shutdownMachineOnIdle": false,   "shutdownMachineOnInternalError": false,   "subdomain": "taskcluster-worker.net",   "taskclusterProxyExecutable": "C:\\generic-worker\\taskcluster-proxy.exe",   "taskclusterProxyPort": 80,   "tasksDir": "C:\\Users",   "workerGroup": "mdc1",   "workerId": "T-W1064-MS-110",   "workerManagerBaseURL": "",   "workerType": "gecko-t-win10-64-hw",   "workerTypeMetadata": {     "config": {       "deploymentId": ""     },     "generic-worker": {       "engine": "multiuser",       "go-arch": "amd64",       "go-os": "windows",       "go-version": "go1.10.8",       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v16.2.0",       "revision": "b321a877c8f1616d8912399f599c325171f15e95",       "source": "https://github.com/taskcluster/generic-worker/commits/b321a877c8f1616d8912399f599c325171f15e95",       "version": "16.2.0"     }   },   "wstAudience": "taskcluster-net",   "wstServerURL": "https://websocktunnel.tasks.build" }

Translating the above line references to the release version of generic-worker we get:

So the problem seems to be that the attempt to read file-caches.json in mounts.go:154 is failing, since the file (or a value inside the file) appears to start with a zero byte '\x00'.

In summary, I think the issue is as follows:

  • a worker is running successfully, claiming and resolving tasks, rebooting, claiming and resolving tasks, rebooting, etc
  • at some point something happens on a worker that causes a running task to get killed (such as this one), and touches state files managed by generic-worker, such as file-caches.json, directory-caches.json, current-task-user.json, next-task-user.json, tasks-resolved-count.txt either deleting the content, or corrupting their content
  • the worker reboots and we get a worker panic due to the corrupted state file(s): invalid character '\x00' looking for beginning of value
  • on all future reboots we get a different panic, like Interactive username task_1573531952 does not match task user task_1573533123 from next-task-user.json file

I am also see, ""message": "Client ID project/releng/generic-worker/datacenter-gecko-t-win10-64-hw does not have sufficient scopes and is missing the following scopes:\n\n```\nauth:websocktunnel-token:taskcluster-net/mdc1.T-W1064-MS-018.60099", through out the logs.

https://my.papertrailapp.com/groups/1141234/events?focus=1130937660785975299&q=ms-018&selected=1130937660785975299

Browsertime tasks on autoland got canceled and autoland reopened.

Severity: blocker → normal
Summary: [CLOSED TREES] Busted generic worker → Busted generic worker

We (miles, markco, and I) been trying to make forward progress on figuring out the root issue of this but have not made much progress. Since trees are reopened and markco has some more mitigations in place, we're going to take a step back from actively debugging this and try to come up with ways to make debugging this easier.

I don't think the generic-worker panic was responsible for back log in this case. Going back through the papertrail logs it looks like ms-110 was the only node that was looping on the panic during this time frame, https://my.papertrailapp.com/groups/1141234/events?focus=1130798589950451736&q=ms%20AND%20panic&selected=1130798589950451736 .

I think the cause is more related to comment 4 and 12. During the browsertime tasks this error message seems to prevalent:

"generic-worker UTC Error: (Intermittent) HTTP response code 501 HTTP/1.1 501 Not Implemented Connection: close Transfer-Encoding: chunked Content-Type: application/xml Date: Tue, 12 Nov 2019 18:31:00 GMT Server: AmazonS3 X-Amz-Id-2: ugKEcuzAT0WhRadnDezZep57DNpm4hK0psg7LYgcfcvSf2ywKcIYwLuVGehdyAmuyxXul9Yx3RI= X-Amz-Request-Id: B0227CE2775EF361 14d <?xml version="1.0" encoding="UTF-8"?> <Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>B0227CE2775EF361</RequestId><HostId>ugKEcuzAT0WhRadnDezZep57DNpm4hK0psg7LYgcfcvSf2ywKcIYwLuVGehdyAmuyxXul9Yx3RI=</HostId></Error>

Which lead to some tasks finishing as an exception:
https://firefox-ci-tc.services.mozilla.com/tasks/JQ9rRMRFSf6bNFe5v66tdQ/runs/0

aryx: Re: comment 4. Is this one time thing, or should we expect to see this again?

Flags: needinfo?(aryx.bugmail)
See Also: → 1596044

Mark: The perf team is working on zipping all files of that process instead of uploading them one by one. I don't know how much this will reduce the run time of these tasks when they reland but even then they shouldn't run on an every single push but only every hour or fifth push.

Flags: needinfo?(aryx.bugmail)
Summary: Busted generic worker → Browsertime tasks severely impacting pending queues on Windows hardware

Hi, we're disabling the visual-metrics on browsertime until we fix the underlying issues here so it shouldn't be occurring anymore: https://bugzilla.mozilla.org/show_bug.cgi?id=1596102

For context, the issue started because of bug 1581240. It caused the task to produce corrupted png files (causing the failure you're seeing). Then, bug 1594438 exacerbated the issue by adding more tests. We are currently working on a fix for the png files, and we are also going to zip the artifacts instead of what's being done now to try to prevent the issue from occurring in the future.

Thanks Greg!

:pmoore can you confirm if this is okay now?

Flags: needinfo?(pmoore)
Flags: needinfo?(pmoore) → needinfo?(gmierz2)

Sorry for playing needinfo-tennis. redirecting to the bug reporter to determine if this is resolved now.

Flags: needinfo?(gmierz2) → needinfo?(rmaries)

Queue is looking good now. Closing.

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(rmaries)
Resolution: --- → FIXED
See Also: → 1615347
You need to log in before you can comment on or make changes to this bug.