Closed Bug 1493688 Opened 6 years ago Closed 5 years ago

generic-worker panic on windows 10 build 1803

Categories

(Taskcluster :: Workers, defect, P2)

x86_64
Windows 10
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: pmoore)

References

()

Details

Attachments

(3 files, 1 obsolete file)

when running generic worker on windows 10 build 1803, no tasks are claimed or started. the instance is instantiated by occ from an ami where the occ and dsc runs report successful completion. the papertrail logs indicate that generic worker starts successfully, but panics shortly after startup and shuts down the system resulting in an instance termination. this happens consistently on all instances running windows 10 1803 rather than intermittently. see the attached logs or https://papertrailapp.com/groups/7318672/events?q=%22generic-worker%20internal%20error%22 for examples of the issue.
(In reply to Rob Thijssen (:grenade UTC+2) from comment #0)
> Created attachment 9011466 [details]
> example papertrail logs showing the panic and shutdown

Is it possible that Mircrosoft removed the 'net' command in build 1803?

It looks like it is failing to run `net localgroup "Remote Desktop Users" /add task_1537798578`.
Flags: needinfo?(rthijssen)
scratch that, that command ran fine, and it was a later syscall that failed ...
Flags: needinfo?(rthijssen)
The syscalls update the known folders for APPDATA/LOCALAPPDATA to be subfolders of the tasks folder, which is configured to be Z:\ ( https://papertrailapp.com/systems/2341364692/events?focus=980903817052876826&selected=980903817052876826 )

If there isn't a Z: drive on these workers (or hasn't been successfully mounted etc) I think that could be the cause.

The failure is happening here:

  https://github.com/taskcluster/generic-worker/blob/v10.11.2/plat_windows.go#L214-L217
there is a z: drive:

Windows PowerShell                                                                                                              
Copyright (C) Microsoft Corporation. All rights reserved.                                                                       

Loading personal and system profiles took 877ms.                                                                                
PS C:\Users\Administrator> z:                                                                                                   
PS Z:\>

pmoore: if you add your ssh public key here:
https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/ssh/authorized_keys
you can ssh directly to any instance that is running (if you catch it before gw terminates it)
the authorized_keys file is added to C:\users\Administrator\ssh\authorized_keys so log in as Administrator
Hi Rob,

This is my public RSA key.

Many thanks!
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #9011497 - Flags: review?(rthijssen)
Comment on attachment 9011497 [details] [review]
GitHub Pull Request for OpenCloudConfig (pmoore rsa key)

merged
Attachment #9011497 - Flags: review?(rthijssen) → review+
Depends on: 1493744
pmoore: in case it's helpful for debugging, the most recent windows 10 1803 iso we're using is here:
https://s3.console.aws.amazon.com/s3/object/windows-ami-builder/iso/en_windows_10_business_edition_version_1803_updated_sep_2018_x64_dvd_37051f54.iso?region=us-west-2&tab=overview
(we had the same issue on all of the 1803 isos).
i can also redeploy the win 10 cu worker type with an 1803 ami in case that's helpful with your debugging.
That should do it! Thanks Rob.
Attachment #9011497 - Attachment is obsolete: true
Attachment #9012581 - Flags: review?(rthijssen)
Attachment #9012581 - Flags: review?(rthijssen) → review+
i've now run the iso-to-ami script from bug 1474548 against a windows 1703 iso and validated that we don't get the gw panic there.

eg: all of the win 10 ec2 tests in these builds were run against the iso-to-ami 1703 amis and gw runs fine there:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0103760141327347a754a452f2fb698b6764757c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5802dd41141a829e942562112f723d07756e5eeb

if we run an 1803 iso through iso-to-ami, we do get the gw panic which suggests that something in the 1803 windows build is incompatible with the gw startup sequence.
retest 1803 amis, validate that we still have this issue
Flags: needinfo?(rthijssen)
currently blocked by bug 1372172 which is utilising the gecko-t-win10-64-gpu-b worker type with a 1703 based ami. as soon as that is closed and the worker type released, i will retest 1803.
Depends on: 1372172
Flags: needinfo?(rthijssen)
Assignee: pmoore → nobody
Status: ASSIGNED → NEW
pmoore: i've attempted a rerun of our test suites on an 1803 based ami and confirmed that we still get a panic from gw as soon as it has started.
https://papertrailapp.com/groups/12074372/events?q=program%3Ageneric-worker%20PANIC

i've created 2 new worker types which will be dedicated to running 1803 until we can get it working. they are:
- gecko-t-win10-64-alpha
- gecko-t-win10-64-gpu-a

here's a try push that attempts to get tasks sent to those workers:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=35cd29eaba14044cf11afc53b4aa8953c45b365f

have you had any luck with working out what is causing the gw panic?
Flags: needinfo?(pmoore)
I'll try to get to this by the end of the week - there is a lot going on at the moment, but I'm aware this is waiting for my input...
I'm afraid I haven't managed to make any progress on this this week. Maybe we can meet up next week in Orlando and look at it together.

Pete: can you find some time to dig into this soon? It is blocking OKRs for relops.

Severity: normal → critical
Flags: needinfo?(pmoore)
Priority: -- → P2

The generic-worker indeed seems to be crashing. Here is the crash report...

Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Creating Windows user task_1547733693... 
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Running command: 'net' 'user' 'task_1547733693' 'pWd0_Gw30Igm87fMg81sTDSCzxInb' '/add' '/expires:never' '/passwordchg:no' '/y' 
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Created new OS user! 
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Running command: 'wmic' 'useraccount' 'where' 'name='task_1547733693'' 'set' 'passwordexpires=false' Updating property(s) of '\\I-05486E568AFF5\ROOT\CIMV2:Win32_UserAccount.Domain="I-05486E568AFF5",Name="task_1547733693"'   Property(s) update successful.  
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Running command: 'net' 'localgroup' 'Remote Desktop Users' '/add' 'task_1547733693' The command completed successfully.   
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call LogonUserW with args: [C0422CF140 C042245360 C0422C3640 2 0 C042245368] 
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 1 0 The operation completed successfully. 
Jan 17 15:01:34 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call LoadUserProfileW with args: [460 C042073080] 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 1 0 The operation completed successfully. 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call SHSetKnownFolderPath with args: [C7D850 0 460 C042355DB0] 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 0 0 The specified module could not be found. 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call SHGetKnownFolderPath with args: [C7D850 8000 460 C042245418] 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 0 0 The operation completed successfully. 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call CoTaskMemFree with args: [14FC2C0] 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 1 0 The operation completed successfully. 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Making system call SHSetKnownFolderPath with args: [C7D840 0 460 C042355EA0] 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:   Result: 80070057 0 Too many posts were made to a semaphore. 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Saving file file-caches.json (absolute path: C:\generic-worker\file-caches.json) 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Saving file directory-caches.json (absolute path: C:\generic-worker\directory-caches.json) 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc042096398, 0x0) 	/home/travis/.gimme/versions/go1.10.3.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x8f4240, 0xc0422454a8) 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:561 +0x2d main.RunWorker.func1(0xc04246fd70) 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:580 +0x59 panic(0x8f4240, 0xc0422454a8) 	/home/travis/.gimme/versions/go1.10.3.src/src/runtime/panic.go:502 +0x237 main.prepareTaskUser(0xc0421f56f0, 0xf, 0x2) 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:216 +0x782 main.PrepareTaskEnvironment(0xbf08408f55930cc8) 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1339 +0x120 main.RunWorker(0x0) 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:639 +0x52a main.main() 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:417 +0x946 
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker:  *********** PANIC occurred! ***********  
Jan 17 15:01:35 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: The parameter is incorrect. 
Jan 17 15:01:36 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Exiting worker with exit code 69 
Jan 17 15:01:36 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: Immediate shutdown being issued... 
Jan 17 15:01:36 i-05486e568aff5bc17.gecko-t-win10-64-alpha.use1.mozilla.com generic-worker: generic-worker internal error

From here we can see this is running version 10.11.2 of generic-worker (commit 4b47d5d9).

It looks like comment 3 is accurate - it is the call to update folder location that is failing.

What is interesting as that we are able to successfully update local app data folder, but not roaming app data folder.

We move them from C: drive to Z: drive.

See MSDN known folder IDs (and search for "FOLDERID_LocalAppData" and "FOLDERID_RoamingAppData" on that page).

After successfully updating the roaming app data folder, we fail to update the local app data folder with the error "The parameter is incorrect.".

Note, the reason we make these syscalls is to relocate data folders from C: drive to Z: drive. Now that we are creating our base images in automation, we may be able to put the users directory on the Z: drive at Windows-installation time. This could mean we wouldn't need to change the location later. This may be a worthwhile simplification, and we could remove the code that changes these folder locations dynamically after creating the task user account.

Other options:

  • we disable moving known folders for windows 10 build 1803
  • we consider not being able to relocate a known folder a non-fatal error

Another thing worth trying is running the knownfolder utility on a base image directly, to see if it has the same issue.

I've checked the GUIDs and they match up. I have no good explanation for why this fails on Windows 10 build 1803 but works on all other Windows platforms we've tried. It is quite an obscure feature, perhaps it genuinely is broken in this build of Windows.

(In reply to Pete Moore [:pmoore][:pete] from comment #22)

Note, the reason we make these syscalls is to relocate data folders from C: drive to Z: drive. Now that we are creating our base images in automation, we may be able to put the users directory on the Z: drive at Windows-installation time. This could mean we wouldn't need to change the location later. This may be a worthwhile simplification, and we could remove the code that changes these folder locations dynamically after creating the task user account.

unfortunately this isn't easily practicable. one of the reasons we use the z: drive at all is because of speed problems on the first ebs volume (see bugs 1305174, 1307204). since we have to attach the z: drive to new spot instances rather than bake it into the ami in order to get the better performing drives, we can't rely on any data added to the z: drive during ami creation.

this makes it impossible to set the users folder to the z: drive during ami creation, since the z: drive that is available during ami creation is discarded and a new blank one attached for each spot instance instantiated from the ami later. if the users folder was set to z during ami creation, the admin and system users folders would be installed there and discarded during ami creation. when a new spot instance was spun up with a blank z drive, it would be missing the admin and system profiles and would behave in ways i couldn't predict.

(In reply to Pete Moore [:pmoore][:pete] from comment #23)

we disable moving known folders for windows 10 build 1803

also worth a shot

we consider not being able to relocate a known folder a non-fatal error

seems to be the path of least resistance

either of these should give us a way forward. windows 10 1803 tests will probably write temp data to the c: drive which will have a performance cost but is better than not being able to test on 1803.
once we get tests working, we could also look at adding some symlinks, setting env vars or some other approach that causes temp data to be written to the task folder on z: in order to fix performance.

Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #9037982 - Flags: review?(jhford)

(In reply to Rob Thijssen [:grenade (UTC+2)] from comment #25)

(In reply to Pete Moore [:pmoore][:pete] from comment #22)

Note, the reason we make these syscalls is to relocate data folders from C: drive to Z: drive. Now that we are creating our base images in automation, we may be able to put the users directory on the Z: drive at Windows-installation time. This could mean we wouldn't need to change the location later. This may be a worthwhile simplification, and we could remove the code that changes these folder locations dynamically after creating the task user account.

unfortunately this isn't easily practicable. one of the reasons we use the z: drive at all is because of speed problems on the first ebs volume (see bugs 1305174, 1307204). since we have to attach the z: drive to new spot instances rather than bake it into the ami in order to get the better performing drives, we can't rely on any data added to the z: drive during ami creation.

this makes it impossible to set the users folder to the z: drive during ami creation, since the z: drive that is available during ami creation is discarded and a new blank one attached for each spot instance instantiated from the ami later. if the users folder was set to z during ami creation, the admin and system users folders would be installed there and discarded during ami creation. when a new spot instance was spun up with a blank z drive, it would be missing the admin and system profiles and would behave in ways i couldn't predict.

This makes perfect sense - many thanks for the explanation.

(In reply to Pete Moore [:pmoore][:pete] from comment #23)

we disable moving known folders for windows 10 build 1803

also worth a shot

we consider not being able to relocate a known folder a non-fatal error

seems to be the path of least resistance

either of these should give us a way forward. windows 10 1803 tests will probably write temp data to the c: drive which will have a performance cost but is better than not being able to test on 1803.
once we get tests working, we could also look at adding some symlinks, setting env vars or some other approach that causes temp data to be written to the task folder on z: in order to fix performance.

Great - I've just made a PR to make it a non-fatal error. Like you say, this should unblock us.

I'm still curious if knownfolder utility runs ok or not on 1803 but I haven't had the chance to check yet.

Attachment #9037982 - Flags: review?(jhford) → review+
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/9c4462a3783a188d4d706bc3bd913fa1e9d17992
Bug 1493688 - make redirecting app data folders a non-critical error

https://github.com/taskcluster/generic-worker/commit/9f6cc54997577f2bdf9fc95495932a14c12af01d
Merge pull request #138 from taskcluster/bug1493688

Bug 1493688 - make redirecting app data folders a non-critical error

Hi Rob!

Can you try again with https://github.com/taskcluster/generic-worker/releases/tag/v11.1.0?

Let me know if you get stuck at all.

Thanks,
Pete

Flags: needinfo?(rthijssen)
Flags: needinfo?(rthijssen)

pmoore: looks good! feel free to close this.

Thanks Rob!

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: