Closed Bug 1417881 Opened 7 years ago Closed 6 years ago

Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied

Categories

(Firefox Build System :: Task Configuration, task, P1)

task

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Some Windows machines failed and fail with cloning issues today:

https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79
https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland
IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied
abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied

This is a different machine: https://treeherder.mozilla.org/logviewer.html#?job_id=145226897&repo=autoland

From #vcs:
nagios-scl3: Thu 09:27:37 UTC [5469] [devservices] reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1% free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #0)
> Some Windows machines failed and fail with cloning issues today:
> 
> https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/
> gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79
> https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland
> IOError: [Errno 13]
> y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/
> components/style/stylesheets/rule__parser.rs.i: Access is denied
> abort:
> y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/
> components/style/stylesheets/rule__parser.rs.i: Access is denied

This sorta looks like bug 1406073#c20. Rob?
 
> From #vcs:
> nagios-scl3: Thu 09:27:37 UTC [5469] [devservices]
> reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1%
> free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap)

Completely unrelated. "Access is denied" is a local error message from the filesystem, not from hgmo. Also, reviewboard != hgmo; nothing in automation is querying MozReview's webheads.
Flags: needinfo?(rthijssen)
yes, it's the same symptom but for different reasons. last time occ/dsc was actually broken on 2012 and failing to run for all 2012 instances. this time we know occ and dsc are running properly otherwise this issue would occur on every instance, on all builds after the first one.

the issue occurs when an hg operation occurs before occ has run access controls against hg-shared.
https://github.com/mozilla-releng/OpenCloudConfig/blob/d3044939/userdata/Manifest/gecko-1-b-win2012.json#L1165-L1181

normally this shouldn't be possible. things that i can think of that could be responsible:
- occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others)
- the build task has started before occ has completed. this isn't supposed to happen as occ sets a flag to say it has completed. generic worker checks the flag before it runs. i don't know enough about how the gw code works to know how likely this is or isn't.
- maybe permissions need to be reset after every robustcheckout operation (between every task). this seems unlikely as i'd expect to see/have seen more occurences of this error besides the two mentioned above.

in any case, if it occurs again, we'll need to debug it on the instance where it occurs. we don't have much data to work with otherwise.
Flags: needinfo?(rthijssen)
(Of course also affects inbound, TH had stopped loading results.)
Priority: -- → P1
Looking at on treeherder, this seems to be affecting many machines, so perhaps grenade's idea is most likely? 

"- occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others)"

Looking at one of the level 3 failed build tasks, the AMI was generated a month ago so nothing has changed with the AMI or generic-worker to cause this.  I'm not sure what the exact steps are to reproduce this scenario.

fubar, any idea of someone that can jump on the level 3 machines and troubleshoot?
Flags: needinfo?(klibby)
markco is looking
Flags: needinfo?(klibby)
rebuilt ami for gecko-3-b-win2012 in hopes that it corrects whatever's going on.
found no evidence of any of the possible causes i mentioned in comment 2.
occ is reporting no errors and the icacls commands appear to have succeeded.
retriggered failures in comment 3 and waiting to see if the new ami helps resolve these.
Flags: needinfo?(rthijssen)
Thank you. Started opening trees at 11pm UTC.
More of these reported today:

14:34:58 < gps> more windows workers hg permissions issues on try:
https://treeherder.mozilla.org/#/jobs?repo=try&author=gszorc@mozilla.com&selectedJob=146188147
15:35:05 < kats> getting a file access denied error on a windows taskcluster build, known
                 issue? https://treeherder.mozilla.org/logviewer.html#?job_id=146210204&repo=try&lineNumber=114
...
15:51:57 < kats> i saw the same failure on i-08e2a03630b9230de  and i-05b0544c86e70602e and
                 i-0eb41d3e4c49f198a
15:52:24 < kats> i-0eb41d3e4c49f198a was the one from which i linked the log above
15:52:25 <&garndt> two of those are the ones I just killed
Also had the problem on i-0d5a03eea706280e5.
rebuilt gecko-1-b-win2012 amis this morning.
what is our path forward? rebuilding AMIs every 5 days isn't a great strategy, though it disproves that it was anything about the AMI being more than a month old.
i don't know. however the amis i rebuilt today were for gecko-1-b-win2012 (try). the ones that i rebuilt last week were for gecko-3-b-win2012 (m-c, m-i, al, etc). todays were just exhibiting the same behaviour as last weeks.
10:51:21 <@grenade> fubar: gecko-1-b-win2012 was last deployed Oct. 16

so maybe we're still looking at an issue with old AMIs. are there any other similar AMIs that are the same age that we can keep an eye out for (esp given that you're PTO soon) ?
Happened again on i-003a954ffa8c36dc6.
some errors to investigate from i-003a954ffa8c36dc6 before it was terminated (https://papertrailapp.com/systems/1383074912/events?focus=873916286793605142)

Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: An attempt to write to the file "Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat" at offset 0 (0x0000000000000000) for 32768 (0x00008000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup. 
Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: Unable to write a shadowed header for file Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat. Error -1022. 
Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: The database engine stopped the instance (0) with error (-1022).      Internal Timing Sequence: [1] 0.000, [2] 0.000, [3] 0.000, [4] 0.000, [5] 0.016, [6] 0.000, [7] 0.000, [8] 0.000, [9] 0.000, [10] 0.000, [11] 0.000, [12] 0.000, [13] 0.000, [14] 0.000, [15] 0.000.
and some more (need to understand what "Exception 1" is and whether or not it's related to the hg checkout):

Dec 03 09:55:07 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Resolving task... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 ERROR encountered: Task not successful due to following exception(s): 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 1) 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: [] 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exit code: 255 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 2) 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: file-missing-on-worker: Could not read directory 'Z:\task_1512294859\public\build' 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [0 C0421EE670] 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06   Result: 0 0 The data area passed to a system call is too small. 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [C0421D2DA0 C0421EE670] 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06   Result: 1 0 The operation completed successfully. 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512284784' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512292393' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294623' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294672' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294723' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294770' via os.RemoveAll(path) call as GenericWorker user... 
Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294814' via os.RemoveAll(path) call as GenericWorker user...
And again on i-0266dc3dac3a60cba.
As well as i-013cd233c248f2d9a and i-077ebab3ae38f2248.
And i-0b99c70cd74cc6c41.
And i-0e983859ff5e8ea39.
And i-01a028775db8de3ac.

How do we change the bug summary so that this bug shows up in treeherder?
The output you want to match is either "IOError: [Errno 13] .* Access is denied" or "abort: .* Access is denied"?

First you either rewrite treeherder to support globbing in bug summaries, or you wrap the invocation of hg in a script that examines the output and logs a line without a path when it sees that error. For lines of output that aren't in the format failureword | filename | errormessage, treeherder searches for some number of characters (which isn't actually the number it claims) of the full line, so it's looking for something like "IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/python/mozbuild/mozpack/manifests.py.i: Access", probably including the filename but most certainly including the hash directory name.
And i-0eb286184904967a9, i-0f1d0fb4107929ddd, and i-02ff7a92f7d09a2b2.
Summary: Intermittent Windows cloning failure with 'Access is denied' on some nodes → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined
Summary: Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied
Update: these failures appeared on the following machines: i-0efc11fb275fdf878, i-09551a89807a7984b and i-01695fe31c7a4f3e8

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=169622298&repo=mozilla-central&lineNumber=114
Also machine:  i-0c29d030b3efe31de
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-b-win2012/workers/eu-central-1/i-0a4f5932a234f74f0 caused mass failures (got quarantined):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e20ebff215891454e33b6a5facd48b1a0893e0fb&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&group_state=expanded&selectedJob=169640871

Expect this to become worse later today (upcoming tree closure).

The Taskcluster team is out of office for the next hours. Eli, needinfoing you in case something has to be done for taskcluster.

Kendall, is this a hg.mozilla.org, a hg cache (?) or a taskcluster worker issue? Thank you in advance.
Flags: needinfo?(klibby)
Flags: needinfo?(eperelman)
I think #c2 still applies; :grenade is redeploying AMIs which should get us out of the immediate issue. We still need to investigate and fix what's actually wrong.
Flags: needinfo?(klibby)
we have in the past seen this resolved by an ami redploy so i am trying that now.
https://tools.taskcluster.net/task-group-inspector/#/NWxVSJ7kQwW-6Sw4-G-rrQ
Redirecting.
Flags: needinfo?(eperelman) → needinfo?(coop)
(In reply to Kendall Libby [:fubar] from comment #33)
> I think #c2 still applies; :grenade is redeploying AMIs which should get us
> out of the immediate issue. We still need to investigate and fix what's
> actually wrong.

This seems to be one symptom of more general OCC behavior we'd like to change. I'd rather not do too much to solve this particular issue (i.e. adding more special case code to workaround this), but rather spend time tackling those root issues.

We talked a bit about how to do so at the work week. I'll send a summation email to the relevant parties to get this moving.
Flags: needinfo?(coop)
just my two cents but i don't think the issue is caused by an occ behaviour. running an occ deploy actually gets things working again. the instances reach some state where task users fail to write (or read, possibly) from the hg robustcheckout cache. this is a consequence of tasks running as a new user on each run and the file permisions of the hg cache being owned by the task user from a previous task. this behaviour has nothing to do with occ and is implemented in generic-worker and robustcheckout.

a workaround was implemented in occ to grant full access to all users on the system to the directory where the hg robustcheckout cache exists whenever occ is run.
https://github.com/mozilla-releng/OpenCloudConfig/blob/4902e29a/userdata/Manifest/gecko-1-b-win2012.json#L1136-L1175

if generic-worker waits until occ has completed its run (after reboot, following a task completion), then occ has a chance to apply its workaround of resetting permissions on the cache.

the occ workaround to set permissions on every reboot is currently what gets things working when this issue pops up. an ami redeploy essentially just kills all the running workers and resets everything to a known good state.

a real solution would involve generic-worker managing cache permissions instead of relying on the hacky workaround in occ. eg generic-worker should manage cache access rights as it is the process that creates the various users that read and write from the cache. gw should not run tasks if the user it has created does not have the necessary permissions to access the caches.

i'm open to discussion on this but i think the suggestion that something in occ is causing this problem isn't quite accurate. occ does it's best to clean up after other processes that are not correctly managing their cache accesses. i don't really know what occ could do better in this regard as it's currently working rather well under the circumstances and an occ ami deploy run is the only mechanism we have to mitigate this issue until gw gets patched to handle this scenario better.
Assignee: relops → nobody
Component: RelOps → Generic-Worker
Product: Infrastructure & Operations → Taskcluster
QA Contact: klibby → pmoore
i've moved this bug to the gw component because i believe that's the only place a real fix can be implemented. if someone has a suggestion about what occ could do to better handle failure of gw task users to access caches that other gw task users created i'll be happy to take another look but for now i don't see how occ can be modified further to continue cleaning up after gw and getting blamed for the bustage in this instance.
We could also have robustcheckout set file permissions on created directories such that all users have access rights. We already have code in robustcheckout for calling Win32 APIs. This feels a bit like a layering violation. But I'm not opposed to adding the workaround to robustcheckout.
It looks like the task definitions aren't mounting a cache for the hg shared repo:

https://queue.taskcluster.net/v1/task/IWEOwdCtRoOQOv1v5Jeeeg


If tasks wish to used cache data, they should mount a writable directory cache.

Search for "Writable Directory Cache" on https://docs.taskcluster.net/reference/workers/generic-worker/docs/payload to see how to implement this in gecko.
Component: Generic-Worker → Task Configuration
Product: Taskcluster → Firefox Build System
QA Contact: pmoore
Depends on: 1350956
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.