1417881 - Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

8 years ago

Some Windows machines failed and fail with cloning issues today: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79 https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied This is a different machine: https://treeherder.mozilla.org/logviewer.html#?job_id=145226897&repo=autoland From #vcs: nagios-scl3: Thu 09:27:37 UTC [5469] [devservices] reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1% free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap)

Kendall Libby [:fubar] (he/him)

Comment 1

•

8 years ago

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #0) > Some Windows machines failed and fail with cloning issues today: > > https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/ > gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79 > https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland > IOError: [Errno 13] > y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/ > components/style/stylesheets/rule__parser.rs.i: Access is denied > abort: > y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/ > components/style/stylesheets/rule__parser.rs.i: Access is denied This sorta looks like bug 1406073#c20. Rob? > From #vcs: > nagios-scl3: Thu 09:27:37 UTC [5469] [devservices] > reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1% > free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap) Completely unrelated. "Access is denied" is a local error message from the filesystem, not from hgmo. Also, reviewboard != hgmo; nothing in automation is querying MozReview's webheads.

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 2

•

8 years ago

yes, it's the same symptom but for different reasons. last time occ/dsc was actually broken on 2012 and failing to run for all 2012 instances. this time we know occ and dsc are running properly otherwise this issue would occur on every instance, on all builds after the first one. the issue occurs when an hg operation occurs before occ has run access controls against hg-shared. https://github.com/mozilla-releng/OpenCloudConfig/blob/d3044939/userdata/Manifest/gecko-1-b-win2012.json#L1165-L1181 normally this shouldn't be possible. things that i can think of that could be responsible: - occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others) - the build task has started before occ has completed. this isn't supposed to happen as occ sets a flag to say it has completed. generic worker checks the flag before it runs. i don't know enough about how the gw code works to know how likely this is or isn't. - maybe permissions need to be reset after every robustcheckout operation (between every task). this seems unlikely as i'd expect to see/have seen more occurences of this error besides the two mentioned above. in any case, if it occurs again, we'll need to debug it on the instance where it occurs. we don't have much data to work with otherwise.

Flags: needinfo?(rthijssen)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 3

•

8 years ago

Trees are closed because Windows builds on autoland are mass failing with this: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&tochange=6bf259a96c4946259eeca33a31774c6c7ca72915&fromchange=1c30145251a460fc34d1ca50a718dff3b911fa45

Flags: needinfo?(rthijssen)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 4

•

8 years ago

(Of course also affects inbound, TH had stopped loading results.)

Chris Cooper [:coop] (he/him)

Updated

•

8 years ago

Priority: -- → P1

Greg Arndt [:garndt]

Comment 5

•

8 years ago

Looking at on treeherder, this seems to be affecting many machines, so perhaps grenade's idea is most likely? "- occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others)" Looking at one of the level 3 failed build tasks, the AMI was generated a month ago so nothing has changed with the AMI or generic-worker to cause this. I'm not sure what the exact steps are to reproduce this scenario. fubar, any idea of someone that can jump on the level 3 machines and troubleshoot?

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 6

•

8 years ago

markco is looking

Flags: needinfo?(klibby)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 7

•

8 years ago

rebuilt ami for gecko-3-b-win2012 in hopes that it corrects whatever's going on. found no evidence of any of the possible causes i mentioned in comment 2. occ is reporting no errors and the icacls commands appear to have succeeded. retriggered failures in comment 3 and waiting to see if the new ami helps resolve these.

Flags: needinfo?(rthijssen)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 8

•

8 years ago

Thank you. Started opening trees at 11pm UTC.

Comment hidden (Intermittent Failures Robot)

Kendall Libby [:fubar] (he/him)

Comment 11

•

8 years ago

More of these reported today: 14:34:58 < gps> more windows workers hg permissions issues on try: https://treeherder.mozilla.org/#/jobs?repo=try&author=gszorc@mozilla.com&selectedJob=146188147 15:35:05 < kats> getting a file access denied error on a windows taskcluster build, known issue? https://treeherder.mozilla.org/logviewer.html#?job_id=146210204&repo=try&lineNumber=114 ... 15:51:57 < kats> i saw the same failure on i-08e2a03630b9230de and i-05b0544c86e70602e and i-0eb41d3e4c49f198a 15:52:24 < kats> i-0eb41d3e4c49f198a was the one from which i linked the log above 15:52:25 <&garndt> two of those are the ones I just killed

Mike Hommey [:glandium]

Comment 12

•

8 years ago

Also had the problem on i-0d5a03eea706280e5.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 13

•

8 years ago

rebuilt gecko-1-b-win2012 amis this morning.

Kendall Libby [:fubar] (he/him)

Comment 14

•

8 years ago

what is our path forward? rebuilding AMIs every 5 days isn't a great strategy, though it disproves that it was anything about the AMI being more than a month old.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 15

•

8 years ago

i don't know. however the amis i rebuilt today were for gecko-1-b-win2012 (try). the ones that i rebuilt last week were for gecko-3-b-win2012 (m-c, m-i, al, etc). todays were just exhibiting the same behaviour as last weeks.

Kendall Libby [:fubar] (he/him)

Comment 16

•

8 years ago

10:51:21 <@grenade> fubar: gecko-1-b-win2012 was last deployed Oct. 16 so maybe we're still looking at an issue with old AMIs. are there any other similar AMIs that are the same age that we can keep an eye out for (esp given that you're PTO soon) ?

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 17

•

8 years ago

i went ahead and deployed gecko-2-b-win2012 for good measure. that's all of 'em. https://tools.taskcluster.net/groups/HajqTx-LQbi8s3dR7U7HjA/tasks/bN9da_-YQGyJIfSqIvPhtA/runs/0/logs/public%2Flogs%2Flive.log

Mike Hommey [:glandium]

Comment 18

•

8 years ago

Happened again on i-003a954ffa8c36dc6.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 19

•

8 years ago

precautionary redeploy of gecko-1-b-win2012 (instance type of i-003a954ffa8c36dc6 from comment 18) https://github.com/mozilla-releng/OpenCloudConfig/commit/23b390bb52b04b5720fcbad8f7310672cbf61205 https://tools.taskcluster.net/groups/TiTUsxJjSNKh8GxC-3C7eQ/tasks/HIvfAjeHRNypkKPSfhmuhA/runs/0/logs/public%2Flogs%2Flive.log

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 20

•

8 years ago

some errors to investigate from i-003a954ffa8c36dc6 before it was terminated (https://papertrailapp.com/systems/1383074912/events?focus=873916286793605142) Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: An attempt to write to the file "Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat" at offset 0 (0x0000000000000000) for 32768 (0x00008000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: Unable to write a shadowed header for file Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat. Error -1022. Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: The database engine stopped the instance (0) with error (-1022). Internal Timing Sequence: [1] 0.000, [2] 0.000, [3] 0.000, [4] 0.000, [5] 0.016, [6] 0.000, [7] 0.000, [8] 0.000, [9] 0.000, [10] 0.000, [11] 0.000, [12] 0.000, [13] 0.000, [14] 0.000, [15] 0.000.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 21

•

8 years ago

and some more (need to understand what "Exception 1" is and whether or not it's related to the hg checkout): Dec 03 09:55:07 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Resolving task... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 ERROR encountered: Task not successful due to following exception(s): Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 1) Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: [] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exit code: 255 Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 2) Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: file-missing-on-worker: Could not read directory 'Z:\task_1512294859\public\build' Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [0 C0421EE670] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Result: 0 0 The data area passed to a system call is too small. Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [C0421D2DA0 C0421EE670] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Result: 1 0 The operation completed successfully. Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512284784' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512292393' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294623' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294672' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294723' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294770' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294814' via os.RemoveAll(path) call as GenericWorker user...

Mike Hommey [:glandium]

Comment 22

•

8 years ago

And again on i-0266dc3dac3a60cba.

Mike Hommey [:glandium]

Comment 23

•

8 years ago

As well as i-013cd233c248f2d9a and i-077ebab3ae38f2248.

Mike Hommey [:glandium]

Comment 24

•

8 years ago

And i-0b99c70cd74cc6c41.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 25

•

8 years ago

https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-b-win2012/workers/eu-central-1/i-05a78f3f34463681d had one passing spidermonkey-sm-compacting-win32/debug and only this failure after that.

Mike Hommey [:glandium]

Comment 26

•

8 years ago

And i-0e983859ff5e8ea39.

Mike Hommey [:glandium]

Comment 27

•

8 years ago

And i-01a028775db8de3ac. How do we change the bug summary so that this bug shows up in treeherder?

Phil Ringnalda (:philor)

Comment 28

•

8 years ago

The output you want to match is either "IOError: [Errno 13] .* Access is denied" or "abort: .* Access is denied"? First you either rewrite treeherder to support globbing in bug summaries, or you wrap the invocation of hg in a script that examines the output and logs a line without a path when it sees that error. For lines of output that aren't in the format failureword | filename | errormessage, treeherder searches for some number of characters (which isn't actually the number it claims) of the full line, so it's looking for something like "IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/python/mozbuild/mozpack/manifests.py.i: Access", probably including the filename but most certainly including the hash directory name.

Mike Hommey [:glandium]

Comment 29

•

8 years ago

And i-0eb286184904967a9, i-0f1d0fb4107929ddd, and i-02ff7a92f7d09a2b2.

Mike Hommey [:glandium]

Updated

•

8 years ago

Summary: Intermittent Windows cloning failure with 'Access is denied' on some nodes → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined

(Away)

Updated

•

8 years ago

Summary: Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied

Andreea Pavel [:apavel]

Comment 30

•

8 years ago

Update: these failures appeared on the following machines: i-0efc11fb275fdf878, i-09551a89807a7984b and i-01695fe31c7a4f3e8 Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=169622298&repo=mozilla-central&lineNumber=114

Andreea Pavel [:apavel]

Comment 31

•

8 years ago

Also machine: i-0c29d030b3efe31de

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 32

•

8 years ago

https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-b-win2012/workers/eu-central-1/i-0a4f5932a234f74f0 caused mass failures (got quarantined): https://treeherder.mozilla.org/#/jobs?repo=try&revision=e20ebff215891454e33b6a5facd48b1a0893e0fb&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&group_state=expanded&selectedJob=169640871 Expect this to become worse later today (upcoming tree closure). The Taskcluster team is out of office for the next hours. Eli, needinfoing you in case something has to be done for taskcluster. Kendall, is this a hg.mozilla.org, a hg cache (?) or a taskcluster worker issue? Thank you in advance.

Flags: needinfo?(klibby)

Flags: needinfo?(eperelman)

Kendall Libby [:fubar] (he/him)

Comment 33

•

8 years ago

I think #c2 still applies; :grenade is redeploying AMIs which should get us out of the immediate issue. We still need to investigate and fix what's actually wrong.

Flags: needinfo?(klibby)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 34

•

8 years ago

we have in the past seen this resolved by an ami redploy so i am trying that now. https://tools.taskcluster.net/task-group-inspector/#/NWxVSJ7kQwW-6Sw4-G-rrQ

:Eli Perelman

Comment 35

•

8 years ago

Redirecting.

Flags: needinfo?(eperelman) → needinfo?(coop)

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Comment 37

•

8 years ago

(In reply to Kendall Libby [:fubar] from comment #33) > I think #c2 still applies; :grenade is redeploying AMIs which should get us > out of the immediate issue. We still need to investigate and fix what's > actually wrong. This seems to be one symptom of more general OCC behavior we'd like to change. I'd rather not do too much to solve this particular issue (i.e. adding more special case code to workaround this), but rather spend time tackling those root issues. We talked a bit about how to do so at the work week. I'll send a summation email to the relevant parties to get this moving.

Flags: needinfo?(coop)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 38

•

8 years ago

just my two cents but i don't think the issue is caused by an occ behaviour. running an occ deploy actually gets things working again. the instances reach some state where task users fail to write (or read, possibly) from the hg robustcheckout cache. this is a consequence of tasks running as a new user on each run and the file permisions of the hg cache being owned by the task user from a previous task. this behaviour has nothing to do with occ and is implemented in generic-worker and robustcheckout. a workaround was implemented in occ to grant full access to all users on the system to the directory where the hg robustcheckout cache exists whenever occ is run. https://github.com/mozilla-releng/OpenCloudConfig/blob/4902e29a/userdata/Manifest/gecko-1-b-win2012.json#L1136-L1175 if generic-worker waits until occ has completed its run (after reboot, following a task completion), then occ has a chance to apply its workaround of resetting permissions on the cache. the occ workaround to set permissions on every reboot is currently what gets things working when this issue pops up. an ami redeploy essentially just kills all the running workers and resets everything to a known good state. a real solution would involve generic-worker managing cache permissions instead of relying on the hacky workaround in occ. eg generic-worker should manage cache access rights as it is the process that creates the various users that read and write from the cache. gw should not run tasks if the user it has created does not have the necessary permissions to access the caches. i'm open to discussion on this but i think the suggestion that something in occ is causing this problem isn't quite accurate. occ does it's best to clean up after other processes that are not correctly managing their cache accesses. i don't really know what occ could do better in this regard as it's currently working rather well under the circumstances and an occ ami deploy run is the only mechanism we have to mitigate this issue until gw gets patched to handle this scenario better.

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

8 years ago

Assignee: relops → nobody

Component: RelOps → Generic-Worker

Product: Infrastructure & Operations → Taskcluster

QA Contact: klibby → pmoore

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 39

•

8 years ago

i've moved this bug to the gw component because i believe that's the only place a real fix can be implemented. if someone has a suggestion about what occ could do to better handle failure of gw task users to access caches that other gw task users created i'll be happy to take another look but for now i don't see how occ can be modified further to continue cleaning up after gw and getting blamed for the bustage in this instance.

Gregory Szorc [:gps]

Comment 40

•

8 years ago

We could also have robustcheckout set file permissions on created directories such that all users have access rights. We already have code in robustcheckout for calling Win32 APIs. This feels a bit like a layering violation. But I'm not opposed to adding the workaround to robustcheckout.

Andreea Pavel [:apavel]

Comment 41

•

8 years ago

As an update, this continues to occur on i-011b9d85bb1e3d353: https://treeherder.mozilla.org/logviewer.html#?job_id=170513267&repo=mozilla-central&lineNumber=113 and i-07f697b865b190bea : https://treeherder.mozilla.org/logviewer.html#?job_id=170513134&repo=mozilla-central&lineNumber=113

Comment hidden (Intermittent Failures Robot)

Pete Moore [:pmoore][:pete]

Comment 43

•

8 years ago

It looks like the task definitions aren't mounting a cache for the hg shared repo: https://queue.taskcluster.net/v1/task/IWEOwdCtRoOQOv1v5Jeeeg If tasks wish to used cache data, they should mount a writable directory cache. Search for "Writable Directory Cache" on https://docs.taskcluster.net/reference/workers/generic-worker/docs/payload to see how to implement this in gecko.

Component: Generic-Worker → Task Configuration

Product: Taskcluster → Firefox Build System

QA Contact: pmoore

Pete Moore [:pmoore][:pete]

Updated

•

8 years ago

Depends on: 1350956

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME