Closed
Bug 1417881
Opened 7 years ago
Closed 6 years ago
Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied
Categories
(Firefox Build System :: Task Configuration, task, P1)
Firefox Build System
Task Configuration
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: aryx, Unassigned)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell infra])
Some Windows machines failed and fail with cloning issues today: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79 https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/components/style/stylesheets/rule__parser.rs.i: Access is denied This is a different machine: https://treeherder.mozilla.org/logviewer.html#?job_id=145226897&repo=autoland From #vcs: nagios-scl3: Thu 09:27:37 UTC [5469] [devservices] reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1% free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap)
Comment 1•7 years ago
|
||
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #0) > Some Windows machines failed and fail with cloning issues today: > > https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/ > gecko-3-b-win2012/workers/us-east-1/i-0e6fb293655cbec79 > https://treeherder.mozilla.org/logviewer.html#?job_id=145297456&repo=autoland > IOError: [Errno 13] > y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/ > components/style/stylesheets/rule__parser.rs.i: Access is denied > abort: > y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/servo/ > components/style/stylesheets/rule__parser.rs.i: Access is denied This sorta looks like bug 1406073#c20. Rob? > From #vcs: > nagios-scl3: Thu 09:27:37 UTC [5469] [devservices] > reviewboard1.webapp.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 1% > free (0 MB out of 2047 MB) (http://m.mozilla.org/Swap) Completely unrelated. "Access is denied" is a local error message from the filesystem, not from hgmo. Also, reviewboard != hgmo; nothing in automation is querying MozReview's webheads.
Flags: needinfo?(rthijssen)
Comment 2•7 years ago
|
||
yes, it's the same symptom but for different reasons. last time occ/dsc was actually broken on 2012 and failing to run for all 2012 instances. this time we know occ and dsc are running properly otherwise this issue would occur on every instance, on all builds after the first one. the issue occurs when an hg operation occurs before occ has run access controls against hg-shared. https://github.com/mozilla-releng/OpenCloudConfig/blob/d3044939/userdata/Manifest/gecko-1-b-win2012.json#L1165-L1181 normally this shouldn't be possible. things that i can think of that could be responsible: - occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others) - the build task has started before occ has completed. this isn't supposed to happen as occ sets a flag to say it has completed. generic worker checks the flag before it runs. i don't know enough about how the gw code works to know how likely this is or isn't. - maybe permissions need to be reset after every robustcheckout operation (between every task). this seems unlikely as i'd expect to see/have seen more occurences of this error besides the two mentioned above. in any case, if it occurs again, we'll need to debug it on the instance where it occurs. we don't have much data to work with otherwise.
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 3•7 years ago
|
||
Trees are closed because Windows builds on autoland are mass failing with this: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&tochange=6bf259a96c4946259eeca33a31774c6c7ca72915&fromchange=1c30145251a460fc34d1ca50a718dff3b911fa45
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 4•7 years ago
|
||
(Of course also affects inbound, TH had stopped loading results.)
Updated•7 years ago
|
Priority: -- → P1
Comment 5•7 years ago
|
||
Looking at on treeherder, this seems to be affecting many machines, so perhaps grenade's idea is most likely? "- occ failed to run the icacls command on some instances (if we see lots of occurences, i'd bet on this. if only a few, then less likely because i don't know why it would work sometimes and not others)" Looking at one of the level 3 failed build tasks, the AMI was generated a month ago so nothing has changed with the AMI or generic-worker to cause this. I'm not sure what the exact steps are to reproduce this scenario. fubar, any idea of someone that can jump on the level 3 machines and troubleshoot?
Flags: needinfo?(klibby)
Comment 7•7 years ago
|
||
rebuilt ami for gecko-3-b-win2012 in hopes that it corrects whatever's going on. found no evidence of any of the possible causes i mentioned in comment 2. occ is reporting no errors and the icacls commands appear to have succeeded. retriggered failures in comment 3 and waiting to see if the new ami helps resolve these.
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 8•7 years ago
|
||
Thank you. Started opening trees at 11pm UTC.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 11•7 years ago
|
||
More of these reported today: 14:34:58 < gps> more windows workers hg permissions issues on try: https://treeherder.mozilla.org/#/jobs?repo=try&author=gszorc@mozilla.com&selectedJob=146188147 15:35:05 < kats> getting a file access denied error on a windows taskcluster build, known issue? https://treeherder.mozilla.org/logviewer.html#?job_id=146210204&repo=try&lineNumber=114 ... 15:51:57 < kats> i saw the same failure on i-08e2a03630b9230de and i-05b0544c86e70602e and i-0eb41d3e4c49f198a 15:52:24 < kats> i-0eb41d3e4c49f198a was the one from which i linked the log above 15:52:25 <&garndt> two of those are the ones I just killed
Comment 12•7 years ago
|
||
Also had the problem on i-0d5a03eea706280e5.
Comment 13•7 years ago
|
||
rebuilt gecko-1-b-win2012 amis this morning.
Comment 14•7 years ago
|
||
what is our path forward? rebuilding AMIs every 5 days isn't a great strategy, though it disproves that it was anything about the AMI being more than a month old.
Comment 15•7 years ago
|
||
i don't know. however the amis i rebuilt today were for gecko-1-b-win2012 (try). the ones that i rebuilt last week were for gecko-3-b-win2012 (m-c, m-i, al, etc). todays were just exhibiting the same behaviour as last weeks.
Comment 16•7 years ago
|
||
10:51:21 <@grenade> fubar: gecko-1-b-win2012 was last deployed Oct. 16 so maybe we're still looking at an issue with old AMIs. are there any other similar AMIs that are the same age that we can keep an eye out for (esp given that you're PTO soon) ?
Comment 17•7 years ago
|
||
i went ahead and deployed gecko-2-b-win2012 for good measure. that's all of 'em. https://tools.taskcluster.net/groups/HajqTx-LQbi8s3dR7U7HjA/tasks/bN9da_-YQGyJIfSqIvPhtA/runs/0/logs/public%2Flogs%2Flive.log
Comment 18•7 years ago
|
||
Happened again on i-003a954ffa8c36dc6.
Comment 19•7 years ago
|
||
precautionary redeploy of gecko-1-b-win2012 (instance type of i-003a954ffa8c36dc6 from comment 18) https://github.com/mozilla-releng/OpenCloudConfig/commit/23b390bb52b04b5720fcbad8f7310672cbf61205 https://tools.taskcluster.net/groups/TiTUsxJjSNKh8GxC-3C7eQ/tasks/HIvfAjeHRNypkKPSfhmuhA/runs/0/logs/public%2Flogs%2Flive.log
Comment 20•7 years ago
|
||
some errors to investigate from i-003a954ffa8c36dc6 before it was terminated (https://papertrailapp.com/systems/1383074912/events?focus=873916286793605142) Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: An attempt to write to the file "Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat" at offset 0 (0x0000000000000000) for 32768 (0x00008000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: Unable to write a shadowed header for file Z:\task_1512295051\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat. Error -1022. Dec 03 09:58:21 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com ESENT: taskhostex (2244) WebCacheLocal: The database engine stopped the instance (0) with error (-1022). Internal Timing Sequence: [1] 0.000, [2] 0.000, [3] 0.000, [4] 0.000, [5] 0.016, [6] 0.000, [7] 0.000, [8] 0.000, [9] 0.000, [10] 0.000, [11] 0.000, [12] 0.000, [13] 0.000, [14] 0.000, [15] 0.000.
Comment 21•7 years ago
|
||
and some more (need to understand what "Exception 1" is and whether or not it's related to the hg checkout): Dec 03 09:55:07 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Resolving task... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 ERROR encountered: Task not successful due to following exception(s): Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 1) Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: [] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exit code: 255 Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: Exception 2) Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: file-missing-on-worker: Could not read directory 'Z:\task_1512294859\public\build' Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [0 C0421EE670] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Result: 0 0 The data area passed to a system call is too small. Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Making system call GetProfilesDirectoryW with args: [C0421D2DA0 C0421EE670] Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Result: 1 0 The operation completed successfully. Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512284784' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512292393' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294623' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294672' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294723' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294770' via os.RemoveAll(path) call as GenericWorker user... Dec 03 09:55:08 i-003a954ffa8c36dc6.gecko-1-b-win2012.euc1.mozilla.com generic-worker: 2017/12/03 09:55:06 Trying to remove directory 'Z:\task_1512294814' via os.RemoveAll(path) call as GenericWorker user...
Comment 22•7 years ago
|
||
And again on i-0266dc3dac3a60cba.
Comment 23•7 years ago
|
||
As well as i-013cd233c248f2d9a and i-077ebab3ae38f2248.
Comment 24•7 years ago
|
||
And i-0b99c70cd74cc6c41.
Reporter | ||
Comment 25•7 years ago
|
||
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-b-win2012/workers/eu-central-1/i-05a78f3f34463681d had one passing spidermonkey-sm-compacting-win32/debug and only this failure after that.
Comment 26•7 years ago
|
||
And i-0e983859ff5e8ea39.
Comment 27•7 years ago
|
||
And i-01a028775db8de3ac. How do we change the bug summary so that this bug shows up in treeherder?
Comment 28•7 years ago
|
||
The output you want to match is either "IOError: [Errno 13] .* Access is denied" or "abort: .* Access is denied"? First you either rewrite treeherder to support globbing in bug summaries, or you wrap the invocation of hg in a script that examines the output and logs a line without a path when it sees that error. For lines of output that aren't in the format failureword | filename | errormessage, treeherder searches for some number of characters (which isn't actually the number it claims) of the full line, so it's looking for something like "IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/python/mozbuild/mozpack/manifests.py.i: Access", probably including the filename but most certainly including the hash directory name.
Comment 29•7 years ago
|
||
And i-0eb286184904967a9, i-0f1d0fb4107929ddd, and i-02ff7a92f7d09a2b2.
Updated•7 years ago
|
Summary: Intermittent Windows cloning failure with 'Access is denied' on some nodes → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined
Summary: Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is defined → Intermittent IOError: [Errno 13] y:\hg-shared\ (...) Access is denied
Comment 30•6 years ago
|
||
Update: these failures appeared on the following machines: i-0efc11fb275fdf878, i-09551a89807a7984b and i-01695fe31c7a4f3e8 Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=169622298&repo=mozilla-central&lineNumber=114
Comment 31•6 years ago
|
||
Also machine: i-0c29d030b3efe31de
Reporter | ||
Comment 32•6 years ago
|
||
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-b-win2012/workers/eu-central-1/i-0a4f5932a234f74f0 caused mass failures (got quarantined): https://treeherder.mozilla.org/#/jobs?repo=try&revision=e20ebff215891454e33b6a5facd48b1a0893e0fb&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&group_state=expanded&selectedJob=169640871 Expect this to become worse later today (upcoming tree closure). The Taskcluster team is out of office for the next hours. Eli, needinfoing you in case something has to be done for taskcluster. Kendall, is this a hg.mozilla.org, a hg cache (?) or a taskcluster worker issue? Thank you in advance.
Flags: needinfo?(klibby)
Flags: needinfo?(eperelman)
Comment 33•6 years ago
|
||
I think #c2 still applies; :grenade is redeploying AMIs which should get us out of the immediate issue. We still need to investigate and fix what's actually wrong.
Flags: needinfo?(klibby)
Comment 34•6 years ago
|
||
we have in the past seen this resolved by an ami redploy so i am trying that now. https://tools.taskcluster.net/task-group-inspector/#/NWxVSJ7kQwW-6Sw4-G-rrQ
Comment hidden (Intermittent Failures Robot) |
Comment 37•6 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #33) > I think #c2 still applies; :grenade is redeploying AMIs which should get us > out of the immediate issue. We still need to investigate and fix what's > actually wrong. This seems to be one symptom of more general OCC behavior we'd like to change. I'd rather not do too much to solve this particular issue (i.e. adding more special case code to workaround this), but rather spend time tackling those root issues. We talked a bit about how to do so at the work week. I'll send a summation email to the relevant parties to get this moving.
Flags: needinfo?(coop)
Comment 38•6 years ago
|
||
just my two cents but i don't think the issue is caused by an occ behaviour. running an occ deploy actually gets things working again. the instances reach some state where task users fail to write (or read, possibly) from the hg robustcheckout cache. this is a consequence of tasks running as a new user on each run and the file permisions of the hg cache being owned by the task user from a previous task. this behaviour has nothing to do with occ and is implemented in generic-worker and robustcheckout. a workaround was implemented in occ to grant full access to all users on the system to the directory where the hg robustcheckout cache exists whenever occ is run. https://github.com/mozilla-releng/OpenCloudConfig/blob/4902e29a/userdata/Manifest/gecko-1-b-win2012.json#L1136-L1175 if generic-worker waits until occ has completed its run (after reboot, following a task completion), then occ has a chance to apply its workaround of resetting permissions on the cache. the occ workaround to set permissions on every reboot is currently what gets things working when this issue pops up. an ami redeploy essentially just kills all the running workers and resets everything to a known good state. a real solution would involve generic-worker managing cache permissions instead of relying on the hacky workaround in occ. eg generic-worker should manage cache access rights as it is the process that creates the various users that read and write from the cache. gw should not run tasks if the user it has created does not have the necessary permissions to access the caches. i'm open to discussion on this but i think the suggestion that something in occ is causing this problem isn't quite accurate. occ does it's best to clean up after other processes that are not correctly managing their cache accesses. i don't really know what occ could do better in this regard as it's currently working rather well under the circumstances and an occ ami deploy run is the only mechanism we have to mitigate this issue until gw gets patched to handle this scenario better.
Updated•6 years ago
|
Assignee: relops → nobody
Component: RelOps → Generic-Worker
Product: Infrastructure & Operations → Taskcluster
QA Contact: klibby → pmoore
Comment 39•6 years ago
|
||
i've moved this bug to the gw component because i believe that's the only place a real fix can be implemented. if someone has a suggestion about what occ could do to better handle failure of gw task users to access caches that other gw task users created i'll be happy to take another look but for now i don't see how occ can be modified further to continue cleaning up after gw and getting blamed for the bustage in this instance.
Comment 40•6 years ago
|
||
We could also have robustcheckout set file permissions on created directories such that all users have access rights. We already have code in robustcheckout for calling Win32 APIs. This feels a bit like a layering violation. But I'm not opposed to adding the workaround to robustcheckout.
Comment 41•6 years ago
|
||
As an update, this continues to occur on i-011b9d85bb1e3d353: https://treeherder.mozilla.org/logviewer.html#?job_id=170513267&repo=mozilla-central&lineNumber=113 and i-07f697b865b190bea : https://treeherder.mozilla.org/logviewer.html#?job_id=170513134&repo=mozilla-central&lineNumber=113
Comment hidden (Intermittent Failures Robot) |
Comment 43•6 years ago
|
||
It looks like the task definitions aren't mounting a cache for the hg shared repo: https://queue.taskcluster.net/v1/task/IWEOwdCtRoOQOv1v5Jeeeg If tasks wish to used cache data, they should mount a writable directory cache. Search for "Writable Directory Cache" on https://docs.taskcluster.net/reference/workers/generic-worker/docs/payload to see how to implement this in gecko.
Component: Generic-Worker → Task Configuration
Product: Taskcluster → Firefox Build System
QA Contact: pmoore
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•