Closed Bug 1406073 Opened 7 years ago Closed 7 years ago

Intermittent Windows cloning failure with .hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: grenade)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=135150424&repo=autoland ~a half dozen Windows jobs failed with these kind of failures. All different worker instances, retriggers succeded. [taskcluster 2017-10-05T12:56:22.699Z] Executing command 4: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 https://hg.mozilla.org/integration/autoland .\build\src z:\build>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 https://hg.mozilla.org/integration/autoland .\build\src (using Mercurial 4.3.1) ensuring https://hg.mozilla.org/integration/autoland@1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 is available at .\build\src (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29) searching for changes adding changesets adding manifests adding file changes added 2 changesets with 2 changes to 2 files searching [ <=> ] 2 searching [ <=> ] 4 adding remote bookmark aurora adding remote bookmark beta adding remote bookmark central adding remote bookmark esr10 adding remote bookmark esr17 adding remote bookmark esr24 adding remote bookmark esr31 adding remote bookmark esr38 adding remote bookmark esr45 adding remote bookmark esr52 adding remote bookmark fx-team adding remote bookmark inbound adding remote bookmark release (pulling to obtain 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63) searching for changes adding changesets adding manifests adding file changes transaction abort! rollback completed Traceback (most recent call last): File "mercurial\scmutil.pyc", line 150, in callcatch File "mercurial\dispatch.pyc", line 292, in _runcatchfunc File "mercurial\dispatch.pyc", line 896, in _dispatch File "mercurial\dispatch.pyc", line 658, in runcommand File "mercurial\dispatch.pyc", line 904, in _runcommand File "mercurial\dispatch.pyc", line 893, in <lambda> File "mercurial\util.pyc", line 1077, in check File "C:/mozilla-build/robustcheckout.py", line 239, in robustcheckout File "C:/mozilla-build/robustcheckout.py", line 516, in _docheckout File "mercurial\exchange.pyc", line 1251, in pull File "mercurial\exchange.pyc", line 1393, in _pullbundle2 File "mercurial\bundle2.pyc", line 364, in processbundle File "mercurial\bundle2.pyc", line 464, in _processpart File "mercurial\bundle2.pyc", line 1552, in handlechangegroup File "mercurial\bundle2.pyc", line 405, in _processchangegroup File "mercurial\changegroup.pyc", line 327, in apply File "mercurial\changegroup.pyc", line 981, in _addchangegroupfiles File "mercurial\revlog.pyc", line 1886, in addgroup File "mercurial\store.pyc", line 488, in __call__ File "mercurial\vfs.pyc", line 405, in __call__ File "mercurial\windows.pyc", line 140, in posixfile IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied
Assignee: nobody → relops
Component: Mercurial: hg.mozilla.org → RelOps
Flags: needinfo?(klibby)
Product: Developer Services → Infrastructure & Operations
QA Contact: arich
from #taskcluster (times EDT): 12:14:13 <&garndt> so, if this is burning all builders, and it might be something that corrupted hg-share on all of them, any objections to me just destroying all gecko-3-win2012 machines and we can try to retrigger some jobs? 12:14:22 < RyanVM> wfm 12:14:47 <&garndt> I don't know enough about hg-share, or how to unstick the windows machines when they hit issues like that (gp.s is away too) 12:16:03 < RyanVM> I'm happy with an "It's an older code but it checks out" solution right now 12:25:40 <&garndt> Aryx: RyanVM I drained the geck-3-b-win2012 pool if you want to try retriggering some level 3 things and see what happens
So, trees reopened for a while, but then eventually started hitting this again with failures like https://treeherder.mozilla.org/logviewer.html#?job_id=135215737&repo=mozilla-inbound
These machines are still running an AMI from September 20th (15 days ago). I am not sure what might have changed to cause a permission error using hg-shared. https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/FV8HqvvbSj6NSk7csY3MiA/runs/0/logs/public%2Flogs%2Flive.log#L24 Looking at one particular machine, it completed its first task successfully, and any task after that hits the issue first task good: https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/Eunwd0dlQrmgWZhqBqy5Ug second task bad: https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/FV8HqvvbSj6NSk7csY3MiA
Is it possible the perms are changing on write from the clone I.E. in the repo they are read only etc? The only other time I see this if some other process is accessing the file.
(In reply to Greg Arndt [:garndt] from comment #6) > first task good: > https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/ > Eunwd0dlQrmgWZhqBqy5Ug > second task bad: > https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/ > FV8HqvvbSj6NSk7csY3MiA What does the cleanup between jobs look like for Windows AMIs? Can processes persist from one job to the next (as Q suggested)? Is there an instance in this state that someone could investigate?
Capturing the current state of investigations, per IRC (#taskcluster): 18:13 <&garndt> glandium: do you have any ideas of why we might be hitting bug 1406073? wondering if you had any magical ideas 18:14 <&garndt> KWierso: Aryx On a different note, most of the existing machines have died off, so if you want to try to get some more builds going (although likely to break again) you can retrigger things 18:14 <&garndt> I'm just not sure what is causing this yet, and having trouble jumping inot windows machines 18:15 <Aryx> garndt: i won't be around long enough if they fail again, so it's up to Wes 18:19 <coop> garndt: will retriggering existing builds that failed suffice as an initial test? 18:19 <&garndt> yea, that's why I mentioned retrigger :) 18:20 <&garndt> I can click retrigger around some jobs 18:20 <&garndt> if someone wants to watch them 18:21 <coop> i've already retriggered some, i'll be around for an hour or two to watch 18:21 <coop> avoiding PGO 18:23 <&garndt> retriggered jobs on https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=224d8c57a4693138c8823ac108659bb1fe5b308e&filter-searchStr=2012 18:23 <fubar> so I'm gonna be that guy: what changed in the N hours preceding the bustage? 18:23 <&garndt> fubar: so far, I have no idea 18:23 <&garndt> the machines themselves should be the same since 9/20 18:23 <&garndt> not sure if this is a vcs thing, only impacts windows 18:23 <fubar> *nod* so.. something in a task? 18:23 <&garndt> linux doesn't use hg-shared I dont' think 18:24 <&garndt> neither does os x 18:24 <fubar> yeah, gonna say it's not VCS 18:24 <fubar> Q has a prior engagement and has to step out. he's going to kick off setting up a loaner so that he can try stuff manually later 18:25 <coop> do trees need to be open for retriggers to work? not seeing these retriggers appear in TH 18:25 <fubar> robustcheckout was my primary suspect for a bit, but that hasn't changed in a week or so 18:26 <Aryx> coop: where did you retrigger? might be just a delay because we have to acquire instances to run the retriggers first 18:26 <coop> see by asking that question, i can guarantee they'll show up a minute late 18:26 <&garndt> I saw some retriggers, and not others 18:26 <coop> er, later 18:26 <&garndt> sometime TH ingestion can be lagged 18:26 <coop> they're there now 18:27 <&garndt> and I don't know why this isn't impacting level 1 buidlers 18:28 <&garndt> nevermind 18:28 <&garndt> it is, we just haven't closed try for it 18:28 <&garndt> is it always the same file? 18:28 <&garndt> _tab_parent.cpp 18:29 <fubar> no 18:29 <glandium> garndt: have you looked at the contents and permissions of the files on that worker? 18:29 <Aryx> garndt: only a small set of file 18:29 <Aryx> files 18:29 <&garndt> glandium: I'm not sure, most of the machiens causing the problem are gone now, and Q tried getting on one of the machines but coulnd't (they are machines that grenade configures) 18:31 <glandium> garndt: what it sounds like is that a builder is pulling stuff into hg-shared at some point with some perms, and a following worker can't append to the same file 18:31 <glandium> because each task uses a different user 18:32 <&garndt> hrm, I wonder why this is an issue now... 18:32 <&garndt> and doesn't happen consistently 18:32 <markco> Is there currently an instance up that hit the error that I can take a look at? 18:33 <&garndt> not that I'm aware of, I was looking for one for Q and couldn't find any 18:33 <arr> markco: dude, you're supposed to be on pto! 18:33 <markco> I know 18:37 <coop> all my retriggers have made it past the hg-shared access point 18:37 <coop> i can check back in on them after bagpipe lessons 18:38 <&garndt> bagpipe lessons? that's awesome 18:49 <KWierso> coop/garndt: my retriggers are also passing the stage where it'd hit the errors :( 18:50 <fubar> wait, what 18:55 <&garndt> KWierso: yea, but they were passing earlier when we reopened too and then it took a couple of hours for things to fail again I think 18:56 <KWierso> garndt: shall I just reopen things and see if it breaks again before end of day? 18:58 <nalexander> On linux, what we were seeing is that different tasks were using the same username (worker) with different uid/gids. Is it possible that different Windows tasks are running as the same "user", but the on-disk permission records diverge in an analogous fashion? 19:01 <&garndt> KWierso: hrm, we could 19:03 <KWierso> autoland opened 19:04 <&garndt> markco: fubar found a broken machine: i-0b7ea0c9eb7cb62dc 19:04 <&garndt> it was an older one 19:04 <&garndt> https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-west-1/i-0b7ea0c9eb7cb62dc 19:04 <&garndt> those are the tasks it ran 19:05 <markco> That is random. i triggered a task to take a look and that is the instance that picked it up 19:05 <&garndt> heh 19:07 <&garndt> hrm 19:07 <&garndt> doesn't like the username/password that's in secrets 19:07 <&garndt> so probably the same error Q is hitting 19:07 <&garndt> oh 19:07 <markco> I was able to log into wwith the admin password 19:08 <&garndt> oh that's your session I'm seeing then, sorry 19:08 <&garndt> let me disconnect 19:08 <markco> rgr 19:08 <&garndt> yea, I didn't realize there are multiple admin passwords in there for each ami release 19:09 <&garndt> and that just made it very clear to me how much I do not know about windows, felt so lost looking how to disconnect :) 19:13 <markco> Are we seeing any pattern with which file it hits the error on? 19:15 <KWierso> garndt: if it helps, new pushes are hitting the errors 19:15 <KWierso> all on i-0d1b47a780793dd5e apparently 19:19 <nalexander> Well, once it's got a poisoned cache, all similar jobs should drink the poison. 19:20 <nalexander> Are these caches on the local machine? Or are they networked? 19:20 <nalexander> (NFS, I guess, on Windows.) 19:20 <&garndt> I believe on the local machine 19:31 <&garndt> markco: I haven't seen a pattern across machines, but when one file acts like that on one machine, it's always a problem on that same machine 19:34 <markco> I know it is a long shot but, do we have an instance that hasn't had failures on it? 19:34 — KWierso|brb will be back, need to get something from the office before they close 19:39 <&garndt> markco: trying to find you one 19:40 <&garndt> markco: I would assume this one will turn green when it's done. https://tools.taskcluster.net/groups/G1VsEwy8QLKF17cp0FVLnA/tasks/E_mbZ_qNSvGrs7k1w7dDMQ/runs/0 19:40 <&garndt> it's past the hg-shared part 19:49 <&garndt> if you want to check it out before the next task fails you might want to hop on there when it's still running to see what hg-shared looks like 19:50 <coop> back 19:50 <markco> cool. I am takening a look 19:50 <&garndt> i-038d4c295c7a5f6f3 is another machine that ran a green task 9 minutes ago 19:50 <&garndt> it hasn't claimed another task yet 19:55 <&garndt> and it's still currently alive 19:59 <markco> I am currently waiting on the first one to complete. 20:05 <&garndt> the second one is a mchine that is alive, completed 1, and is not doing anything in case you wanted to see what it's like after one completes green
(In reply to Chris Cooper [:coop] from comment #9) > Capturing the current state of investigations, per IRC (#taskcluster): Summary: * new jobs on new Windows builder AMI are not hitting the hg-shared issue (yet?) * markco is investigating the filesystem on some builders to verify the state of the hg-shared clone
To further confuse things, it was assumed that once a machine was failing jobs, it would continue to fail those jobs. However, here is a machine where that is not the case: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-west-1/i-0b7ea0c9eb7cb62dc It failed a win32 opt build, and then afterwards passed one.
(In reply to Chris Cooper [:coop] from comment #8) > (In reply to Greg Arndt [:garndt] from comment #6) > > first task good: > > https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/ > > Eunwd0dlQrmgWZhqBqy5Ug > > second task bad: > > https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/ > > FV8HqvvbSj6NSk7csY3MiA > > What does the cleanup between jobs look like for Windows AMIs? Can processes > persist from one job to the next (as Q suggested)? > > Is there an instance in this state that someone could investigate? As far as I am aware mainly the clean up is destroying the task directory, removing the temp user, creating a new task user, creating a new task directory that has the new task user as the owner, and then claiming the next task. I would need pmoore to elaborate further. Currently our builders do not reboot between tasks, and this has not been a problem. Looks like it's been that way for 4 months: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/generic-worker.config#L3 If the theory is that there might be something lingering around on these machines that is causing access issues, perhaps we could change this to reboot after ever job for the time being and then try to test in an isolate worker type? I am not sure of the ramifications of changing this on the builders.
Nevermind, it does appear to be rebooting between jobs as it is. Oct 05 17:24:37 i-0b7ea0c9eb7cb62dc.gecko-3-b-win2012.usw1.mozilla.com generic-worker: 2017/10/05 22:24:37 Exiting worker with exit code 67 Which will cause the machine to format and reboot.
Hey Rob, have you seen this before? Thanks, Pete
Flags: needinfo?(rthijssen)
It looks like OpenCloudConfig and generic-worker are running at the same time. This may be related.
Isee this in ship-it "Warning: releases/comm-beta has status "closed". Reason: bug 1406073 windows builds often fail cloning. if your branch doesn't use Windows, push with "CLOSED TREE" in the commit message" Does this also afffect Thunderbird release building on the old build system?
Wayne, buildbot is unaffected. its just gecko-3-b-win2012 taskcluster workers.
occ patch to prevent gw and occ running simultaneously for gecko-3-b-win2012 is complete and new ami is live https://github.com/mozilla-releng/OpenCloudConfig/commit/03505f592ebb834619f7540c58df172c70832db1
Flags: needinfo?(rthijssen)
i believe the problem was caused by recent changes in occ which were incompatible with the currently running builder amis. when new worker instances spin up, they run occ. mostly in order to mount fresh ebs volumes for the y: and z: drives, but also, significant for this outage, to grant permission on the shared hg cache on y: to new task users. a recent change in occ which was intended to solve an issue for a different worker type (win 7), relies on functionality in wmf and powershell 5 which was not present on gecko-*-b-win2012 base amis yet. this caused new builder instances to fail to apply the permission patch to the shared hg cache on y: when booting up. hence the access errors on instances that already had a populated hg cache. i've raised bug 1406354 to deal with patching occ to prevent this sort of issue in future.
Assignee: relops → rthijssen
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell infra]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: