Closed
Bug 1406073
Opened 7 years ago
Closed 7 years ago
Intermittent Windows cloning failure with .hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aryx, Assigned: grenade)
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell infra])
Attachments
(1 file)
2.38 MB,
image/png
|
Details |
https://treeherder.mozilla.org/logviewer.html#?job_id=135150424&repo=autoland
~a half dozen Windows jobs failed with these kind of failures. All different worker instances, retriggers succeded.
[taskcluster 2017-10-05T12:56:22.699Z] Executing command 4: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 https://hg.mozilla.org/integration/autoland .\build\src
z:\build>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 https://hg.mozilla.org/integration/autoland .\build\src
(using Mercurial 4.3.1)
ensuring https://hg.mozilla.org/integration/autoland@1df4622b548bade48b18d2b9f2cbf7aa2de1bc63 is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
searching for changes
adding changesets
adding manifests
adding file changes
added 2 changesets with 2 changes to 2 files
searching [ <=> ] 2
searching [ <=> ] 4
adding remote bookmark aurora
adding remote bookmark beta
adding remote bookmark central
adding remote bookmark esr10
adding remote bookmark esr17
adding remote bookmark esr24
adding remote bookmark esr31
adding remote bookmark esr38
adding remote bookmark esr45
adding remote bookmark esr52
adding remote bookmark fx-team
adding remote bookmark inbound
adding remote bookmark release
(pulling to obtain 1df4622b548bade48b18d2b9f2cbf7aa2de1bc63)
searching for changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
Traceback (most recent call last):
File "mercurial\scmutil.pyc", line 150, in callcatch
File "mercurial\dispatch.pyc", line 292, in _runcatchfunc
File "mercurial\dispatch.pyc", line 896, in _dispatch
File "mercurial\dispatch.pyc", line 658, in runcommand
File "mercurial\dispatch.pyc", line 904, in _runcommand
File "mercurial\dispatch.pyc", line 893, in <lambda>
File "mercurial\util.pyc", line 1077, in check
File "C:/mozilla-build/robustcheckout.py", line 239, in robustcheckout
File "C:/mozilla-build/robustcheckout.py", line 516, in _docheckout
File "mercurial\exchange.pyc", line 1251, in pull
File "mercurial\exchange.pyc", line 1393, in _pullbundle2
File "mercurial\bundle2.pyc", line 364, in processbundle
File "mercurial\bundle2.pyc", line 464, in _processpart
File "mercurial\bundle2.pyc", line 1552, in handlechangegroup
File "mercurial\bundle2.pyc", line 405, in _processchangegroup
File "mercurial\changegroup.pyc", line 327, in apply
File "mercurial\changegroup.pyc", line 981, in _addchangegroupfiles
File "mercurial\revlog.pyc", line 1886, in addgroup
File "mercurial\store.pyc", line 488, in __call__
File "mercurial\vfs.pyc", line 405, in __call__
File "mercurial\windows.pyc", line 140, in posixfile
IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied
abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\data/dom/media/_channel_media_decoder.cpp.i: Access is denied
![]() |
Reporter | |
Comment 1•7 years ago
|
||
Two observations:
1) The frequency is increasing.
2) The issue doesn't hit Try so far: https://treeherder.mozilla.org/#/jobs?repo=try&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&fromchange=6ca29397356f7bc3becf588136b8f82b34f8cb12&selectedJob=135158116
![]() |
Reporter | |
Comment 2•7 years ago
|
||
Closed trees after this started to affect almost all Windows builds on https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=ebd010e955f77a1aaebbe32841913677579ad33c&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Severity: normal → blocker
Flags: needinfo?(klibby)
![]() |
Reporter | |
Updated•7 years ago
|
Assignee: nobody → relops
Component: Mercurial: hg.mozilla.org → RelOps
Flags: needinfo?(klibby)
Product: Developer Services → Infrastructure & Operations
QA Contact: arich
Comment 3•7 years ago
|
||
from #taskcluster (times EDT):
12:14:13 <&garndt> so, if this is burning all builders, and it might be something that
corrupted hg-share on all of them, any objections to me just destroying all
gecko-3-win2012 machines and we can try to retrigger some jobs?
12:14:22 < RyanVM> wfm
12:14:47 <&garndt> I don't know enough about hg-share, or how to unstick the windows machines
when they hit issues like that (gp.s is away too)
12:16:03 < RyanVM> I'm happy with an "It's an older code but it checks out" solution right now
12:25:40 <&garndt> Aryx: RyanVM I drained the geck-3-b-win2012 pool if you want to try
retriggering some level 3 things and see what happens
Comment 4•7 years ago
|
||
So, trees reopened for a while, but then eventually started hitting this again with failures like https://treeherder.mozilla.org/logviewer.html#?job_id=135215737&repo=mozilla-inbound
Comment 6•7 years ago
|
||
These machines are still running an AMI from September 20th (15 days ago). I am not sure what might have changed to cause a permission error using hg-shared. https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/FV8HqvvbSj6NSk7csY3MiA/runs/0/logs/public%2Flogs%2Flive.log#L24
Looking at one particular machine, it completed its first task successfully, and any task after that hits the issue
first task good: https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/Eunwd0dlQrmgWZhqBqy5Ug
second task bad: https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/FV8HqvvbSj6NSk7csY3MiA
Is it possible the perms are changing on write from the clone I.E. in the repo they are read only etc? The only other time I see this if some other process is accessing the file.
Comment 8•7 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #6)
> first task good:
> https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/
> Eunwd0dlQrmgWZhqBqy5Ug
> second task bad:
> https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/
> FV8HqvvbSj6NSk7csY3MiA
What does the cleanup between jobs look like for Windows AMIs? Can processes persist from one job to the next (as Q suggested)?
Is there an instance in this state that someone could investigate?
Comment 9•7 years ago
|
||
Capturing the current state of investigations, per IRC (#taskcluster):
18:13 <&garndt> glandium: do you have any ideas of why we might be hitting bug 1406073? wondering if you had any magical ideas
18:14 <&garndt> KWierso: Aryx On a different note, most of the existing machines have died off, so if you want to try to get some more builds going (although likely to break again) you can retrigger things
18:14 <&garndt> I'm just not sure what is causing this yet, and having trouble jumping inot windows machines
18:15 <Aryx> garndt: i won't be around long enough if they fail again, so it's up to Wes
18:19 <coop> garndt: will retriggering existing builds that failed suffice as an initial test?
18:19 <&garndt> yea, that's why I mentioned retrigger :)
18:20 <&garndt> I can click retrigger around some jobs
18:20 <&garndt> if someone wants to watch them
18:21 <coop> i've already retriggered some, i'll be around for an hour or two to watch
18:21 <coop> avoiding PGO
18:23 <&garndt> retriggered jobs on https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=224d8c57a4693138c8823ac108659bb1fe5b308e&filter-searchStr=2012
18:23 <fubar> so I'm gonna be that guy: what changed in the N hours preceding the bustage?
18:23 <&garndt> fubar: so far, I have no idea
18:23 <&garndt> the machines themselves should be the same since 9/20
18:23 <&garndt> not sure if this is a vcs thing, only impacts windows
18:23 <fubar> *nod* so.. something in a task?
18:23 <&garndt> linux doesn't use hg-shared I dont' think
18:24 <&garndt> neither does os x
18:24 <fubar> yeah, gonna say it's not VCS
18:24 <fubar> Q has a prior engagement and has to step out. he's going to kick off setting up a loaner so that he can try stuff manually later
18:25 <coop> do trees need to be open for retriggers to work? not seeing these retriggers appear in TH
18:25 <fubar> robustcheckout was my primary suspect for a bit, but that hasn't changed in a week or so
18:26 <Aryx> coop: where did you retrigger? might be just a delay because we have to acquire instances to run the retriggers first
18:26 <coop> see by asking that question, i can guarantee they'll show up a minute late
18:26 <&garndt> I saw some retriggers, and not others
18:26 <coop> er, later
18:26 <&garndt> sometime TH ingestion can be lagged
18:26 <coop> they're there now
18:27 <&garndt> and I don't know why this isn't impacting level 1 buidlers
18:28 <&garndt> nevermind
18:28 <&garndt> it is, we just haven't closed try for it
18:28 <&garndt> is it always the same file?
18:28 <&garndt> _tab_parent.cpp
18:29 <fubar> no
18:29 <glandium> garndt: have you looked at the contents and permissions of the files on that worker?
18:29 <Aryx> garndt: only a small set of file
18:29 <Aryx> files
18:29 <&garndt> glandium: I'm not sure, most of the machiens causing the problem are gone now, and Q tried getting on one of the machines but coulnd't (they are machines that grenade configures)
18:31 <glandium> garndt: what it sounds like is that a builder is pulling stuff into hg-shared at some point with some perms, and a following worker can't append to the same file
18:31 <glandium> because each task uses a different user
18:32 <&garndt> hrm, I wonder why this is an issue now...
18:32 <&garndt> and doesn't happen consistently
18:32 <markco> Is there currently an instance up that hit the error that I can take a look at?
18:33 <&garndt> not that I'm aware of, I was looking for one for Q and couldn't find any
18:33 <arr> markco: dude, you're supposed to be on pto!
18:33 <markco> I know
18:37 <coop> all my retriggers have made it past the hg-shared access point
18:37 <coop> i can check back in on them after bagpipe lessons
18:38 <&garndt> bagpipe lessons? that's awesome
18:49 <KWierso> coop/garndt: my retriggers are also passing the stage where it'd hit the errors :(
18:50 <fubar> wait, what
18:55 <&garndt> KWierso: yea, but they were passing earlier when we reopened too and then it took a couple of hours for things to fail again I think
18:56 <KWierso> garndt: shall I just reopen things and see if it breaks again before end of day?
18:58 <nalexander> On linux, what we were seeing is that different tasks were using the same username (worker) with different uid/gids. Is it possible that different Windows tasks are running as the same "user", but the on-disk permission records diverge in an analogous fashion?
19:01 <&garndt> KWierso: hrm, we could
19:03 <KWierso> autoland opened
19:04 <&garndt> markco: fubar found a broken machine: i-0b7ea0c9eb7cb62dc
19:04 <&garndt> it was an older one
19:04 <&garndt> https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-west-1/i-0b7ea0c9eb7cb62dc
19:04 <&garndt> those are the tasks it ran
19:05 <markco> That is random. i triggered a task to take a look and that is the instance that picked it up
19:05 <&garndt> heh
19:07 <&garndt> hrm
19:07 <&garndt> doesn't like the username/password that's in secrets
19:07 <&garndt> so probably the same error Q is hitting
19:07 <&garndt> oh
19:07 <markco> I was able to log into wwith the admin password
19:08 <&garndt> oh that's your session I'm seeing then, sorry
19:08 <&garndt> let me disconnect
19:08 <markco> rgr
19:08 <&garndt> yea, I didn't realize there are multiple admin passwords in there for each ami release
19:09 <&garndt> and that just made it very clear to me how much I do not know about windows, felt so lost looking how to disconnect :)
19:13 <markco> Are we seeing any pattern with which file it hits the error on?
19:15 <KWierso> garndt: if it helps, new pushes are hitting the errors
19:15 <KWierso> all on i-0d1b47a780793dd5e apparently
19:19 <nalexander> Well, once it's got a poisoned cache, all similar jobs should drink the poison.
19:20 <nalexander> Are these caches on the local machine? Or are they networked?
19:20 <nalexander> (NFS, I guess, on Windows.)
19:20 <&garndt> I believe on the local machine
19:31 <&garndt> markco: I haven't seen a pattern across machines, but when one file acts like that on one machine, it's always a problem on that same machine
19:34 <markco> I know it is a long shot but, do we have an instance that hasn't had failures on it?
19:34 — KWierso|brb will be back, need to get something from the office before they close
19:39 <&garndt> markco: trying to find you one
19:40 <&garndt> markco: I would assume this one will turn green when it's done. https://tools.taskcluster.net/groups/G1VsEwy8QLKF17cp0FVLnA/tasks/E_mbZ_qNSvGrs7k1w7dDMQ/runs/0
19:40 <&garndt> it's past the hg-shared part
19:49 <&garndt> if you want to check it out before the next task fails you might want to hop on there when it's still running to see what hg-shared looks like
19:50 <coop> back
19:50 <markco> cool. I am takening a look
19:50 <&garndt> i-038d4c295c7a5f6f3 is another machine that ran a green task 9 minutes ago
19:50 <&garndt> it hasn't claimed another task yet
19:55
<&garndt> and it's still currently alive
19:59 <markco> I am currently waiting on the first one to complete.
20:05 <&garndt> the second one is a mchine that is alive, completed 1, and is not doing anything in case you wanted to see what it's like after one completes green
Comment 10•7 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #9)
> Capturing the current state of investigations, per IRC (#taskcluster):
Summary:
* new jobs on new Windows builder AMI are not hitting the hg-shared issue (yet?)
* markco is investigating the filesystem on some builders to verify the state of the hg-shared clone
Comment 11•7 years ago
|
||
To further confuse things, it was assumed that once a machine was failing jobs, it would continue to fail those jobs. However, here is a machine where that is not the case:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-win2012/workers/us-west-1/i-0b7ea0c9eb7cb62dc
It failed a win32 opt build, and then afterwards passed one.
Comment 12•7 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #8)
> (In reply to Greg Arndt [:garndt] from comment #6)
> > first task good:
> > https://tools.taskcluster.net/groups/QDR6Q1u4QzWItvPjYuWv0w/tasks/
> > Eunwd0dlQrmgWZhqBqy5Ug
> > second task bad:
> > https://tools.taskcluster.net/groups/P8uVnYZlSHOVyFdKcDrxOA/tasks/
> > FV8HqvvbSj6NSk7csY3MiA
>
> What does the cleanup between jobs look like for Windows AMIs? Can processes
> persist from one job to the next (as Q suggested)?
>
> Is there an instance in this state that someone could investigate?
As far as I am aware mainly the clean up is destroying the task directory, removing the temp user, creating a new task user, creating a new task directory that has the new task user as the owner, and then claiming the next task. I would need pmoore to elaborate further.
Currently our builders do not reboot between tasks, and this has not been a problem. Looks like it's been that way for 4 months:
https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/GenericWorker/generic-worker.config#L3
If the theory is that there might be something lingering around on these machines that is causing access issues, perhaps we could change this to reboot after ever job for the time being and then try to test in an isolate worker type?
I am not sure of the ramifications of changing this on the builders.
Comment hidden (Intermittent Failures Robot) |
Comment 14•7 years ago
|
||
Nevermind, it does appear to be rebooting between jobs as it is.
Oct 05 17:24:37 i-0b7ea0c9eb7cb62dc.gecko-3-b-win2012.usw1.mozilla.com generic-worker: 2017/10/05 22:24:37 Exiting worker with exit code 67
Which will cause the machine to format and reboot.
Comment 15•7 years ago
|
||
Hey Rob, have you seen this before?
Thanks,
Pete
Flags: needinfo?(rthijssen)
Comment 16•7 years ago
|
||
It looks like OpenCloudConfig and generic-worker are running at the same time. This may be related.
Comment 17•7 years ago
|
||
Isee this in ship-it "Warning: releases/comm-beta has status "closed". Reason: bug 1406073 windows builds often fail cloning. if your branch doesn't use Windows, push with "CLOSED TREE" in the commit message"
Does this also afffect Thunderbird release building on the old build system?
Assignee | ||
Comment 18•7 years ago
|
||
Wayne, buildbot is unaffected. its just gecko-3-b-win2012 taskcluster workers.
Assignee | ||
Comment 19•7 years ago
|
||
occ patch to prevent gw and occ running simultaneously for gecko-3-b-win2012 is complete and new ami is live
https://github.com/mozilla-releng/OpenCloudConfig/commit/03505f592ebb834619f7540c58df172c70832db1
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 20•7 years ago
|
||
i believe the problem was caused by recent changes in occ which were incompatible with the currently running builder amis. when new worker instances spin up, they run occ. mostly in order to mount fresh ebs volumes for the y: and z: drives, but also, significant for this outage, to grant permission on the shared hg cache on y: to new task users.
a recent change in occ which was intended to solve an issue for a different worker type (win 7), relies on functionality in wmf and powershell 5 which was not present on gecko-*-b-win2012 base amis yet.
this caused new builder instances to fail to apply the permission patch to the shared hg cache on y: when booting up. hence the access errors on instances that already had a populated hg cache.
i've raised bug 1406354 to deal with patching occ to prevent this sort of issue in future.
Updated•7 years ago
|
Assignee: relops → rthijssen
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Whiteboard: [stockwell infra]
You need to log in
before you can comment on or make changes to this bug.
Description
•