Closed Bug 898519 Opened 9 years ago Closed 9 years ago

hdiutil: attach failed - Device not configured

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: armenzg, Unassigned)

Details

Attachments

(1 file)

aki believes that a reboot fixes the issue and it seems that we indeed have reboot issues:
http://buildbot-master79.srv.releng.usw2.mozilla.com:8201/builders/Rev4%20MacOSX%20Lion%2010.7%20fx-team%20debug%20test%20jsreftest/builds/116/steps/reboot/logs/stdio

/tools/buildbot/bin/python scripts/external_tools/count_and_reboot.py -f ../reboot_count.txt -n 1 -z
 in dir /Users/cltbld/talos-slave/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['/tools/buildbot/bin/python', 'scripts/external_tools/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-mHOjbd/Render
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-DOVuck/org.x:0
  HOME=/Users/cltbld
  LOGNAME=cltbld
  MOZ_HIDE_RESULTS_TABLE=1
  MOZ_NO_REMOTE=1
  NO_EM_RESTART=1
  NO_FAIL_ON_TEST_ERRORS=1
  PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PROPERTIES_FILE=/Users/cltbld/talos-slave/test/buildprops.json
  PWD=/Users/cltbld/talos-slave/test
  PYTHONPATH=/Library/Python/2.7/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-sQIWfR/Listeners
  TMPDIR=/var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.7
  XPCOM_DEBUG_BREAK=warn
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 using PTY: False
************************************************************************************************
*********** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MESSAGE EXPECTED **********
************************************************************************************************
sudo: unknown uid: 501
Per IRC convo:

I think the "unknown uid" issue is the root cause for both the borked reboot and the inability to hdiutil.

I'm not sure if we're losing the mach context mid-run, or if we started buildbot badly.

We've lost mach context before, when we started buildbot via ssh, then disconnected.  I'm not sure if there's another cause.

If we don't have mach context at the beginning, we could avoid this by having a check in runslave.py that makes sure we have mach context before starting buildbot.  If we're losing it mid-run, I'm not sure how to deal with it.

(Maybe ssh'ing into localhost via some local key, and running |sudo reboot| inside that ssh command?  Hacky, but might work, as that would establish a new mach context.)
I found this forum post: http://www.hochschule-trier.de/index.php?id=12677 which mentions a couple of things to try.

One person reports running sudo several times eventually worked.  Another says "dscacheutil -flushcache" resolved a DNS issue and made sudo recognize his uid again.
Product: mozilla.org → Release Engineering
(In reply to Tim Taubert [:ttaubert] from comment #5)
> Should we turn this into an intermittent failure?
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=26474180&tree=Mozilla-Aurora
> https://tbpl.mozilla.org/php/getParsedLog.php?id=26473954&tree=Mozilla-Aurora

nick thomas rebooted the slave and i filed the investigation/problem tracking bug - bug 904473
(In reply to Wes Kocher (:KWierso) from comment #8)
That slave has done a green job since, so it must have self-corrected.
(In reply to Wes Kocher (:KWierso) from comment #10)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=26657387&tree=Fx-Team
> 
> talos-r4-lion-014 has now failed five in a row.

rebooted
Attached file tmp.txt
A crash, from the first instance of this on this slave today - which references internal system error(s) from: https://tbpl.mozilla.org/php/getParsedLog.php?id=26677373&tree=Mozilla-Aurora&full=1
(In reply to Justin Wood (:Callek) from comment #12)
> Created attachment 791741 [details]
> tmp.txt
> 
> A crash, from the first instance of this on this slave today - which
> references internal system error(s) from:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=26677373&tree=Mozilla-
> Aurora&full=1

Hey Bill, I see you disabled this test (at least on trunk) for 10.8. Is there something you can come up with that would be causing this level of system failure on 10.7 sporadically. And/or a patch we need to uplift to fix it, and/or a decision to disable this test.

I have a feeling this is the underlying problem causing these degree of failure
Flags: needinfo?(wmccloskey)
(In reply to Justin Wood (:Callek) from comment #13)
> I have a feeling this is the underlying problem causing these degree of
> failure

see-also Bug 900453
Sorry, I'm at a complete loss about what's going on. The test_fullscreen test has been crashing in various ways for a long time on Macs. I'll pass the buck to Steven Michaud.
Flags: needinfo?(wmccloskey) → needinfo?(smichaud)
I doubt this failure is related to the test_fullscreen failure.

I haven't a clue about this either ... yet.

What *exactly* is the hdiutil command that fails with the "hdiutil: attach failed - Device not configured" error?  I've been trying to reproduce that error, but haven't yet managed to find out how.
Flags: needinfo?(smichaud)
The command is "bash ../tools/buildfarm/utils/installdmg.sh firefox-18.0.en-US.mac.dmg" running http://mxr.mozilla.org/build/source/tools/buildfarm/utils/installdmg.sh so `hdiutil attach -noautoopen -mountpoint ./mnt "firefox-18.0.en-US.mac.dmg"`
(In reply to Steven Michaud from comment #17)
> I doubt this failure is related to the test_fullscreen failure.
> 
> I haven't a clue about this either ... yet.
> 
> What *exactly* is the hdiutil command that fails with the "hdiutil: attach
> failed - Device not configured" error?  I've been trying to reproduce that
> error, but haven't yet managed to find out how.

Yet, all signs point to this (we've run numerous machines through diagnostics with no sign of problems, it is happening across all our mac minis of this OS and every time I looked it has this test failure/crash which has had problems on OSX across the board.)

Also, in the same run this fails in AFTER the failure, when trying to reboot we get:

sudo: unknown uid: 28


while normally the system would reboot. the hdiutil issue is also fixed by a reboot, which works when we manually ssh in. So some interaction between fullscreen api and the OS is causing the whole OS to fault... I suspect this may be a real issue for our users as well as our test farm.
Flags: needinfo?(smichaud)
> hdiutil attach -noautoopen -mountpoint ./mnt "firefox-18.0.en-US.mac.dmg"

Thanks.

Is ./mnt part of a local file system (on a local hard drive), or is it by some chance remote (on some kind of network share)?
Flags: needinfo?(smichaud)
(In reply to comment #20)

I don't really understand this comment -- probably because I'm missing a lot of context.  Please rewrite it with more detail (and more context).
(Following up comment #22)

You can start by telling me which part of my comment you're responding to :-)
(Following up comment #21)

Also where exactly is ./mnt?  What's its full path?
I've noticed that "device not configured" is the error text for kPOSIXErrorENXIO aka ENXIO.  But this is *supposed* to mean "no such device or address".

In other words (as best I can tell) you get the "device not configured" error message when the "device" corresponding to the attached DMG file's was "created" without error, but hdiutil can't (for some reason) find it afterwards.

I'll look for a way to confirm or deny this.
(In reply to Steven Michaud from comment #21)
> > hdiutil attach -noautoopen -mountpoint ./mnt "firefox-18.0.en-US.mac.dmg"
> 
> Thanks.
> 
> Is ./mnt part of a local file system (on a local hard drive), or is it by
> some chance remote (on some kind of network share)?

[cltbld@talos-r4-lion-041.build.scl1.mozilla.com ~]$ mount -v
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
map -fstab on /Network/Servers (autofs, automounted, nobrowse)

(In reply to Steven Michaud from comment #24)
> (Following up comment #21)
> 
> Also where exactly is ./mnt?  What's its full path?

As per the log, its ${CWD}/mnt/ for whatever job runs it.

(In reply to Steven Michaud from comment #23)
> (Following up comment #22)
> 
> You can start by telling me which part of my comment you're responding to :-)

Sure: 

>(In reply to Steven Michaud from comment #17)
>> I doubt this failure is related to the test_fullscreen failure.

... as for:
(In reply to Steven Michaud from comment #22)
> (In reply to comment #20)
> 
> I don't really understand this comment -- probably because I'm missing a lot
> of context.  Please rewrite it with more detail (and more context).

So, backing up, you don't feel it is related to the test_fullscreen failure. However the following facts hold true here:

* All OSX 10.7 testers are the same exact hardware profile, with the same exact configuration
* this test has been the source of numerous problems on OSX across all platforms, causing it to be disabled multiple places.
* When this issue has hit, multiple times we have run the physical machine through both Apple and our Internal hardware diagnostics, of which no issues have been found.
* The log linked in c#18 shows errors like:
  |09:33:36     INFO -  2013-08-19 09:33:36.060 firefox-bin[1227:507] Mozilla has caught an Obj-C exception [NSInternalInconsistencyException: Error (268435459) creating CGSWindow on line 263]|
* Every time I have seen this bug and looked it has the following traits:
  ** repeatedly failing with the hdiutil message job after job until manually recovered.
  ** a reboot of any sort recovers from this problem
  ** of all times this has failed, that I have looked at, the first-failure-in-row is the same as the log in c#18 pointing at this test.
  ** the machines are unable to reboot automatically after this hits (see below)
  ** releng is able to ssh in and reboot using the same command, as the same user, as what is failing on the host itself
   **** this leads me to suspect its a crash/failure in the OSX Mach Context that this test is triggering that is causing all sorts of issues, and a new ssh session establishes a new Mach context for said ssh session


For the reboot issue, the job where the test_fullscreen stuff fails [and all subsequent jobs as well] also fail with the following message:

sudo: unknown uid: 28

And again, when we manually try to run this, over an ssh prompt, we don't get that message and the host reboots.
Flags: needinfo?(smichaud)
So you're saying you think the "hdiutil: attach failed - Device not configured" errors and the test_fullscreen failures have the same cause, and that both have to do with the "OSX Mach Context" (whatever that is)?

I suppose it's possible.

But the approach I always take (here and elsewhere) is to work from the ground up, looking for parts of the problem that can be reproduced (and thereby explained).  This approach takes a while, but it's much surer than top-down guessing.  For now I'm trying to find out when/why the "hdiutil: attach failed - Device not configured" error message gets displayed.
Flags: needinfo?(smichaud)
By the way, to my mind the most interesting messages in the test_fullscreen failure logs are these:

09:33:35     INFO -  2013-08-19 09:33:35.001 plugin-container[1245:903] HIToolbox: received notification of WindowServer event port death.
09:33:35     INFO -  2013-08-19 09:33:35.031 plugin-container[1246:903] HIToolbox: received notification of WindowServer event port death.

Something is apparently killing the WindowServer during the fullscreen tests, or at least making it inaccessible (by killing the Mach port that client programs use to talk to it).
09:33:34     INFO -  2013-08-19 09:33:34.979 firefox-bin[1227:12f8f] ERROR: CGSSetWindowTransformAtPlacement() returned -308
09:33:34     INFO -  2013-08-19 09:33:34.986 firefox-bin[1227:12f8f] ERROR: CGSSetWindowTransformAtPlacement() returned 268435459
09:33:34     INFO -  2013-08-19 09:33:34.986 firefox-bin[1227:12f8f] ERROR: CGSSetWindowTransformAtPlacement() returned 268435459
09:33:35     INFO -  2013-08-19 09:33:35.044 firefox-bin[1227:507] HIToolbox: received notification of WindowServer event port death.
09:33:35     INFO -  2013-08-19 09:33:35.045 firefox-bin[1227:507] port matched the WindowServer port created in BindCGSToRunLoop
09:33:35     INFO -  2013-08-19 09:33:35.052 firefox-bin[1227:a403] invalid context
09:33:35     INFO -  2013-08-19 09:33:35.060 firefox-bin[1227:a403] invalid context
09:33:35     INFO -  2013-08-19 09:33:35.081 firefox-bin[1227:507] ___createFullScreenMessageConnection_block_invoke_1: Fullscreen message error: Connection interrupted
09:33:35     INFO -  2013-08-19 09:33:35.001 plugin-container[1245:903] HIToolbox: received notification of WindowServer event port death.
09:33:35     INFO -  2013-08-19 09:33:35.031 plugin-container[1246:903] HIToolbox: received notification of WindowServer event port death.
(In reply to Steven Michaud from comment #27)
> So you're saying you think the "hdiutil: attach failed - Device not
> configured" errors and the test_fullscreen failures have the same cause, and
> that both have to do with the "OSX Mach Context" (whatever that is)?

Almost, my theory is this test is triggering some bug in backend OSX that is causing this problem we see both in the test and the other system issues.

> But the approach I always take (here and elsewhere) is to work from the
> ground up, looking for parts of the problem that can be reproduced (and
> thereby explained).  This approach takes a while, but it's much surer than
> top-down guessing.  For now I'm trying to find out when/why the "hdiutil:
> attach failed - Device not configured" error message gets displayed.

I don't disagree, but I also highly suggest given this evidence that we consider disabling this test, and being careful what aspects of it we push (enabled) to try until we can identify the issue/problem here.
> I also highly suggest given this evidence that we consider disabling
> this test, and being careful what aspects of it we push (enabled) to
> try until we can identify the issue/problem here.

I assume you mean the test_fullscreen test.

If so, I think that's an excellent idea.  Among other things, we'll
see if disabling it also gets rid of the "hdiutil: attach failed -
Device not configured" errors.
Very interestingly, the man page for hdiutil says the following for
the ENXIO error (aka "Device not configured"):

  This error is returned explicitly by DiskImages when its kernel
  driver or framework helper cannot be contacted.  It also often shows
  up when a device has been removed while I/O is still active.  One
  common case of the helper not being found is when Foundation's
  Distributed Objects RPC mechanism cannot be configured.  D.O.
  doesn't work under dead Mach bootstrap contexts such as can exist in
  a reattached screen(1) session.  Root users can take advantage of
  StartupItemContext(8) (in /usr/libexec) to access the startup item
  Mach bootstrap context.

And the manpage for StartupItemContext says (partly):

  The StartupItemContext utility launches the specified program in
  StartupItem bootstrap context.  Each Darwin and Mac OS X login
  creates a unique bootstrap subset context to contain login specific
  Mach port registrations with the bootstrap server.  All such
  registrations per- formed within the context of that subset are only
  visible to other processes within that context or subsequent subsets
  of it.  Therefore, a Mach port based service/daemon launched within
  a login context will not be visible to other such contexts.

(Ah, more information about the mysterious "Mach contexts".)

That raises some more questions:

What user do the tests which fail run as?

How does that "user" log in?

Do the failures happen only when the tests are run from twistd on a
slave that's connected to a buildbot master?  Or do they also occur
when desktop_unittest.py is run (on a slave) without any connection to
a buildbot server (as per the instructions here
https://wiki.mozilla.org/ReleaseEngineering/Mozharness/07-May-2013?title=ReleaseEngineering/Mozharness).

And here's a link that might be relevant:

http://jenkins-ci.361315.n4.nabble.com/hdiutil-attach-OS-X-command-fails-through-Hudson-td380541.html
> What user do the tests which fail run as?
>
> How does that "user" log in?

Is the user cltbld?

If so, does it make a difference whether or not cltbld is logged in locally (using the standard OS X login screen)?
(In reply to Steven Michaud from comment #32)
> (Ah, more information about the mysterious "Mach contexts".)

Sounds like my suspicion is at least in-part correct ;-)

> That raises some more questions:
> 
> What user do the tests which fail run as?

cltbld

> How does that "user" log in?

Logs in automatically (e.g. we use puppet to specify the default login user name as cltbld and set the password appropriately):

http://mxr.mozilla.org/build/source/puppet/modules/users/manifests/builder/autologin.pp#7

Which then runs puppet at boot: http://mxr.mozilla.org/build/source/puppet/modules/puppet/files/puppet-atboot.plist

Followed by buildbot if puppet was successful: http://mxr.mozilla.org/build/source/puppet/modules/buildslave/templates/buildslave.plist.erb

> Do the failures happen only when the tests are run from twistd on a
> slave that's connected to a buildbot master?  Or do they also occur
> when desktop_unittest.py is run (on a slave) without any connection to
> a buildbot server (as per the instructions here
> https://wiki.mozilla.org/ReleaseEngineering/Mozharness/07-May-
> 2013?title=ReleaseEngineering/Mozharness).

I don't know for sure, here is the interesting bit though, should you be logging in via ssh OSX creates a new Mach context over your login, which is for the duration of your login, however it doesn't reap it from under any running (detatched) processes right away.

This was a problem on our Mobile development side for a while, causing us to launch mobile tests from inside a |screen| session (manually on every machine, of which we rarely rebooted) because Mach context would get lost at some random point in the future if we launched the jobs over an ssh session. Mach context being lost has never previously been an issue on our OSX hardware for OSX tests though.

(In reply to Steven Michaud from comment #33)
> > What user do the tests which fail run as?
> >
> > How does that "user" log in?
> 
> Is the user cltbld?

yes

> If so, does it make a difference whether or not cltbld is logged in locally
> (using the standard OS X login screen)?

Untested but I doubt it based on the data we have so far.
Thanks for your info.  I need to digest it, and in particular find out more about puppet.

But here's another question:

How is twistd run on a slave?  Is is a "login item" for the cltbld account?  Is it a "Startup Item" (run from /Library/StartupItems or /System/Library/StartupItems)?  Or is it a "launch Daemon" (run from /Library/LaunchDaemons or /System/Library/LaunchDaemons)?
Yet another question:

I noticed the following in the logs:

DISPLAY=/tmp/launch-rLlb3o/org.x:0

This implies there's some kind of X Windows set up on the slaves.  Is it documented somewhere how this is set up?

Do the slaves have physical monitors?
(In reply to Steven Michaud from comment #35)
> Thanks for your info.  I need to digest it, and in particular find out more
> about puppet.

Other than as a way to drive our configuration, and as it relates to making sure it ran successfully before we launch, its not relevant here. For intents and purposes you can ignore it (It just helps dictate how the machines configuration will be)

> But here's another question:
> 
> How is twistd run on a slave?  


(In reply to Justin Wood (:Callek) from comment #34)
> Followed by buildbot if puppet was successful:
> http://mxr.mozilla.org/build/source/puppet/modules/buildslave/templates/
> buildslave.plist.erb

That plist gets noticed after login (which we do, as mentioned above), then after puppet (this plist watches for a file to be touched from puppet, essentially)

That plist calls: 
http://mxr.mozilla.org/build/source/puppet/modules/buildslave/files/darwin-run-buildslave.sh

which then calls:
http://mxr.mozilla.org/build/source/puppet/modules/buildslave/files/runslave.py

In that file, it gets the twisted command:
http://mxr.mozilla.org/build/source/puppet/modules/buildslave/files/runslave.py#391
and then runs it in:
http://mxr.mozilla.org/build/source/puppet/modules/buildslave/files/runslave.py#211
To comment 35, it's a LaunchAgent, which starts in the context of the logged-in user (cltbld, in particular).

To comment 36, launchd listens for X connections by default.  I don't know if X is installed - it's possible that connecting to that unix socket would result in an error being logged.

Unrelated to X, but also in comment 36: yes, these hosts have physical monitors - more precisely, they have EDID boxes which look like physical monitors to the hosts, thus activating the GPUs.
Thanks Justin and Dustin.  I think I have a pretty good idea now how tests are run on the slaves.  It's still puzzling that the DISPLAY environment variable is set while the tests are running, even though they (apparently) don't use X Windows.  But that may just be a coincidence.

The big (central) puzzle is still how the script(s) running the tests can somehow get "detached" from the WindowServer during the middle of the test_fullscreen test, and how the DiskImages framework can end up semi-permanently "detached" from its "kernel driver or framework helper".

At bug 892107 I have a loaner slave that I've been using to investigate bug 884471.  I think I'm close to being done with bug 884471.  But I'll hang on to my loaner to do some tests that may throw light on this bug.
Sure enough, we hit a fullscreen crash on talos-r4-lion-036 today and as expected, it started throwing up hdiutil errors after.
So I have an r5 mini at home, and occasionally it just loses its head - literally.  One of the two video outputs (I think the HDMI, but I'm not at home to check) just outputs colored static, and the other flashes blue indicating that the OS is recalculating the available screen real estate.  I've always assumed this was a problem with the monitor, but maybe it's a problem with the Mini.  As an aside, the Mini also lacks the ability to run both outputs rotated 90 degrees, failing with a very odd ghosting effect.  So there *are* bugs in the display drivers/firmware.

I wonder if the same thing is happening here?  I realize that's pretty vague - and I don't have any logging or other info to correlate with this behavior, either.  Just a thought.
I think this could easily be a hardware problem.  Here's how the first step (the test_fullscreen failure) might happen:

Using information from the attached EDID box, the computer tries to put the graphics hardware into a fullscreen mode that it doesn't support.  This makes the WindowServer choke.

Where do the EDID box settings come from?  Are they all the same, or are they separately "programmed" for each major version of the OS?

(By the way, when I'm finally done with bug 884471 on my loaner, I'm going to try out deliberately disconnecting the WindowServer while tests are running, and see what happens.  I won't actually be *killing* the WindowServer, which forces a logout (and the termination of all the user's running processes).  Instead I'll be playing tricks with an interpose library to kill the Mach port FF uses to connect to the WindowServer, or something along those lines.)
The EDID settings are the same for all hosts.  In fact, a copy of them is at
  http://hg.mozilla.org/build/puppet/file/22cddf00975a/modules/gui/files/edid.bin

I don't recall the details of how this is physically configured.  If the above info is not enough, we will need to consult dcops (Datacenter Operations) to find out the process.
> I don't recall the details of how this is physically configured.  If
> the above info is not enough, we will need to consult dcops
> (Datacenter Operations) to find out the process.

Please do.

I'm not entirely sure this information will be relevant.  But I think
we should follow up all reasonable leads.
Van, can you provide information on how each of the EDIDs are configured?  I'm pretty sure they're all individually programmed to be identical to all of the others, but you have insight into the specific process that you use in DCOps.
Flags: needinfo?(vle)
we use these EDIDs - http://www.extron.com/product/product.aspx?id=edid101d&s=5

they're currently attached to the mini's HDMI port and there is no extra configuration done on our end.
Flags: needinfo?(vle)
(In reply to comment #46)

So, to use Extron's terminology, we're using the "pre-stored EDID [information]".  We're not making use of the following capability:

  Alternatively, the EDID 101D can be set to capture and store EDID information
  when connected to a display.

Is this correct?
Flags: needinfo?(vle)
these installations predate me but I'm pretty sure we don't make use of that capability right now. the ones i've worked on has the rotary dial set to B (1600x1200) resolution.
Flags: needinfo?(vle)
(In reply to Wes Kocher (:KWierso) from comment #49)
> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.
> html?name=talos-r4-lion-052 doesn't look too good...

Rebooted.
We've removed the lion platform. I know there are a handful of mtnlion events too, but none reported in 5 months.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.