898519 - hdiutil: attach failed - Device not configured

Reporter

Description

•

12 years ago

aki believes that a reboot fixes the issue and it seems that we indeed have reboot issues: http://buildbot-master79.srv.releng.usw2.mozilla.com:8201/builders/Rev4%20MacOSX%20Lion%2010.7%20fx-team%20debug%20test%20jsreftest/builds/116/steps/reboot/logs/stdio /tools/buildbot/bin/python scripts/external_tools/count_and_reboot.py -f ../reboot_count.txt -n 1 -z in dir /Users/cltbld/talos-slave/test/. (timeout 1200 secs) watching logfiles {} argv: ['/tools/buildbot/bin/python', 'scripts/external_tools/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z'] environment: Apple_PubSub_Socket_Render=/tmp/launch-mHOjbd/Render CVS_RSH=ssh DISPLAY=/tmp/launch-DOVuck/org.x:0 HOME=/Users/cltbld LOGNAME=cltbld MOZ_HIDE_RESULTS_TABLE=1 MOZ_NO_REMOTE=1 NO_EM_RESTART=1 NO_FAIL_ON_TEST_ERRORS=1 PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PROPERTIES_FILE=/Users/cltbld/talos-slave/test/buildprops.json PWD=/Users/cltbld/talos-slave/test PYTHONPATH=/Library/Python/2.7/site-packages SHELL=/bin/bash SSH_AUTH_SOCK=/tmp/launch-sQIWfR/Listeners TMPDIR=/var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/ USER=cltbld VERSIONER_PYTHON_PREFER_32_BIT=no VERSIONER_PYTHON_VERSION=2.7 XPCOM_DEBUG_BREAK=warn __CF_USER_TEXT_ENCODING=0x1F5:0:0 using PTY: False ************************************************************************************************ *********** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MESSAGE EXPECTED ********** ************************************************************************************************ sudo: unknown uid: 501

Aki Sasaki (not active)

Comment 1

•

12 years ago

Per IRC convo: I think the "unknown uid" issue is the root cause for both the borked reboot and the inability to hdiutil. I'm not sure if we're losing the mach context mid-run, or if we started buildbot badly. We've lost mach context before, when we started buildbot via ssh, then disconnected. I'm not sure if there's another cause. If we don't have mach context at the beginning, we could avoid this by having a check in runslave.py that makes sure we have mach context before starting buildbot. If we're losing it mid-run, I'm not sure how to deal with it. (Maybe ssh'ing into localhost via some local key, and running |sudo reboot| inside that ssh command? Hacky, but might work, as that would establish a new mach context.)

John Hopkins (:jhopkins)

Comment 2

•

12 years ago

I found this forum post: http://www.hochschule-trier.de/index.php?id=12677 which mentions a couple of things to try. One person reports running sudo several times eventually worked. Another says "dscacheutil -flushcache" resolved a DNS issue and made sudo recognize his uid again.

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Tim Taubert [:ttaubert] (inactive)

Comment 3

•

12 years ago

This just occured on m-c: https://tbpl.mozilla.org/php/getParsedLog.php?id=26473784&tree=Mozilla-Central

Tim Taubert [:ttaubert] (inactive)

Comment 4

•

12 years ago

And again: https://tbpl.mozilla.org/php/getParsedLog.php?id=26473873&tree=Mozilla-Central

Tim Taubert [:ttaubert] (inactive)

Comment 5

•

12 years ago

Should we turn this into an intermittent failure? https://tbpl.mozilla.org/php/getParsedLog.php?id=26474180&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=26473954&tree=Mozilla-Aurora

Tim Taubert [:ttaubert] (inactive)

Comment 6

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=26474268&tree=Mozilla-Aurora

Carsten Book [:Tomcat]

Comment 7

•

12 years ago

(In reply to Tim Taubert [:ttaubert] from comment #5) > Should we turn this into an intermittent failure? > > https://tbpl.mozilla.org/php/getParsedLog.php?id=26474180&tree=Mozilla-Aurora > https://tbpl.mozilla.org/php/getParsedLog.php?id=26473954&tree=Mozilla-Aurora nick thomas rebooted the slave and i filed the investigation/problem tracking bug - bug 904473

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 8

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=26509884&tree=Mozilla-Inbound&full=1 talos-mtnlion-r5-007

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

12 years ago

(In reply to Wes Kocher (:KWierso) from comment #8) That slave has done a green job since, so it must have self-corrected.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 10

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=26657387&tree=Fx-Team talos-r4-lion-014 has now failed five in a row.

Justin Wood (:Callek)

Comment 11

•

12 years ago

(In reply to Wes Kocher (:KWierso) from comment #10) > https://tbpl.mozilla.org/php/getParsedLog.php?id=26657387&tree=Fx-Team > > talos-r4-lion-014 has now failed five in a row. rebooted

Justin Wood (:Callek)

Comment 12

•

12 years ago

Attached file tmp.txt — Details

A crash, from the first instance of this on this slave today - which references internal system error(s) from: https://tbpl.mozilla.org/php/getParsedLog.php?id=26677373&tree=Mozilla-Aurora&full=1

Justin Wood (:Callek)

Comment 13

•

12 years ago

(In reply to Justin Wood (:Callek) from comment #12) > Created attachment 791741 [details] > tmp.txt > > A crash, from the first instance of this on this slave today - which > references internal system error(s) from: > https://tbpl.mozilla.org/php/getParsedLog.php?id=26677373&tree=Mozilla- > Aurora&full=1 Hey Bill, I see you disabled this test (at least on trunk) for 10.8. Is there something you can come up with that would be causing this level of system failure on 10.7 sporadically. And/or a patch we need to uplift to fix it, and/or a decision to disable this test. I have a feeling this is the underlying problem causing these degree of failure

Flags: needinfo?(wmccloskey)

Justin Wood (:Callek)

Comment 14

•

12 years ago

(In reply to Justin Wood (:Callek) from comment #13) > I have a feeling this is the underlying problem causing these degree of > failure see-also Bug 900453

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 15

•

12 years ago

Sorry, I'm at a complete loss about what's going on. The test_fullscreen test has been crashing in various ways for a long time on Macs. I'll pass the buck to Steven Michaud.

Flags: needinfo?(wmccloskey) → needinfo?(smichaud)

Ed Morley [:emorley]

Comment 16

•

12 years ago

slave: talos-r4-lion-055 https://tbpl.mozilla.org/php/getParsedLog.php?id=26680115&tree=Mozilla-B2g18