Closed Bug 546424 Opened 14 years ago Closed 14 years ago

linux ix machines hang at grub sometimes

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: bhearsum)

References

Details

it gets stuck at the first GRUB message.  I can't see to dive into the BIOS setup via http://10.250.49.100 either.
catlee: does this need IT intervention then?
running grub-install /dev/sda seems to have fixed both slave01, and slave08.  Still not sure what the initial cause was.

There's a rescue .iso image on the desktop of admin.b.m.o.  If you boot the slave off of that, and then select 'grubdisk' from the first menu, then "AUTOMAGIC BOOT", and then select the 2.6.18 kernel to boot from, you can then run 'grub-install /dev/sda' as root.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
04 and 09 are stuck at grub now. I'll fix them up shortly.
04 was rebooted 14 times through buildbot, the last being on the 17th around 1600 PST.
09 rebooted 16 times the last being around 1800 PST on the 18th


The other slaves seem to have similar numbers. Eg, 05 has rebooted 15 times so far, and is OK.
I just fixed up 04 and 09.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: mv-moz2-linux-ix-slave01 is busted → linux ix machines hang at grub sometimes
bhearsum said he'd look at this.
Assignee: nobody → bhearsum
Haven't seen any more occurrences of this, yet.
Blocks: 545801
slave11 just hung at grub when rebooted to go into production
slave05 is also hung at grub, last build finished on sm01 at Wed Mar 3 12:19:03 2010.
I compared the broken slave11's boot sector to slave01 and slave02 (both of which are working). I found that slave11 and slave02's boot sectors were identical. Compared to slave01's, I found that the 61st - 70th bytes of the mbr differed. It's interesting to me that slave02 and slave11's mbr are identical, yet they exhibit different behaviour. I'm going to compare their BIOS' and partition tables for any differences.
I did some more digging into the mbr differences and found out that the first byte that differed refers to the boot drive. In 01, which broke and was fixed, it was set to 0xFF, which means "boot from the drive the bios tells you to". On 02, which has never broken, it was set to 0x80, which means "boot from the first hard drive". On 11, which broke and hasn't been fixed yet, it's also set to 0x80.

Catlee told me that the grub guys told him that the BIOS could be reordering drives semi-randomly. If that's the case, having GRUB trying to boot from "the first hard drive" could be what's breaking us.

Here's a table of slave status' and what the boot drive byte is set to in the MBR:
SLAVE      |  HAS BROKEN? | HAS BEEN FIXED? | BOOT DRIVE BYTE
ix-slave01 |     YES      |       YES       |      0xFF
ix-slave02 |     NO       |       NO        |      0x80
ix-slave03 |     NO       |       NO        |      0x80
ix-slave04 |     YES      |       YES       |      0xFF
ix-slave05 |     YES      |       NO        |      0x80
ix-slave06 |     NO       |       NO        |      0x80
ix-slave07 |     NO       |       NO        |      0x80
ix-slave08 |     YES      |       YES       |      0xFF
ix-slave09 |     YES      |       YES       |      0xFF
ix-slave10 |     NO       |       NO        |      0x80
ix-slave11 |     YES      |       NO        |      0x80
ix-slave12 |     NO       |       NO        |      0x80
ix-slave13 |     NO       |       NO        |      0x80
ix-slave14 |     NO       |       NO        |      0x80
ix-slave15 |     NO       |       NO        |      0x80
ix-slave16 |     NO       |       NO        |      0x80
ix-slave17 |     NO       |       NO        |      0x80
ix-slave18 |     NO       |       NO        |      0x80
ix-slave19 |     NO       |       NO        |      0x80
ix-slave20 |     NO       |       NO        |      0x80
ix-slave21 |     NO       |       NO        |      0x80
ix-slave22 |     NO       |       NO        |      0x80
ix-slave23 |     NO       |       NO        |      0x80
ix-slave24 |     NO       |       NO        |      0x80
ix-slave25 |     NO       |       NO        |      0x80

To summarize: All machines that have either not broken or haven't yet been fixed hardcode "first hard drive" as the boot drive. Machines that have broken and then been fixed use "the thing the bios passes" as the boot drive. Given that we haven't had any re-occurrences after a slave has been fixed, and what Catlee found out from the grub folk, I strongly suspect re-installing grub on all of these machines will fix the issue.

I'm going to go around and do that now.
All Linux ix machines have have grub reinstalled on them and I've confirmed that the boot drive byte of the mbr is correctly set to 0xFF.

I *think* this bug is fixed now but I'll leave it open for a couple of days at least.
For posterity, here's how to check what the boot drive byte is set to:
dd bs=512 skip=62 count=1 if=/dev/hda | od -Ax -tx1z -v

That will print a hexdump of the entire mbr. The byte at 0x40 is the boot drive byte. Specifically, it is the second column in the row starting with "000040".

Eg, in this snippet the boot byte is "80":
000000 eb 48 90 d0 bc 00 7c fb 50 07 50 1f fc be 1b 7c  >.H....|.P.P....|<
000010 bf 1b 06 50 57 b9 e5 01 f3 a4 cb be be 07 b1 04  >...PW...........<
000020 38 2c 7c 09 75 15 83 c6 10 e2 f5 cd 18 8b 14 8b  >8,|.u...........<
000030 ee 83 c6 10 49 74 16 38 2c 74 f6 be 10 07 03 02  >....It.8,t......<
000040 80 00 00 80 b8 85 64 00 00 08 fa 80 ca 80 ea 53  >......d........S<
000050 7c 00 00 31 c0 8e d8 8e d0 bc 00 20 fb a0 40 7c  >|..1....... ..@|<
000060 3c ff 74 02 88 c2 52 be 79 7d e8 34 01 f6 c2 80  ><.t...R.y}.4....<
000070 74 54 b4 41 bb aa 55 cd 13 5a 52 72 49 81 fb 55  >tT.A..U..ZRrI..U<
000080 aa 75 43 a0 41 7c 84 c0 75 05 83 e1 01 74 37 66  >.uC.A|..u....t7f<
Haven't seen any recurrences, closing this bug.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.