If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

linux ix machines hang at grub sometimes

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
8 years ago
4 years ago

People

(Reporter: catlee, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

8 years ago
it gets stuck at the first GRUB message.  I can't see to dive into the BIOS setup via http://10.250.49.100 either.

Comment 1

8 years ago
catlee: does this need IT intervention then?
(Reporter)

Comment 2

8 years ago
running grub-install /dev/sda seems to have fixed both slave01, and slave08.  Still not sure what the initial cause was.

There's a rescue .iso image on the desktop of admin.b.m.o.  If you boot the slave off of that, and then select 'grubdisk' from the first menu, then "AUTOMAGIC BOOT", and then select the 2.6.18 kernel to boot from, you can then run 'grub-install /dev/sda' as root.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Reporter)

Updated

8 years ago
Duplicate of this bug: 546490
(Assignee)

Comment 4

8 years ago
04 and 09 are stuck at grub now. I'll fix them up shortly.
(Assignee)

Comment 5

8 years ago
04 was rebooted 14 times through buildbot, the last being on the 17th around 1600 PST.
09 rebooted 16 times the last being around 1800 PST on the 18th


The other slaves seem to have similar numbers. Eg, 05 has rebooted 15 times so far, and is OK.
(Assignee)

Comment 6

8 years ago
I just fixed up 04 and 09.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: mv-moz2-linux-ix-slave01 is busted → linux ix machines hang at grub sometimes
bhearsum said he'd look at this.
Assignee: nobody → bhearsum
(Assignee)

Comment 8

8 years ago
Haven't seen any more occurrences of this, yet.
(Assignee)

Updated

8 years ago
Blocks: 545801
(Reporter)

Comment 9

8 years ago
slave11 just hung at grub when rebooted to go into production
slave05 is also hung at grub, last build finished on sm01 at Wed Mar 3 12:19:03 2010.
(Assignee)

Comment 11

8 years ago
I compared the broken slave11's boot sector to slave01 and slave02 (both of which are working). I found that slave11 and slave02's boot sectors were identical. Compared to slave01's, I found that the 61st - 70th bytes of the mbr differed. It's interesting to me that slave02 and slave11's mbr are identical, yet they exhibit different behaviour. I'm going to compare their BIOS' and partition tables for any differences.
(Assignee)

Comment 12

8 years ago
I did some more digging into the mbr differences and found out that the first byte that differed refers to the boot drive. In 01, which broke and was fixed, it was set to 0xFF, which means "boot from the drive the bios tells you to". On 02, which has never broken, it was set to 0x80, which means "boot from the first hard drive". On 11, which broke and hasn't been fixed yet, it's also set to 0x80.

Catlee told me that the grub guys told him that the BIOS could be reordering drives semi-randomly. If that's the case, having GRUB trying to boot from "the first hard drive" could be what's breaking us.

Here's a table of slave status' and what the boot drive byte is set to in the MBR:
SLAVE      |  HAS BROKEN? | HAS BEEN FIXED? | BOOT DRIVE BYTE
ix-slave01 |     YES      |       YES       |      0xFF
ix-slave02 |     NO       |       NO        |      0x80
ix-slave03 |     NO       |       NO        |      0x80
ix-slave04 |     YES      |       YES       |      0xFF
ix-slave05 |     YES      |       NO        |      0x80
ix-slave06 |     NO       |       NO        |      0x80
ix-slave07 |     NO       |       NO        |      0x80
ix-slave08 |     YES      |       YES       |      0xFF
ix-slave09 |     YES      |       YES       |      0xFF
ix-slave10 |     NO       |       NO        |      0x80
ix-slave11 |     YES      |       NO        |      0x80
ix-slave12 |     NO       |       NO        |      0x80
ix-slave13 |     NO       |       NO        |      0x80
ix-slave14 |     NO       |       NO        |      0x80
ix-slave15 |     NO       |       NO        |      0x80
ix-slave16 |     NO       |       NO        |      0x80
ix-slave17 |     NO       |       NO        |      0x80
ix-slave18 |     NO       |       NO        |      0x80
ix-slave19 |     NO       |       NO        |      0x80
ix-slave20 |     NO       |       NO        |      0x80
ix-slave21 |     NO       |       NO        |      0x80
ix-slave22 |     NO       |       NO        |      0x80
ix-slave23 |     NO       |       NO        |      0x80
ix-slave24 |     NO       |       NO        |      0x80
ix-slave25 |     NO       |       NO        |      0x80

To summarize: All machines that have either not broken or haven't yet been fixed hardcode "first hard drive" as the boot drive. Machines that have broken and then been fixed use "the thing the bios passes" as the boot drive. Given that we haven't had any re-occurrences after a slave has been fixed, and what Catlee found out from the grub folk, I strongly suspect re-installing grub on all of these machines will fix the issue.

I'm going to go around and do that now.
(Assignee)

Comment 13

8 years ago
All Linux ix machines have have grub reinstalled on them and I've confirmed that the boot drive byte of the mbr is correctly set to 0xFF.

I *think* this bug is fixed now but I'll leave it open for a couple of days at least.
(Assignee)

Comment 14

8 years ago
For posterity, here's how to check what the boot drive byte is set to:
dd bs=512 skip=62 count=1 if=/dev/hda | od -Ax -tx1z -v

That will print a hexdump of the entire mbr. The byte at 0x40 is the boot drive byte. Specifically, it is the second column in the row starting with "000040".

Eg, in this snippet the boot byte is "80":
000000 eb 48 90 d0 bc 00 7c fb 50 07 50 1f fc be 1b 7c  >.H....|.P.P....|<
000010 bf 1b 06 50 57 b9 e5 01 f3 a4 cb be be 07 b1 04  >...PW...........<
000020 38 2c 7c 09 75 15 83 c6 10 e2 f5 cd 18 8b 14 8b  >8,|.u...........<
000030 ee 83 c6 10 49 74 16 38 2c 74 f6 be 10 07 03 02  >....It.8,t......<
000040 80 00 00 80 b8 85 64 00 00 08 fa 80 ca 80 ea 53  >......d........S<
000050 7c 00 00 31 c0 8e d8 8e d0 bc 00 20 fb a0 40 7c  >|..1....... ..@|<
000060 3c ff 74 02 88 c2 52 be 79 7d e8 34 01 f6 c2 80  ><.t...R.y}.4....<
000070 74 54 b4 41 bb aa 55 cd 13 5a 52 72 49 81 fb 55  >tT.A..U..ZRrI..U<
000080 aa 75 43 a0 41 7c 84 c0 75 05 83 e1 01 74 37 66  >.uC.A|..u....t7f<
(Assignee)

Comment 15

8 years ago
Haven't seen any recurrences, closing this bug.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.