Closed Bug 699250 Opened 13 years ago Closed 13 years ago

talos-r3-fed64-xxx that failed to start the kernel are ready to go back in the pool

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: coop)

Details

(Whiteboard: [badslave?][hardware])

They encounter an error we saw in bug 663644:
    Initializing network drop monitor service
They continue to show this error even after a reimage.

With some experimentation, digipengi has determined that the kernel flags suggested in

  https://bugzilla.redhat.com/show_bug.cgi?id=632811

which are

  nohz=off highres=off

will allow a host with this issue to boot.  I don't have any data on *why* this works, or why it's only necessary on some hosts - I can only surmise that this is a race condition that's tickled by machines at a certain point in their trajectory to the slag-heap.

Sadly, both of these options are timing-related.  nohz refers to "dynamic ticks", which means that the kernel will not interrupt every 100th of a second as normal to do housekeeping tasks, but will schedule interrupts as-needed.  So if there's only one runnable process, that process just keeps running.  If there are no runnable processes, the kernel sleeps.  My understanding from the docs is that nohz=off means to *use* a fixed-schedule interrupt.

Highres=off turns off the high-resolution timer support.

I'm suspicious that changing these options will change talos numbers.  That said, we don't have many other options for these systems.

What should we do?

talos-r3-fed-034 is currently disabled in slavealloc and booted with this option, but it's not in the grub config so without changes it will fail to boot if rebooted.

talos-r3-fed64-018
talos-r3-fed64-028
talos-r3-fed64-048
talos-r3-fed64-049
(from bug 695580) also show this problem, but are currently hung at boot.
(In reply to Dustin J. Mitchell [:dustin] from comment #0)
> What should we do?
> 
> talos-r3-fed-034 is currently disabled in slavealloc and booted with this
> option, but it's not in the grub config so without changes it will fail to
> boot if rebooted.

I suggest four possible courses of action:

1) retire affected minis (not ideal for capacity reasons)
2) determine whether these minis are still under applecare, and if so, have them serviced
3) have someone from releng run the machines with "nohz=off highres=off" through staging and compare talos numbers 
4) reimage these minis with another target OS to see whether we can put them to use elsewhere. Could reimage some machines already in the target OS as fed64 to maintain the same capacity.
Priority: -- → P3
Whiteboard: [badslave?][hardware]
Unless we've proven that there's a problem with the system while it's running OS X, I don't think we can send it in for repair.  I'm fairly certain that Apple will just laugh and install OS X and tell us nothing is wrong with it.  We can try taking one of these machines and putting OS X on it, though to see if it still has an issue booting.
Coop and I talked about this today, and would like to do a double-pronged attack.

Releng should enable talos-r3-fed-034 and test it to see if it makes any difference to test times.

We will take another machine and install OS X on it to see if it has issues with another OS (if it does, we can ship them off for repair).
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
colo-trip: --- → scl1
Assignee: server-ops-releng → arich
coop: talos-r3-fed64-018 is now running snow leopard and had no issues imaging as far as I can tell.  Can it be run through its paces?
colo-trip: scl1 → ---
Priority: P3 → --
(In reply to Amy Rich [:arich] [:arr] from comment #4)
> coop: talos-r3-fed64-018 is now running snow leopard and had no issues
> imaging as far as I can tell.  Can it be run through its paces?

The slave is still named talos-r3-fed64-018 though, correct?

I'll grab that slave and talos-r3-fed-034 and put them both through their paces in staging.
Yeah, I didn't change the name, still talos-r3-fed64-018.
Since I was in the datacenter myself, I had a poke at these.  I reset the PRAM (unplug the machine, wait 10 sec, hold down the power button while plugging the machine back in again), and this seems to have done the trick.

It has the odd side effect of no longer turning on the power light, but working machine seems more important than power light.

I've zapped:

talos-r3-fed64-028
talos-r3-fed64-048
talos-r3-fed64-049
talos-r3-fed-034 

talos-r3-fed64-018 was running snowleopard as a test (see above) and was reinstalled with fed64.  Having osx on there also appears to have fixed the issue without needing a pram reset.

All of these machines need to have their named changed from the ref image and to be re-puppetized.
Assignee: arich → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Summary: some talos-r3-fed64-xxx fail to start the kernel → talos-r3-fed64-xxx that failed to start the kernel are ready to go back in the pool
(In reply to Amy Rich [:arich] [:arr] from comment #7)
> Since I was in the datacenter myself, I had a poke at these.  I reset the
> PRAM (unplug the machine, wait 10 sec, hold down the power button while
> plugging the machine back in again), and this seems to have done the trick.

Thanks for the extra effort here, Amy. I'll get them back into production.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
(In reply to Amy Rich [:arich] [:arr] from comment #7)
> talos-r3-fed64-028
> talos-r3-fed64-048
> talos-r3-fed64-049
> talos-r3-fed-034 

talos-r3-fed-034 is back in service now.
talos-r3-fed64-018
talos-r3-fed64-028
talos-r3-fed64-048
talos-r3-fed64-049

^^ These slaves are now back in production.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.