Closed Bug 473126 Opened 12 years ago Closed 12 years ago

restart build VMs after netapp outage


(Release Engineering :: General, defect, P1)



(Not tracked)



(Reporter: joduinn, Assigned: joduinn)



bug#473113 caused any VM using netapp-c or netapp-d to fail. Disks are now back online, so we are manually restarting the VMs and processes.

Tree is already closed for unrelated code crash, so leaving tree closed.

nthomas has already revived production-master, production-1.9-master and is working on qm-rhel02
Component: Release Engineering → Release Engineering: Maintenance
Priority: -- → P1
* production-master just had a soft reboot, plus an update on the hg buildbot-configs to pick up moz2-win32-slave14 (bug 465868) and bug 472779.
* production-1.9-master had a soft reboot
* qm-rhel02 needed hard reset and spent some time running fsck, didn't see any errors correct but the fancy boot gui may have helped me miss them

All three are up and accepting slaves.
* staging-master is up (soft restart) but only moz2-master is running there right now
* try-master is up (soft restart), three linux & three win32 slaves restarted. NB: these are on eql01-bm06
* qm-buildbot01 (talos staging) needed a hard reset after a kernel panic, didn't start any masters there
* staging-try-master (soft restart), staging-try1-{win32,linux}-slave soft restarted - all three need buildbot started
(In reply to comment #2)
> * staging-master is up (soft restart) but only moz2-master is running there
> right now
> * try-master is up (soft restart), three linux & three win32 slaves restarted.
> NB: these are on eql01-bm06

reed in irc confirmed this is also a problem with eqllogic also.  :-(
* production-1.8-master needed a hard reset because of a kernel panic
At this point all the moz2-{win32,linux}-slaveNN machines have been rebooted, as well as fx-{linux,win32}-tbox, fxdbug-{linux,win32}-tbox, and assorted other machines. The major outstanding set is the the unit testers for 3.0, which will probably be needed for closing of 3.0.6. 

There has been some clobbering needed for corrupted source/objdirs, we should expect to see more redness throughout the day as slaves are allocated jobs. We originally started out clobbering all build dirs on moz2-*-slaveNN but the number of machines affected made this too slow a process to complete.
Depends on: 473112
Alright, I've done the following:
* fx-linux-1.9-slave04 - didn't have an IP, restarted networking, restarted buildbot
* fx-win32-1.9-slave07 - restarted buildbot
* bm-centos5-unittest-01 - didn't have an IP, restarted networking, restarted buildbot
* restarted fx-linux-1.9-slave07, 09
* restarted buildbot on fx-linux-1.9-slave08

After seeing how many staging slaves needed help recovering I decided to let them be for now and fix them when we need them. AFAICT all production slaves are back up and running. I'm keeping an eye on the 1.9.1/m-c/tm master for burning that requires a clobber.
clobbered m-c leak test and m-c build on moz2-linux-slave07
clobbered 1.9.1 unit test on moz2-win32-slave07
At this point, we think all Talos machines are back running fine, as are most of the build/unittest machines. 

A few build/unittest machines are still showing up needing manual clobbers before turning green, so leaving this bug open while those get spotted and fixed.
Haven't seen any failures since I posted my last comment. Going to consider this FIXED now.
Closed: 12 years ago
Resolution: --- → FIXED
doug/stuart reported hitting weird failures in mobile linux builds all day. One examples is moz2-linux-slave07 hitting the following several times in a row:

 closing stdin
 using PTY: True
Upon execvpe /scratchbox/moz_scratchbox ['/scratchbox/moz_scratchbox', '-p', 'mkdir -p build'] in environment id 150137140
:Traceback (most recent call last):
  File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/internet/", line 393, in _fork
    executable, args, environment)
  File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/internet/", line 439, in _execChild
    os.execvpe(executable, args, environment)
  File "/tools/python-2.5.2/lib/python2.5/", line 362, in execvpe
    _execvpe(file, args, env)
  File "/tools/python-2.5.2/lib/python2.5/", line 377, in _execvpe
    func(file, *argrest)
OSError: [Errno 2] No such file or directory
program finished with exit code 1
=== Output ended ===
Assignee: nobody → joduinn
Resolution: FIXED → ---
John, I've been working on mobile-arm in general over in bug 472779. My guess is that moz2-linux-slave07, 08 and 10 are older clones and not brought up to spec on the scratchbox setup. I assumed they were and added them to the slave list when we restarted moz2-master yesterday; they'll get removed again when coop does his thing tomorrow, and we can reclone or fix without time so much time pressure.
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Looking thought the day's builds, seems that there were odd errors like this
from the following machines:

moz2-linux-slave07 (twice)
moz2-linux-slave10 (twice)
moz2-linux-slave11 (once)

In comment#12, nthomas notes that slave07, 08, 10 were just newly brought over to production, and he suspects something was missed in their setup; he's tracking fixing that in bug#472779. 

Looks like slave11 hit a separate problem which, aiui, is caused by a change to mozconfig since 9th December. I've just clobbered mobile-linux-arm-dep directory on slave07, 10, 11, as they were idle. Lets watch and see if that "fixes" the problem that slave11 was reporting.
Resolution: FIXED → ---
(In reply to comment #13)

To tweak this a bit, the problem on moz2-linux-slave11 is bug 472779, which is now resolved. The problem with slave07/08/09/10 is noted in bug 465868 and will be solved there, or spun out to a new bug.
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Component: Release Engineering: Maintenance → Release Engineering
Product: → Release Engineering
You need to log in before you can comment on or make changes to this bug.