Closed
Bug 659344
Opened 13 years ago
Closed 12 years ago
slaves require access to build.mozilla.org to reboot
Categories
(Release Engineering :: General, defect, P4)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 712205
People
(Reporter: jrmuizel, Unassigned)
Details
I was looking over some try job runs yesterday and noticed a bunch failed with the following error: --00:16:46-- http://build.mozilla.org/talos/tools/buildfarm/maintenance/count_and_reboot.py => `count_and_reboot.py' Resolving build.mozilla.org... failed: Unknown host. Things seem to be working again, but it might be worth looking into preventing this in the future.
Comment 1•13 years ago
|
||
The DNS problem was bug 659238, but why the heck is count_and_reboot.py resolving build.mozilla.org?
Comment 2•13 years ago
|
||
Talos jobs wget the count_and_reboot.py from there rather than using tools. jrmuizel could you please point us to the tbpl link? or could you tell us if this happened only to talos jobs?
Comment 3•13 years ago
|
||
(In reply to comment #2) > Talos jobs wget the count_and_reboot.py from there rather than using tools. We should *definitely* fix that, particularly since a failed count_and_reboot.py often leaves the slave running another job without rebooting (where DisconnectStep isn't used). Can we just feed this file directly to the slave from the buildbot configs? In most cases, AIUI, we're not even counting, so it's a simple platform check and 'sudo reboot' or 'shutdown -r -f -t 0'..
Summary: build.mozilla.org lookups failed lastnight → slaves require access to build.mozilla.org to reboot
Reporter | ||
Comment 4•13 years ago
|
||
(In reply to comment #2) > Talos jobs wget the count_and_reboot.py from there rather than using tools. > > jrmuizel could you please point us to the tbpl link? > or could you tell us if this happened only to talos jobs? http://tbpl.mozilla.org/?tree=Try&rev=0fb5e427719a
Comment 5•13 years ago
|
||
(In reply to comment #3) > (In reply to comment #2) > > Talos jobs wget the count_and_reboot.py from there rather than using tools. > > We should *definitely* fix that, particularly since a failed > count_and_reboot.py often leaves the slave running another job without > rebooting (where DisconnectStep isn't used). > > Can we just feed this file directly to the slave from the buildbot configs? > In most cases, AIUI, we're not even counting, so it's a simple platform > check and 'sudo reboot' or 'shutdown -r -f -t 0'.. Makes sense. We are currently rebooting on every test job and I don't see us going back to counting and rebooting anytime soon.
Comment 6•13 years ago
|
||
or if we want to maintain behaviour for staging purposes (do they reboot every run?) we could have it try once to do count_and_reboot.py and if that fails, fallback to a reboot as the failsafe.
Comment 7•13 years ago
|
||
Bear in mind that rebooting after every job is an integral part of the new slave-monitoring regime, so -- aside from fuzzer jobs -- I think we should just reboot always.
Updated•13 years ago
|
Priority: -- → P4
Comment 8•12 years ago
|
||
count_and_reboot.py should already be on the slave before it even tries to take a job.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•