Closed Bug 659344 Opened 13 years ago Closed 12 years ago

slaves require access to build.mozilla.org to reboot

Categories

(Release Engineering :: General, defect, P4)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 712205

People

(Reporter: jrmuizel, Unassigned)

Details

I was looking over some try job runs yesterday and noticed a bunch failed with the following error:

--00:16:46--  http://build.mozilla.org/talos/tools/buildfarm/maintenance/count_and_reboot.py
           => `count_and_reboot.py'
Resolving build.mozilla.org... failed: Unknown host.

Things seem to be working again, but it might be worth looking into preventing this in the future.
The DNS problem was bug 659238, but why the heck is count_and_reboot.py resolving build.mozilla.org?
Talos jobs wget the count_and_reboot.py from there rather than using tools.

jrmuizel could you please point us to the tbpl link?
or could you tell us if this happened only to talos jobs?
(In reply to comment #2)
> Talos jobs wget the count_and_reboot.py from there rather than using tools.

We should *definitely* fix that, particularly since a failed count_and_reboot.py often leaves the slave running another job without rebooting (where DisconnectStep isn't used).

Can we just feed this file directly to the slave from the buildbot configs?  In most cases, AIUI, we're not even counting, so it's a simple platform check and 'sudo reboot' or 'shutdown -r -f -t 0'..
Summary: build.mozilla.org lookups failed lastnight → slaves require access to build.mozilla.org to reboot
(In reply to comment #2)
> Talos jobs wget the count_and_reboot.py from there rather than using tools.
> 
> jrmuizel could you please point us to the tbpl link?
> or could you tell us if this happened only to talos jobs?

http://tbpl.mozilla.org/?tree=Try&rev=0fb5e427719a
(In reply to comment #3)
> (In reply to comment #2)
> > Talos jobs wget the count_and_reboot.py from there rather than using tools.
> 
> We should *definitely* fix that, particularly since a failed
> count_and_reboot.py often leaves the slave running another job without
> rebooting (where DisconnectStep isn't used).
> 
> Can we just feed this file directly to the slave from the buildbot configs? 
> In most cases, AIUI, we're not even counting, so it's a simple platform
> check and 'sudo reboot' or 'shutdown -r -f -t 0'..

Makes sense. We are currently rebooting on every test job and I don't see us going back to counting and rebooting anytime soon.
or if we want to maintain behaviour for staging purposes (do they reboot every run?) we could have it try once to do count_and_reboot.py and if that fails, fallback to a reboot as the failsafe.
Bear in mind that rebooting after every job is an integral part of the new slave-monitoring regime, so -- aside from fuzzer jobs -- I think we should just reboot always.
Priority: -- → P4
count_and_reboot.py should already be on the slave before it even tries to take a job.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.