Closed Bug 517862 Opened 15 years ago Closed 14 years ago

some talos vista slaves throwing a "Service request failed" dialog on boot

Categories

(Release Engineering :: General, defect)

x86
Windows Vista
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [buildslaves])

Catlee notice qm-pvista-try03 with a blocking dialog up. I went around and looked and found it on talos-rev2-vista01, vista05, vista10, and vista15. This is pretty bad, as we rely on these machines coming back up cleanly.
(In reply to comment #0)
> Catlee notice qm-pvista-try03 

I meant qm-pvista-try04 here.
It _seems_ like we got in this state because the slaves couldn't reach the OPSI server while they were booting. I believe that I restarted the OPSI server at some point on Friday, which might explain this. It seems like quite the coincidence that 5 slaves would hit it at the same time, though.

I'm looking into a way to make sure these slaves _always_ come back up, regardless of the state of the server. Getting to the slave logs when they come back up will be helpful.
There's a few parameters in the manual that might be of help:
LoginBlockerStart
LoginBlockerTimeoutConnect
LoginBlockerTimeoutInstall

I've also posted a thread on the OPSI forums to see if anyone there has hit this. https://forum.opsi.org/viewtopic.php?f=8&t=939&start=0&sid=373b227488c7ad629dcf791fb906f765
Some updates:
* We had issues with XP machines disappearing overnight. Some of them had an OPSI screen up when I logged on, others had a screensaver - these ones recovered fine after logging in and moving the mouse.
* Even with guidance from the OPSI developers I cannot get the login blocker to stop running.

I'm still trying to debug this but I'm running out of ideas. Given that we can reproduce this problem in staging I think we should back it out of the production machines until I can find a fix. If we leave it running, we're going to lose Vista machines very quickly.
Catlee pointed out some win32 build machines that are hitting issues too. He also suggested that it might be a load issue on the server. I looked at production-opsi and it's definitely not quick: it's swapping like mad right now. We should add some RAM to it in a downtime to help combat that. We also need to figure out how to get VMware tools on it, still.

For now, I've restarted the OPSI service to get us a clean state. I'm very interested to see if this fixes any of the issues.
Interesting point: the 3 build machines that Catlee noticed came back up fine right away after the opsi service had been restarted. My leading theory on this now is load on the OPSI server (thanks Catlee!).
Depends on: 518167
We had 2 more Vista machines disappear overnight. Load is better on the OPSI server since the service was restarted, but still not great. I've done a few upgrades on staging-opsi and they seem to be helping:
* upgrade to 2gb of ram
* upgrade the kernel to a 686 one
* install vmware tools

I'm going to do the same upgrades to production-opsi in the downtime tomorrow.

One of the OPSI devs has given me new preloginloader packages which I'll try if the server upgrades don't fix this issue.
Found two more XP slaves with the screensaver up, still not sure if it's connected to OPSI or not.
qm-pvista-try01 got itself into the "Service request failed" state around 8pm yesterday.
Depends on: 518564
No longer blocks: 510556
(In reply to comment #6)
> One of the OPSI devs has given me new preloginloader packages which I'll try if
> the server upgrades don't fix this issue.

I'm finally getting around to trying out the latest preloginloader. Since this issue is so hard to reproduce, here's my plan:
* test the new preloginloader in staging
* if it works (eg, doesn't immediately break) roll it out across the vista machines

We usually have a machine or two go down with a dialog every day, so we should be able to see pretty quickly if this fixes things.
Blocks: 510556
Ok, version 3.4-30 of the preloginloader seems to work ok. I installed it on talos-rev1-vista01 and then did a test install of a package - everything was fine. The OPSI folks say that they've changed the retry behaviour when OPSI can't contact the service, so it sounds promising.

I'll get this rolled out across the Vista machines and then we'll watch and wait.
Unfortunately we still had 2 Vistas slaves hang with the OPSI dialog overnight. Back to the drawing board.
I had no luck reproducing this problem yesterday. However, when I was getting set up this morning my local mini *did* hit the same problem. Unfortunately, the log level on it wasn't set high enough to get any new information. I've bumped the log level now and I've got it set to reboot constantly. Hopefully it'll break again soon.
The OPSI folks gave me a way to bump up the log level, which should help us debug why opsiclientd is crashing. To do so, I did this:
* open up config editor
* click 'server configuration'
* add a new property to the 'additional configuration' section called 'opsiclientd.global.log_level' and set the value to '7'.

The next machine that fails should have a bunch of extra information in the log, which I'll have to grab after it gets rebooted.

In the meantime, I've still got my local mini cycling over and over trying to reproduce the problem again here.
No Vista machines came up with the OPSI dialog over the weekend.
So, we still haven't seen any slaves fail. However, in bug 525016 we backed out the new preloginloader, back to the original version running on the Vista slaves. It will be interesting to see if they start to fail again now.
Even with the downgraded preloginloader package there still haven't been any vista slaves with dialogs up over the weekend. Leaving this open for now, but at this current moment, we're in a good state.
See Also: → 522078
(In reply to comment #16)
> Even with the downgraded preloginloader package there still haven't been any
> vista slaves with dialogs up over the weekend. Leaving this open for now, but
> at this current moment, we're in a good state.

Two months later, and we've been saying failures now and again. Not nearly as many as we have in the past, though. We should probably look into this a bit more, but it's not nearly as pressing, so I'm going to future this bug for now.
Component: Release Engineering → Release Engineering: Future
Per bhearsum this is Vista specific. Dont know yet about win7.
OS: Mac OS X → Windows Vista
Whiteboard: [buildslaves]
I've seen this on win7.
(In reply to comment #19)
> I've seen this on win7.

Are you sure? This bug is about something specific triggered by OPSI, which we don't have on Windows 7 yet.
Oop, sorry for the confusion.  I must have seen it on the new rev3 xp machines.
We don't see this very often anymore, for whatever reason, and given that we're moving to Windows 7 and retiring the Vista slaves, I doubt we're going to spend anymore effort here. I'm going to close WONTFIX -- we can re-open if this changes, or we see the issues on Windows 7.
Assignee: bhearsum → nobody
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Summary: some talos slaves throwing a "Service request failed" dialog on boot → some talos vista slaves throwing a "Service request failed" dialog on boot
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.