Closed
Bug 517862
Opened 15 years ago
Closed 15 years ago
some talos vista slaves throwing a "Service request failed" dialog on boot
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: bhearsum, Unassigned)
References
Details
(Whiteboard: [buildslaves])
Catlee notice qm-pvista-try03 with a blocking dialog up. I went around and looked and found it on talos-rev2-vista01, vista05, vista10, and vista15. This is pretty bad, as we rely on these machines coming back up cleanly.
Reporter | ||
Comment 1•15 years ago
|
||
(In reply to comment #0)
> Catlee notice qm-pvista-try03
I meant qm-pvista-try04 here.
Reporter | ||
Comment 2•15 years ago
|
||
It _seems_ like we got in this state because the slaves couldn't reach the OPSI server while they were booting. I believe that I restarted the OPSI server at some point on Friday, which might explain this. It seems like quite the coincidence that 5 slaves would hit it at the same time, though.
I'm looking into a way to make sure these slaves _always_ come back up, regardless of the state of the server. Getting to the slave logs when they come back up will be helpful.
There's a few parameters in the manual that might be of help:
LoginBlockerStart
LoginBlockerTimeoutConnect
LoginBlockerTimeoutInstall
I've also posted a thread on the OPSI forums to see if anyone there has hit this. https://forum.opsi.org/viewtopic.php?f=8&t=939&start=0&sid=373b227488c7ad629dcf791fb906f765
Reporter | ||
Comment 3•15 years ago
|
||
Some updates:
* We had issues with XP machines disappearing overnight. Some of them had an OPSI screen up when I logged on, others had a screensaver - these ones recovered fine after logging in and moving the mouse.
* Even with guidance from the OPSI developers I cannot get the login blocker to stop running.
I'm still trying to debug this but I'm running out of ideas. Given that we can reproduce this problem in staging I think we should back it out of the production machines until I can find a fix. If we leave it running, we're going to lose Vista machines very quickly.
Reporter | ||
Comment 4•15 years ago
|
||
Catlee pointed out some win32 build machines that are hitting issues too. He also suggested that it might be a load issue on the server. I looked at production-opsi and it's definitely not quick: it's swapping like mad right now. We should add some RAM to it in a downtime to help combat that. We also need to figure out how to get VMware tools on it, still.
For now, I've restarted the OPSI service to get us a clean state. I'm very interested to see if this fixes any of the issues.
Reporter | ||
Comment 5•15 years ago
|
||
Interesting point: the 3 build machines that Catlee noticed came back up fine right away after the opsi service had been restarted. My leading theory on this now is load on the OPSI server (thanks Catlee!).
Reporter | ||
Comment 6•15 years ago
|
||
We had 2 more Vista machines disappear overnight. Load is better on the OPSI server since the service was restarted, but still not great. I've done a few upgrades on staging-opsi and they seem to be helping:
* upgrade to 2gb of ram
* upgrade the kernel to a 686 one
* install vmware tools
I'm going to do the same upgrades to production-opsi in the downtime tomorrow.
One of the OPSI devs has given me new preloginloader packages which I'll try if the server upgrades don't fix this issue.
Reporter | ||
Comment 7•15 years ago
|
||
Found two more XP slaves with the screensaver up, still not sure if it's connected to OPSI or not.
Reporter | ||
Comment 8•15 years ago
|
||
qm-pvista-try01 got itself into the "Service request failed" state around 8pm yesterday.
Reporter | ||
Comment 9•15 years ago
|
||
(In reply to comment #6)
> One of the OPSI devs has given me new preloginloader packages which I'll try if
> the server upgrades don't fix this issue.
I'm finally getting around to trying out the latest preloginloader. Since this issue is so hard to reproduce, here's my plan:
* test the new preloginloader in staging
* if it works (eg, doesn't immediately break) roll it out across the vista machines
We usually have a machine or two go down with a dialog every day, so we should be able to see pretty quickly if this fixes things.
Blocks: 510556
Reporter | ||
Comment 10•15 years ago
|
||
Ok, version 3.4-30 of the preloginloader seems to work ok. I installed it on talos-rev1-vista01 and then did a test install of a package - everything was fine. The OPSI folks say that they've changed the retry behaviour when OPSI can't contact the service, so it sounds promising.
I'll get this rolled out across the Vista machines and then we'll watch and wait.
Reporter | ||
Comment 11•15 years ago
|
||
Unfortunately we still had 2 Vistas slaves hang with the OPSI dialog overnight. Back to the drawing board.
Reporter | ||
Comment 12•15 years ago
|
||
I had no luck reproducing this problem yesterday. However, when I was getting set up this morning my local mini *did* hit the same problem. Unfortunately, the log level on it wasn't set high enough to get any new information. I've bumped the log level now and I've got it set to reboot constantly. Hopefully it'll break again soon.
Reporter | ||
Comment 13•15 years ago
|
||
The OPSI folks gave me a way to bump up the log level, which should help us debug why opsiclientd is crashing. To do so, I did this:
* open up config editor
* click 'server configuration'
* add a new property to the 'additional configuration' section called 'opsiclientd.global.log_level' and set the value to '7'.
The next machine that fails should have a bunch of extra information in the log, which I'll have to grab after it gets rebooted.
In the meantime, I've still got my local mini cycling over and over trying to reproduce the problem again here.
Reporter | ||
Comment 14•15 years ago
|
||
No Vista machines came up with the OPSI dialog over the weekend.
Reporter | ||
Comment 15•15 years ago
|
||
So, we still haven't seen any slaves fail. However, in bug 525016 we backed out the new preloginloader, back to the original version running on the Vista slaves. It will be interesting to see if they start to fail again now.
Reporter | ||
Comment 16•15 years ago
|
||
Even with the downgraded preloginloader package there still haven't been any vista slaves with dialogs up over the weekend. Leaving this open for now, but at this current moment, we're in a good state.
Reporter | ||
Comment 17•15 years ago
|
||
(In reply to comment #16)
> Even with the downgraded preloginloader package there still haven't been any
> vista slaves with dialogs up over the weekend. Leaving this open for now, but
> at this current moment, we're in a good state.
Two months later, and we've been saying failures now and again. Not nearly as many as we have in the past, though. We should probably look into this a bit more, but it's not nearly as pressing, so I'm going to future this bug for now.
Component: Release Engineering → Release Engineering: Future
Comment 18•15 years ago
|
||
Per bhearsum this is Vista specific. Dont know yet about win7.
OS: Mac OS X → Windows Vista
Reporter | ||
Updated•15 years ago
|
Whiteboard: [buildslaves]
Comment 19•15 years ago
|
||
I've seen this on win7.
Reporter | ||
Comment 20•15 years ago
|
||
(In reply to comment #19)
> I've seen this on win7.
Are you sure? This bug is about something specific triggered by OPSI, which we don't have on Windows 7 yet.
Comment 21•15 years ago
|
||
Oop, sorry for the confusion. I must have seen it on the new rev3 xp machines.
Reporter | ||
Comment 22•15 years ago
|
||
We don't see this very often anymore, for whatever reason, and given that we're moving to Windows 7 and retiring the Vista slaves, I doubt we're going to spend anymore effort here. I'm going to close WONTFIX -- we can re-open if this changes, or we see the issues on Windows 7.
Assignee: bhearsum → nobody
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Summary: some talos slaves throwing a "Service request failed" dialog on boot → some talos vista slaves throwing a "Service request failed" dialog on boot
Comment 23•15 years ago
|
||
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•