517862 - some talos vista slaves throwing a "Service request failed" dialog on boot

Reporter

Description

•

15 years ago

Catlee notice qm-pvista-try03 with a blocking dialog up. I went around and looked and found it on talos-rev2-vista01, vista05, vista10, and vista15. This is pretty bad, as we rely on these machines coming back up cleanly.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

15 years ago

(In reply to comment #0) > Catlee notice qm-pvista-try03 I meant qm-pvista-try04 here.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

15 years ago

It _seems_ like we got in this state because the slaves couldn't reach the OPSI server while they were booting. I believe that I restarted the OPSI server at some point on Friday, which might explain this. It seems like quite the coincidence that 5 slaves would hit it at the same time, though. I'm looking into a way to make sure these slaves _always_ come back up, regardless of the state of the server. Getting to the slave logs when they come back up will be helpful. There's a few parameters in the manual that might be of help: LoginBlockerStart LoginBlockerTimeoutConnect LoginBlockerTimeoutInstall I've also posted a thread on the OPSI forums to see if anyone there has hit this. https://forum.opsi.org/viewtopic.php?f=8&t=939&start=0&sid=373b227488c7ad629dcf791fb906f765

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

15 years ago

Some updates: * We had issues with XP machines disappearing overnight. Some of them had an OPSI screen up when I logged on, others had a screensaver - these ones recovered fine after logging in and moving the mouse. * Even with guidance from the OPSI developers I cannot get the login blocker to stop running. I'm still trying to debug this but I'm running out of ideas. Given that we can reproduce this problem in staging I think we should back it out of the production machines until I can find a fix. If we leave it running, we're going to lose Vista machines very quickly.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

15 years ago

Catlee pointed out some win32 build machines that are hitting issues too. He also suggested that it might be a load issue on the server. I looked at production-opsi and it's definitely not quick: it's swapping like mad right now. We should add some RAM to it in a downtime to help combat that. We also need to figure out how to get VMware tools on it, still. For now, I've restarted the OPSI service to get us a clean state. I'm very interested to see if this fixes any of the issues.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

15 years ago

Interesting point: the 3 build machines that Catlee noticed came back up fine right away after the opsi service had been restarted. My leading theory on this now is load on the OPSI server (thanks Catlee!).

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

15 years ago

Depends on: 518167

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

15 years ago

We had 2 more Vista machines disappear overnight. Load is better on the OPSI server since the service was restarted, but still not great. I've done a few upgrades on staging-opsi and they seem to be helping: * upgrade to 2gb of ram * upgrade the kernel to a 686 one * install vmware tools I'm going to do the same upgrades to production-opsi in the downtime tomorrow. One of the OPSI devs has given me new preloginloader packages which I'll try if the server upgrades don't fix this issue.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 7

•

15 years ago

Found two more XP slaves with the screensaver up, still not sure if it's connected to OPSI or not.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

15 years ago

qm-pvista-try01 got itself into the "Service request failed" state around 8pm yesterday.

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

15 years ago

Depends on: 518564

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

15 years ago

No longer blocks: 510556

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 9

•

15 years ago

(In reply to comment #6) > One of the OPSI devs has given me new preloginloader packages which I'll try if > the server upgrades don't fix this issue. I'm finally getting around to trying out the latest preloginloader. Since this issue is so hard to reproduce, here's my plan: * test the new preloginloader in staging * if it works (eg, doesn't immediately break) roll it out across the vista machines We usually have a machine or two go down with a dialog every day, so we should be able to see pretty quickly if this fixes things.

Blocks: 510556

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 10

•

15 years ago

Ok, version 3.4-30 of the preloginloader seems to work ok. I installed it on talos-rev1-vista01 and then did a test install of a package - everything was fine. The OPSI folks say that they've changed the retry behaviour when OPSI can't contact the service, so it sounds promising. I'll get this rolled out across the Vista machines and then we'll watch and wait.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

15 years ago

Unfortunately we still had 2 Vistas slaves hang with the OPSI dialog overnight. Back to the drawing board.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 12

•

15 years ago

I had no luck reproducing this problem yesterday. However, when I was getting set up this morning my local mini *did* hit the same problem. Unfortunately, the log level on it wasn't set high enough to get any new information. I've bumped the log level now and I've got it set to reboot constantly. Hopefully it'll break again soon.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 13

•

15 years ago

The OPSI folks gave me a way to bump up the log level, which should help us debug why opsiclientd is crashing. To do so, I did this: * open up config editor * click 'server configuration' * add a new property to the 'additional configuration' section called 'opsiclientd.global.log_level' and set the value to '7'. The next machine that fails should have a bunch of extra information in the log, which I'll have to grab after it gets rebooted. In the meantime, I've still got my local mini cycling over and over trying to reproduce the problem again here.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 14

•

15 years ago

No Vista machines came up with the OPSI dialog over the weekend.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 15

•

15 years ago

So, we still haven't seen any slaves fail. However, in bug 525016 we backed out the new preloginloader, back to the original version running on the Vista slaves. It will be interesting to see if they start to fail again now.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 16

•

15 years ago

Even with the downgraded preloginloader package there still haven't been any vista slaves with dialogs up over the weekend. Leaving this open for now, but at this current moment, we're in a good state.

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

15 years ago

Comment 17

•

15 years ago

(In reply to comment #16) > Even with the downgraded preloginloader package there still haven't been any > vista slaves with dialogs up over the weekend. Leaving this open for now, but > at this current moment, we're in a good state. Two months later, and we've been saying failures now and again. Not nearly as many as we have in the past, though. We should probably look into this a bit more, but it's not nearly as pressing, so I'm going to future this bug for now.

Component: Release Engineering → Release Engineering: Future

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 18

•

15 years ago

Per bhearsum this is Vista specific. Dont know yet about win7.

OS: Mac OS X → Windows Vista

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

15 years ago

Whiteboard: [buildslaves]

alice nodelman [:alice] [:anode]

Comment 19

•

15 years ago

I've seen this on win7.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 20

•

15 years ago

(In reply to comment #19) > I've seen this on win7. Are you sure? This bug is about something specific triggered by OPSI, which we don't have on Windows 7 yet.

alice nodelman [:alice] [:anode]

Comment 21

•

15 years ago

Oop, sorry for the confusion. I must have seen it on the new rev3 xp machines.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 22

•

15 years ago

We don't see this very often anymore, for whatever reason, and given that we're moving to Windows 7 and retiring the Vista slaves, I doubt we're going to spend anymore effort here. I'm going to close WONTFIX -- we can re-open if this changes, or we see the issues on Windows 7.

Assignee: bhearsum → nobody

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → WONTFIX

Summary: some talos slaves throwing a "Service request failed" dialog on boot → some talos vista slaves throwing a "Service request failed" dialog on boot

Chris Cooper [:coop] (he/him)

Comment 23

•

15 years ago

Moving closed Future bugs into Release Engineering in preparation for removing the Future component.

Component: Release Engineering: Future → Release Engineering

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering