Closed Bug 664318 Opened 13 years ago Closed 12 years ago

[stage] input.allizom.org Test automation hampered by frequent timeouts

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

VERIFIED WONTFIX

People

(Reporter: mbrandt, Assigned: cturra)

References

()

Details

Attachments

(2 files)

Attached image Service Unavaliable
Service Unavailable:
This is fairly recent and is illiciting false positives in our test automation. We're seeing quite a few timeouts and "Service Unavailable." The timeouts we might be able to bypass by increasing the time window in which the test runs however we can't bypass the "Service Unavailable" errors.

Can you investigate the cause of this? Please let us know what we can do to help dig into the cause of the behavior.
Attached image timeout
Here are a few timestamps to help w/ looking through the logs:
http://qa-selenium.mv.mozilla.com:8080/view/Input/job/input.stage/

Failed > Console Output  #725 	Jun 14, 2011 2:10:09 PM	
Failed > Console Output  #724 	Jun 14, 2011 8:00:32 AM
Assignee: server-ops → nmaul
The screenshot in 725 is a a Zeus 500 error, meaning the backend server did not respond in a timely manner.

It's worth noting that Apache gets reloaded on this server at least every 10 minutes (at 0,10,20,30,40,50 min past the hour, every hour). This is a shared staging server, serving more than just input.allizom.org, so it may be related to one of the other sites causing a load problem or something.

I offset this in Selenium... it was set for this schedule:

0 8,16 * * *

it is now:

2 8,16 * * *

Assuming I changed the right setting in the right place, your tests should no longer overlap with the Apache restart, which should give you more consistent results.

Of course, manually run tests can still trip up. I recommend just paying attention to the timing of them- failures occurring on a 10-minute mark within the hour are likely bogus.

Let us know if it doesn't get any better over the next couple days.
Status: NEW → ASSIGNED
Going to close this out... if the problem is not solved, please re-open and we'll keep looking. Thanks!
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Thanks for digging into this. We've had a good run with many fewer timeouts. There's been an increase in intermittent timeouts recently. What can I provide you on to help diagnose this further?

- Are we still hitting timeouts because Apache is reloading (comment 3)?
- What is the Zeus timeout threshold, is this setting shared between the other projects being hosted on the server?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
In talking to Corey on IRC, it looks bug 651148 will either obviate this or make it all-the-more apparent.  After discussing it with him, the plan is to mark this bug as depending on bug 651148, which should happen sometime next week or so, and then we (WebQA) can comment on what we're seeing in the new setup.
Depends on: 651148
Component: Server Operations → Server Operations: Web Operations
QA Contact: mrz → cshields
How has this been? The bug in comment 6 was resolved recently, wondering if we can close this out too.
(In reply to Jake Maul [:jakem] from comment #7)
> How has this been? The bug in comment 6 was resolved recently, wondering if
> we can close this out too.

Jake: still seeing quite a few failures (timeouts, as well as "Search Unavailables," the latter of which we've xfailed a few).
(In reply to Stephen Donner [:stephend] from comment #8)
> (In reply to Jake Maul [:jakem] from comment #7)
> > How has this been? The bug in comment 6 was resolved recently, wondering if
> > we can close this out too.
> 
> Jake: still seeing quite a few failures (timeouts, as well as "Search
> Unavailables," the latter of which we've xfailed a few).

http://qa-selenium.mv.mozilla.com:8080/job/input.stage/buildTimeTrend
I wonder if we're hitting an issue with the Seamicro dev nodes not being fast enough.

Phong: can we get a VM put up and in place on this cluster (input-dev, PHX1)? Once it's puppetized I can put it in the right node class and zeus cluster and all that.

RHEL6, 2GB RAM, 1 core, 10GB disk should be sufficient for testing. Thanks!
Assignee: nmaul → phong
Whiteboard: want vm
Is this in phx1 or sjc1?
PHX1. :)
Can you create a VM on the PHX cluster for this?
Assignee: phong → dparsons
Assignee: dparsons → server-ops
Component: Server Operations: Web Operations → Server Operations: Virtualization
QA Contact: cshields → dparsons
Let's hold off on a VM - we'll have new Xeon gear there in about a week, I'd like to just give a full blade to it.
OK, gonna move it back to the server-ops queue then.
Component: Server Operations: Virtualization → Server Operations
QA Contact: dparsons → phong
Whiteboard: want vm
Just wanted to note, since I've been poking at input a bunch the last couple of days for bug 725782, that input.allizom.org is hosted on mrapp-stage02 right now, which is an older DL360 G4, and runs a bunch of stuff, so it's not seamicro slowness, just older hardware slowness due to sharing. Input-dev is on a pair of seamicro nodes, but stage and prod were never migrated to the new admin node and setup with the current deployment style.

As Corey mentioned above, moving staging new a new Xeon seamicro and to the new inputadm and deployment style is probably for the best long term, so once a node is ready, I can take the lead on setting up a new input.allizom.org on it and getting it tested by the developers.
Depends on: 728219
Assignee: server-ops → bburton
Pending Xeon seamicros and migration to phx1
:solarace - thx for the update :)
Assignee: bburton → cturra
Blocks: 737547
per the "Input: Sequel or Reboot" meeting this morning, we will be doing a reboot on input so marking this bug "won't fix"
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → WONTFIX
QA verified wontfix - thank you for following up and cleaning out the bug queue cturra.
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: