Closed Bug 664318 Opened 14 years ago Closed 13 years ago

[stage] input.allizom.org Test automation hampered by frequent timeouts

Tracking

(Not tracked)

Status:

VERIFIED WONTFIX

People

(Reporter: mbrandt, Assigned: cturra)

References

(
URL
)

Details

Attachments

(2 files)

Service Unavaliable 14 years ago Matt Brandt [:mbrandt] 14.90 KB, image/png		Details
timeout 14 years ago Matt Brandt [:mbrandt] 294.54 KB, image/png		Details

Matt Brandt [:mbrandt]

Reporter

Description

•

14 years ago

Attached image Service Unavaliable — Details

Service Unavailable: This is fairly recent and is illiciting false positives in our test automation. We're seeing quite a few timeouts and "Service Unavailable." The timeouts we might be able to bypass by increasing the time window in which the test runs however we can't bypass the "Service Unavailable" errors. Can you investigate the cause of this? Please let us know what we can do to help dig into the cause of the behavior.

Matt Brandt [:mbrandt]

Reporter

Comment 1

•

14 years ago

Attached image timeout — Details

Matt Brandt [:mbrandt]

Reporter

Comment 2

•

14 years ago

Here are a few timestamps to help w/ looking through the logs: http://qa-selenium.mv.mozilla.com:8080/view/Input/job/input.stage/ Failed > Console Output #725 Jun 14, 2011 2:10:09 PM Failed > Console Output #724 Jun 14, 2011 8:00:32 AM

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops → nmaul

Jake Maul [:jakem]

Comment 3

•

14 years ago

The screenshot in 725 is a a Zeus 500 error, meaning the backend server did not respond in a timely manner. It's worth noting that Apache gets reloaded on this server at least every 10 minutes (at 0,10,20,30,40,50 min past the hour, every hour). This is a shared staging server, serving more than just input.allizom.org, so it may be related to one of the other sites causing a load problem or something. I offset this in Selenium... it was set for this schedule: 0 8,16 * * * it is now: 2 8,16 * * * Assuming I changed the right setting in the right place, your tests should no longer overlap with the Apache restart, which should give you more consistent results. Of course, manually run tests can still trip up. I recommend just paying attention to the timing of them- failures occurring on a 10-minute mark within the hour are likely bogus. Let us know if it doesn't get any better over the next couple days.

Status: NEW → ASSIGNED

Jake Maul [:jakem]

Comment 4

•

14 years ago

Going to close this out... if the problem is not solved, please re-open and we'll keep looking. Thanks!

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Reporter

Comment 5

•

14 years ago

Thanks for digging into this. We've had a good run with many fewer timeouts. There's been an increase in intermittent timeouts recently. What can I provide you on to help diagnose this further? - Are we still hitting timeouts because Apache is reloading (comment 3)? - What is the Zeus timeout threshold, is this setting shared between the other projects being hosted on the server?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Stephen Donner [:stephend] Not actively reading bugmail

Comment 6

•

14 years ago

In talking to Corey on IRC, it looks bug 651148 will either obviate this or make it all-the-more apparent. After discussing it with him, the plan is to mark this bug as depending on bug 651148, which should happen sometime next week or so, and then we (WebQA) can comment on what we're seeing in the new setup.

Depends on: 651148

Corey Shields [:cshields]

Updated

•

14 years ago

Component: Server Operations → Server Operations: Web Operations

QA Contact: mrz → cshields

Jake Maul [:jakem]

Comment 7

•

14 years ago

How has this been? The bug in comment 6 was resolved recently, wondering if we can close this out too.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 8

•

14 years ago

(In reply to Jake Maul [:jakem] from comment #7) > How has this been? The bug in comment 6 was resolved recently, wondering if > we can close this out too. Jake: still seeing quite a few failures (timeouts, as well as "Search Unavailables," the latter of which we've xfailed a few).

Stephen Donner [:stephend] Not actively reading bugmail

Comment 9

•

14 years ago

(In reply to Stephen Donner [:stephend] from comment #8) > (In reply to Jake Maul [:jakem] from comment #7) > > How has this been? The bug in comment 6 was resolved recently, wondering if > > we can close this out too. > > Jake: still seeing quite a few failures (timeouts, as well as "Search > Unavailables," the latter of which we've xfailed a few). http://qa-selenium.mv.mozilla.com:8080/job/input.stage/buildTimeTrend

Jake Maul [:jakem]

Comment 10

•

14 years ago

I wonder if we're hitting an issue with the Seamicro dev nodes not being fast enough. Phong: can we get a VM put up and in place on this cluster (input-dev, PHX1)? Once it's puppetized I can put it in the right node class and zeus cluster and all that. RHEL6, 2GB RAM, 1 core, 10GB disk should be sufficient for testing. Thanks!

Assignee: nmaul → phong

Whiteboard: want vm

Phong Tran [:phong]

Comment 11

•

13 years ago

Is this in phx1 or sjc1?

Jake Maul [:jakem]

Comment 12

•

13 years ago

PHX1. :)

Phong Tran [:phong]

Comment 13

•

13 years ago

Can you create a VM on the PHX cluster for this?

Assignee: phong → dparsons

Dan Parsons [:lerxst]

Updated

•

13 years ago

Assignee: dparsons → server-ops

Component: Server Operations: Web Operations → Server Operations: Virtualization

QA Contact: cshields → dparsons

Corey Shields [:cshields]

Comment 14

•

13 years ago

Let's hold off on a VM - we'll have new Xeon gear there in about a week, I'd like to just give a full blade to it.

Dan Parsons [:lerxst]

Comment 15

•

13 years ago

OK, gonna move it back to the server-ops queue then.

Component: Server Operations: Virtualization → Server Operations

QA Contact: dparsons → phong

Whiteboard: want vm

Brandon Burton [:solarce]

Comment 16

•

13 years ago

Just wanted to note, since I've been poking at input a bunch the last couple of days for bug 725782, that input.allizom.org is hosted on mrapp-stage02 right now, which is an older DL360 G4, and runs a bunch of stuff, so it's not seamicro slowness, just older hardware slowness due to sharing. Input-dev is on a pair of seamicro nodes, but stage and prod were never migrated to the new admin node and setup with the current deployment style. As Corey mentioned above, moving staging new a new Xeon seamicro and to the new inputadm and deployment style is probably for the best long term, so once a node is ready, I can take the lead on setting up a new input.allizom.org on it and getting it tested by the developers.

Corey Shields [:cshields]

Updated

•

13 years ago

Depends on: 728219

Brandon Burton [:solarce]

Updated

•

13 years ago

Assignee: server-ops → bburton

Brandon Burton [:solarce]

Comment 17

•

13 years ago

Pending Xeon seamicros and migration to phx1

Matt Brandt [:mbrandt]

Reporter

Comment 18

•

13 years ago

:solarace - thx for the update :)

Brandon Burton [:solarce]

Updated

•

13 years ago

Assignee: bburton → cturra

Chris Turra [:cturra]

Assignee

Updated

•

13 years ago

Blocks: 737547

Chris Turra [:cturra]

Assignee

Comment 19

•

13 years ago

per the "Input: Sequel or Reboot" meeting this morning, we will be doing a reboot on input so marking this bug "won't fix"

Status: REOPENED → RESOLVED

Closed: 14 years ago → 13 years ago

Resolution: --- → WONTFIX

Matt Brandt [:mbrandt]

Reporter

Comment 20

•

13 years ago

QA verified wontfix - thank you for following up and cleaning out the bug queue cturra.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.