Closed Bug 936538 Opened 12 years ago Closed 11 years ago

Negatus eventually hangs on Linux

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dminor, Unassigned)

References

(Blocks 1 open bug)

Details

Dan Minor [:dminor]

Reporter

Description

•

12 years ago

About once a week I need to reboot the steeplechase clients due to either being unable to connect to the SUTAgent or connecting but not getting a prompt. I'm going to call this a Negatus bug until I have a chance to investigate further as it appears things are running fine otherwise (e.g. I can still ssh into the client machine.)

Mark Côté [:mcote]

Comment 1

•

12 years ago

I don't think we ever did any extensive testing for memory leaks or other stability issues in Negatus, so it's entirely likely that it's freezing or crashing after a while. Some basic valgrind analysis would probably go a long way.

Dan Minor [:dminor]

Reporter

Comment 2

•

12 years ago

After a bit more investigation this appears to be a slowdown rather than a hang. It takes so long to respond that our scripts hit timeouts, but a telnet session will eventually connect. I'm not sure if it is a resource leak, since it will affect one of the two steeplechase clients without impacting the other one. I've restarted the client machines, but I'm interested to see if the same client fails next time. On the problem client it appeared that it was not killing firefox properly as the process was hanging around after the tests ran. Killing firefox would improve performance to the point that tests would start, but they would still eventually fail with timeouts. On the other hand, the non-problem client had five or six firefoxes running from cancelled jobs and it was responding just fine. I'll give Valgrind a go and see what it turns up.

Dan Minor [:dminor]

Reporter

Comment 3

•

12 years ago

It seems that restarting the agent and restarting the ssh tunnelling do not fix things, but rebooting the system will. Not sure what that points to.

Dan Minor [:dminor]

Reporter

Comment 4

•

12 years ago

I've run stress tests of Negatus that use the same functions as the steeplechase tests do (without starting firefox and running the steeplechase test suite) over several days and have not been able to reproduce this, so it seems the problem is something with either firefox or the test suite.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

(not currently active) Ted Mielczarek

Comment 5

•

12 years ago

Can you check stats on open file descriptors and other measurable things to see if we're leaking some resource?

Dan Minor [:dminor]

Reporter

Comment 6

•

12 years ago

Reopening. I'm also seeing a bunch of sockets in CLOSE_WAIT, I'll double check that we're closing things cleanly on the test harness side. There are also a number of pipes left open, basically one per connection in CLOSE_WAIT. I'm not sure if Negatus will clean these up automatically once the socket in CLOSE_WAIT times out. The above might be caused by test harness problems, but I also noticed a slow memory leak.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Dan Minor [:dminor]

Reporter

Comment 7

•

12 years ago

This is solveable by adding a reboot step for the clients after the tests run. My understanding is that this is our standard practice for other test suites, any objections to doing this here?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 8

•

12 years ago

That seems fine, but it would probably also behoove us to figure out the root cause here.

Flags: needinfo?(ted)

Dan Minor [:dminor]

Reporter

Comment 9

•

12 years ago

Sigh, looks like the easy way out is no longer cooperating. Rebooting the VMs through Negatus is not bringing them back in a runnable state. Root cause it is.

Dan Minor [:dminor]

Reporter

Comment 10

•

11 years ago

Syd, I know you fixed a number of Negatus bugs. Have you noticed any Negatus stability problems on Linux since your fixes landed?

Flags: needinfo?(spolk)

Syd Polk :sydpolk

Comment 11

•

11 years ago

I fixed a major buffer overrun, and since then, we have not seen anything like this on our environment.

Flags: needinfo?(spolk)

Dan Minor [:dminor]

Reporter

Comment 12

•

11 years ago

Thanks Syd, I think it is safe to close this one now.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

8 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Negatus eventually hangs on Linux

Categories

(Testing Graveyard :: SUTAgent, defect)

Tracking

(Not tracked)

People

(Reporter: dminor, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated