Closed Bug 936538 Opened 6 years ago Closed 5 years ago
Negatus eventually hangs on Linux
About once a week I need to reboot the steeplechase clients due to either being unable to connect to the SUTAgent or connecting but not getting a prompt. I'm going to call this a Negatus bug until I have a chance to investigate further as it appears things are running fine otherwise (e.g. I can still ssh into the client machine.)
I don't think we ever did any extensive testing for memory leaks or other stability issues in Negatus, so it's entirely likely that it's freezing or crashing after a while. Some basic valgrind analysis would probably go a long way.
After a bit more investigation this appears to be a slowdown rather than a hang. It takes so long to respond that our scripts hit timeouts, but a telnet session will eventually connect. I'm not sure if it is a resource leak, since it will affect one of the two steeplechase clients without impacting the other one. I've restarted the client machines, but I'm interested to see if the same client fails next time. On the problem client it appeared that it was not killing firefox properly as the process was hanging around after the tests ran. Killing firefox would improve performance to the point that tests would start, but they would still eventually fail with timeouts. On the other hand, the non-problem client had five or six firefoxes running from cancelled jobs and it was responding just fine. I'll give Valgrind a go and see what it turns up.
It seems that restarting the agent and restarting the ssh tunnelling do not fix things, but rebooting the system will. Not sure what that points to.
I've run stress tests of Negatus that use the same functions as the steeplechase tests do (without starting firefox and running the steeplechase test suite) over several days and have not been able to reproduce this, so it seems the problem is something with either firefox or the test suite.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Can you check stats on open file descriptors and other measurable things to see if we're leaking some resource?
Reopening. I'm also seeing a bunch of sockets in CLOSE_WAIT, I'll double check that we're closing things cleanly on the test harness side. There are also a number of pipes left open, basically one per connection in CLOSE_WAIT. I'm not sure if Negatus will clean these up automatically once the socket in CLOSE_WAIT times out. The above might be caused by test harness problems, but I also noticed a slow memory leak.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
This is solveable by adding a reboot step for the clients after the tests run. My understanding is that this is our standard practice for other test suites, any objections to doing this here?
That seems fine, but it would probably also behoove us to figure out the root cause here.
Sigh, looks like the easy way out is no longer cooperating. Rebooting the VMs through Negatus is not bringing them back in a runnable state. Root cause it is.
Syd, I know you fixed a number of Negatus bugs. Have you noticed any Negatus stability problems on Linux since your fixes landed?
I fixed a major buffer overrun, and since then, we have not seen anything like this on our environment.
Thanks Syd, I think it is safe to close this one now.
Status: REOPENED → RESOLVED
Closed: 6 years ago → 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.