Closed Bug 781162 Opened 8 years ago Closed 8 years ago

resolve our twisted errors on tegras as they are causing many tests to fail for infra reasons

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

Looking at 51 infra related failures (red/purple/blue), I see 15 failures that have twisted failures in them.  This is in less than a 24 hour time window.  If we saw twisted connection failures at this rate on desktop tests nobody would trust the automation.

I really don't know much about twisted, but this appears to be that we are losing the connection between the master and slave.  Speculating here, my understanding is that we have many slaves on a single foopy and we probably need to restart those slaves or we have such a high load on the foopy that we fail to respond in time causing a failure.

Do we have a schedule for maintaining the foopies?  
Maybe reboot them once a day?
Can we reduce the number of tegras on a foopy?
Do we know what type of load we are seeing on the foopies?

Since this accounts for about 30% of our infra related failures, we really should take steps to resolve this as much as possible.
This *feels* like its connection lost due to our foopies forcibly killing the buildbot process. This is a good thing in many cases as it indicates that clientproxy found an error.flg, and that it thinks the tegra is bad and should get pulled out of production.

This is the only *good* way to kill off a tegra atm, since my "gracefully kill buildbot" solution didn't work/isn't working yet.

So, all in all, I think this (as stated) is wontfix, since I don't see anything actionable that will improve our situation on the whole.

Let me address your Q's though anyway (and ask some of my own):

(In reply to Joel Maher (:jmaher) from comment #0)
> Looking at 51 infra related failures (red/purple/blue), I see 15 failures
> that have twisted failures in them.  

You don't cite the twistd errors you see, or what steps they were in, which leads me to my conclusion above, happy to get a different idea though.

> I really don't know much about twisted, but this appears to be that we are
> losing the connection between the master and slave.  Speculating here, my
> understanding is that we have many slaves on a single foopy and we probably
> need to restart those slaves or we have such a high load on the foopy that
> we fail to respond in time causing a failure.

If my answer above is not accurate assuming, this might be a good thought.

> Do we have a schedule for maintaining the foopies?  

Not really, but if it helps we can.

> Maybe reboot them once a day?

Sadly, we can't do this easily, in order to reboot a foopy today, we have to do a bunch of things (a) gracefully shutdown the master, which has many tegras and a bunch of foopies, attached (b) ./stop_cp.sh on all the foopies [which takes a while] (c) reboot foopies (d) reboot the master [or restart it manually] (d) run ./start_cp on all the foopies from a screen session.

So its not exactly automateable in the way we have it today, and would cut in to our uptime by a good few hours, per master.

> Can we reduce the number of tegras on a foopy?

Yes, but we are already low enough that I would rather not go much lower, there is a real cost to having more foopies (space/power/time) As it is, even without panda's, we need over a whole Rack for all the foopies we need [if we account for linux foopy hosting every tegra]. 

> Do we know what type of load we are seeing on the foopies?

Yes, we have ganglia installed on the foopies [must be authorized to access its dashboard, iirc]. the CPU/Network load is quite low on the foopies, We do see wio spikes though, but for the most part we're pretty good with the tegras-per-foopy we have right now.

> Since this accounts for about 30% of our infra related failures, we really
> should take steps to resolve this as much as possible.

I'd love to get a better solution here, but I don't know a good way to make buildbot happily turn this error into a red/fail/retry or an ignored case, we are working toward that though.
From what has been said, it appears the twisted errors are most likely a result of something else and not the root cause.  I would say that is pretty accurate knowing that I see other errors in the log files.  Maybe we would have 3 or 4 out of 51 errors that would have a twisted error with no other displayed error.  

If this is something we cannot tackle, then we should close this bug.
(In reply to Joel Maher (:jmaher) from comment #2)
> If this is something we cannot tackle, then we should close this bug.

Since I don't know of a way to tackle it in our current infra, I'll reso/wontfix given this. We can revisit in the future when we get some of our other issues squared away.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.