Closed Bug 618363 Opened 14 years ago Closed 14 years ago

clientproxy.py not detecting down devices

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: bear)

References

Details

Attachments

(5 files, 4 obsolete files)

* no proxy.flg * cannot telnet to tegra-007 20701 (among many other tegras) * clientproxy still "listening for heartbeat" * buildbot is sending jobs there, resulting in a ton of reds This may be worth fixing if it's simple. If not, this is probably a sign we need to rearchitect this with ateam to take the ini and listen for heartbeat rather than poll. Also, since a tegra can go down without warning, the sut_tools need to be able to tell clientproxy that they're unable to reach the device, and have clientproxy do a ping or something.
Blocks: 610600
I've been working on a refactoring that is more robust given the changes to devicemanager and sutagent
adds dialback handler so clientproxy is now aware when a tegra is starting up. adds better support for when tegra disappears and heartbeat fails - now buildslave is stopped and it won't be started until tegra is rebooted. did not include in this patch extra logging support and some other minor refactoring - I wanted to test this core change first and the other code was distracting me
Attachment #501608 - Flags: review?(aki)
realized after submitting that I had accidently removed installApp() proxy.flg check in the dialback handler
Attachment #501608 - Attachment is obsolete: true
Attachment #501610 - Flags: review?(aki)
Attachment #501608 - Flags: review?(aki)
bah - shouldn't do last minute patch tweaks at 0300 - last patch was diff'd with changes local that hadn't been double-checked
Attachment #501610 - Attachment is obsolete: true
Attachment #501612 - Flags: review?(aki)
Attachment #501610 - Flags: review?(aki)
Comment on attachment 501612 [details] [diff] [review] add dialback handler and also handle heartbeat going-away >+class DialbackHandler(asyncore.dispatcher_with_send): >+ def handle_read(self): >+ data = self.recv(1024) Does this 1024 limit us in any way that we should be aware of? I was wondering if we'd catch things mid-stream, but that's probably port 20700 rather than the dialback port. I might be mistaken on both of these counts. >+ dialback = DialbackServer('0.0.0.0', port) Does this work with 0.0.0.0? Does that become host ip? I think I have some high level questions about clientproxy, but this overall looks like an improvement. I'm thinking we should test this on foopy01 and put live if it works, and meanwhile talk with Bob about how this is architected later today.
(In reply to comment #5) > Comment on attachment 501612 [details] [diff] [review] > add dialback handler and also handle heartbeat going-away > > >+class DialbackHandler(asyncore.dispatcher_with_send): > >+ def handle_read(self): > >+ data = self.recv(1024) > > Does this 1024 limit us in any way that we should be aware of? > I was wondering if we'd catch things mid-stream, but that's probably port 20700 > rather than the dialback port. I looked at the data coming and 1024 seemed roomy enough - but yea, I could easily bump it to 4k. what I really should do, now that you point it out, is make sure the code can handle two socket read triggers where the second half is partial data from the prior trigger. > I might be mistaken on both of these counts. > > >+ dialback = DialbackServer('0.0.0.0', port) > > Does this work with 0.0.0.0? Does that become host ip? 0.0.0.0 signals that it should listen on all available ip's > > I think I have some high level questions about clientproxy, but this overall > looks like an improvement. I'm thinking we should test this on foopy01 and put > live if it works, and meanwhile talk with Bob about how this is architected > later today. cool!
Comment on attachment 501612 [details] [diff] [review] add dialback handler and also handle heartbeat going-away I'm going to r+ this with the knowledge that we may want more enhancements later.
Attachment #501612 - Flags: review?(aki) → review+
Blocks: 609536
refactored the dialback and heartbeat code (and the rest actually) to remove a variable write issue that had snuck in. new code now uses singleton for the two main process loops and multiprocess safe queue's for all communication
Attachment #501612 - Attachment is obsolete: true
Attachment #504697 - Flags: review?(aki)
Attachment #504697 - Flags: review?(aki) → review+
added some details to the logs to know in eventloop which tegra we are seeing, added a hbRepeats check to only print certain normal log output every N cycles and also fixed a glitch in the dialback handler that was causing the tegra to not get re-instated if it rebooted after being taken offline by cp
Attachment #504942 - Flags: review?(aki)
Comment on attachment 504942 [details] [diff] [review] add some detail to logs and fix a bug in dialback I think this is good.
Attachment #504942 - Flags: review?(aki) → review+
code checks twistd.log for last activity and if it's more than 14400 seconds (4 hrs) it will shutdown the slave
Attachment #505725 - Flags: review?(aki)
realized I had left in extra debug logging
Attachment #505725 - Attachment is obsolete: true
Attachment #505729 - Flags: review?(aki)
Attachment #505725 - Flags: review?(aki)
Attachment #505729 - Attachment is patch: true
Attachment #505729 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 505729 [details] [diff] [review] added code to monitor twistd.log for last activity Hm, you don't seem to be using bbHangs at all. Old code? I would prefer that 14400 be configurable, but I'm happy just to see this -- twistd.log monitoring is along the lines I was brainstorming as well. Have you thought about trying to launch buildbot via twistd rather than build{bot,slave} start? (Not sure how, just know that it's supposed to hang less.) We don't have any tegras in production, which is bad, but allows us absolute free reign in our staging testing. Let's hammer on this, tweak if necessary, and hopefully it gets us patched up enough to stay up. And get some sleep!
Attachment #505729 - Flags: review?(aki) → review+
Comment on attachment 505970 [details] [diff] [review] update error regex to flag a step as an exception for remote device errors I assume you explicitly don't want to retry here? (this will be an improvement either way)
Attachment #505970 - Flags: review?(aki) → review+
committed changeset 1074:7ed7ea2c708c
committed changeset 1236:6dde8b84aa47
The new code forced a refactoring of the event and monitor loops. Now each tegra get's it's own monitor loop that maintains state inside of that thread with *all* events being passed in by a singleton SlaveMgr instance. Cleaned up error handling and log output to be more consistent.
Attachment #506948 - Flags: review?(aki)
Comment on attachment 506948 [details] [diff] [review] added code to handle tegra's who are offline That's a pretty big patch, but I really can't argue with the number of green's we're seeing.
Attachment #506948 - Flags: review?(aki) → review+
committed changeset 1087:5246670fa3b0
Aki - I want to mark this as fixed as we should file new bugs for anything that happens from this point on.
Go for it. Clean up whatever bugs that have been fixed. The only one I can think of offhand that we might want to keep is bug 610600, since it tracks other bug fixes, but if we keep that open it should be re-summarized with a normal priority.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: