Closed
Bug 618363
Opened 14 years ago
Closed 14 years ago
clientproxy.py not detecting down devices
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mozilla, Assigned: bear)
References
Details
Attachments
(5 files, 4 obsolete files)
53.80 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
10.42 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
6.38 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
604 bytes,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
26.59 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
* no proxy.flg
* cannot telnet to tegra-007 20701 (among many other tegras)
* clientproxy still "listening for heartbeat"
* buildbot is sending jobs there, resulting in a ton of reds
This may be worth fixing if it's simple.
If not, this is probably a sign we need to rearchitect this with ateam to take the ini and listen for heartbeat rather than poll.
Also, since a tegra can go down without warning, the sut_tools need to be able to tell clientproxy that they're unable to reach the device, and have clientproxy do a ping or something.
Assignee | ||
Comment 1•14 years ago
|
||
I've been working on a refactoring that is more robust given the changes to devicemanager and sutagent
Assignee | ||
Comment 2•14 years ago
|
||
adds dialback handler so clientproxy is now aware when a tegra is starting up.
adds better support for when tegra disappears and heartbeat fails - now buildslave is stopped and it won't be started until tegra is rebooted.
did not include in this patch extra logging support and some other minor refactoring - I wanted to test this core change first and the other code was distracting me
Attachment #501608 -
Flags: review?(aki)
Assignee | ||
Comment 3•14 years ago
|
||
realized after submitting that I had accidently removed installApp() proxy.flg check in the dialback handler
Attachment #501608 -
Attachment is obsolete: true
Attachment #501610 -
Flags: review?(aki)
Attachment #501608 -
Flags: review?(aki)
Assignee | ||
Comment 4•14 years ago
|
||
bah - shouldn't do last minute patch tweaks at 0300 - last patch was diff'd with changes local that hadn't been double-checked
Attachment #501610 -
Attachment is obsolete: true
Attachment #501612 -
Flags: review?(aki)
Attachment #501610 -
Flags: review?(aki)
Reporter | ||
Comment 5•14 years ago
|
||
Comment on attachment 501612 [details] [diff] [review]
add dialback handler and also handle heartbeat going-away
>+class DialbackHandler(asyncore.dispatcher_with_send):
>+ def handle_read(self):
>+ data = self.recv(1024)
Does this 1024 limit us in any way that we should be aware of?
I was wondering if we'd catch things mid-stream, but that's probably port 20700 rather than the dialback port.
I might be mistaken on both of these counts.
>+ dialback = DialbackServer('0.0.0.0', port)
Does this work with 0.0.0.0? Does that become host ip?
I think I have some high level questions about clientproxy, but this overall looks like an improvement. I'm thinking we should test this on foopy01 and put live if it works, and meanwhile talk with Bob about how this is architected later today.
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #5)
> Comment on attachment 501612 [details] [diff] [review]
> add dialback handler and also handle heartbeat going-away
>
> >+class DialbackHandler(asyncore.dispatcher_with_send):
> >+ def handle_read(self):
> >+ data = self.recv(1024)
>
> Does this 1024 limit us in any way that we should be aware of?
> I was wondering if we'd catch things mid-stream, but that's probably port 20700
> rather than the dialback port.
I looked at the data coming and 1024 seemed roomy enough - but yea, I could easily bump it to 4k. what I really should do, now that you point it out, is make sure the code can handle two socket read triggers where the second half is partial data from the prior trigger.
> I might be mistaken on both of these counts.
>
> >+ dialback = DialbackServer('0.0.0.0', port)
>
> Does this work with 0.0.0.0? Does that become host ip?
0.0.0.0 signals that it should listen on all available ip's
>
> I think I have some high level questions about clientproxy, but this overall
> looks like an improvement. I'm thinking we should test this on foopy01 and put
> live if it works, and meanwhile talk with Bob about how this is architected
> later today.
cool!
Reporter | ||
Comment 7•14 years ago
|
||
Comment on attachment 501612 [details] [diff] [review]
add dialback handler and also handle heartbeat going-away
I'm going to r+ this with the knowledge that we may want more enhancements later.
Attachment #501612 -
Flags: review?(aki) → review+
Assignee | ||
Comment 8•14 years ago
|
||
refactored the dialback and heartbeat code (and the rest actually) to remove a variable write issue that had snuck in. new code now uses singleton for the two main process loops and multiprocess safe queue's for all communication
Attachment #501612 -
Attachment is obsolete: true
Attachment #504697 -
Flags: review?(aki)
Reporter | ||
Updated•14 years ago
|
Attachment #504697 -
Flags: review?(aki) → review+
Assignee | ||
Comment 9•14 years ago
|
||
added some details to the logs to know in eventloop which tegra we are seeing, added a hbRepeats check to only print certain normal log output every N cycles and also fixed a glitch in the dialback handler that was causing the tegra to not get re-instated if it rebooted after being taken offline by cp
Attachment #504942 -
Flags: review?(aki)
Reporter | ||
Comment 10•14 years ago
|
||
Comment on attachment 504942 [details] [diff] [review]
add some detail to logs and fix a bug in dialback
I think this is good.
Attachment #504942 -
Flags: review?(aki) → review+
Assignee | ||
Comment 11•14 years ago
|
||
code checks twistd.log for last activity and if it's more than 14400 seconds (4 hrs) it will shutdown the slave
Attachment #505725 -
Flags: review?(aki)
Assignee | ||
Comment 12•14 years ago
|
||
realized I had left in extra debug logging
Attachment #505725 -
Attachment is obsolete: true
Attachment #505729 -
Flags: review?(aki)
Attachment #505725 -
Flags: review?(aki)
Reporter | ||
Updated•14 years ago
|
Attachment #505729 -
Attachment is patch: true
Attachment #505729 -
Attachment mime type: application/octet-stream → text/plain
Reporter | ||
Comment 13•14 years ago
|
||
Comment on attachment 505729 [details] [diff] [review]
added code to monitor twistd.log for last activity
Hm, you don't seem to be using bbHangs at all. Old code?
I would prefer that 14400 be configurable, but I'm happy just to see this -- twistd.log monitoring is along the lines I was brainstorming as well.
Have you thought about trying to launch buildbot via twistd rather than build{bot,slave} start? (Not sure how, just know that it's supposed to hang less.)
We don't have any tegras in production, which is bad, but allows us absolute free reign in our staging testing. Let's hammer on this, tweak if necessary, and hopefully it gets us patched up enough to stay up.
And get some sleep!
Attachment #505729 -
Flags: review?(aki) → review+
Assignee | ||
Comment 14•14 years ago
|
||
Attachment #505970 -
Flags: review?(aki)
Reporter | ||
Comment 15•14 years ago
|
||
Comment on attachment 505970 [details] [diff] [review]
update error regex to flag a step as an exception for remote device errors
I assume you explicitly don't want to retry here?
(this will be an improvement either way)
Attachment #505970 -
Flags: review?(aki) → review+
Assignee | ||
Comment 16•14 years ago
|
||
committed changeset 1074:7ed7ea2c708c
Assignee | ||
Comment 17•14 years ago
|
||
committed changeset 1236:6dde8b84aa47
Assignee | ||
Comment 18•14 years ago
|
||
The new code forced a refactoring of the event and monitor loops. Now each tegra get's it's own monitor loop that maintains state inside of that thread with *all* events being passed in by a singleton SlaveMgr instance.
Cleaned up error handling and log output to be more consistent.
Attachment #506948 -
Flags: review?(aki)
Reporter | ||
Comment 19•14 years ago
|
||
Comment on attachment 506948 [details] [diff] [review]
added code to handle tegra's who are offline
That's a pretty big patch, but I really can't argue with the number of green's we're seeing.
Attachment #506948 -
Flags: review?(aki) → review+
Assignee | ||
Comment 20•14 years ago
|
||
committed changeset 1087:5246670fa3b0
Assignee | ||
Comment 21•14 years ago
|
||
Aki - I want to mark this as fixed as we should file new bugs for anything that happens from this point on.
Reporter | ||
Comment 22•14 years ago
|
||
Go for it. Clean up whatever bugs that have been fixed. The only one I can think of offhand that we might want to keep is bug 610600, since it tracks other bug fixes, but if we keep that open it should be re-summarized with a normal priority.
Assignee | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•