Closed
Bug 648306
Opened 14 years ago
Closed 14 years ago
buildslave monitor thread gets wedged on long running clientproxy sessions
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bear, Assigned: bear)
References
Details
Attachments
(2 files, 4 obsolete files)
16.24 KB,
text/plain
|
mozilla
:
review+
bear
:
checked-in+
|
Details |
24.05 KB,
patch
|
mozilla
:
review+
bear
:
checked-in+
|
Details | Diff | Splinter Review |
The thread that is responsible for monitoring if/how the buildslave instance is running and communicating that to the main tegra monitoring thread is dieing and i'm not seeing/catching that event properly.
The two choices are to 1) finish making cp a complete multi-threaded app or 2) refactor it back to a single threaded app with a monster state machine
After looking at how much "inside info" is needed to manage and restart the Tegra pool, i've decided to refactor back to a basic single process app and comment the hell out of the state machine :). This is possible now because the SUTAgent has gotten a *lot* better and stable so I don't need the fancy stuff.
Updated•14 years ago
|
Assignee: nobody → bear
Assignee | ||
Comment 1•14 years ago
|
||
this is the first pass. next version will include the items pulled from installApp, config and reboot .py's
Attachment #525747 -
Flags: feedback?(aki)
Assignee | ||
Comment 2•14 years ago
|
||
Attachment #525748 -
Flags: feedback?(aki)
Comment 3•14 years ago
|
||
Comment on attachment 525747 [details]
pulled common code into sut_lib - first pass
>log = logging.getLogger()
I'm guessing this is for pylint or standalone execution?
If so, cool.
>def getOurIP(hostname='bm-foopy.build.mozilla.org'):
Maybe the hardcoded hostname should be bm-remote, since that's already a dependency elsewhere. bm-foopy is only a dependency for staging slaves, I think.
>def getLastLine(filename):
I'm going to guess we're tail -1'ing twistd.log that often has empty lines?
Adding that to the comment might be useful for the curious.
>def stopSlave(pidFile):
Sounds like you're only having issues with buildslave startup, which is awesome in terms of stopSlave finally working.
>def checkSlaveAlive(bbClient):
> """Check if the buildslave process is alive.
> If it is alive, then also check to determine what was the
> date/time of the last line of it's log output.
>
> Return None if it's not alive or datetime delta
> """
> pidFile = os.path.join(bbClient, 'twistd.pid')
> if os.path.isfile(pidFile):
> try:
> pid = int(file(pidFile, 'r').read().strip())
> try:
> os.kill(pid, signal.SIGHUP)
> return True
> except OSError:
> dumpException('check to see if pid %s is active failed')
> except:
> dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile)
> return False
Does being able to sighup == alive?
Or is that part of this next function?
(In general sighupping buildbot gives me the jibblies but if sighup is harmless in buildbot land, that's cool.)
>def checkSlaveActive(bbClient):
> logFile = os.path.join(bbClient, 'twistd.log')
> lastline = getLastLine(logFile)
> if len(lastline) > 0:
> logTS = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S')
> logTD = datetime.datetime.now() - logTS
> return logTD
> return None
As above, IIRC you needed the sighup to force twistd.log output, which would mean checkSlaveActive depends on checkSlaveAlive.
Overall, cool, avoiding having these as duplicate functions in every script helps.
Attachment #525747 -
Flags: feedback?(aki) → feedback+
Assignee | ||
Comment 4•14 years ago
|
||
(In reply to comment #3)
> Comment on attachment 525747 [details]
> pulled common code into sut_lib - first pass
>
> >log = logging.getLogger()
>
> I'm guessing this is for pylint or standalone execution?
> If so, cool.
yea, I find it easier to leave it so when you do have file logging it "just works"
>
> >def getOurIP(hostname='bm-foopy.build.mozilla.org'):
>
> Maybe the hardcoded hostname should be bm-remote, since that's already a
> dependency elsewhere. bm-foopy is only a dependency for staging slaves, I
> think.
oh - good point! I'll make that change
>
> >def getLastLine(filename):
>
> I'm going to guess we're tail -1'ing twistd.log that often has empty lines?
> Adding that to the comment might be useful for the curious.
it can happen in very odd cases as the twistd log will also contain output from the slaves - i'll add the comment
>
> >def stopSlave(pidFile):
>
> Sounds like you're only having issues with buildslave startup, which is awesome
> in terms of stopSlave finally working.
the change to how we start the slave really helped make killing it saner
>
> >def checkSlaveAlive(bbClient):
> > """Check if the buildslave process is alive.
> > If it is alive, then also check to determine what was the
> > date/time of the last line of it's log output.
> >
> > Return None if it's not alive or datetime delta
> > """
> > pidFile = os.path.join(bbClient, 'twistd.pid')
> > if os.path.isfile(pidFile):
> > try:
> > pid = int(file(pidFile, 'r').read().strip())
> > try:
> > os.kill(pid, signal.SIGHUP)
> > return True
> > except OSError:
> > dumpException('check to see if pid %s is active failed')
> > except:
> > dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile)
> > return False
>
> Does being able to sighup == alive?
> Or is that part of this next function?
you will get an exception if the process doesn't exist. sighup generates log activity which helps the next function
> (In general sighupping buildbot gives me the jibblies but if sighup is harmless
> in buildbot land, that's cool.)
it's harmless in buildslave land - not something I can say for sure about the master
>
> >def checkSlaveActive(bbClient):
> > logFile = os.path.join(bbClient, 'twistd.log')
> > lastline = getLastLine(logFile)
> > if len(lastline) > 0:
> > logTS = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S')
> > logTD = datetime.datetime.now() - logTS
> > return logTD
> > return None
>
> As above, IIRC you needed the sighup to force twistd.log output, which would
> mean checkSlaveActive depends on checkSlaveAlive.
I am doing the sighup to handle the rare, but happening, case where the slave is truly idle
>
> Overall, cool, avoiding having these as duplicate functions in every script
> helps.
Comment 5•14 years ago
|
||
Comment on attachment 525748 [details]
updated check.py to use sut_lib and also to improve the output
>def checkTegra(tegra):
> tegraIP = getIPAddress(tegra)
> tegraPath = os.path.join(options.bbpath, tegra)
> errorFile = os.path.join(tegraPath, 'error.flg')
> proxyFile = os.path.join(tegraPath, 'proxy.flg')
>
> status = { 'active': False,
> 'error.flg': False,
> 'proxy.flg': False,
> 'tegra': tegra,
> 'rebooted': False,
> }
>
> log.debug('%s: %s' % (tegra, tegraIP))
>
> try:
> hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> hbSocket.settimeout(float(120))
> hbSocket.connect((tegraIP, 20700))
>
> time.sleep(2)
>
> hbSocket.send('info\n')
>
> d = hbSocket.recv(4096)
>
> log.debug('socket data length %d' % len(d))
> log.debug(d)
>
> if len(d) > 0:
> result = True
>
> if options.reset:
> status['rebooted'] = True
> hbSocket.send('rebt\n')
> # not interested in result
> # taking advantage of the pause
> d = hbSocket.recv(4096)
This represents a change in how we use check.py -r -- afaict if we check.py -r on an active production tegra, it'll kill the currently running job due to a reboot.
Is that inaccurate?
I think blindly rebooting a tegra before checking error.flg and checkSlaveActive, just because someone used the -r flag, is a mistake.
This is a pretty big one, but other than that this file looks good.
Attachment #525748 -
Flags: feedback?(aki) → feedback-
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #5)
> Comment on attachment 525748 [details]
> updated check.py to use sut_lib and also to improve the output
>
> >def checkTegra(tegra):
> > tegraIP = getIPAddress(tegra)
> > tegraPath = os.path.join(options.bbpath, tegra)
> > errorFile = os.path.join(tegraPath, 'error.flg')
> > proxyFile = os.path.join(tegraPath, 'proxy.flg')
> >
> > status = { 'active': False,
> > 'error.flg': False,
> > 'proxy.flg': False,
> > 'tegra': tegra,
> > 'rebooted': False,
> > }
> >
> > log.debug('%s: %s' % (tegra, tegraIP))
> >
> > try:
> > hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > hbSocket.settimeout(float(120))
> > hbSocket.connect((tegraIP, 20700))
> >
> > time.sleep(2)
> >
> > hbSocket.send('info\n')
> >
> > d = hbSocket.recv(4096)
> >
> > log.debug('socket data length %d' % len(d))
> > log.debug(d)
> >
> > if len(d) > 0:
> > result = True
> >
> > if options.reset:
> > status['rebooted'] = True
> > hbSocket.send('rebt\n')
> > # not interested in result
> > # taking advantage of the pause
> > d = hbSocket.recv(4096)
>
> This represents a change in how we use check.py -r -- afaict if we check.py -r
> on an active production tegra, it'll kill the currently running job due to a
> reboot.
>
> Is that inaccurate?
>
> I think blindly rebooting a tegra before checking error.flg and
> checkSlaveActive, just because someone used the -r flag, is a mistake.
> This is a pretty big one, but other than that this file looks good.
nope - that is a great catch. I'll fix it.
Assignee | ||
Comment 7•14 years ago
|
||
moved reboot to a more saner location
Attachment #525748 -
Attachment is obsolete: true
Attachment #525830 -
Flags: feedback?(aki)
Comment 8•14 years ago
|
||
Comment on attachment 525830 [details]
updated check.py to use sut_lib and also to improve the output
This looks a lot better.
What happens if proxy.flg is set with -r ?
Also, should we wait to make sure it comes up properly? Or just check again later?
Attachment #525830 -
Flags: feedback?(aki) → feedback+
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> Comment on attachment 525830 [details]
> updated check.py to use sut_lib and also to improve the output
>
> This looks a lot better.
> What happens if proxy.flg is set with -r ?
> Also, should we wait to make sure it comes up properly? Or just check again
> later?
the -r only clears/reboots for error.flg - proxy.flg is only present during a
normal reboot so should never appear at the same time as error.flg
I didn't want to slow down the "status check" behaviour that is the major focus of this tool by having it enter a wait-for-return loop
Assignee | ||
Comment 10•14 years ago
|
||
Attachment #525830 -
Attachment is obsolete: true
Attachment #525946 -
Flags: review?(aki)
Assignee | ||
Comment 11•14 years ago
|
||
the only piece that is still in a thread is the dialback server
Attachment #525947 -
Flags: review?(aki)
Assignee | ||
Comment 12•14 years ago
|
||
adds sut_lib.py to repo
refactors installApp, reboot and cleanup to use sut_lib
includes changes to check.py for refactoring and sut_lib
Attachment #525747 -
Attachment is obsolete: true
Attachment #525946 -
Attachment is obsolete: true
Attachment #525946 -
Flags: review?(aki)
Attachment #526002 -
Flags: review?(aki)
Comment 13•14 years ago
|
||
Comment on attachment 525947 [details]
refactored clientproxy to be single(ish) threaded
Sweet.
How does this look?
Attachment #525947 -
Flags: review?(aki) → review+
Updated•14 years ago
|
Attachment #526002 -
Flags: review?(aki) → review+
Assignee | ||
Comment 14•14 years ago
|
||
(In reply to comment #13)
> Comment on attachment 525947 [details]
> refactored clientproxy to be single(ish) threaded
>
> Sweet.
> How does this look?
so far it's running just as stable as before but with the tests not hanging as often I haven't had a chance to see the one edge case I'm worried about.
the good news is that the soft reset (erase error.flg and try again) *and* the hard reset (reboot tegra) code is now working great. combine this with the check.py improvements and i'm feeling good about turning tegras over to normal slave duty next monday.
Assignee | ||
Comment 15•14 years ago
|
||
aki - going to checkin the code and setup foopy01 with the very latest to test sut_tools update against staging
Assignee | ||
Updated•14 years ago
|
Attachment #525947 -
Flags: checked-in+
Assignee | ||
Comment 16•14 years ago
|
||
Comment on attachment 526002 [details] [diff] [review]
add sut_lib.py to repo, refactor sut_tools helper scripts to use common lib
committed changeset 1355:409d955e3493
Attachment #526002 -
Flags: checked-in+
Assignee | ||
Comment 17•14 years ago
|
||
this code is now live on bm-foopy, foopy01 - foopy04
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•