Closed Bug 648306 Opened 10 years ago Closed 10 years ago

buildslave monitor thread gets wedged on long running clientproxy sessions

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Assigned: bear)

References

Details

Attachments

(2 files, 4 obsolete files)

The thread that is responsible for monitoring if/how the buildslave instance is running and communicating that to the main tegra monitoring thread is dieing and i'm not seeing/catching that event properly.

The two choices are to 1) finish making cp a complete multi-threaded app or 2) refactor it back to a single threaded app with a monster state machine

After looking at how much "inside info" is needed to manage and restart the Tegra pool, i've decided to refactor back to a basic single process app and comment the hell out of the state machine :). This is possible now because the SUTAgent has gotten a *lot* better and stable so I don't need the fancy stuff.
Assignee: nobody → bear
this is the first pass.  next version will include the items pulled from installApp, config and reboot .py's
Attachment #525747 - Flags: feedback?(aki)
Attachment #525748 - Flags: feedback?(aki)
Comment on attachment 525747 [details]
pulled common code into sut_lib - first pass

>log = logging.getLogger()

I'm guessing this is for pylint or standalone execution?
If so, cool.

>def getOurIP(hostname='bm-foopy.build.mozilla.org'):

Maybe the hardcoded hostname should be bm-remote, since that's already a dependency elsewhere.  bm-foopy is only a dependency for staging slaves, I think.

>def getLastLine(filename):

I'm going to guess we're tail -1'ing twistd.log that often has empty lines?
Adding that to the comment might be useful for the curious.

>def stopSlave(pidFile):

Sounds like you're only having issues with buildslave startup, which is awesome in terms of stopSlave finally working.

>def checkSlaveAlive(bbClient):
>    """Check if the buildslave process is alive.
>    If it is alive, then also check to determine what was the
>    date/time of the last line of it's log output.
>
>    Return None if it's not alive or datetime delta
>    """
>    pidFile = os.path.join(bbClient, 'twistd.pid')
>    if os.path.isfile(pidFile):
>        try:
>            pid = int(file(pidFile, 'r').read().strip())
>            try:
>                os.kill(pid, signal.SIGHUP)
>                return True
>            except OSError:
>                dumpException('check to see if pid %s is active failed')
>        except:
>            dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile)
>    return False

Does being able to sighup == alive?
Or is that part of this next function?
(In general sighupping buildbot gives me the jibblies but if sighup is harmless in buildbot land, that's cool.)

>def checkSlaveActive(bbClient):
>    logFile  = os.path.join(bbClient, 'twistd.log')
>    lastline = getLastLine(logFile)
>    if len(lastline) > 0:
>        logTS  = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S')
>        logTD  = datetime.datetime.now() - logTS
>        return logTD
>    return None

As above, IIRC you needed the sighup to force twistd.log output, which would mean checkSlaveActive depends on checkSlaveAlive.

Overall, cool, avoiding having these as duplicate functions in every script helps.
Attachment #525747 - Flags: feedback?(aki) → feedback+
(In reply to comment #3)
> Comment on attachment 525747 [details]
> pulled common code into sut_lib - first pass
> 
> >log = logging.getLogger()
> 
> I'm guessing this is for pylint or standalone execution?
> If so, cool.

yea, I find it easier to leave it so when you do have file logging it "just works"

> 
> >def getOurIP(hostname='bm-foopy.build.mozilla.org'):
> 
> Maybe the hardcoded hostname should be bm-remote, since that's already a
> dependency elsewhere.  bm-foopy is only a dependency for staging slaves, I
> think.

oh - good point!  I'll make that change

> 
> >def getLastLine(filename):
> 
> I'm going to guess we're tail -1'ing twistd.log that often has empty lines?
> Adding that to the comment might be useful for the curious.

it can happen in very odd cases as the twistd log will also contain output from the slaves - i'll add the comment

> 
> >def stopSlave(pidFile):
> 
> Sounds like you're only having issues with buildslave startup, which is awesome
> in terms of stopSlave finally working.

the change to how we start the slave really helped make killing it saner

> 
> >def checkSlaveAlive(bbClient):
> >    """Check if the buildslave process is alive.
> >    If it is alive, then also check to determine what was the
> >    date/time of the last line of it's log output.
> >
> >    Return None if it's not alive or datetime delta
> >    """
> >    pidFile = os.path.join(bbClient, 'twistd.pid')
> >    if os.path.isfile(pidFile):
> >        try:
> >            pid = int(file(pidFile, 'r').read().strip())
> >            try:
> >                os.kill(pid, signal.SIGHUP)
> >                return True
> >            except OSError:
> >                dumpException('check to see if pid %s is active failed')
> >        except:
> >            dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile)
> >    return False
> 
> Does being able to sighup == alive?
> Or is that part of this next function?

you will get an exception if the process doesn't exist.  sighup generates log activity which helps the next function

> (In general sighupping buildbot gives me the jibblies but if sighup is harmless
> in buildbot land, that's cool.)

it's harmless in buildslave land - not something I can say for sure about the master

> 
> >def checkSlaveActive(bbClient):
> >    logFile  = os.path.join(bbClient, 'twistd.log')
> >    lastline = getLastLine(logFile)
> >    if len(lastline) > 0:
> >        logTS  = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S')
> >        logTD  = datetime.datetime.now() - logTS
> >        return logTD
> >    return None
> 
> As above, IIRC you needed the sighup to force twistd.log output, which would
> mean checkSlaveActive depends on checkSlaveAlive.

I am doing the sighup to handle the rare, but happening, case where the slave is truly idle

> 
> Overall, cool, avoiding having these as duplicate functions in every script
> helps.
Comment on attachment 525748 [details]
updated check.py to use sut_lib and also to improve the output

>def checkTegra(tegra):
>    tegraIP   = getIPAddress(tegra)
>    tegraPath = os.path.join(options.bbpath, tegra)
>    errorFile = os.path.join(tegraPath, 'error.flg')
>    proxyFile = os.path.join(tegraPath, 'proxy.flg')
>
>    status = { 'active':    False,
>               'error.flg': False,
>               'proxy.flg': False,
>               'tegra':     tegra,
>               'rebooted':  False,
>             }
>
>    log.debug('%s: %s' % (tegra, tegraIP))
>
>    try:
>        hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>        hbSocket.settimeout(float(120))
>        hbSocket.connect((tegraIP, 20700))
>
>        time.sleep(2)
>
>        hbSocket.send('info\n')
>
>        d = hbSocket.recv(4096)
>
>        log.debug('socket data length %d' % len(d))
>        log.debug(d)
>
>        if len(d) > 0:
>            result = True
>
>        if options.reset:
>            status['rebooted'] = True
>            hbSocket.send('rebt\n')
>            # not interested in result
>            # taking advantage of the pause
>            d = hbSocket.recv(4096)

This represents a change in how we use check.py -r -- afaict if we check.py -r on an active production tegra, it'll kill the currently running job due to a reboot.

Is that inaccurate?

I think blindly rebooting a tegra before checking error.flg and checkSlaveActive, just because someone used the -r flag, is a mistake.
This is a pretty big one, but other than that this file looks good.
Attachment #525748 - Flags: feedback?(aki) → feedback-
(In reply to comment #5)
> Comment on attachment 525748 [details]
> updated check.py to use sut_lib and also to improve the output
> 
> >def checkTegra(tegra):
> >    tegraIP   = getIPAddress(tegra)
> >    tegraPath = os.path.join(options.bbpath, tegra)
> >    errorFile = os.path.join(tegraPath, 'error.flg')
> >    proxyFile = os.path.join(tegraPath, 'proxy.flg')
> >
> >    status = { 'active':    False,
> >               'error.flg': False,
> >               'proxy.flg': False,
> >               'tegra':     tegra,
> >               'rebooted':  False,
> >             }
> >
> >    log.debug('%s: %s' % (tegra, tegraIP))
> >
> >    try:
> >        hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >        hbSocket.settimeout(float(120))
> >        hbSocket.connect((tegraIP, 20700))
> >
> >        time.sleep(2)
> >
> >        hbSocket.send('info\n')
> >
> >        d = hbSocket.recv(4096)
> >
> >        log.debug('socket data length %d' % len(d))
> >        log.debug(d)
> >
> >        if len(d) > 0:
> >            result = True
> >
> >        if options.reset:
> >            status['rebooted'] = True
> >            hbSocket.send('rebt\n')
> >            # not interested in result
> >            # taking advantage of the pause
> >            d = hbSocket.recv(4096)
> 
> This represents a change in how we use check.py -r -- afaict if we check.py -r
> on an active production tegra, it'll kill the currently running job due to a
> reboot.
> 
> Is that inaccurate?
> 
> I think blindly rebooting a tegra before checking error.flg and
> checkSlaveActive, just because someone used the -r flag, is a mistake.
> This is a pretty big one, but other than that this file looks good.

nope - that is a great catch.  I'll fix it.
moved reboot to a more saner location
Attachment #525748 - Attachment is obsolete: true
Attachment #525830 - Flags: feedback?(aki)
Comment on attachment 525830 [details]
updated check.py to use sut_lib and also to improve the output

This looks a lot better.
What happens if proxy.flg is set with -r ?
Also, should we wait to make sure it comes up properly? Or just check again later?
Attachment #525830 - Flags: feedback?(aki) → feedback+
(In reply to comment #8)
> Comment on attachment 525830 [details]
> updated check.py to use sut_lib and also to improve the output
> 
> This looks a lot better.
> What happens if proxy.flg is set with -r ?
> Also, should we wait to make sure it comes up properly? Or just check again
> later?

the -r only clears/reboots for error.flg - proxy.flg is only present during a 
normal reboot so should never appear at the same time as error.flg

I didn't want to slow down the "status check" behaviour that is the major focus of this tool by having it enter a wait-for-return loop
Attachment #525830 - Attachment is obsolete: true
Attachment #525946 - Flags: review?(aki)
the only piece that is still in a thread is the dialback server
Attachment #525947 - Flags: review?(aki)
Blocks: 649994
adds sut_lib.py to repo
refactors installApp, reboot and cleanup to use sut_lib
includes changes to check.py for refactoring and sut_lib
Attachment #525747 - Attachment is obsolete: true
Attachment #525946 - Attachment is obsolete: true
Attachment #525946 - Flags: review?(aki)
Attachment #526002 - Flags: review?(aki)
Comment on attachment 525947 [details]
refactored clientproxy to be single(ish) threaded

Sweet.
How does this look?
Attachment #525947 - Flags: review?(aki) → review+
Attachment #526002 - Flags: review?(aki) → review+
(In reply to comment #13)
> Comment on attachment 525947 [details]
> refactored clientproxy to be single(ish) threaded
> 
> Sweet.
> How does this look?

so far it's running just as stable as before but with the tests not hanging as often I haven't had a chance to see the one edge case I'm worried about.

the good news is that the soft reset (erase error.flg and try again) *and* the hard reset (reboot tegra) code is now working great.  combine this with the check.py improvements and i'm feeling good about turning tegras over to normal slave duty next monday.
aki - going to checkin the code and setup foopy01 with the very latest to test sut_tools update against staging
Attachment #525947 - Flags: checked-in+
Comment on attachment 526002 [details] [diff] [review]
add sut_lib.py to repo, refactor sut_tools helper scripts to use common lib

committed changeset 1355:409d955e3493
Attachment #526002 - Flags: checked-in+
this code is now live on bm-foopy, foopy01 - foopy04
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.