Closed Bug 648306 Opened 14 years ago Closed 14 years ago

buildslave monitor thread gets wedged on long running clientproxy sessions

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bear, Assigned: bear)

References

Details

Attachments

(2 files, 4 obsolete files)

pulled common code into sut_lib - first pass 14 years ago Mike Taylor [:bear] 5.62 KB, text/plain	mozilla : feedback+	Details
updated check.py to use sut_lib and also to improve the output 14 years ago Mike Taylor [:bear] 4.41 KB, text/plain	mozilla : feedback-	Details
updated check.py to use sut_lib and also to improve the output 14 years ago Mike Taylor [:bear] 4.50 KB, text/plain	mozilla : feedback+	Details
updated check.py to use sut_lib and also to improve the output 14 years ago Mike Taylor [:bear] 4.55 KB, text/plain		Details
refactored clientproxy to be single(ish) threaded 14 years ago Mike Taylor [:bear] 16.24 KB, text/plain	mozilla : review+ bear : checked-in+	Details
add sut_lib.py to repo, refactor sut_tools helper scripts to use common lib 14 years ago Mike Taylor [:bear] 24.05 KB, patch	mozilla : review+ bear : checked-in+	Details \| Diff \| Splinter Review

Mike Taylor [:bear]

Assignee

Description

•

14 years ago

The thread that is responsible for monitoring if/how the buildslave instance is running and communicating that to the main tegra monitoring thread is dieing and i'm not seeing/catching that event properly. The two choices are to 1) finish making cp a complete multi-threaded app or 2) refactor it back to a single threaded app with a monster state machine After looking at how much "inside info" is needed to manage and restart the Tegra pool, i've decided to refactor back to a basic single process app and comment the hell out of the state machine :). This is possible now because the SUTAgent has gotten a *lot* better and stable so I don't need the fancy stuff.

Aki Sasaki (not active)

Updated

•

14 years ago

Assignee: nobody → bear

Mike Taylor [:bear]

Assignee

Comment 1

•

14 years ago

Attached file pulled common code into sut_lib - first pass (obsolete) — Details

this is the first pass. next version will include the items pulled from installApp, config and reboot .py's

Attachment #525747 - Flags: feedback?(aki)

Mike Taylor [:bear]

Assignee

Comment 2

•

14 years ago

Attached file updated check.py to use sut_lib and also to improve the output (obsolete) — Details

Attachment #525748 - Flags: feedback?(aki)

Aki Sasaki (not active)

Comment 3

•

14 years ago

Comment on attachment 525747 [details] pulled common code into sut_lib - first pass >log = logging.getLogger() I'm guessing this is for pylint or standalone execution? If so, cool. >def getOurIP(hostname='bm-foopy.build.mozilla.org'): Maybe the hardcoded hostname should be bm-remote, since that's already a dependency elsewhere. bm-foopy is only a dependency for staging slaves, I think. >def getLastLine(filename): I'm going to guess we're tail -1'ing twistd.log that often has empty lines? Adding that to the comment might be useful for the curious. >def stopSlave(pidFile): Sounds like you're only having issues with buildslave startup, which is awesome in terms of stopSlave finally working. >def checkSlaveAlive(bbClient): > """Check if the buildslave process is alive. > If it is alive, then also check to determine what was the > date/time of the last line of it's log output. > > Return None if it's not alive or datetime delta > """ > pidFile = os.path.join(bbClient, 'twistd.pid') > if os.path.isfile(pidFile): > try: > pid = int(file(pidFile, 'r').read().strip()) > try: > os.kill(pid, signal.SIGHUP) > return True > except OSError: > dumpException('check to see if pid %s is active failed') > except: > dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile) > return False Does being able to sighup == alive? Or is that part of this next function? (In general sighupping buildbot gives me the jibblies but if sighup is harmless in buildbot land, that's cool.) >def checkSlaveActive(bbClient): > logFile = os.path.join(bbClient, 'twistd.log') > lastline = getLastLine(logFile) > if len(lastline) > 0: > logTS = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S') > logTD = datetime.datetime.now() - logTS > return logTD > return None As above, IIRC you needed the sighup to force twistd.log output, which would mean checkSlaveActive depends on checkSlaveAlive. Overall, cool, avoiding having these as duplicate functions in every script helps.

Attachment #525747 - Flags: feedback?(aki) → feedback+

Mike Taylor [:bear]

Assignee

Comment 4

•

14 years ago

(In reply to comment #3) > Comment on attachment 525747 [details] > pulled common code into sut_lib - first pass > > >log = logging.getLogger() > > I'm guessing this is for pylint or standalone execution? > If so, cool. yea, I find it easier to leave it so when you do have file logging it "just works" > > >def getOurIP(hostname='bm-foopy.build.mozilla.org'): > > Maybe the hardcoded hostname should be bm-remote, since that's already a > dependency elsewhere. bm-foopy is only a dependency for staging slaves, I > think. oh - good point! I'll make that change > > >def getLastLine(filename): > > I'm going to guess we're tail -1'ing twistd.log that often has empty lines? > Adding that to the comment might be useful for the curious. it can happen in very odd cases as the twistd log will also contain output from the slaves - i'll add the comment > > >def stopSlave(pidFile): > > Sounds like you're only having issues with buildslave startup, which is awesome > in terms of stopSlave finally working. the change to how we start the slave really helped make killing it saner > > >def checkSlaveAlive(bbClient): > > """Check if the buildslave process is alive. > > If it is alive, then also check to determine what was the > > date/time of the last line of it's log output. > > > > Return None if it's not alive or datetime delta > > """ > > pidFile = os.path.join(bbClient, 'twistd.pid') > > if os.path.isfile(pidFile): > > try: > > pid = int(file(pidFile, 'r').read().strip()) > > try: > > os.kill(pid, signal.SIGHUP) > > return True > > except OSError: > > dumpException('check to see if pid %s is active failed') > > except: > > dumpException('unable to check slave - pidfile [%s] was not readable' % pidFile) > > return False > > Does being able to sighup == alive? > Or is that part of this next function? you will get an exception if the process doesn't exist. sighup generates log activity which helps the next function > (In general sighupping buildbot gives me the jibblies but if sighup is harmless > in buildbot land, that's cool.) it's harmless in buildslave land - not something I can say for sure about the master > > >def checkSlaveActive(bbClient): > > logFile = os.path.join(bbClient, 'twistd.log') > > lastline = getLastLine(logFile) > > if len(lastline) > 0: > > logTS = datetime.datetime.strptime(lastline[:19], '%Y-%m-%d %H:%M:%S') > > logTD = datetime.datetime.now() - logTS > > return logTD > > return None > > As above, IIRC you needed the sighup to force twistd.log output, which would > mean checkSlaveActive depends on checkSlaveAlive. I am doing the sighup to handle the rare, but happening, case where the slave is truly idle > > Overall, cool, avoiding having these as duplicate functions in every script > helps.

Aki Sasaki (not active)

Comment 5

•

14 years ago

Comment on attachment 525748 [details] updated check.py to use sut_lib and also to improve the output >def checkTegra(tegra): > tegraIP = getIPAddress(tegra) > tegraPath = os.path.join(options.bbpath, tegra) > errorFile = os.path.join(tegraPath, 'error.flg') > proxyFile = os.path.join(tegraPath, 'proxy.flg') > > status = { 'active': False, > 'error.flg': False, > 'proxy.flg': False, > 'tegra': tegra, > 'rebooted': False, > } > > log.debug('%s: %s' % (tegra, tegraIP)) > > try: > hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > hbSocket.settimeout(float(120)) > hbSocket.connect((tegraIP, 20700)) > > time.sleep(2) > > hbSocket.send('info\n') > > d = hbSocket.recv(4096) > > log.debug('socket data length %d' % len(d)) > log.debug(d) > > if len(d) > 0: > result = True > > if options.reset: > status['rebooted'] = True > hbSocket.send('rebt\n') > # not interested in result > # taking advantage of the pause > d = hbSocket.recv(4096) This represents a change in how we use check.py -r -- afaict if we check.py -r on an active production tegra, it'll kill the currently running job due to a reboot. Is that inaccurate? I think blindly rebooting a tegra before checking error.flg and checkSlaveActive, just because someone used the -r flag, is a mistake. This is a pretty big one, but other than that this file looks good.

Attachment #525748 - Flags: feedback?(aki) → feedback-

Mike Taylor [:bear]

Assignee

Comment 6

•

14 years ago

(In reply to comment #5) > Comment on attachment 525748 [details] > updated check.py to use sut_lib and also to improve the output > > >def checkTegra(tegra): > > tegraIP = getIPAddress(tegra) > > tegraPath = os.path.join(options.bbpath, tegra) > > errorFile = os.path.join(tegraPath, 'error.flg') > > proxyFile = os.path.join(tegraPath, 'proxy.flg') > > > > status = { 'active': False, > > 'error.flg': False, > > 'proxy.flg': False, > > 'tegra': tegra, > > 'rebooted': False, > > } > > > > log.debug('%s: %s' % (tegra, tegraIP)) > > > > try: > > hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > > hbSocket.settimeout(float(120)) > > hbSocket.connect((tegraIP, 20700)) > > > > time.sleep(2) > > > > hbSocket.send('info\n') > > > > d = hbSocket.recv(4096) > > > > log.debug('socket data length %d' % len(d)) > > log.debug(d) > > > > if len(d) > 0: > > result = True > > > > if options.reset: > > status['rebooted'] = True > > hbSocket.send('rebt\n') > > # not interested in result > > # taking advantage of the pause > > d = hbSocket.recv(4096) > > This represents a change in how we use check.py -r -- afaict if we check.py -r > on an active production tegra, it'll kill the currently running job due to a > reboot. > > Is that inaccurate? > > I think blindly rebooting a tegra before checking error.flg and > checkSlaveActive, just because someone used the -r flag, is a mistake. > This is a pretty big one, but other than that this file looks good. nope - that is a great catch. I'll fix it.

Mike Taylor [:bear]

Assignee

Comment 7

•

14 years ago

Attached file updated check.py to use sut_lib and also to improve the output (obsolete) — Details

moved reboot to a more saner location

Attachment #525748 - Attachment is obsolete: true

Attachment #525830 - Flags: feedback?(aki)

Aki Sasaki (not active)

Comment 8

•

14 years ago

Comment on attachment 525830 [details] updated check.py to use sut_lib and also to improve the output This looks a lot better. What happens if proxy.flg is set with -r ? Also, should we wait to make sure it comes up properly? Or just check again later?

Attachment #525830 - Flags: feedback?(aki) → feedback+

Mike Taylor [:bear]

Assignee

Comment 9

•

14 years ago

(In reply to comment #8) > Comment on attachment 525830 [details] > updated check.py to use sut_lib and also to improve the output > > This looks a lot better. > What happens if proxy.flg is set with -r ? > Also, should we wait to make sure it comes up properly? Or just check again > later? the -r only clears/reboots for error.flg - proxy.flg is only present during a normal reboot so should never appear at the same time as error.flg I didn't want to slow down the "status check" behaviour that is the major focus of this tool by having it enter a wait-for-return loop

Mike Taylor [:bear]

Assignee

Comment 10

•

14 years ago

Attached file updated check.py to use sut_lib and also to improve the output (obsolete) — Details

Attachment #525830 - Attachment is obsolete: true

Attachment #525946 - Flags: review?(aki)

Mike Taylor [:bear]

Assignee

Comment 11

•

14 years ago

Attached file refactored clientproxy to be single(ish) threaded — Details

the only piece that is still in a thread is the dialback server

Attachment #525947 - Flags: review?(aki)

Mike Taylor [:bear]

Assignee

Updated

•

14 years ago

Blocks: 649994

Mike Taylor [:bear]

Assignee

Comment 12

•

14 years ago

Attached patch add sut_lib.py to repo, refactor sut_tools helper scripts to use common lib — Details — Splinter Review

adds sut_lib.py to repo refactors installApp, reboot and cleanup to use sut_lib includes changes to check.py for refactoring and sut_lib

Attachment #525747 - Attachment is obsolete: true

Attachment #525946 - Attachment is obsolete: true

Attachment #525946 - Flags: review?(aki)

Attachment #526002 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 13

•

14 years ago

Comment on attachment 525947 [details] refactored clientproxy to be single(ish) threaded Sweet. How does this look?

Attachment #525947 - Flags: review?(aki) → review+

Aki Sasaki (not active)

Updated

•

14 years ago

Attachment #526002 - Flags: review?(aki) → review+

Mike Taylor [:bear]

Assignee

Comment 14

•

14 years ago

(In reply to comment #13) > Comment on attachment 525947 [details] > refactored clientproxy to be single(ish) threaded > > Sweet. > How does this look? so far it's running just as stable as before but with the tests not hanging as often I haven't had a chance to see the one edge case I'm worried about. the good news is that the soft reset (erase error.flg and try again) *and* the hard reset (reboot tegra) code is now working great. combine this with the check.py improvements and i'm feeling good about turning tegras over to normal slave duty next monday.

Mike Taylor [:bear]

Assignee

Comment 15

•

14 years ago

aki - going to checkin the code and setup foopy01 with the very latest to test sut_tools update against staging

Mike Taylor [:bear]

Assignee

Updated

•

14 years ago

Attachment #525947 - Flags: checked-in+

Mike Taylor [:bear]

Assignee

Comment 16

•

14 years ago

Comment on attachment 526002 [details] [diff] [review] add sut_lib.py to repo, refactor sut_tools helper scripts to use common lib committed changeset 1355:409d955e3493

Attachment #526002 - Flags: checked-in+

Mike Taylor [:bear]

Assignee

Comment 17

•

14 years ago

this code is now live on bm-foopy, foopy01 - foopy04

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.