Closed Bug 799896 Opened 13 years ago Closed 13 years ago

Clientproxy is stopping tegra buildbot instances in some cases where it seems we shouldn't really be.

Categories

(Release Engineering :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

Details

Attachments

(1 file)

Soo, I was looking at some of the common cases recently where we had buildslave disconnected errors.. and it seems like the weird cases are primarily coming from clientproxy right now. As an example take tegra-058 which had issues where during its normal job would be getting killed off by buildbot. It seems that there are cases where the tegra would reboot, and we, well, think we should kill off buildbot. http://mxr.mozilla.org/build/source/tools/sut_tools/clientproxy.py#306 http://mxr.mozilla.org/build/source/tools/sut_tools/clientproxy.py#328 Something seems to have changed recently to cause this to surface. And it is killing off a good chunk of good jobs. Helpful for any insight you guys want to help provide. The following is from the tegra-058 clientproxy.log We have: 2012-10-09 19:00:57,495 DEBUG MainProcess: EVENT: active 2012-10-09 19:00:57,495 DEBUG MainProcess: event active hbFails 0 / 50 2012-10-09 19:00:57,495 DEBUG MainProcess: bbActive True tegraActive True 2012-10-09 19:00:57,495 DEBUG MainProcess: checking if slave is alive [/builds/tegra-058/twistd.pid] 2012-10-09 19:00:57,496 DEBUG MainProcess: EVENT: None 2012-10-09 19:01:02,372 DEBUG MainProcess: socket data [20121009-19:01:02 Rebooting ... ] 2012-10-09 19:01:02,372 WARNING MainProcess: device is rebooting 2012-10-09 19:01:07,373 DEBUG MainProcess: bbActive True tegraActive True 2012-10-09 19:01:07,373 DEBUG MainProcess: checking if slave is alive [/builds/tegra-058/twistd.pid] 2012-10-09 19:02:14,440 INFO MainProcess: Error connecting to data port - sleeping for 215 seconds 2012-10-09 19:05:49,465 DEBUG MainProcess: EVENT: reboot 2012-10-09 19:05:49,465 DEBUG MainProcess: event reboot hbFails 1 / 50 2012-10-09 19:05:49,465 WARNING MainProcess: Tegra rebooting, stopping buildslave 2012-10-09 19:05:49,465 DEBUG MainProcess: bbActive True tegraActive False 2012-10-09 19:05:49,466 DEBUG MainProcess: checking if slave is alive [/builds/tegra-058/twistd.pid] 2012-10-09 19:05:49,484 DEBUG MainProcess: EVENT: stop 2012-10-09 19:05:49,484 DEBUG MainProcess: event stop hbFails 1 / 50 2012-10-09 19:05:49,484 DEBUG MainProcess: buildslave: looking for /builds/tegra-058/twistd.pid 2012-10-09 19:05:49,484 DEBUG MainProcess: buildslave: looking for /builds/tegra-058/twistd.pid 2012-10-09 19:05:49,485 DEBUG MainProcess: buildslave: first attempt to kill 9616 2012-10-09 19:05:49,485 INFO MainProcess: calling kill for pid 9616 with signal 15 2012-10-09 19:05:49,485 DEBUG MainProcess: calling [ps -U cltbld -O ppid,pgid,command] 2012-10-09 19:05:49,569 DEBUG MainProcess: PID PPID PGID COMMAND TT STAT TIME COMMAND 2012-10-09 19:05:49,570 DEBUG MainProcess: 3837 1 3836 /opt/local/Libra ?? S 0:05.01 /opt/local/$ 2012-10-09 19:05:49,570 DEBUG MainProcess: 4319 1 4318 /opt/local/Libra ?? S 1:07.24 /opt/local/$ 2012-10-09 19:05:49,570 DEBUG MainProcess: 4751 1 4751 /sbin/launchd ?? Ss 14:18.79 /sbin/launc$ 2012-10-09 19:05:49,570 DEBUG MainProcess: 9616 1 9615 /opt/local/Libra ?? S 0:02.51 /opt/local/$ 2012-10-09 19:05:50,033 DEBUG MainProcess: 9865 3837 3836 /opt/local/Libra ?? S 0:02.34 /opt/local/$ 2012-10-09 19:05:50,033 DEBUG MainProcess: 9870 9865 3836 ../hostutils/bin ?? S 0:12.41 ../hostutil$ 2012-10-09 19:05:50,034 DEBUG MainProcess: 10257 4319 4318 /opt/local/Libra ?? S 0:02.75 /opt/local/$ 2012-10-09 19:05:50,034 DEBUG MainProcess: 10264 91032 91031 /opt/local/Libra ?? S 0:01.87 /opt/local/$ 2012-10-09 19:05:50,034 DEBUG MainProcess: 10267 10257 4318 /builds/tegra-05 ?? S 0:08.14 /builds/teg$ 2012-10-09 19:05:50,034 DEBUG MainProcess: 10268 97981 97980 /opt/local/Libra ?? S 0:04.12 /opt/local/$ 2012-10-09 19:05:50,034 DEBUG MainProcess: 10272 10268 97980 ../hostutils/bin ?? S 1:09.54 ../hostutil$ 2012-10-09 19:05:50,035 DEBUG MainProcess: 10278 10264 91031 /builds/tegra-06 ?? S 0:07.39 /builds/teg$ 2012-10-09 19:05:50,035 DEBUG MainProcess: 10287 21401 21400 /opt/local/Libra ?? S 0:01.32 /opt/local/$ 2012-10-09 19:05:50,035 DEBUG MainProcess: 10290 10287 21400 /builds/tegra-06 ?? S 0:05.28 /builds/teg$ 2012-10-09 19:05:50,035 DEBUG MainProcess: 10293 92509 92508 /opt/local/Libra ?? S 0:00.25 /opt/local/$ 2012-10-09 19:05:50,047 DEBUG MainProcess: 10336 92546 92544 /opt/local/Libra ?? S 0:01.24 /opt/local/$ 2012-10-09 19:05:50,048 DEBUG MainProcess: 10363 10336 92544 /builds/tegra-05 ?? S 0:03.85 /builds/teg$ 2012-10-09 19:05:50,048 DEBUG MainProcess: 10387 63702 63701 /opt/local/Libra ?? S 0:00.49 /opt/local/$ 2012-10-09 19:05:50,048 DEBUG MainProcess: 10392 10387 63701 ../hostutils/bin ?? S 0:04.56 ../hostutil$ 2012-10-09 19:05:50,048 DEBUG MainProcess: 10394 10393 10394 /bin/sh -c /buil ?? Ss 0:00.00 /bin/sh -c $ 2012-10-09 19:05:50,048 DEBUG MainProcess: 10395 10394 10394 /bin/bash /build ?? S 0:00.00 /bin/bash /$ 2012-10-09 19:05:50,049 DEBUG MainProcess: 10396 10395 10394 python sut_tools ?? S 0:00.67 python sut_$ 2012-10-09 19:05:50,049 DEBUG MainProcess: 10410 9616 9615 /opt/local/Libra ?? S 0:00.35 /opt/local/$ 2012-10-09 19:05:50,049 DEBUG MainProcess: 10418 10410 9615 ../hostutils/bin ?? S 0:00.41 ../hostutil$ 2012-10-09 19:05:50,049 DEBUG MainProcess: 10427 59694 59693 unzip -oq ../fen ?? R 0:03.13 unzip -oq .$ 2012-10-09 19:05:50,049 DEBUG MainProcess: 10432 92613 92612 rm -rf build ?? R 0:00.71 rm -rf build 2012-10-09 19:05:50,049 DEBUG MainProcess: 10436 10396 10394 ping -c 5 tegra- ?? S 0:00.01 ping -c 5 t$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 10437 41737 41736 ps -U cltbld -O ?? R 0:00.01 ps -U cltbl$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 21401 1 21400 /opt/local/Libra ?? S 0:57.95 /opt/local/$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 41610 1 41609 /opt/local/Libra ?? S 0:43.84 /opt/local/$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 41611 41610 41609 /opt/local/Libra ?? S 1:22.53 /opt/local/$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 41631 1 41630 /opt/local/Libra ?? S 0:45.30 /opt/local/$ 2012-10-09 19:05:50,050 DEBUG MainProcess: 41632 41631 41630 /opt/local/Libra ?? S 1:22.61 /opt/local/$ 2012-10-09 19:05:50,051 DEBUG MainProcess: 41662 1 41661 /opt/local/Libra ?? S 0:45.91 /opt/local/$ 2012-10-09 19:05:50,051 DEBUG MainProcess: 41663 41662 41661 /opt/local/Libra ?? S 1:22.61 /opt/local/$ ... 2012-10-09 19:05:50,058 INFO MainProcess: calling kill for pid 10410 with signal 15 2012-10-09 19:05:50,058 DEBUG MainProcess: calling [ps -U cltbld -O ppid,pgid,command] 2012-10-09 19:05:50,090 DEBUG MainProcess: PID PPID PGID COMMAND TT STAT TIME COMMAND ... 2012-10-09 19:05:50,119 INFO MainProcess: calling kill for pid 10418 with signal 15 2012-10-09 19:05:50,120 DEBUG MainProcess: calling [ps -U cltbld -O ppid,pgid,command] .... 2012-10-09 19:05:59,238 DEBUG MainProcess: 10418 not found 2012-10-09 19:05:59,238 DEBUG MainProcess: 10410 not found 2012-10-09 19:06:03,243 DEBUG MainProcess: verifying buildslave is gone 2012-10-09 19:06:03,243 INFO MainProcess: buildslave stopped 2012-10-09 19:06:03,244 DEBUG MainProcess: verify step of stopProcess 2012-10-09 19:06:03,244 DEBUG MainProcess: Traceback (most recent call last): 2012-10-09 19:06:03,244 DEBUG MainProcess: 2012-10-09 19:06:03,244 DEBUG MainProcess: File "/builds/tools/sut_tools/sut_lib.py", line 264, in stopProce$ 2012-10-09 19:06:03,245 DEBUG MainProcess: if len(fileTail) > 0: 2012-10-09 19:06:03,245 DEBUG MainProcess: 2012-10-09 19:06:03,245 DEBUG MainProcess: OSError: [Errno 2] No such file or directory: '/builds/tegra-058/tw$ 2012-10-09 19:06:03,245 DEBUG MainProcess: 2012-10-09 19:06:03,245 DEBUG MainProcess: Traceback End 2012-10-09 19:06:03,245 ERROR MainProcess: buildslave: pid 9616 not found 2012-10-09 19:06:03,246 DEBUG MainProcess: bbActive False tegraActive False 2012-10-09 19:06:03,246 DEBUG MainProcess: EVENT: None 2012-10-09 19:06:03,246 DEBUG MainProcess: socket data [20121009-19:05:49 trace output] 2012-10-09 19:06:03,246 INFO MainProcess: heartbeat detected 2012-10-09 19:06:03,246 DEBUG MainProcess: bbActive False tegraActive False 2012-10-09 19:06:03,247 DEBUG MainProcess: EVENT: active 2012-10-09 19:06:03,247 DEBUG MainProcess: event active hbFails 0 / 50 2012-10-09 19:06:03,247 DEBUG MainProcess: bbActive False tegraActive True 2012-10-09 19:06:03,248 DEBUG MainProcess: EVENT: verify 2012-10-09 19:06:03,248 DEBUG MainProcess: event verify hbFails 0 / 50 2012-10-09 19:06:03,248 INFO MainProcess: Running verify code 2012-10-09 19:06:03,248 DEBUG MainProcess: calling [python /builds/sut_tools/verify.py tegra-058] ...... (that last verify was successful, btw)
I just looked at every single purple job on inbound today, and with the exception of 118 which has been mostly-broken but left in service since the 28th of September, they were done by 058, 102, 103, 105, 109, 196 or 212, every one of which, while completely blameless, had committed suicide and was brought back to life on Monday by being PDU cycled. So, what code changed on those blameless slaves when they got rebooted on Monday?
So, as discussed on IRC, I think we're safe assuming (with clientproxy) that all reboots are actually legit reboots at this point. Reasons: * We shut off the watcher induced reboots on the device * We have a verify.py to alert us if a device is down before we do the real-job aspects, and it fails out with an error.flg when it doesn't * Other things that need to wait on a reboot, if the device doesn't come back fail out, and if they don't also set an error.flg will give us to a new job that has verify.py handy to catch faults. * Only non-expected causes of reboots at this point will be either done manually, be a case of bad-device, or a colo issue. --> all of which should be visible to us via other means not needing a slave-takedown via proxy.flg's absense. Why this matters now: * With recent sut/devicemanager changes we now can reconnect to the SUTAgent socket much faster in most cases, as well fail out properly, also fast, in the ones where its bad. * I theorize that our Event-Queue model for clientproxy was causing "seen" reboots of the device to get back to handling the "reboot" aspect AFTER proxy.flg went away, therefore thinking it was not a legit case --> and forcibly killing buildbot because of it. * Slaves that are otherwise good can look like bad slaves if they come back "too fast" by constantly getting killed due to this clientproxy thing. * Slaves that are otherwise bad, won't get killed this way anyway, due to proxy.flg lingering while we await the device to come back (or cleanup to kill off the flag file)
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #670784 - Flags: review?(jmaher)
Attachment #670784 - Flags: review?(kmoir)
Attachment #670784 - Flags: review?
Comment on attachment 670784 [details] [diff] [review] [tools] remove proxy.flg and magic stopping of the slave when we don't have it. Callek asked me to look at this patch in IRC - lgtm. Also, I tested it in staging and the builds are green so far.
Attachment #670784 - Flags: review?(kmoir)
Attachment #670784 - Flags: review?
Attachment #670784 - Flags: review+
Okay Callek mentioned that I needed to do a stop_cp and start_cp on the staging foopies to actually verify this. I did this and again the builds are still green.
Comment on attachment 670784 [details] [diff] [review] [tools] remove proxy.flg and magic stopping of the slave when we don't have it. Review of attachment 670784 [details] [diff] [review]: ----------------------------------------------------------------- overall this looks good. The more we can remove and/or simplify in all this code the better off we will be going forward. ::: sut_tools/config.py @@ +114,5 @@ > + width, height = getResolution(dm) > + print("current resolution X:%d Y:%d" % (width, height)) > + if width != refWidth and height != refHeight: > + setFlag(errorFile, "Remote Device Error: current resolution X:%d Y:%d does not match what was set X:%d Y:%d" % (width, height, refWidth, refHeight)) > + sys.exit(1) this whole section was moved out of a try/finally block. What will happen if we find an exception? Will we clean up good, or cause ugly nasty problems which are hard to debug?
Attachment #670784 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #5) > ::: sut_tools/config.py > @@ +114,5 @@ > > + width, height = getResolution(dm) > > + print("current resolution X:%d Y:%d" % (width, height)) > > + if width != refWidth and height != refHeight: > > + setFlag(errorFile, "Remote Device Error: current resolution X:%d Y:%d does not match what was set X:%d Y:%d" % (width, height, refWidth, refHeight)) > > + sys.exit(1) > > this whole section was moved out of a try/finally block. What will happen > if we find an exception? Will we clean up good, or cause ugly nasty > problems which are hard to debug? The try/finally was *only* used to cleanup the proxy.flg on error, all other errors bubbling of Exceptions etc will still occur as always., since we never actually caught an error explicitly here. -- Landed in http://hg.mozilla.org/build/tools/rev/bc38e47c0372 and updated all foopies. Then ran stop_cp // start_cp against all devices that we hadn't already explicitly run stop_cp on (I used the check.sh to know)
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Interesting: 2012-10-15 15:45:28,268 DEBUG MainProcess: bbActive True tegraActive True 2012-10-15 15:45:28,268 DEBUG MainProcess: checking if slave is alive [/builds/tegra-096/twistd.pid] 2012-10-15 15:45:28,268 DEBUG MainProcess: EVENT: None 2012-10-15 15:45:28,268 DEBUG MainProcess: bbActive True tegraActive True 2012-10-15 15:45:28,269 DEBUG MainProcess: checking if slave is alive [/builds/tegra-096/twistd.pid] 2012-10-15 15:45:28,270 INFO MainProcess: Error connecting to data port - sleeping for 10 seconds 2012-10-15 15:45:38,270 DEBUG MainProcess: EVENT: offline 2012-10-15 15:45:38,270 DEBUG MainProcess: event offline hbFails 1 / 50 2012-10-15 15:45:38,270 DEBUG MainProcess: buildslave: looking for /builds/tegra-096/twistd.pid 2012-10-15 15:45:38,271 DEBUG MainProcess: buildslave: first attempt to kill 51630 2012-10-15 15:45:38,271 INFO MainProcess: calling kill for pid 51630 with signal 15 So a (slightly) different cause, potential to be a regression from this bug though.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: