1018118 - Growing pending queues for tegras and time between jobs per tegra > 6 hours

Reporter

Description

•

11 years ago

All the tegras I've looked at show the same pattern: they grab a single job, run it to completion, and then hang for 6 hours until the slaveapi reboot successfully resurrects them. Here's an example tegra. Notice the ~6hr span between jobs: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tegra&name=tegra-101 If I look at the log for the most recent job as of now, it's hard (for me) to tell if the reboot is working as intended: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/627/steps/reboot%20device/logs/stdio A 6hr cycle time is obviously untenable. Has anything changed in this code recently? I took a look at the tools repo but didn't see anything. Maybe something has changed on the networking side recently?

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

11 years ago

1) lots of useful information in the watcher.log, error.flg, even twisted.log for this tegra - which should really be available from slavehealth page (bug 976100 would fix this) 2) I could connect to the sit agent, but the test root was not available - however, error.log said it could not connect to the device, which is not true (bug 976106 would fix this) 3) test root is not available if sd card cannot be mounted, but I found a race condition that can cause this problem (bug 1010173 would fix the race condition) - so I wanted to see if this was the cause… i could test this by seeing if i can write to the sd card, even though the device says the test root is unavailable… so i tried this, and had a strange surprise! 4) when i connect to it, and ls /mnt/sdcard, i can see it is an empty directory, i double checked the test root with the testroot command and confirmed it returned the known error “##AGENT-WARNING## unable to determine test root”. Then I tried to copy a (small) file into /mnt/sdcard to see if I could write there - and this is what happened… The copy command (cp) crashed the SUT Agent (baaaaaad bug). It didn’t just cause the client to disconnect, it actually crashed the SUT agent daemon on port 20701. But no worries - there is also a daemon on port 20700. I tried the same on this port too (notice, on port 20700 there is no prompt, so slightly harder to differentiate input/output below). This did exactly the same thing - so now I crashed but daemons for this device, and rendered it useless, until it is PDU rebooted. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... Connected to tegra-101. Escape character is '^]'. $>testroot ##AGENT-WARNING## unable to determine test root $>ls /mnt/sdcard $>ls dev root data default.prop init init.goldfish.rc init.rc init_recovery.rc proc sbin sys system etc d mnt acct sdcard cache misc config usbdisk $>cat default.prop # # ADDITIONAL_DEFAULT_PROPERTIES # ro.secure=0 ro.allow.mock.location=1 ro.debuggable=1 persist.service.adb.enable=1 persist.tegra.dpy5.mode.width=1280 persist.tegra.dpy5.mode.height=720 EXTERNAL_STORAGE_MOUNT=/mnt/sdcard ro.opengles.version=131072 dalvik.vm.heapsize=48m $>cp default.prop /mnt/sdcard Connection closed by foreign host. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... Connected to tegra-101. Escape character is '^]'. 20140530-00:52:29 trace output ls /mnt/sdcard ls dev root data default.prop init init.goldfish.rc init.rc init_recovery.rc proc sbin sys system etc d mnt acct sdcard cache misc config usbdisk 20140530-00:53:29 Thump thump - 00:26:e8:d4:25:33 cp default.prop /mnt/sdcard/default.prop Connection closed by foreign host. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$

Pete Moore [:pmoore][:pete]

Reporter

Comment 2

•

11 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #0) > All the tegras I've looked at show the same pattern: they grab a single > job, run it to completion, and then hang for 6 hours until the slaveapi > reboot successfully resurrects them. > > Here's an example tegra. Notice the ~6hr span between jobs: > > https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave. > html?class=test&type=tegra&name=tegra-101 > > If I look at the log for the most recent job as of now, it's hard (for > me) to tell if the reboot is working as intended: > > http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/ > Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/ > 627/steps/reboot%20device/logs/stdio > > A 6hr cycle time is obviously untenable. Has anything changed in this > code recently? I took a look at the tools repo but didn't see anything. > Maybe something has changed on the networking side recently? I forgot to say this was copied/pasted from an email from :coop ...

Pete Moore [:pmoore][:pete]

Reporter

Comment 3

•

11 years ago

Current error.flg files I see: 75 tegras: Automation Error: Unable to connect to device after 5 attempts 49 tegras: Unknown verify failure 8 tegras: Remote Device Error: Unhandled exception in cleanupDevice 7 tegras: Remote Device Error: updateApp() call failed - exiting 5 tegras: Remote Device Error: Unable to properly remove /mnt/sdcard/tests 3 tegras: Automation Error: Unable to ping device after 5 attempts 1 tegra: Remote Device Error: unable to write to sdcard 1 tegra: Remote Device Error: failed to restore /system/etc/hosts

Pete Moore [:pmoore][:pete]

Reporter

Comment 4

•

11 years ago

Of the 75 "Automation Error: Unable to connect to device after 5 attempts" I believe many may be because of bug 1010173. Testing this theory now by attempting to write to the test root, even though SUT Agent says it is unavailable. I don't yet have an explanation for the long periods of time between taking jobs/rebooting successfully.

Pete Moore [:pmoore][:pete]

Reporter

Comment 5

•

11 years ago

Attached file watcher.log-20140530 — Details

this watcher.log is quite revealing... 29/05/2014 20:27:28 - 29/05/2014 21:18:01 PDT ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541 nasty syslog: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text green 05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now It has just finished cleaning up, and making itself available. One second later, it is given a job from the buildbot master: 29/05/2014 21:30:15 - 29/05/2014 21:51:35 PDT ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20crashtest/builds/2875 The logs now show it is running successfully again: 3821 2014-05-29 21:35:01 -- ############################################################### 3822 2014-05-29 21:35:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now # 3823 2014-05-29 21:35:01 -- ############################################################### 3824 2014-05-29 21:35:01 -- contacting slavealloc 3825 2014-05-29 21:35:01 -- buildbot pid is 18217 3826 2014-05-29 21:35:01 -- (heartbeat) buildbot is running 3827 2014-05-29 21:35:01 -- Cycle for our device (tegra-101) complete 3849 2014-05-29 21:55:01 -- ############################################################### 3850 2014-05-29 21:55:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now # 3851 2014-05-29 21:55:01 -- ############################################################### 3852 2014-05-29 21:55:01 -- contacting slavealloc 3853 2014-05-29 21:55:01 -- Buildbot is not running The job completes (successfully) at 21:51:35 PDT. A few minutes later: 401 05/29/2014 21:55:08: INFO: Uninstalling org.mozilla.fennec... 402 05/29/2014 21:55:08: INFO: Waiting for device to come back... 403 05/29/2014 21:57:08: INFO: Try 1 404 05/29/2014 21:57:58: INFO: devroot None 405 05/29/2014 21:59:58: INFO: Try 2 406 2014-05-29 22:00:01 -- *** ERROR *** failed to aquire lockfile 407 05/29/2014 22:00:49: INFO: devroot None 408 05/29/2014 22:02:49: INFO: Try 3 409 05/29/2014 22:03:39: INFO: devroot None 410 2014-05-29 22:05:01 -- *** ERROR *** failed to aquire lockfile 411 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out. 412 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out. 413 05/29/2014 22:06:29: INFO: /builds/tegra-101/error.flg 414 05/29/2014 22:06:59: INFO: verifyDevice: failing to cleanup device So there is a cleanup job running, which fails after the reboot. I am quite sure this again could be the race condition from bug 1010173. However, only a reboot can fix this. For the next several hours, no reboot is attempted. Instead, every 5 minutes the watcher runs, sees there is an error.flg and then exits, apart from once per hour when it deletes the error.flg and retries. Each time, it will get the device root error “##AGENT-WARNING## unable to determine test root” and so it will recreate the error.flg file, and give up again. I think eventually, there is some generic slave rebooter, that sees - "wow - 6 hours since this machine last took a job, I'm going to reboot it". And then, hopefully all is ok again, although of course the race condition is always a possibility. Conclusion ========== * Necessary: This race condition needs to be fixed (bug 1010173) * Necessary: Logging needs to be improved (don't just say error.flg exists, output what it contains) * Necessary: Errors should not be swallowed, e.g. bug 976106 needs to be completed * Optional: Slave health should provide links to log files, to make life easier to troubleshoot (bug 976100) * Necessary: We need to find out what the numerous error messages are caused by in sut tools, e.g.: 1) http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text green 2) 05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 3) Maybe ok, maybe not: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now * Optional: until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure

Chris Cooper [:coop] (he/him)

Assignee

Comment 6

•

11 years ago

Attached patch Reboot unreachable tegras — Details — Splinter Review

Whoever gets to this first, please clear the review request for the other reviewer unless you think it needs more eyes. This patch does 3 things: 1) Adds the contents of the error.flg file where pertinent in watch_devices.sh 2) Switches sut_lib to use the Linux syntax for ps. The Mac syntax was causing the "Warning: bad syntax, perhaps a bogus '-'" output, and foopies are not Macs any more. 3) Attempts a soft reboot if we cannot ping the device

Attachment #8431652 - Flags: review?(pmoore)

Attachment #8431652 - Flags: review?(bugspam.Callek)

Pete Moore [:pmoore][:pete]

Reporter

Comment 7

•

11 years ago

Comment on attachment 8431652 [details] [diff] [review] Reboot unreachable tegras Review of attachment 8431652 [details] [diff] [review]: ----------------------------------------------------------------- r+ with the changes I suggested :) ::: buildfarm/mobile/watch_devices.sh @@ +58,5 @@ > death "Not Starting, slavealloc says we're disabled" 64 > fi > if [ -f /builds/$device/error.flg ]; then > log "error.flg file detected" > + local contents=`cat error.flg` This should be: local contents=`cat /builds/$device/error.flg` @@ +86,5 @@ > else # buildbot running > log "(heartbeat) buildbot is running" > if [ -f /builds/$device/error.flg ]; then > log "Found an error.flg, expecting buildbot to self-kill after this job" > + local contents=`cat error.flg` This should be: local contents=`cat /builds/$device/error.flg` ::: lib/python/sut_lib/__init__.py @@ +197,5 @@ > # 18456 1 18455 /opt/local/Libra ?? S 0:00.88 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd$ > # 18575 18456 18455 /opt/local/Libra ?? S 0:00.52 > # /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python > # ../../sut_tools/installApp.py 10.250.49.8 ../talos-data/fennec-4.1a1pr$ > + p, lines = runCommand(['ps', 'xU', 'cltbld', '-O', 'ppid,pgid,command']) This looks good - did you test it? ::: sut_tools/verify.py @@ +367,3 @@ > log.info("verifyDevice: failing to ping") > + # See if we can recover the device with a reboot. > + soft_reboot(dm=dm, device=device) I'll defer to Callek on this one :)

Attachment #8431652 - Flags: review?(pmoore) → review+

Pete Moore [:pmoore][:pete]

Reporter

Comment 8

•

11 years ago

Callek, are you happy with the soft_reboot part?

Flags: needinfo?(bugspam.Callek)

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

11 years ago

Assignee: nobody → coop

Status: NEW → ASSIGNED

Priority: -- → P2

Chris Cooper [:coop] (he/him)

Assignee

Comment 9

•

11 years ago

Callek: ping re: comment #8?

Justin Wood (:Callek)

Comment 10

•

11 years ago

Not ideal imo. But yea let's do it

Flags: needinfo?(bugspam.Callek)

Chris Cooper [:coop] (he/him)

Assignee

Comment 11

•

11 years ago

Comment on attachment 8431652 [details] [diff] [review] Reboot unreachable tegras Review of attachment 8431652 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/1aee4963a541

Attachment #8431652 - Flags: review?(bugspam.Callek) → checked-in+

Chris Cooper [:coop] (he/him)

Assignee

Comment 12

•

11 years ago

This got deployed yesterday.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Jordan Lund (:jlund)

Updated

•

11 years ago

Comment 13

•

11 years ago

There are just a couple of trailing parts left open, so reopening to track these... * Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now * Until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure

Status: RESOLVED → REOPENED

Depends on: 1010173, 976106, 976100

Resolution: FIXED → ---

Pete Moore [:pmoore][:pete]

Reporter

Comment 14

•

11 years ago

Ah - the second one is handled by coop's patch, i believe - so just the first one - but i think i can handle that with adding a bug dependency (we already have a bug about cleanup not being done - i'll find it...)

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

11 years ago

Depends on: 742479

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

11 years ago

Summary: Pending queue for tegras > 1000 and time between jobs per tegra is > 6 hours → Growing pending queues for tegras and time between jobs per tegra > 6 hours

Justin Wood (:Callek)

Comment 15

•

11 years ago

* Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now This has been a long standing thing, and is a directory on the foopy. its a product of how tests run for tegras, and is just informational from buildbot when it connects with/coordinates with a master. I call this bug fixed.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

watcher.log-20140530 11 years ago Pete Moore [:pmoore][:pete] 145.27 KB, text/plain		Details
Reboot unreachable tegras 11 years ago Chris Cooper [:coop] (he/him) 5.48 KB, patch	pmoore : review+ coop : checked-in+	Details \| Diff \| Splinter Review