Closed Bug 1018118 Opened 11 years ago Closed 11 years ago

Growing pending queues for tegras and time between jobs per tegra > 6 hours

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: coop)

References

Details

Attachments

(2 files)

All the tegras I've looked at show the same pattern: they grab a single job, run it to completion, and then hang for 6 hours until the slaveapi reboot successfully resurrects them. Here's an example tegra. Notice the ~6hr span between jobs: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tegra&name=tegra-101 If I look at the log for the most recent job as of now, it's hard (for me) to tell if the reboot is working as intended: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/627/steps/reboot%20device/logs/stdio A 6hr cycle time is obviously untenable. Has anything changed in this code recently? I took a look at the tools repo but didn't see anything. Maybe something has changed on the networking side recently?
1) lots of useful information in the watcher.log, error.flg, even twisted.log for this tegra - which should really be available from slavehealth page (bug 976100 would fix this) 2) I could connect to the sit agent, but the test root was not available - however, error.log said it could not connect to the device, which is not true (bug 976106 would fix this) 3) test root is not available if sd card cannot be mounted, but I found a race condition that can cause this problem (bug 1010173 would fix the race condition) - so I wanted to see if this was the cause… i could test this by seeing if i can write to the sd card, even though the device says the test root is unavailable… so i tried this, and had a strange surprise! 4) when i connect to it, and ls /mnt/sdcard, i can see it is an empty directory, i double checked the test root with the testroot command and confirmed it returned the known error “##AGENT-WARNING## unable to determine test root”. Then I tried to copy a (small) file into /mnt/sdcard to see if I could write there - and this is what happened… The copy command (cp) crashed the SUT Agent (baaaaaad bug). It didn’t just cause the client to disconnect, it actually crashed the SUT agent daemon on port 20701. But no worries - there is also a daemon on port 20700. I tried the same on this port too (notice, on port 20700 there is no prompt, so slightly harder to differentiate input/output below). This did exactly the same thing - so now I crashed but daemons for this device, and rendered it useless, until it is PDU rebooted. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... Connected to tegra-101. Escape character is '^]'. $>testroot ##AGENT-WARNING## unable to determine test root $>ls /mnt/sdcard $>ls dev root data default.prop init init.goldfish.rc init.rc init_recovery.rc proc sbin sys system etc d mnt acct sdcard cache misc config usbdisk $>cat default.prop # # ADDITIONAL_DEFAULT_PROPERTIES # ro.secure=0 ro.allow.mock.location=1 ro.debuggable=1 persist.service.adb.enable=1 persist.tegra.dpy5.mode.width=1280 persist.tegra.dpy5.mode.height=720 EXTERNAL_STORAGE_MOUNT=/mnt/sdcard ro.opengles.version=131072 dalvik.vm.heapsize=48m $>cp default.prop /mnt/sdcard Connection closed by foreign host. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... Connected to tegra-101. Escape character is '^]'. 20140530-00:52:29 trace output ls /mnt/sdcard ls dev root data default.prop init init.goldfish.rc init.rc init_recovery.rc proc sbin sys system etc d mnt acct sdcard cache misc config usbdisk 20140530-00:53:29 Thump thump - 00:26:e8:d4:25:33 cp default.prop /mnt/sdcard/default.prop Connection closed by foreign host. [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700 Trying 10.26.85.77... telnet: connect to address 10.26.85.77: Connection refused [cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$
(In reply to Pete Moore [:pete][:pmoore] from comment #0) > All the tegras I've looked at show the same pattern: they grab a single > job, run it to completion, and then hang for 6 hours until the slaveapi > reboot successfully resurrects them. > > Here's an example tegra. Notice the ~6hr span between jobs: > > https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave. > html?class=test&type=tegra&name=tegra-101 > > If I look at the log for the most recent job as of now, it's hard (for > me) to tell if the reboot is working as intended: > > http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/ > Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/ > 627/steps/reboot%20device/logs/stdio > > A 6hr cycle time is obviously untenable. Has anything changed in this > code recently? I took a look at the tools repo but didn't see anything. > Maybe something has changed on the networking side recently? I forgot to say this was copied/pasted from an email from :coop ...
Current error.flg files I see: 75 tegras: Automation Error: Unable to connect to device after 5 attempts 49 tegras: Unknown verify failure 8 tegras: Remote Device Error: Unhandled exception in cleanupDevice 7 tegras: Remote Device Error: updateApp() call failed - exiting 5 tegras: Remote Device Error: Unable to properly remove /mnt/sdcard/tests 3 tegras: Automation Error: Unable to ping device after 5 attempts 1 tegra: Remote Device Error: unable to write to sdcard 1 tegra: Remote Device Error: failed to restore /system/etc/hosts
Of the 75 "Automation Error: Unable to connect to device after 5 attempts" I believe many may be because of bug 1010173. Testing this theory now by attempting to write to the test root, even though SUT Agent says it is unavailable. I don't yet have an explanation for the long periods of time between taking jobs/rebooting successfully.
Attached file watcher.log-20140530
this watcher.log is quite revealing... 29/05/2014 20:27:28 - 29/05/2014 21:18:01 PDT ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541 nasty syslog: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text green 05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now It has just finished cleaning up, and making itself available. One second later, it is given a job from the buildbot master: 29/05/2014 21:30:15 - 29/05/2014 21:51:35 PDT ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20crashtest/builds/2875 The logs now show it is running successfully again: 3821 2014-05-29 21:35:01 -- ############################################################### 3822 2014-05-29 21:35:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now # 3823 2014-05-29 21:35:01 -- ############################################################### 3824 2014-05-29 21:35:01 -- contacting slavealloc 3825 2014-05-29 21:35:01 -- buildbot pid is 18217 3826 2014-05-29 21:35:01 -- (heartbeat) buildbot is running 3827 2014-05-29 21:35:01 -- Cycle for our device (tegra-101) complete 3849 2014-05-29 21:55:01 -- ############################################################### 3850 2014-05-29 21:55:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now # 3851 2014-05-29 21:55:01 -- ############################################################### 3852 2014-05-29 21:55:01 -- contacting slavealloc 3853 2014-05-29 21:55:01 -- Buildbot is not running The job completes (successfully) at 21:51:35 PDT. A few minutes later: 401 05/29/2014 21:55:08: INFO: Uninstalling org.mozilla.fennec... 402 05/29/2014 21:55:08: INFO: Waiting for device to come back... 403 05/29/2014 21:57:08: INFO: Try 1 404 05/29/2014 21:57:58: INFO: devroot None 405 05/29/2014 21:59:58: INFO: Try 2 406 2014-05-29 22:00:01 -- *** ERROR *** failed to aquire lockfile 407 05/29/2014 22:00:49: INFO: devroot None 408 05/29/2014 22:02:49: INFO: Try 3 409 05/29/2014 22:03:39: INFO: devroot None 410 2014-05-29 22:05:01 -- *** ERROR *** failed to aquire lockfile 411 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out. 412 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out. 413 05/29/2014 22:06:29: INFO: /builds/tegra-101/error.flg 414 05/29/2014 22:06:59: INFO: verifyDevice: failing to cleanup device So there is a cleanup job running, which fails after the reboot. I am quite sure this again could be the race condition from bug 1010173. However, only a reboot can fix this. For the next several hours, no reboot is attempted. Instead, every 5 minutes the watcher runs, sees there is an error.flg and then exits, apart from once per hour when it deletes the error.flg and retries. Each time, it will get the device root error “##AGENT-WARNING## unable to determine test root” and so it will recreate the error.flg file, and give up again. I think eventually, there is some generic slave rebooter, that sees - "wow - 6 hours since this machine last took a job, I'm going to reboot it". And then, hopefully all is ok again, although of course the race condition is always a possibility. Conclusion ========== * Necessary: This race condition needs to be fixed (bug 1010173) * Necessary: Logging needs to be improved (don't just say error.flg exists, output what it contains) * Necessary: Errors should not be swallowed, e.g. bug 976106 needs to be completed * Optional: Slave health should provide links to log files, to make life easier to troubleshoot (bug 976100) * Necessary: We need to find out what the numerous error messages are caused by in sut tools, e.g.: 1) http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text green 2) 05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 3) Maybe ok, maybe not: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now * Optional: until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure
Whoever gets to this first, please clear the review request for the other reviewer unless you think it needs more eyes. This patch does 3 things: 1) Adds the contents of the error.flg file where pertinent in watch_devices.sh 2) Switches sut_lib to use the Linux syntax for ps. The Mac syntax was causing the "Warning: bad syntax, perhaps a bogus '-'" output, and foopies are not Macs any more. 3) Attempts a soft reboot if we cannot ping the device
Attachment #8431652 - Flags: review?(pmoore)
Attachment #8431652 - Flags: review?(bugspam.Callek)
Comment on attachment 8431652 [details] [diff] [review] Reboot unreachable tegras Review of attachment 8431652 [details] [diff] [review]: ----------------------------------------------------------------- r+ with the changes I suggested :) ::: buildfarm/mobile/watch_devices.sh @@ +58,5 @@ > death "Not Starting, slavealloc says we're disabled" 64 > fi > if [ -f /builds/$device/error.flg ]; then > log "error.flg file detected" > + local contents=`cat error.flg` This should be: local contents=`cat /builds/$device/error.flg` @@ +86,5 @@ > else # buildbot running > log "(heartbeat) buildbot is running" > if [ -f /builds/$device/error.flg ]; then > log "Found an error.flg, expecting buildbot to self-kill after this job" > + local contents=`cat error.flg` This should be: local contents=`cat /builds/$device/error.flg` ::: lib/python/sut_lib/__init__.py @@ +197,5 @@ > # 18456 1 18455 /opt/local/Libra ?? S 0:00.88 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd$ > # 18575 18456 18455 /opt/local/Libra ?? S 0:00.52 > # /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python > # ../../sut_tools/installApp.py 10.250.49.8 ../talos-data/fennec-4.1a1pr$ > + p, lines = runCommand(['ps', 'xU', 'cltbld', '-O', 'ppid,pgid,command']) This looks good - did you test it? ::: sut_tools/verify.py @@ +367,3 @@ > log.info("verifyDevice: failing to ping") > + # See if we can recover the device with a reboot. > + soft_reboot(dm=dm, device=device) I'll defer to Callek on this one :)
Attachment #8431652 - Flags: review?(pmoore) → review+
Callek, are you happy with the soft_reboot part?
Flags: needinfo?(bugspam.Callek)
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Callek: ping re: comment #8?
Not ideal imo. But yea let's do it
Flags: needinfo?(bugspam.Callek)
Comment on attachment 8431652 [details] [diff] [review] Reboot unreachable tegras Review of attachment 8431652 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/1aee4963a541
Attachment #8431652 - Flags: review?(bugspam.Callek) → checked-in+
This got deployed yesterday.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
See Also: → 1028772
There are just a couple of trailing parts left open, so reopening to track these... * Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now * Until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure
Status: RESOLVED → REOPENED
Depends on: 1010173, 976106, 976100
Resolution: FIXED → ---
Ah - the second one is handled by coop's patch, i believe - so just the first one - but i think i can handle that with adding a bug dependency (we already have a bug about cleanup not being done - i'll find it...)
Depends on: 742479
Summary: Pending queue for tegras > 1000 and time between jobs per tegra is > 6 hours → Growing pending queues for tegras and time between jobs per tegra > 6 hours
* Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now This has been a long standing thing, and is a directory on the foopy. its a product of how tests run for tegras, and is just informational from buildbot when it connects with/coordinates with a master. I call this bug fixed.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: