Last Comment Bug 817762 - Mozpools states seem stuck
: Mozpools states seem stuck
Status: RESOLVED FIXED
:
Product: Infrastructure & Operations
Classification: Other
Component: RelOps (show other bugs)
: other
: x86_64 Windows 7
: -- normal (vote)
: ---
Assigned To: Dustin J. Mitchell [:dustin]
: Amy Rich [:arr] [:arich]
:
Mentors:
: 818906 819552 891173 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-03 12:15 PST by Jake Watkins [:dividehex] (pto-till-12/05)
Modified: 2013-07-29 07:50 PDT (History)
9 users (show)
See Also:
colo-trip: ---
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---
Cab Review: ServiceNow Change Request (use flag)


Attachments
logging.patch (1.89 KB, patch)
2012-12-06 09:23 PST, Dustin J. Mitchell [:dustin]
mcote: review+
Details | Diff | Splinter Review
bug817762.patch (5.41 KB, patch)
2012-12-06 14:39 PST, Dustin J. Mitchell [:dustin]
mcote: review+
Details | Diff | Splinter Review
bug817762.patch (2.11 KB, patch)
2013-02-11 14:03 PST, Dustin J. Mitchell [:dustin]
rsoderberg: review+
Details | Diff | Splinter Review
bug817762-set-keepalive.patch (2.09 KB, patch)
2013-03-01 10:57 PST, Dustin J. Mitchell [:dustin]
catlee: review+
Details | Diff | Splinter Review
bug817762-curframes.patch (1.67 KB, patch)
2013-04-02 06:03 PDT, Dustin J. Mitchell [:dustin]
jwatkins: review+
Details | Diff | Splinter Review
bug817762-rollback.patch (2.62 KB, patch)
2013-06-06 09:45 PDT, Dustin J. Mitchell [:dustin]
jwatkins: review+
dustin: checked‑in+
Details | Diff | Splinter Review
bug817762-suicide.patch (1.80 KB, patch)
2013-07-11 11:58 PDT, Dustin J. Mitchell [:dustin]
no flags Details | Diff | Splinter Review
bug817762-suicide-r2.patch (2.20 KB, patch)
2013-07-12 08:46 PDT, Dustin J. Mitchell [:dustin]
jwatkins: review+
dustin: checked‑in+
Details | Diff | Splinter Review
bug817762-puppet.patch (1.97 KB, patch)
2013-07-12 08:59 PDT, Dustin J. Mitchell [:dustin]
jwatkins: review+
dustin: checked‑in+
Details | Diff | Splinter Review

Description Jake Watkins [:dividehex] (pto-till-12/05) 2012-12-03 12:15:18 PST
I ran into an issue while installing the panda-android.2 image on chassis 2.  Everything seemed to be going well the statemachine entered android_rebooting.  Once it entered that state, it never changed.


2012-12-03T11:20:01 statemachine writing pxe_config panda-android.2, boot config {} to db
2012-12-03T11:20:01 statemachine entering state start_pxe_boot
2012-12-03T11:20:01 statemachine entering state pxe_power_cycling
2012-12-03T11:20:01 bmm setting PXE config to 'panda-android.2'
2012-12-03T11:20:01 bmm initiating power cycle
2012-12-03T11:20:04 statemachine entering state pxe_booting
2012-12-03T11:21:15 syslog second stage URL found: http://10.12.48.27/scripts/android-second-stage.sh
2012-12-03T11:21:15 syslog wget success: --2025-11-06 06:42:00-- http://10.12.48.27/scripts/android-second-stage.sh#012Connecting to 10.12.48.27:80... connected.#012HTTP request sent, awaiting response... 200 OK#012Length: 1829 (1.8K) [application/x-sh]#012Saving to: `/opt/scripts/second-stage.sh'#012#012 0K . 100% 9.53M=0s#012#0122025-11-06 06:42:00 (9.53 MB/s) - `/opt/scripts/second-stage.sh' saved [1829/1829]
2012-12-03T11:21:15 syslog Executing second-stage.sh
2012-12-03T11:21:15 syslog beginning android-second-stage.sh
2012-12-03T11:21:15 syslog setting time with ntpdate
2012-12-03T11:21:24 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_downloading/
2012-12-03T11:21:24 bmm clearing PXE config
2012-12-03T11:21:24 statemachine entering state android_downloading
2012-12-03T11:21:24 syslog got android image panda-android.2
2012-12-03T11:21:24 syslog fetching http://10.12.48.27/artifacts/panda-android.2/system.tar.bz2
2012-12-03T11:21:36 syslog fetching http://10.12.48.27/artifacts/panda-android.2/userdata.tar.bz2
2012-12-03T11:21:38 syslog fetching http://10.12.48.27/artifacts/panda-android.2/boot.tar.bz2
2012-12-03T11:21:39 syslog Formatting partitions
2012-12-03T11:21:51 syslog Mounting partitions
2012-12-03T11:21:51 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_extracting/
2012-12-03T11:21:51 statemachine entering state android_extracting
2012-12-03T11:21:52 syslog Extracting artifacts
2012-12-03T11:23:43 syslog Unmounting partitions
2012-12-03T11:23:58 syslog Imaging complete. Rebooting
2012-12-03T11:23:58 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_rebooting/
2012-12-03T11:23:58 statemachine entering state android_rebooting
Comment 1 Dustin J. Mitchell [:dustin] 2012-12-03 12:40:35 PST
My guess is that something got wedged in the power-cycling code, since all of the affected pandas were on the same relay board, and other operations were proceeding apace.

Amy had something similar happen in rack 10, where she had mis-formatted the relay info.

It's possible that this happens when an "unexpected" error (which the misformatted relay info might trigger) occurs.
Comment 2 Aki Sasaki [:aki] 2012-12-03 17:46:42 PST
Does this only affect a handful of devices?  Only Android?

Asking because jhopkins was having issues with b2g pandas not coming up post-image.
I'm not sure if that's this bug or something else ?
Comment 3 Dustin J. Mitchell [:dustin] 2012-12-03 18:00:32 PST
It has nothing to do with the image on the pandas.  The manifestation is that a state that should time out (e.g., *_rebooting) never does.  It may be related to what jhopkins is seeing.  I restarted mozpool on mobile-imaging-001 for him a few hours ago.
Comment 4 Dustin J. Mitchell [:dustin] 2012-12-04 12:09:12 PST
I've been tailing the logs on mobile-services for a while now, and I haven't seen this happen again.
Comment 5 Dustin J. Mitchell [:dustin] 2012-12-04 12:16:52 PST
I also can't reproduce this at home.  I've tried:
 - incorrectly formatted relay hosts (no :'s, like amy's)
 - correctly formatted relay hosts with a bogus hostname
 - correctly formatted relay hosts with a bogus bank #

in all cases, I get a traceback in the log, and things proceed as expected.
Comment 6 Dustin J. Mitchell [:dustin] 2012-12-05 13:29:27 PST
Still no replication here.  Another possibility is that a MySQL connection hangs.  We do all of the MySQL transactions synchronously, so if this hung the devicemanager or requestmanager thread, all timeouts would stop.  I'm not sure how to work around that, so I don't want to try to address it without more evidence.
Comment 7 Dustin J. Mitchell [:dustin] 2012-12-06 08:22:50 PST
*** Bug 818906 has been marked as a duplicate of this bug. ***
Comment 8 Dustin J. Mitchell [:dustin] 2012-12-06 09:22:13 PST
My notes from debugging this on mobile-imaging-001.p1 (which has now been restarted and is back in service):

mozpool-s 17890 apache    3u  IPv4 21022167      0t0      TCP localhost:8010->localhost:49740 (ESTABLISHED)
mozpool-s 17890 apache    4u  IPv4 21020638      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38556->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    5u  IPv4 19798499      0t0      TCP *:8010 (LISTEN)
mozpool-s 17890 apache    6u  IPv4 21020575      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38554->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    7u  IPv4 21020014      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38542->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    9u  IPv4 21020152      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38545->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)


all CLOSE_WAIT = server closed, client didn't get the memo yet.  Doing an API call will open one in ESTABLISHED.

on a "running" system, these are all ESTABLISHED, and anything reading/writing those fd's would get an error, so this is unlikely a MySQL-originated hang

(gdb) info threads
  11 Thread 0x7f6b3a7ae700 (LWP 17899)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  10 Thread 0x7f6b39dad700 (LWP 17900)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  9 Thread 0x7f6b393ac700 (LWP 17901)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  8 Thread 0x7f6b23fff700 (LWP 17902)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  7 Thread 0x7f6b235fe700 (LWP 17903)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  6 Thread 0x7f6b22bfd700 (LWP 17904)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  5 Thread 0x7f6b221fc700 (LWP 17905)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  4 Thread 0x7f6b217fb700 (LWP 17906)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  3 Thread 0x7f6b20dfa700 (LWP 17907)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  2 Thread 0x7f6b0bfff700 (LWP 17908)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7f6b45d27700 (LWP 17890)  0x000000368d8dc273 in poll () from /lib64/libc.so.6


device list API call works
power-control API calls work
everything works except states that need to timeout, e.g., pc_rebooting

state transitions (handle_event) seem to work anyway, and use the same LockByName instance
  - handle_event (for bogus event "fooie") successful on every panda on the server
    - slowness is Apache, grr

Oldest state_timeout value is for panda-0159 and panda-0141 (tie), when went to b2g_rebooting:
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting

So whatever failed, failed then?  Is it global or panda-specific?
  -> maybe the tie is responsible??

request.1643 INFO - [2012-12-05 18:57:14,447] entering state closed
127.0.0.1:40744 - - [05/Dec/2012 18:57:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40747 - - [05/Dec/2012 18:57:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting  *****
127.0.0.1:40748 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0141/event/b2g_rebooting/" - 200 OK
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting  *****
127.0.0.1:40749 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0159/event/b2g_rebooting/" - 200 OK
device.panda-0137 INFO - [2012-12-05 18:57:38,313] entering state b2g_rebooting (also failed)
127.0.0.1:40750 - - [05/Dec/2012 18:57:38] "HTTP/1.1 GET /api/device/panda-0137/event/b2g_rebooting/" - 200 OK
device.panda-0143 INFO - [2012-12-05 18:57:47,087] entering state b2g_rebooting (also failed) ..
127.0.0.1:40751 - - [05/Dec/2012 18:57:47] "HTTP/1.1 GET /api/device/panda-0143/event/b2g_rebooting/" - 200 OK
device.panda-0158 INFO - [2012-12-05 18:57:48,686] entering state b2g_rebooting
127.0.0.1:40752 - - [05/Dec/2012 18:57:48] "HTTP/1.1 GET /api/device/panda-0158/event/b2g_rebooting/" - 200 OK
device.panda-0127 INFO - [2012-12-05 18:57:50,703] entering state b2g_rebooting
127.0.0.1:40753 - - [05/Dec/2012 18:57:50] "HTTP/1.1 GET /api/device/panda-0127/event/b2g_rebooting/" - 200 OK
device.panda-0104 INFO - [2012-12-05 18:57:54,091] entering state b2g_rebooting
127.0.0.1:40754 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/device/panda-0104/event/b2g_rebooting/" - 200 OK
127.0.0.1:40755 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1644 INFO - [2012-12-05 18:57:55,004] entering state finding_device
request.1644 INFO - [2012-12-05 18:57:55,046] assigning requested device panda-0106
request.1644 INFO - [2012-12-05 18:57:55,086] request succeeded
request.1644 INFO - [2012-12-05 18:57:55,086] entering state contacting_lifeguard
request.1644 INFO - [2012-12-05 18:57:55,144] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:57:55,169] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:57:55,192] ignored event free in state failed_pxe_booting
127.0.0.1:40758 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:57:55,211] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1644 INFO - [2012-12-05 18:57:55,211] entering state closed
127.0.0.1:40756 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40759 - - [05/Dec/2012 18:57:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
device.panda-0150 INFO - [2012-12-05 18:57:58,719] entering state b2g_rebooting
127.0.0.1:40760 - - [05/Dec/2012 18:57:58] "HTTP/1.1 GET /api/device/panda-0150/event/b2g_rebooting/" - 200 OK
127.0.0.1:40761 - - [05/Dec/2012 18:58:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40762 - - [05/Dec/2012 18:58:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40763 - - [05/Dec/2012 18:58:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40764 - - [05/Dec/2012 18:58:13] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40765 - - [05/Dec/2012 18:58:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1645 INFO - [2012-12-05 18:58:14,586] entering state finding_device
request.1645 INFO - [2012-12-05 18:58:14,628] assigning requested device panda-0085
request.1645 INFO - [2012-12-05 18:58:14,641] request succeeded
request.1645 INFO - [2012-12-05 18:58:14,641] entering state contacting_lifeguard
request.1645 INFO - [2012-12-05 18:58:14,700] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:14,780] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:58:14,803] ignored event free in state failed_b2g_downloading
127.0.0.1:40768 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:14,821] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1645 INFO - [2012-12-05 18:58:14,822] entering state closed
127.0.0.1:40766 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40769 - - [05/Dec/2012 18:58:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40770 - - [05/Dec/2012 18:58:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1646 INFO - [2012-12-05 18:58:55,355] entering state finding_device
request.1646 INFO - [2012-12-05 18:58:55,397] assigning requested device panda-0106
request.1646 INFO - [2012-12-05 18:58:55,410] request succeeded
request.1646 INFO - [2012-12-05 18:58:55,410] entering state contacting_lifeguard
request.1646 INFO - [2012-12-05 18:58:55,501] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:55,526] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:58:55,548] ignored event free in state failed_pxe_booting
127.0.0.1:40773 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:55,566] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1646 INFO - [2012-12-05 18:58:55,566] entering state closed
127.0.0.1:40771 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40774 - - [05/Dec/2012 18:58:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
127.0.0.1:40775 - - [05/Dec/2012 18:59:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40776 - - [05/Dec/2012 18:59:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40777 - - [05/Dec/2012 18:59:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40778 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40779 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1647 INFO - [2012-12-05 18:59:14,969] entering state finding_device
request.1647 INFO - [2012-12-05 18:59:15,010] assigning requested device panda-0085
request.1647 INFO - [2012-12-05 18:59:15,025] request succeeded
request.1647 INFO - [2012-12-05 18:59:15,025] entering state contacting_lifeguard
request.1647 INFO - [2012-12-05 18:59:15,218] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:15,244] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:59:15,267] ignored event free in state failed_b2g_downloading
127.0.0.1:40782 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:15,285] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1647 INFO - [2012-12-05 18:59:15,285] entering state closed
127.0.0.1:40780 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40783 - - [05/Dec/2012 18:59:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40784 - - [05/Dec/2012 18:59:51] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
127.0.0.1:40785 - - [05/Dec/2012 18:59:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1648 INFO - [2012-12-05 18:59:55,708] entering state finding_device
request.1648 INFO - [2012-12-05 18:59:55,748] assigning requested device panda-0106
request.1648 INFO - [2012-12-05 18:59:55,762] request succeeded
request.1648 INFO - [2012-12-05 18:59:55,762] entering state contacting_lifeguard
request.1648 INFO - [2012-12-05 18:59:55,817] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:55,843] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:59:55,866] ignored event free in state failed_pxe_booting
127.0.0.1:40788 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:55,885] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1648 INFO - [2012-12-05 18:59:55,885] entering state closed
127.0.0.1:40786 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40789 - - [05/Dec/2012 18:59:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
<---- timeout for panda-0141 and panda-0159 should have happened here ---->
127.0.0.1:40790 - - [05/Dec/2012 19:00:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40791 - - [05/Dec/2012 19:00:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40792 - - [05/Dec/2012 19:00:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40793 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40794 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
Comment 9 Dustin J. Mitchell [:dustin] 2012-12-06 09:23:03 PST
Created attachment 689234 [details] [diff] [review]
logging.patch
Comment 10 Mark Côté [:mcote] 2012-12-06 09:26:17 PST
Comment on attachment 689234 [details] [diff] [review]
logging.patch

Cool cool.
Comment 11 Dustin J. Mitchell [:dustin] 2012-12-06 14:11:29 PST
With the logging in place,

device INFO - [2012-12-06 12:39:45,679] tick
device ERROR - [2012-12-06 12:39:45,686] _run failed
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 44, in run
    self._run();
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 65, in _run
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 61, in _get_timed_out_machine_names
    return data.get_timed_out_devices(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 218, in get_timed_out_devices
    return get_timed_out(model.devices, model.devices.c.name, imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 199, in get_timed_out
    & (tbl.c.imaging_server_id == imaging_server_id)))
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1449, in execute
    params)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1698, in _execute_context
    context)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1691, in _execute_context
    context)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 331, in do_execute
    cursor.execute(statement, parameters)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
    self.errorhandler(self, exc, value)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 189, in defaulterrorhandler
    raise errorclass, errorvalue
ProgrammingError: (ProgrammingError) (1146, u"Table 'mozpool.devices' doesn't exist") 'SELECT devices.name \nFROM devices \nWHERE devices.state_timeout < %s AND devices.imaging_server_id = %s' (datetime.datetime(2012, 12, 6, 12, 39, 45, 6
79912), 201L)
device INFO - [2012-12-06 12:39:45,688] _run returned


(and I know why that specific error happened -- I had to quickly drop and re-create the table to work around a MySQL bug).

I'm willing to blame earlier failures on similar things, and assume that the traceback just didn't make it into the logs (although it should..).  So, time to bulletproof that loop.
Comment 12 Dustin J. Mitchell [:dustin] 2012-12-06 14:39:42 PST
Created attachment 689396 [details] [diff] [review]
bug817762.patch

This leaves a not-unreasonable amount of logging in, I think.
Comment 13 Mark Côté [:mcote] 2012-12-06 15:19:21 PST
Comment on attachment 689396 [details] [diff] [review]
bug817762.patch

Review of attachment 689396 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozpool/statedriver.py
@@ +42,1 @@
>          try:

Missed this in the last patch, but having a try/except inside a try/finally seems like overkill. :)
Comment 14 Dustin J. Mitchell [:dustin] 2012-12-06 15:27:28 PST
Mark also pointed out in irc that the thread is not a deamon so it makes interrupting the process a bit hard (note that web.py's worker threads do the same thing - they don't quit until Firefox drops the HTTP/1.1 connection).
Comment 15 Dustin J. Mitchell [:dustin] 2012-12-06 19:10:03 PST
I landed this with a flattened try syntax and thd.setDaemon(1) added.

I'm going to optimistically call this done.
Comment 16 Dustin J. Mitchell [:dustin] 2012-12-07 14:16:07 PST
*** Bug 819552 has been marked as a duplicate of this bug. ***
Comment 17 Dustin J. Mitchell [:dustin] 2013-02-06 14:33:33 PST
optimism is just future disappointment..

request WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another
device WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another

this is on the staging server.
Comment 18 Dustin J. Mitchell [:dustin] 2013-02-06 14:52:21 PST
Looking back in the logs, I see a few times when a single timeout loop spilled past 10 seconds, but recovered afterward.  These really are just warnings.

device WARNING - [2013-02-05 03:19:17,396] polling thread still running at 10s; not starting another
device.fakepanda-2043 INFO - [2013-02-05 03:19:17,271] entering state free
device INFO - [2013-02-05 03:19:18,356] handling timeout on fakepanda-2044
fakedevices.fakepanda-2043 DEBUG - [2013-02-05 03:19:18,360] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:18,826] polling thread still running at 11s; not starting another
device.fakepanda-2044 INFO - [2013-02-05 03:19:18,599] entering state free
device INFO - [2013-02-05 03:19:19,808] handling timeout on panda-0010
fakedevices.fakepanda-2044 DEBUG - [2013-02-05 03:19:19,810] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:20,310] polling thread still running at 12s; not starting another
device.panda-0010 INFO - [2013-02-05 03:19:20,062] entering state free
device INFO - [2013-02-05 03:19:22,145] handling timeout on fakepanda-2052

later, some failures caused by losing a db connection:

request WARNING - [2013-02-05 09:25:03,688] polling thread still running at 10s; not starting another
device WARNING - [2013-02-05 09:25:03,865] polling thread still running at 10s; not starting another
request WARNING - [2013-02-05 09:25:04,689] polling thread still running at 11s; not starting another
device WARNING - [2013-02-05 09:25:04,866] polling thread still running at 11s; not starting another
request WARNING - [2013-02-05 09:25:05,791] polling thread still running at 12s; not starting another
device WARNING - [2013-02-05 09:25:05,968] polling thread still running at 12s; not starting another
request WARNING - [2013-02-05 09:25:07,003] polling thread still running at 13s; not starting another
device WARNING - [2013-02-05 09:25:07,180] polling thread still running at 13s; not starting another
request WARNING - [2013-02-05 09:25:08,336] polling thread still running at 14s; not starting another
device WARNING - [2013-02-05 09:25:08,512] polling thread still running at 14s; not starting another
request WARNING - [2013-02-05 09:25:09,802] polling thread still running at 16s; not starting another
device WARNING - [2013-02-05 09:25:09,977] polling thread still running at 16s; not starting another
request WARNING - [2013-02-05 09:25:11,414] polling thread still running at 17s; not starting another
device WARNING - [2013-02-05 09:25:11,590] polling thread still running at 17s; not starting another
request WARNING - [2013-02-05 09:25:13,188] polling thread still running at 19s; not starting another
device WARNING - [2013-02-05 09:25:13,363] polling thread still running at 19s; not starting another
request WARNING - [2013-02-05 09:25:15,139] polling thread still running at 21s; not starting another
device WARNING - [2013-02-05 09:25:15,314] polling thread still running at 21s; not starting another
request WARNING - [2013-02-05 09:25:17,286] polling thread still running at 23s; not starting another
device WARNING - [2013-02-05 09:25:17,460] polling thread still running at 23s; not starting another
request WARNING - [2013-02-05 09:25:19,646] polling thread still running at 25s; not starting another
device WARNING - [2013-02-05 09:25:19,821] polling thread still running at 25s; not starting another
request WARNING - [2013-02-05 09:25:22,241] polling thread still running at 28s; not starting another
device WARNING - [2013-02-05 09:25:22,418] polling thread still running at 28s; not starting another
request WARNING - [2013-02-05 09:25:25,098] polling thread still running at 31s; not starting another
device WARNING - [2013-02-05 09:25:25,274] polling thread still running at 31s; not starting another
request WARNING - [2013-02-05 09:25:28,240] polling thread still running at 34s; not starting another
device WARNING - [2013-02-05 09:25:28,416] polling thread still running at 34s; not starting another
request WARNING - [2013-02-05 09:25:31,696] polling thread still running at 38s; not starting another
device WARNING - [2013-02-05 09:25:31,872] polling thread still running at 38s; not starting another
request WARNING - [2013-02-05 09:25:35,498] polling thread still running at 41s; not starting another
device WARNING - [2013-02-05 09:25:35,674] polling thread still running at 41s; not starting another
request WARNING - [2013-02-05 09:25:39,679] polling thread still running at 46s; not starting another
device WARNING - [2013-02-05 09:25:39,856] polling thread still running at 45s; not starting another
request WARNING - [2013-02-05 09:25:44,279] polling thread still running at 50s; not starting another
device WARNING - [2013-02-05 09:25:44,455] polling thread still running at 50s; not starting another
request WARNING - [2013-02-05 09:25:49,339] polling thread still running at 55s; not starting another
device WARNING - [2013-02-05 09:25:49,515] polling thread still running at 55s; not starting another
request WARNING - [2013-02-05 09:25:54,899] polling thread still running at 61s; not starting another
device WARNING - [2013-02-05 09:25:55,076] polling thread still running at 61s; not starting another
request WARNING - [2013-02-05 09:26:01,022] polling thread still running at 67s; not starting another
device WARNING - [2013-02-05 09:26:01,198] polling thread still running at 67s; not starting another
request WARNING - [2013-02-05 09:26:07,756] polling thread still running at 74s; not starting another
device WARNING - [2013-02-05 09:26:07,929] polling thread still running at 74s; not starting another
request WARNING - [2013-02-05 09:26:15,165] polling thread still running at 81s; not starting another
device WARNING - [2013-02-05 09:26:15,337] polling thread still running at 81s; not starting another
request WARNING - [2013-02-05 09:26:23,312] polling thread still running at 89s; not starting another
device WARNING - [2013-02-05 09:26:23,482] polling thread still running at 89s; not starting another
request WARNING - [2013-02-05 09:26:32,273] polling thread still running at 98s; not starting another
device WARNING - [2013-02-05 09:26:32,446] polling thread still running at 98s; not starting another
request WARNING - [2013-02-05 09:26:42,124] polling thread still running at 108s; not starting another
device WARNING - [2013-02-05 09:26:42,306] polling thread still running at 108s; not starting another
request WARNING - [2013-02-05 09:26:52,970] polling thread still running at 119s; not starting another
device WARNING - [2013-02-05 09:26:53,147] polling thread still running at 119s; not starting another
request WARNING - [2013-02-05 09:27:04,900] polling thread still running at 131s; not starting another
device WARNING - [2013-02-05 09:27:05,075] polling thread still running at 131s; not starting another
request WARNING - [2013-02-05 09:27:18,021] polling thread still running at 144s; not starting another
device WARNING - [2013-02-05 09:27:18,194] polling thread still running at 144s; not starting another
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
    return self.handle()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
    return self._delegate(fn, self.fvars, args)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
    return handle_class(cls)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
    return tocall(*args)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
    results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 86, in GET
    return data.list_pxe_configs()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 445, in list_pxe_configs
    conn = sql.get_conn()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
    return get_engine().connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 474, in checkout
    self)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/event.py", line 377, in __call__
    fn(*args, **kw)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 15, in _checkout_listener
    cursor.execute("SELECT 1")
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
    self.errorhandler(self, exc, value)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler
    raise Error(errorclass, errorvalue)
Error: (<class 'socket.error'>, error(104, 'Connection reset by peer'))

127.0.0.1:38413 - - [05/Feb/2013 09:27:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 500 Internal Server Error
request WARNING - [2013-02-05 09:27:32,455] polling thread still running at 158s; not starting another
device WARNING - [2013-02-05 09:27:32,626] polling thread still running at 158s; not starting another
request WARNING - [2013-02-05 09:27:48,329] polling thread still running at 174s; not starting another
device WARNING - [2013-02-05 09:27:48,505] polling thread still running at 174s; not starting another
request WARNING - [2013-02-05 09:28:05,789] polling thread still running at 192s; not starting another
device WARNING - [2013-02-05 09:28:05,972] polling thread still running at 192s; not starting another
request WARNING - [2013-02-05 09:28:24,997] polling thread still running at 211s; not starting another
device WARNING - [2013-02-05 09:28:25,181] polling thread still running at 211s; not starting another
request WARNING - [2013-02-05 09:28:46,130] polling thread still running at 232s; not starting another
device WARNING - [2013-02-05 09:28:46,313] polling thread still running at 232s; not starting another
request WARNING - [2013-02-05 09:29:09,369] polling thread still running at 255s; not starting another
device WARNING - [2013-02-05 09:29:09,561] polling thread still running at 255s; not starting another
127.0.0.1:38414 - - [05/Feb/2013 09:29:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
request WARNING - [2013-02-05 09:29:34,939] polling thread still running at 281s; not starting another
device WARNING - [2013-02-05 09:29:35,120] polling thread still running at 281s; not starting another
request WARNING - [2013-02-05 09:30:03,061] polling thread still running at 309s; not starting another
device WARNING - [2013-02-05 09:30:03,243] polling thread still running at 309s; not starting another
127.0.0.1:38417 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0021/state/" - 200 OK
127.0.0.1:38416 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0010/state/" - 200 OK
request WARNING - [2013-02-05 09:30:33,997] polling thread still running at 340s; not starting another
device WARNING - [2013-02-05 09:30:34,181] polling thread still running at 340s; not starting another
request WARNING - [2013-02-05 09:31:08,022] polling thread still running at 374s; not starting another
device WARNING - [2013-02-05 09:31:08,208] polling thread still running at 374s; not starting another
request WARNING - [2013-02-05 09:31:45,464] polling thread still running at 411s; not starting another
device WARNING - [2013-02-05 09:31:45,628] polling thread still running at 411s; not starting another
request WARNING - [2013-02-05 09:32:26,644] polling thread still running at 452s; not starting another
device WARNING - [2013-02-05 09:32:26,808] polling thread still running at 452s; not starting another
..

it never recovered from that one.

Immediate plan is to add some monitoring for this (probably by touching a file on every startup, and monitoring that file age with nagios).  I'll need to stare at the code some more to figure out what's really going wrong here.  The MySQL error is probably a big hint.
Comment 19 Dustin J. Mitchell [:dustin] 2013-02-06 18:07:18 PST
bug 838925 for the monitoring.
Comment 20 Dustin J. Mitchell [:dustin] 2013-02-06 18:09:19 PST
http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application may be helpful for debugging this
Comment 21 Dustin J. Mitchell [:dustin] 2013-02-11 07:31:34 PST
In my dev environment, I tried cutting off traffic from mozpool to mysql using iptables -j DROP.  I do see the timeout loop hang, which is not ideal -- but it comes back after the iptables rule is deleted.  This is the case even if I restart mysql in the interim - the ongoing queries eventually get an RST, at which time they raise an exception, which is caught and logged, and life goes on normally.

I would blame synchronous calls to `requests`, but those only occur in the mozpool layer, and both layers are seeing the timeouts at the same time.
Comment 22 Dustin J. Mitchell [:dustin] 2013-02-11 09:47:32 PST
Still can't replicate :(
Comment 23 Dustin J. Mitchell [:dustin] 2013-02-11 12:14:54 PST
I *may* have replicated this.  At least, I have a thread that's hung for 3558s, after blocking and unblocking traffic to the mysql server repeatedly.

It's blocked somewhere in DeviceMachine.handle_timeout().

My guess is that I managed to block traffic right after a query had been sent, but before the query response was returned.  So from the mozpool system, it looks like the query is still pending (with a full TCP connection and everything).  From the DB server, it probably tried to send a few packets for the response, got no ACKs, and finally closed the socket.

So I think that the fix is to add timeouts to all DB calls and/or set SO_KEEPALIVE and tweak the kernel to use something less than 2h for that timer.
Comment 24 Dustin J. Mitchell [:dustin] 2013-02-11 12:26:02 PST
Poking around with gdb shows a bunch of threads in __libc_recv.  The fd is optimized out, unfortunately, but that's strong evidence we're hung in a read here.
Comment 25 Dustin J. Mitchell [:dustin] 2013-02-11 13:32:46 PST
It's funny how all roads lead back to asynchronous programming being a good idea :)

Otherwise, the best fix I can find for this is to use multiprocessing to put all DB connections in a subprocess, and shoot that process in the head if it takes longer than desired to return a result.  That's pretty awful.
Comment 26 Dustin J. Mitchell [:dustin] 2013-02-11 13:48:52 PST
Or use SO_KEEPALIVE.  I've confirmed that MySQLdb sets SO_KEEPALIVE, so it only remains to dial the kernel's timeout to something < 2h.
Comment 27 Dustin J. Mitchell [:dustin] 2013-02-11 14:03:46 PST
Created attachment 712617 [details] [diff] [review]
bug817762.patch
Comment 28 Richard Soderberg [:atoll] 2013-02-11 14:41:11 PST
Comment on attachment 712617 [details] [diff] [review]
bug817762.patch

Review of attachment 712617 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, feel free to ignore my bikeshed about timeout values below.

::: modules/tweaks/manifests/tcp_keepalive.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp

s/every minute/60 seconds/, to match the value below ("every minute" is technically equivalent to "60 seconds", but the unit conversion obscures the link between this paragraph and that value).

@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp
> +# connections, and fail after 5m of no response (5 probes).  This is pretty

I don't think that itvl=60 is a good choice here. If the remote isn't responding, it probably isn't going to respond by waiting 60 seconds between probes; a delay of 5 seconds between probes should be more than sufficient?

300 second timeouts exceed the useful timeout window for most applications. What is the maximum timeout that provides a tolerable experience?

The tcp_keepalive_time here of "60 seconds" implies that your app needs to know within 60 seconds that the connection has gone away, and so you would use need time=40, itvl=5, probes=4 to make sure that the socket is terminated at T=60s for being idle.

If the maximum application timeout window is 300 seconds, then you would use e.g. time=280, itvl=5, probes=4.

@@ +9,5 @@
> +# failure at the other end of the connection.
> +
> +class tweaks::tcp_keepalive {
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {

You'll have better compatibility with either case $::osfamily { RedHat, Debian: ... } or case $::kernel { Linux: ... }

@@ +11,5 @@
> +class tweaks::tcp_keepalive {
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {
> +            sysctl::value {
> +                'net.ipv4.tcp_keepalive_time':

# wait N seconds since last data packet (or probe response) received before sending the first probe

@@ +13,5 @@
> +        CentOS, Ubuntu: {
> +            sysctl::value {
> +                'net.ipv4.tcp_keepalive_time':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_itvl':

# wait N seconds before a keepalive probe fails (times out)

@@ +15,5 @@
> +                'net.ipv4.tcp_keepalive_time':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_itvl':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_probes':

# terminate the connection after N probes have timed out

@@ +17,5 @@
> +                'net.ipv4.tcp_keepalive_itvl':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_probes':
> +                    value => 5;
> +            }

Add per-line docs for each value, as these settings are particularly tricky to understand without context (and even the explanatory paragraph above isn't necessarily enough).
Comment 29 Richard Soderberg [:atoll] 2013-02-11 14:44:11 PST
(In reply to Richard Soderberg [:atoll] from comment #28)
> use need time=40, itvl=5, probes=4 to make sure that the socket is
> terminated at T=60s for being idle.

Correction: to make sure that the socket is terminated at T=60s for being dead.
Comment 30 Dustin J. Mitchell [:dustin] 2013-02-11 15:01:00 PST
"a few minutes" is OK for this application.  I picked 3.  $::operatingsystem is a very common pattern around puppetagain, so I kept that pattern, although you're right that $::kernel is equivalent.
Comment 31 Dustin J. Mitchell [:dustin] 2013-02-12 10:06:57 PST
I landed this.  Let's hope it works :)
Comment 32 Dustin J. Mitchell [:dustin] 2013-02-25 07:44:52 PST
This is still occurring for shorter times..  On mobile-imaging-002, I see some pauses for about 900s and about 1050s.  There's still no obvious cause in the logs - the timeout before is for a device in the free state, so it should just be pinging.

I see the same on mobile-imaging-005, 003, and 004.  All at the same time - beginning at 
  device WARNING - [2013-02-23 04:39:12,276] polling thread still running at 10s; not starting another

:cyborgshadow, I see

[04:35:55] <nagios-scl3> cyborgshadow: Downtime for backup2.db.scl3.mozilla.com scheduled for 4:00:00

right around that time.  Did anything related happen on the buildbot cluster?
Comment 33 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-02-25 07:57:37 PST
FTR, racks 2 to 5 are not in used by the automation since we don't yet have enough load.

I could set rack 2 and see if it clears this.

FYI, racks 1 to 5 are allocated for b2g pandas.
Comment 34 Dustin J. Mitchell [:dustin] 2013-02-25 08:01:36 PST
I assume it's good/bad luck which racks are affected.
Comment 36 Dustin J. Mitchell [:dustin] 2013-03-01 07:44:51 PST
All of the servers failed with something like

request WARNING - [2013-02-28 13:48:30,166] polling thread still running at 211s; not starting another
device WARNING - [2013-02-28 13:48:38,886] polling thread still running at 211s; not starting another

at exactly that time.  They then recovered.  That was the Great Switch Failure Of 2013-02-28.  So that's good news.  However, one didn't recover, on mobile-imaging-007:

device WARNING - [2013-02-28 13:48:30,985] polling thread still running at 211s; not starting another
...
device WARNING - [2013-03-01 07:41:20,658] polling thread still running at 64581s; not starting another

note that only the device loop failed, not the requests loop.
Comment 37 Dustin J. Mitchell [:dustin] 2013-03-01 09:59:27 PST
Unfortunately, all of the cool introspection techniques require Python's symbols, which we don't build for the custom package.

The relevant threads are, I think,

Thread 14 (Thread 0x7fd5c7228700 (LWP 25028)):
#0  0x0000003e9dade513 in select () from /lib64/libc.so.6
#1  0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 25028 attached - interrupt to quit
select(0, NULL, NULL, NULL, {17, 489110}) = 0 (Timeout)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(2, "device WARNING - [2013-03-01 08:"..., 104) = 104
select(0, NULL, NULL, NULL, {60, 240069}

  ** the Lifeguard StateDriver, repeatedly logging about the polling thread still running


Thread 13 (Thread 0x7fd5c5e26700 (LWP 25030)):
#0  0x0000003e9dade513 in select () from /lib64/libc.so.6
#1  0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 25030 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 727435}) = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12054
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd5bc0013c0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0})    = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12057
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>

  ** the Mozpool StateDriver, repeatedly cloning off a new timeout handling thread



Thread 2 (Thread 0x7fd5961fc700 (LWP 2196)):
#0  0x0000003e9de0e8ec in recv () from /lib64/libpthread.so.0
#1  0x00007fd5c9a7fa81 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#2  0x00007fd5c9a7fcae in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#3  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#13 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#14 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#15 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#16 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#17 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#18 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#19 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#20 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#21 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#22 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#23 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#24 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#25 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#26 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#27 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#28 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#29 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#30 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#31 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#32 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#33 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#34 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#35 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#36 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#37 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#38 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#39 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#40 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#41 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#42 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#43 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#44 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#45 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#46 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#47 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#48 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#49 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#50 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#51 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#52 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#53 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#54 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#55 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#56 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#57 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#58 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#59 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#60 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#61 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#62 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#63 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#64 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#65 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#66 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#67 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#68 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#69 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#70 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#71 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#72 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#73 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#74 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#75 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#76 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#77 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#78 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#79 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#80 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 2196 attached - interrupt to quit
recvfrom(5, ^C <unfinished ...>

[root@mobile-imaging-007 ~]# lsof -p 25019
COMMAND     PID   USER   FD   TYPE   DEVICE SIZE/OFF     NODE NAME
...
mozpool-s 25019 apache    5u  IPv4 53125226      0t0      TCP mobile-imaging-007.p7.releng.scl1.mozilla.com:48075->buildbot-rw-vip.db.scl3.mozilla.com:mysql (ESTABLISHED)

meaning that it's waiting for incoming data from the DB.  Per above, that TCP connection should see traffic every 120s, if not every 5s.

[root@mobile-imaging-007 ~]# tcpdump tcp port 48075
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

shows no output over ~10m.  Is SO_KEEPALIVE set?  Linux doesn't provide a programmatic way to look at sockopts, but we can do it in gdb.

(gdb) call (int*)malloc(4)
$1 = (int *) 0x2c87990
(gdb) call (int*)malloc(4)
$2 = (int *) 0x2bc56d0
(gdb) call *(int *)$2 = 4
$3 = 4
(gdb) call getsockopt(5, 1, 9, $1, $2)
$4 = 0
(gdb) p *(int *)$1
$5 = 0
(gdb) quit

$1 is *optval and $2 is *optlen, so *(int *)$1 is the value of SO_KEEPALIVE.  The $5 = 0 means "disabled", where 1 would mean "enabled", as confirmed by creating a socket by hand in another process.  So at least for some connections, Python-MySQL is not setting SO_KEEPALIVE, as I thought in comment 26.  A grep of the Python-MySQL source for SO_KEEPALIVE confirms this.  Comment 26 was based on reading an strace of a new connection being made, but must have been a mis-read of the source.
Comment 38 Dustin J. Mitchell [:dustin] 2013-03-01 10:06:15 PST
Sorry, I forgot we're using PyMySQL and not Python-MySQL.  PyMySQL doesn't set SO_KEEPALIVE either, but at least it's in Python and thus monkey-patchable.
Comment 39 Dustin J. Mitchell [:dustin] 2013-03-01 10:09:47 PST
https://github.com/petehunt/PyMySQL/issues/139
Comment 40 Dustin J. Mitchell [:dustin] 2013-03-01 10:57:17 PST
Created attachment 720043 [details] [diff] [review]
bug817762-set-keepalive.patch

Unfortunately, this fix isn't possible with MySQLdb -- it uses the mysql client API, which hides the socket even from the C code, let alone the Python.  The patch at least logs loudly in that case.
Comment 41 Chris AtLee [:catlee] 2013-03-01 11:05:47 PST
Comment on attachment 720043 [details] [diff] [review]
bug817762-set-keepalive.patch

Review of attachment 720043 [details] [diff] [review]:
-----------------------------------------------------------------

If we ever need to use the mysql driver, we could iterate over all open fd's and try to set the keepalive socket option on them.
Comment 42 Dustin J. Mitchell [:dustin] 2013-03-01 11:18:58 PST
Richard helpfully points out that the MySQL client libraries which Python-MySQL talks to *do* set this option.  So there's no need to work around this with Python-MySQL.  I've already landed attachment 720043 [details] [diff] [review], but I'll make some comment-only updates to indicate that this is a complete solution.  I'll update teh PyMySQL issue too.
Comment 43 Dustin J. Mitchell [:dustin] 2013-03-07 10:30:48 PST
I just deployed mozpool-2.0.3 in bug 848738, which contains only this fix.
Comment 44 Dustin J. Mitchell [:dustin] 2013-03-11 15:26:17 PDT
request.13980 INFO - [2013-03-11 15:01:02,522] entering state device_not_found
device WARNING - [2013-03-11 15:01:03,141] polling thread still running at 295639s; not starting another
request WARNING - [2013-03-11 15:01:12,357] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:01:13,359] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:01:14,460] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:01:15,672] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:01:17,004] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:01:18,470] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:01:20,083] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:01:21,856] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:01:23,807] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:01:25,953] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:01:28,314] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:01:30,910] polling thread still running at 28s; not starting another
request WARNING - [2013-03-11 15:01:33,767] polling thread still running at 31s; not starting another
request WARNING - [2013-03-11 15:01:36,908] polling thread still running at 34s; not starting another
request WARNING - [2013-03-11 15:01:40,364] polling thread still running at 38s; not starting another
request WARNING - [2013-03-11 15:01:44,166] polling thread still running at 41s; not starting another
request WARNING - [2013-03-11 15:01:48,348] polling thread still running at 46s; not starting another
request WARNING - [2013-03-11 15:01:52,947] polling thread still running at 50s; not starting another
request WARNING - [2013-03-11 15:01:58,007] polling thread still running at 55s; not starting another
request.13980 INFO - [2013-03-11 15:02:02,619] entering state closed
device WARNING - [2013-03-11 15:02:03,441] polling thread still running at 295699s; not starting another
device WARNING - [2013-03-11 15:03:03,742] polling thread still running at 295759s; not starting another
device WARNING - [2013-03-11 15:04:04,043] polling thread still running at 295820s; not starting another
db.sql DEBUG - [2013-03-11 15:04:13,723] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:04:33,744] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:05:04,307] polling thread still running at 295880s; not starting another
db.sql DEBUG - [2013-03-11 15:05:33,817] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:06:04,608] polling thread still running at 295940s; not starting another
device WARNING - [2013-03-11 15:07:04,908] polling thread still running at 296001s; not starting another
db.sql DEBUG - [2013-03-11 15:07:13,912] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:08:05,209] polling thread still running at 296061s; not starting another
db.sql DEBUG - [2013-03-11 15:08:13,975] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:09:05,510] polling thread still running at 296121s; not starting another
device WARNING - [2013-03-11 15:10:05,810] polling thread still running at 296182s; not starting another
device WARNING - [2013-03-11 15:11:06,111] polling thread still running at 296242s; not starting another
device WARNING - [2013-03-11 15:12:06,411] polling thread still running at 296302s; not starting another
device WARNING - [2013-03-11 15:13:06,712] polling thread still running at 296362s; not starting another
device WARNING - [2013-03-11 15:14:07,013] polling thread still running at 296423s; not starting another
db.sql DEBUG - [2013-03-11 15:14:14,354] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:14:34,375] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:15:07,313] polling thread still running at 296483s; not starting another
db.sql DEBUG - [2013-03-11 15:15:34,448] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:16:07,614] polling thread still running at 296543s; not starting another
device WARNING - [2013-03-11 15:17:07,914] polling thread still running at 296604s; not starting another
db.sql DEBUG - [2013-03-11 15:17:14,543] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:18:08,215] polling thread still running at 296664s; not starting another
db.sql DEBUG - [2013-03-11 15:18:14,607] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:19:08,516] polling thread still running at 296724s; not starting another

and from the sec mozpool:

device WARNING - [2013-03-11 15:23:34,240] polling thread still running at 293767s; not starting another
request WARNING - [2013-03-11 15:23:36,863] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:23:37,864] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:23:38,965] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:23:40,177] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:23:41,509] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:23:42,975] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:23:44,588] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:23:46,361] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:23:48,312] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:23:50,458] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:23:52,819] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:23:55,415] polling thread still running at 28s; not starting another
request ERROR - [2013-03-11 15:23:56,863] failure in _tick
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 78, in _tick
    self.poll_for_timeouts()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 101, in poll_for_timeouts
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/mozpool/requestmachine.py", line 71, in _get_timed_out_machine_names
    return data.get_timed_out_requests(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 491, in get_timed_out_requests
    return get_timed_out(model.requests, model.requests.c.id, imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 276, in get_timed_out
    res = sql.get_conn().execute(select(
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
    return get_engine().connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 401, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 738, in _do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
Comment 45 Dustin J. Mitchell [:dustin] 2013-03-11 15:33:55 PDT
The sec server is running 2.0.2 (without the fix), but mobile-imaging-001 is running the fixed version:

/var/log/mozpool.log.1:root INFO - [2013-03-07 10:44:06,038] Mozpool-2.0.3 server starting

so something else is amiss.
Comment 46 Dustin J. Mitchell [:dustin] 2013-03-11 15:40:12 PDT
Same on 003 and 004.  The heartbeat monitor is missing these because only one of the two timeout loops (for mozpool and lifeguard) is failing.
Comment 47 Dustin J. Mitchell [:dustin] 2013-03-12 14:13:00 PDT
This isn't actually blocking 802317.

I need some help on this.  As you can see from the above, I'm beyond grasping at straws at this point.
Comment 48 Dustin J. Mitchell [:dustin] 2013-03-12 14:49:16 PDT
I verified that the socket fd's do have SO_KEEPALIVE set via the gdb trick above.

I shut off access to the db using iptables.  When the connection was restored, there were some "Connection reset by peer" errors, followed by normal operation - no wedging.
Comment 49 Chris AtLee [:catlee] 2013-03-13 20:14:02 PDT
I think mozpool on mobile-imaging-001.p1.releng.scl1.mozilla.com got wedged as part of the firewall maintenance tonight.

20:00 < nagios-releng> Wed 20:00:37 PDT [433] mobile-imaging-001.p1.releng.scl1.mozilla.com:http is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http)
20:01 < nagios-releng> Wed 20:01:07 PDT [434] panda-0102.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:07 PDT [435] panda-0094.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [436] panda-0099.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [437] panda-0122.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)

I logged in as root@mobile-imaging-001.p1.releng.scl1.mozilla.com and killed
apache   30477 30475  0 Mar11 ?        00:16:45 /opt/mozpool/frontend/bin/python2.7 /opt/mozpool/frontend/bin/mozpool-server 8010

nagios started being happy again, and there was much rejoicing.
Comment 50 Dustin J. Mitchell [:dustin] 2013-03-27 10:23:58 PDT
Well, there was less sadness anyway.

I learned about sys._current_frames() at PyCon.  It might be worth patching mozpool to dump that periodically if this is indeed still the problem.  The logs from the 13th are gone, unfortunately.
Comment 51 Dustin J. Mitchell [:dustin] 2013-04-02 06:03:54 PDT
Created attachment 732293 [details] [diff] [review]
bug817762-curframes.patch

Surely this debugging output will lead me to a solution!

This only happens rarely now, so it could still take a while..
Comment 52 Jake Watkins [:dividehex] (pto-till-12/05) 2013-04-02 08:54:47 PDT
Comment on attachment 732293 [details] [diff] [review]
bug817762-curframes.patch

Will this be backed out once the cause is discovered and fixed?

r+
Comment 53 Dustin J. Mitchell [:dustin] 2013-04-02 09:35:27 PDT
This and a pile of other patches made on this bug's behalf.  Patch is landed, although it will wait until a new deployment.
Comment 54 Dustin J. Mitchell [:dustin] 2013-05-03 06:10:12 PDT
We haven't seen this recur in a month now.  I'll give it another month, and then remove some of the debugging code.
Comment 55 Dustin J. Mitchell [:dustin] 2013-05-30 13:48:07 PDT
36h ago, when there was a Zayo failure and ~90s of BGP reconvergence:

device WARNING - [2013-05-29 01:56:23,320] polling thread still running at 98s; not starting another
request WARNING - [2013-05-29 01:56:30,549] polling thread still running at 98s; not starting another

but they all recovered.  Another week, and we'll start backing this out.
Comment 56 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-06-03 10:17:52 PDT
dustin, is this related? There are 30 minutes of nothing.

2013-06-03T07:18:55 statemachine entering state operation_complete
2013-06-03T07:18:55 statemachine sending imaging result 'complete' to Mozpool
2013-06-03T07:18:55 statemachine entering state ready
2013-06-03T07:49:16 sut connecting to SUT agent

"If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin. "
Comment 57 Dustin J. Mitchell [:dustin] 2013-06-03 10:27:30 PDT
Nope - that looks normal, actually.  It imaged the board, and then didn't do anything to it while it was reserved, then did a SUT check 40m later, after the reservation was closed.
Comment 58 Dustin J. Mitchell [:dustin] 2013-06-06 09:45:15 PDT
Created attachment 759209 [details] [diff] [review]
bug817762-rollback.patch

Bug 817762: Roll back
    
    Revert "Bug 817762: write out current frames whenever a timeout thread
    is wedged; r=dividehex" but leave the monitoring of the polling thread's
    duration, as this gives useful output during network failures, etc.
    Soften the logging a little bit by not logging until the hang has lasted
    for 80s.
Comment 59 Jake Watkins [:dividehex] (pto-till-12/05) 2013-06-06 15:13:07 PDT
Comment on attachment 759209 [details] [diff] [review]
bug817762-rollback.patch

looks good
Comment 60 Dustin J. Mitchell [:dustin] 2013-06-11 12:18:25 PDT
This just recurred on -010.

> 127.0.0.1:52550 - - [11/Jun/2013 11:48:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> sut.cli ERROR - [2013-06-11 11:49:06,060] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0071.p10.releng.scl1.mozilla.com after 1 attempts
> 127.0.0.1:52551 - - [11/Jun/2013 11:49:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52552 - - [11/Jun/2013 11:49:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> device INFO - [2013-06-11 11:49:14,965] handling timeout on panda-0072
> request INFO - [2013-06-11 11:49:16,060] handling timeout on 208650
> 127.0.0.1:52553 - - [11/Jun/2013 11:49:26] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> 127.0.0.1:52555 - - [11/Jun/2013 11:49:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:49:55,610] setting SO_KEEPALIVE on MySQL socket 13
> db.pool DEBUG - [2013-06-11 11:49:56,549] setting SO_KEEPALIVE on MySQL socket 15
> 127.0.0.1:52560 - - [11/Jun/2013 11:49:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:08,733] setting SO_KEEPALIVE on MySQL socket 18
> 127.0.0.1:52563 - - [11/Jun/2013 11:50:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52565 - - [11/Jun/2013 11:50:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:18,054] setting SO_KEEPALIVE on MySQL socket 20
> db.pool DEBUG - [2013-06-11 11:50:27,101] setting SO_KEEPALIVE on MySQL socket 22
> 127.0.0.1:52569 - - [11/Jun/2013 11:50:27] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,490] setting SO_KEEPALIVE on MySQL socket 24
> 127.0.0.1:52572 - - [11/Jun/2013 11:50:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,744] setting SO_KEEPALIVE on MySQL socket 26
> device WARNING - [2013-06-11 11:50:36,441] polling thread still running at 81s; not starting another
> 127.0.0.1:52576 - - [11/Jun/2013 11:50:37] "HTTP/1.1 GET /api/relay/panda-relay-004/test/" - 200 OK
> 127.0.0.1:52577 - - [11/Jun/2013 11:50:38] "HTTP/1.1 GET /api/relay/panda-relay-006/test/" - 200 OK
> 127.0.0.1:52578 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/relay/panda-relay-002/test/" - 200 OK
> 127.0.0.1:52580 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
> device WARNING - [2013-06-11 11:50:44,589] polling thread still running at 89s; not starting another
> 127.0.0.1:52583 - - [11/Jun/2013 11:50:48] "HTTP/1.1 GET /api/relay/panda-relay-079/test/" - 200 OK
> device WARNING - [2013-06-11 11:50:53,553] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 11:51:03,413] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 11:51:14,259] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 11:51:26,189] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 11:51:39,313] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 11:51:53,749] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 11:52:09,628] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 11:52:27,084] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 11:52:46,298] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 11:53:07,433] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 11:53:30,662] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 11:53:56,236] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 11:54:24,367] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 11:54:55,311] polling thread still running at 340s; not starting another
> device WARNING - [2013-06-11 11:55:29,349] polling thread still running at 374s; not starting another
> device WARNING - [2013-06-11 11:56:06,781] polling thread still running at 411s; not starting another
> device WARNING - [2013-06-11 11:56:47,968] polling thread still running at 453s; not starting another
> device WARNING - [2013-06-11 11:57:33,273] polling thread still running at 498s; not starting another
> device WARNING - [2013-06-11 11:58:23,108] polling thread still running at 548s; not starting another
> device WARNING - [2013-06-11 11:59:17,927] polling thread still running at 602s; not starting another
> device WARNING - [2013-06-11 12:00:18,228] polling thread still running at 663s; not starting another
> root INFO - [2013-06-11 12:01:01,564] Mozpool-4.1.1 server starting
> db.pool DEBUG - [2013-06-11 12:01:01,658] setting SO_KEEPALIVE on MySQL socket 3
> db.pool DEBUG - [2013-06-11 12:01:01,690] setting SO_KEEPALIVE on MySQL socket 4
> request INFO - [2013-06-11 12:01:01,696] handling timeout on 208643
> device INFO - [2013-06-11 12:01:01,704] handling timeout on panda-0061
> db.pool DEBUG - [2013-06-11 12:01:01,744] setting SO_KEEPALIVE on MySQL socket 6
> db.pool DEBUG - [2013-06-11 12:01:09,643] setting SO_KEEPALIVE on MySQL socket 10
> db.pool DEBUG - [2013-06-11 12:01:11,695] setting SO_KEEPALIVE on MySQL socket 12
> db.pool DEBUG - [2013-06-11 12:01:17,705] setting SO_KEEPALIVE on MySQL socket 14
> db.pool DEBUG - [2013-06-11 12:01:18,431] setting SO_KEEPALIVE on MySQL socket 17
> db.pool DEBUG - [2013-06-11 12:01:28,213] setting SO_KEEPALIVE on MySQL socket 19
> db.pool DEBUG - [2013-06-11 12:01:35,736] setting SO_KEEPALIVE on MySQL socket 22
> db.pool DEBUG - [2013-06-11 12:01:47,746] setting SO_KEEPALIVE on MySQL socket 24
> device WARNING - [2013-06-11 12:02:23,155] polling thread still running at 81s; not starting another
> device WARNING - [2013-06-11 12:02:31,304] polling thread still running at 89s; not starting another
> device WARNING - [2013-06-11 12:02:40,268] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 12:02:50,128] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 12:03:00,974] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 12:03:12,904] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 12:03:26,027] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 12:03:40,463] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 12:03:56,342] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 12:04:13,810] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 12:04:33,024] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 12:04:54,157] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 12:05:17,406] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 12:05:42,980] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 12:06:11,111] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 12:06:42,055] polling thread still running at 340s; not starting another
> request.208643 WARNING - [2013-06-11 12:01:01,756] HERE
> request WARNING - [2013-06-11 12:02:23,174] polling thread still running at 81s; not starting another
> db.pool DEBUG - [2013-06-11 12:07:08,132] setting SO_KEEPALIVE on MySQL socket 203
> device INFO - [2013-06-11 12:07:08,133] handling timeout on panda-0062
> request.208645 INFO - [2013-06-11 12:01:04,179] Request closed.

Note that this continued to time out for 5m *after* I restarted it ("Mozpool-4.1.1 server starting").  Note, too, the times on the log entries -- they're *way* out of order.
Comment 61 Amy Rich [:arr] [:arich] 2013-06-11 12:23:02 PDT
There were a bunch of dead items reported on AWS buildbot masters right after this, so maybe there was a network blip?
Comment 62 Dustin J. Mitchell [:dustin] 2013-06-11 13:13:55 PDT
And the other weird thing here is that it fixed itself this time.
Comment 63 Dustin J. Mitchell [:dustin] 2013-06-11 14:14:45 PDT
The mystery of the mis-ordered log entries is solved.  The message time is set early in the call to _log, and then it runs through the relevant handlers.  Well, for 'request.###', the first handler is writing to the logs table in the DB.  It would seem that all of the DB connections were a bit tied up at that time.

Looking on another imaging server:
device WARNING - [2013-06-11 12:04:15,207] polling thread still running at 904s; not starting another

So, I think that the DB just got overloaded at this time.  Mozpool weathered it as best it could.
Comment 64 Dustin J. Mitchell [:dustin] 2013-06-11 16:17:52 PDT
DB load graphs show a spike at almost exactly 9am pacific, but nothing at noon pacific.  So, this could be network or it could be DB.
Comment 65 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-08 17:38:53 PDT
*** Bug 891173 has been marked as a duplicate of this bug. ***
Comment 66 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-08 17:43:59 PDT
mozpool-server on mobile-imaging-009.p9 hung today @ ~1646 PDT.  Restarting mozpool-server recovered it.  This also caused the nagios checks for the pandas and relays in p9 to alert since the checks are dependent on the mozpool http API.
Comment 67 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-08 18:04:02 PDT
First Traceback started at 1530 PDT.  I would be surprised if this wasn't related to the major SCL3 network outage taking place today.

===============================

device ERROR - [2013-07-08 15:30:11,388] failure in _tick
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 81, in _tick
    self.poll_for_timeouts()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 115, in poll_for_timeouts
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 73, in _get_timed_out_machine_names
    return self.db.devices.list_timed_out(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/base.py", line 100, in list_timed_out
    & (tbl.c.imaging_server_id == imaging_server_id)))
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/pool.py", line 65, in execute
    conn = self.engine.connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 402, in __init__
    conn = self.connection = self._connection_record.get_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 324, in get_connection
    self.connection = self.__connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 344, in __connect
    connection = self.__pool._creator()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 281, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/__init__.py", line 93, in Connect
    return Connection(*args, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 510, in __init__
    self._connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 677, in _connect
    self._request_authentication()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 719, in _request_authentication
    self._send_authentication()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 767, in _send_authentication
    auth_packet = MysqlPacket(self)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 200, in __init__
    self.__recv_packet()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 206, in __recv_packet
    raise OperationalError(2013, "Lost connection to MySQL server during query")
OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') None None
Comment 68 Dustin J. Mitchell [:dustin] 2013-07-08 18:46:33 PDT
Yes, that tb is normal during a MySQL failure.  I'll be curious to know how long the hang was, and whether it was greater than the socket timeout.
Comment 69 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-08 19:06:56 PDT
For reference to the scl3 outage:
https://bugzilla.mozilla.org/show_bug.cgi?id=891128
Comment 70 Dustin J. Mitchell [:dustin] 2013-07-09 08:59:22 PDT
On mobile-imaging-001, it looks like MySQL went away at about 2013-07-08 16:11 (pacific).  About 180s after that, the "Lost connection.." errors start.  There is a recovery a few minutes later, followed by another spate of polling failures.  The last "polling thread still running" is at 16:22:52.  There were some additional MySQL failures at 16:29:54, but those did not affect the timeout loop.

On mobile-imaging-009, things started at 16:10:26 with some lost connections.  Mozpool treated those as failed operations, handled within the normal flow of the device state machine.  The timeout loop failed at the same times as on mobile-imaging-001.  It saw the same MySQL failures at around 16:29:54.

However, its device polling thread continued to hang:

> device WARNING - [2013-07-08 17:25:11,049] polling thread still running at 4462s; not starting another

There were lots of errors about

> Timeout: HTTPConnectionPool(host='mobile-imaging-009.p9.releng.scl1.mozilla.com', port=80): Request timed out. (timeout=30)

and eventually

> socket.error: [Errno 24] Too many open files

followed by a bunch of failures to connect to MySQL that are probably hiding the same underlying errno.  Then the host was restarted.

Without seeing what happened, I can't say much more than that.  This could partly be a resource overload - too many async HTTP connections to self waiting for a long `requests`  timeout, while lifeguard is happily initiating more connections.  But that doesn't explain the state of the device loop.
Comment 71 Dustin J. Mitchell [:dustin] 2013-07-09 09:24:49 PDT
This also affected mobile-imaging-008.

I hate threads.

I'm tired of this bug.

Let's change the "still running" warning to kill the entire process when it reaches, say, 600s.  Supervisord will restart the process.
Comment 72 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-09 10:19:51 PDT
(In reply to Dustin J. Mitchell [:dustin] from comment #71)
> This also affected mobile-imaging-008.
> 
> I hate threads.
> 
> I'm tired of this bug.
> 
> Let's change the "still running" warning to kill the entire process when it
> reaches, say, 600s.  Supervisord will restart the process.

I think that's is a good idea as long as it gets reported somewhere (nagios, email, etc) with 'loud noises and bright flashing lights'
Comment 73 Dustin J. Mitchell [:dustin] 2013-07-11 08:53:17 PDT
This failed on -007 as well.  None of the threads were blocked reading or writing from a socket -- they were all in sem_wait, which indicates some kind of sync primitive.
Comment 74 Dustin J. Mitchell [:dustin] 2013-07-11 11:58:59 PDT
Created attachment 774158 [details] [diff] [review]
bug817762-suicide.patch

DIAF
Comment 75 Dustin J. Mitchell [:dustin] 2013-07-12 05:42:24 PDT
I didn't add any notification to this, aside from the logging.  I forgot, but on reflection, I don't see why -- what action can we take when this occurs, other than to needlessly re-open this bug again?
Comment 76 Dustin J. Mitchell [:dustin] 2013-07-12 07:50:47 PDT
Comment on attachment 774158 [details] [diff] [review]
bug817762-suicide.patch

I'll revise to send an email.
Comment 77 Dustin J. Mitchell [:dustin] 2013-07-12 08:46:08 PDT
Created attachment 774687 [details] [diff] [review]
bug817762-suicide-r2.patch
Comment 78 Dustin J. Mitchell [:dustin] 2013-07-12 08:59:32 PDT
Created attachment 774696 [details] [diff] [review]
bug817762-puppet.patch
Comment 79 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-12 13:55:15 PDT
Comment on attachment 774687 [details] [diff] [review]
bug817762-suicide-r2.patch

Review of attachment 774687 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Comment 80 Jake Watkins [:dividehex] (pto-till-12/05) 2013-07-14 01:36:05 PDT
This closed the trees tonight :-{  SHIPIT PLS

https://bugzilla.mozilla.org/show_bug.cgi?id=893511
Comment 81 Dustin J. Mitchell [:dustin] 2013-07-15 06:35:12 PDT
The preceding two patches will be in Mozpool-4.1.2 (bug 893757)
Comment 82 Dustin J. Mitchell [:dustin] 2013-07-15 08:44:44 PDT
I pushed a fix to the puppet patch in https://hg.mozilla.org/build/puppet/rev/32837fc8e15d
Comment 83 Dustin J. Mitchell [:dustin] 2013-07-15 09:07:00 PDT
It occurs to me that the problem here is probably two-part now:
 1. DB error causes an exception, which skips or breaks lock cleanup
 2. Threads pile up on the lock

So a possible way to duplicate this is to make all DB methods randomly raise exceptions, and see if I can get a failure.
Comment 84 Dustin J. Mitchell [:dustin] 2013-07-29 07:50:45 PDT
So the root cause here is loooong DB queries, blocked while the entire DB server is locked for many minutes.  Some of the failures we saw were only that and nothing more.

There seems to be a deadlock that occurs, too, in this situation.  That may occur when something that is supposed to take only N seconds doesn't even *start* for N seconds.

At any rate, the suicide code seems to be holding this together, and getting production Mozpool off of the buildbot DB cluster (which has the multi-minute locks) will help, too (bug 897109).  There's nothing more to do here.

Note You need to log in before you can comment on or make changes to this bug.