Closed Bug 817762 Opened 12 years ago Closed 11 years ago

Mozpools states seem stuck

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dustin)

References

Details

Attachments

(8 files, 1 obsolete file)

I ran into an issue while installing the panda-android.2 image on chassis 2.  Everything seemed to be going well the statemachine entered android_rebooting.  Once it entered that state, it never changed.


2012-12-03T11:20:01 statemachine writing pxe_config panda-android.2, boot config {} to db
2012-12-03T11:20:01 statemachine entering state start_pxe_boot
2012-12-03T11:20:01 statemachine entering state pxe_power_cycling
2012-12-03T11:20:01 bmm setting PXE config to 'panda-android.2'
2012-12-03T11:20:01 bmm initiating power cycle
2012-12-03T11:20:04 statemachine entering state pxe_booting
2012-12-03T11:21:15 syslog second stage URL found: http://10.12.48.27/scripts/android-second-stage.sh
2012-12-03T11:21:15 syslog wget success: --2025-11-06 06:42:00-- http://10.12.48.27/scripts/android-second-stage.sh#012Connecting to 10.12.48.27:80... connected.#012HTTP request sent, awaiting response... 200 OK#012Length: 1829 (1.8K) [application/x-sh]#012Saving to: `/opt/scripts/second-stage.sh'#012#012 0K . 100% 9.53M=0s#012#0122025-11-06 06:42:00 (9.53 MB/s) - `/opt/scripts/second-stage.sh' saved [1829/1829]
2012-12-03T11:21:15 syslog Executing second-stage.sh
2012-12-03T11:21:15 syslog beginning android-second-stage.sh
2012-12-03T11:21:15 syslog setting time with ntpdate
2012-12-03T11:21:24 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_downloading/
2012-12-03T11:21:24 bmm clearing PXE config
2012-12-03T11:21:24 statemachine entering state android_downloading
2012-12-03T11:21:24 syslog got android image panda-android.2
2012-12-03T11:21:24 syslog fetching http://10.12.48.27/artifacts/panda-android.2/system.tar.bz2
2012-12-03T11:21:36 syslog fetching http://10.12.48.27/artifacts/panda-android.2/userdata.tar.bz2
2012-12-03T11:21:38 syslog fetching http://10.12.48.27/artifacts/panda-android.2/boot.tar.bz2
2012-12-03T11:21:39 syslog Formatting partitions
2012-12-03T11:21:51 syslog Mounting partitions
2012-12-03T11:21:51 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_extracting/
2012-12-03T11:21:51 statemachine entering state android_extracting
2012-12-03T11:21:52 syslog Extracting artifacts
2012-12-03T11:23:43 syslog Unmounting partitions
2012-12-03T11:23:58 syslog Imaging complete. Rebooting
2012-12-03T11:23:58 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_rebooting/
2012-12-03T11:23:58 statemachine entering state android_rebooting
My guess is that something got wedged in the power-cycling code, since all of the affected pandas were on the same relay board, and other operations were proceeding apace.

Amy had something similar happen in rack 10, where she had mis-formatted the relay info.

It's possible that this happens when an "unexpected" error (which the misformatted relay info might trigger) occurs.
Does this only affect a handful of devices?  Only Android?

Asking because jhopkins was having issues with b2g pandas not coming up post-image.
I'm not sure if that's this bug or something else ?
It has nothing to do with the image on the pandas.  The manifestation is that a state that should time out (e.g., *_rebooting) never does.  It may be related to what jhopkins is seeing.  I restarted mozpool on mobile-imaging-001 for him a few hours ago.
I've been tailing the logs on mobile-services for a while now, and I haven't seen this happen again.
I also can't reproduce this at home.  I've tried:
 - incorrectly formatted relay hosts (no :'s, like amy's)
 - correctly formatted relay hosts with a bogus hostname
 - correctly formatted relay hosts with a bogus bank #

in all cases, I get a traceback in the log, and things proceed as expected.
Still no replication here.  Another possibility is that a MySQL connection hangs.  We do all of the MySQL transactions synchronously, so if this hung the devicemanager or requestmanager thread, all timeouts would stop.  I'm not sure how to work around that, so I don't want to try to address it without more evidence.
Blocks: 802317
My notes from debugging this on mobile-imaging-001.p1 (which has now been restarted and is back in service):

mozpool-s 17890 apache    3u  IPv4 21022167      0t0      TCP localhost:8010->localhost:49740 (ESTABLISHED)
mozpool-s 17890 apache    4u  IPv4 21020638      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38556->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    5u  IPv4 19798499      0t0      TCP *:8010 (LISTEN)
mozpool-s 17890 apache    6u  IPv4 21020575      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38554->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    7u  IPv4 21020014      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38542->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache    9u  IPv4 21020152      0t0      TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38545->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)


all CLOSE_WAIT = server closed, client didn't get the memo yet.  Doing an API call will open one in ESTABLISHED.

on a "running" system, these are all ESTABLISHED, and anything reading/writing those fd's would get an error, so this is unlikely a MySQL-originated hang

(gdb) info threads
  11 Thread 0x7f6b3a7ae700 (LWP 17899)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  10 Thread 0x7f6b39dad700 (LWP 17900)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  9 Thread 0x7f6b393ac700 (LWP 17901)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  8 Thread 0x7f6b23fff700 (LWP 17902)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  7 Thread 0x7f6b235fe700 (LWP 17903)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  6 Thread 0x7f6b22bfd700 (LWP 17904)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  5 Thread 0x7f6b221fc700 (LWP 17905)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  4 Thread 0x7f6b217fb700 (LWP 17906)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  3 Thread 0x7f6b20dfa700 (LWP 17907)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
  2 Thread 0x7f6b0bfff700 (LWP 17908)  0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7f6b45d27700 (LWP 17890)  0x000000368d8dc273 in poll () from /lib64/libc.so.6


device list API call works
power-control API calls work
everything works except states that need to timeout, e.g., pc_rebooting

state transitions (handle_event) seem to work anyway, and use the same LockByName instance
  - handle_event (for bogus event "fooie") successful on every panda on the server
    - slowness is Apache, grr

Oldest state_timeout value is for panda-0159 and panda-0141 (tie), when went to b2g_rebooting:
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting

So whatever failed, failed then?  Is it global or panda-specific?
  -> maybe the tie is responsible??

request.1643 INFO - [2012-12-05 18:57:14,447] entering state closed
127.0.0.1:40744 - - [05/Dec/2012 18:57:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40747 - - [05/Dec/2012 18:57:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting  *****
127.0.0.1:40748 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0141/event/b2g_rebooting/" - 200 OK
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting  *****
127.0.0.1:40749 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0159/event/b2g_rebooting/" - 200 OK
device.panda-0137 INFO - [2012-12-05 18:57:38,313] entering state b2g_rebooting (also failed)
127.0.0.1:40750 - - [05/Dec/2012 18:57:38] "HTTP/1.1 GET /api/device/panda-0137/event/b2g_rebooting/" - 200 OK
device.panda-0143 INFO - [2012-12-05 18:57:47,087] entering state b2g_rebooting (also failed) ..
127.0.0.1:40751 - - [05/Dec/2012 18:57:47] "HTTP/1.1 GET /api/device/panda-0143/event/b2g_rebooting/" - 200 OK
device.panda-0158 INFO - [2012-12-05 18:57:48,686] entering state b2g_rebooting
127.0.0.1:40752 - - [05/Dec/2012 18:57:48] "HTTP/1.1 GET /api/device/panda-0158/event/b2g_rebooting/" - 200 OK
device.panda-0127 INFO - [2012-12-05 18:57:50,703] entering state b2g_rebooting
127.0.0.1:40753 - - [05/Dec/2012 18:57:50] "HTTP/1.1 GET /api/device/panda-0127/event/b2g_rebooting/" - 200 OK
device.panda-0104 INFO - [2012-12-05 18:57:54,091] entering state b2g_rebooting
127.0.0.1:40754 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/device/panda-0104/event/b2g_rebooting/" - 200 OK
127.0.0.1:40755 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1644 INFO - [2012-12-05 18:57:55,004] entering state finding_device
request.1644 INFO - [2012-12-05 18:57:55,046] assigning requested device panda-0106
request.1644 INFO - [2012-12-05 18:57:55,086] request succeeded
request.1644 INFO - [2012-12-05 18:57:55,086] entering state contacting_lifeguard
request.1644 INFO - [2012-12-05 18:57:55,144] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:57:55,169] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:57:55,192] ignored event free in state failed_pxe_booting
127.0.0.1:40758 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:57:55,211] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1644 INFO - [2012-12-05 18:57:55,211] entering state closed
127.0.0.1:40756 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40759 - - [05/Dec/2012 18:57:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
device.panda-0150 INFO - [2012-12-05 18:57:58,719] entering state b2g_rebooting
127.0.0.1:40760 - - [05/Dec/2012 18:57:58] "HTTP/1.1 GET /api/device/panda-0150/event/b2g_rebooting/" - 200 OK
127.0.0.1:40761 - - [05/Dec/2012 18:58:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40762 - - [05/Dec/2012 18:58:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40763 - - [05/Dec/2012 18:58:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40764 - - [05/Dec/2012 18:58:13] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40765 - - [05/Dec/2012 18:58:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1645 INFO - [2012-12-05 18:58:14,586] entering state finding_device
request.1645 INFO - [2012-12-05 18:58:14,628] assigning requested device panda-0085
request.1645 INFO - [2012-12-05 18:58:14,641] request succeeded
request.1645 INFO - [2012-12-05 18:58:14,641] entering state contacting_lifeguard
request.1645 INFO - [2012-12-05 18:58:14,700] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:14,780] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:58:14,803] ignored event free in state failed_b2g_downloading
127.0.0.1:40768 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:14,821] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1645 INFO - [2012-12-05 18:58:14,822] entering state closed
127.0.0.1:40766 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40769 - - [05/Dec/2012 18:58:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40770 - - [05/Dec/2012 18:58:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1646 INFO - [2012-12-05 18:58:55,355] entering state finding_device
request.1646 INFO - [2012-12-05 18:58:55,397] assigning requested device panda-0106
request.1646 INFO - [2012-12-05 18:58:55,410] request succeeded
request.1646 INFO - [2012-12-05 18:58:55,410] entering state contacting_lifeguard
request.1646 INFO - [2012-12-05 18:58:55,501] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:55,526] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:58:55,548] ignored event free in state failed_pxe_booting
127.0.0.1:40773 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:55,566] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1646 INFO - [2012-12-05 18:58:55,566] entering state closed
127.0.0.1:40771 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40774 - - [05/Dec/2012 18:58:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
127.0.0.1:40775 - - [05/Dec/2012 18:59:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40776 - - [05/Dec/2012 18:59:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40777 - - [05/Dec/2012 18:59:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40778 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40779 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1647 INFO - [2012-12-05 18:59:14,969] entering state finding_device
request.1647 INFO - [2012-12-05 18:59:15,010] assigning requested device panda-0085
request.1647 INFO - [2012-12-05 18:59:15,025] request succeeded
request.1647 INFO - [2012-12-05 18:59:15,025] entering state contacting_lifeguard
request.1647 INFO - [2012-12-05 18:59:15,218] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:15,244] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:59:15,267] ignored event free in state failed_b2g_downloading
127.0.0.1:40782 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:15,285] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1647 INFO - [2012-12-05 18:59:15,285] entering state closed
127.0.0.1:40780 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40783 - - [05/Dec/2012 18:59:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40784 - - [05/Dec/2012 18:59:51] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
127.0.0.1:40785 - - [05/Dec/2012 18:59:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1648 INFO - [2012-12-05 18:59:55,708] entering state finding_device
request.1648 INFO - [2012-12-05 18:59:55,748] assigning requested device panda-0106
request.1648 INFO - [2012-12-05 18:59:55,762] request succeeded
request.1648 INFO - [2012-12-05 18:59:55,762] entering state contacting_lifeguard
request.1648 INFO - [2012-12-05 18:59:55,817] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:55,843] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:59:55,866] ignored event free in state failed_pxe_booting
127.0.0.1:40788 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:55,885] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1648 INFO - [2012-12-05 18:59:55,885] entering state closed
127.0.0.1:40786 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40789 - - [05/Dec/2012 18:59:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
<---- timeout for panda-0141 and panda-0159 should have happened here ---->
127.0.0.1:40790 - - [05/Dec/2012 19:00:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40791 - - [05/Dec/2012 19:00:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40792 - - [05/Dec/2012 19:00:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40793 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40794 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
Attached patch logging.patchSplinter Review
Attachment #689234 - Flags: review?(mcote)
Comment on attachment 689234 [details] [diff] [review]
logging.patch

Cool cool.
Attachment #689234 - Flags: review?(mcote) → review+
With the logging in place,

device INFO - [2012-12-06 12:39:45,679] tick
device ERROR - [2012-12-06 12:39:45,686] _run failed
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 44, in run
    self._run();
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 65, in _run
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 61, in _get_timed_out_machine_names
    return data.get_timed_out_devices(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 218, in get_timed_out_devices
    return get_timed_out(model.devices, model.devices.c.name, imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 199, in get_timed_out
    & (tbl.c.imaging_server_id == imaging_server_id)))
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1449, in execute
    params)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1698, in _execute_context
    context)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1691, in _execute_context
    context)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 331, in do_execute
    cursor.execute(statement, parameters)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
    self.errorhandler(self, exc, value)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 189, in defaulterrorhandler
    raise errorclass, errorvalue
ProgrammingError: (ProgrammingError) (1146, u"Table 'mozpool.devices' doesn't exist") 'SELECT devices.name \nFROM devices \nWHERE devices.state_timeout < %s AND devices.imaging_server_id = %s' (datetime.datetime(2012, 12, 6, 12, 39, 45, 6
79912), 201L)
device INFO - [2012-12-06 12:39:45,688] _run returned


(and I know why that specific error happened -- I had to quickly drop and re-create the table to work around a MySQL bug).

I'm willing to blame earlier failures on similar things, and assume that the traceback just didn't make it into the logs (although it should..).  So, time to bulletproof that loop.
Attached patch bug817762.patchSplinter Review
This leaves a not-unreasonable amount of logging in, I think.
Attachment #689396 - Flags: review?(mcote)
Comment on attachment 689396 [details] [diff] [review]
bug817762.patch

Review of attachment 689396 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozpool/statedriver.py
@@ +42,1 @@
>          try:

Missed this in the last patch, but having a try/except inside a try/finally seems like overkill. :)
Attachment #689396 - Flags: review?(mcote) → review+
Mark also pointed out in irc that the thread is not a deamon so it makes interrupting the process a bit hard (note that web.py's worker threads do the same thing - they don't quit until Firefox drops the HTTP/1.1 connection).
I landed this with a flattened try syntax and thd.setDaemon(1) added.

I'm going to optimistically call this done.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
optimism is just future disappointment..

request WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another
device WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another

this is on the staging server.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looking back in the logs, I see a few times when a single timeout loop spilled past 10 seconds, but recovered afterward.  These really are just warnings.

device WARNING - [2013-02-05 03:19:17,396] polling thread still running at 10s; not starting another
device.fakepanda-2043 INFO - [2013-02-05 03:19:17,271] entering state free
device INFO - [2013-02-05 03:19:18,356] handling timeout on fakepanda-2044
fakedevices.fakepanda-2043 DEBUG - [2013-02-05 03:19:18,360] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:18,826] polling thread still running at 11s; not starting another
device.fakepanda-2044 INFO - [2013-02-05 03:19:18,599] entering state free
device INFO - [2013-02-05 03:19:19,808] handling timeout on panda-0010
fakedevices.fakepanda-2044 DEBUG - [2013-02-05 03:19:19,810] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:20,310] polling thread still running at 12s; not starting another
device.panda-0010 INFO - [2013-02-05 03:19:20,062] entering state free
device INFO - [2013-02-05 03:19:22,145] handling timeout on fakepanda-2052

later, some failures caused by losing a db connection:

request WARNING - [2013-02-05 09:25:03,688] polling thread still running at 10s; not starting another
device WARNING - [2013-02-05 09:25:03,865] polling thread still running at 10s; not starting another
request WARNING - [2013-02-05 09:25:04,689] polling thread still running at 11s; not starting another
device WARNING - [2013-02-05 09:25:04,866] polling thread still running at 11s; not starting another
request WARNING - [2013-02-05 09:25:05,791] polling thread still running at 12s; not starting another
device WARNING - [2013-02-05 09:25:05,968] polling thread still running at 12s; not starting another
request WARNING - [2013-02-05 09:25:07,003] polling thread still running at 13s; not starting another
device WARNING - [2013-02-05 09:25:07,180] polling thread still running at 13s; not starting another
request WARNING - [2013-02-05 09:25:08,336] polling thread still running at 14s; not starting another
device WARNING - [2013-02-05 09:25:08,512] polling thread still running at 14s; not starting another
request WARNING - [2013-02-05 09:25:09,802] polling thread still running at 16s; not starting another
device WARNING - [2013-02-05 09:25:09,977] polling thread still running at 16s; not starting another
request WARNING - [2013-02-05 09:25:11,414] polling thread still running at 17s; not starting another
device WARNING - [2013-02-05 09:25:11,590] polling thread still running at 17s; not starting another
request WARNING - [2013-02-05 09:25:13,188] polling thread still running at 19s; not starting another
device WARNING - [2013-02-05 09:25:13,363] polling thread still running at 19s; not starting another
request WARNING - [2013-02-05 09:25:15,139] polling thread still running at 21s; not starting another
device WARNING - [2013-02-05 09:25:15,314] polling thread still running at 21s; not starting another
request WARNING - [2013-02-05 09:25:17,286] polling thread still running at 23s; not starting another
device WARNING - [2013-02-05 09:25:17,460] polling thread still running at 23s; not starting another
request WARNING - [2013-02-05 09:25:19,646] polling thread still running at 25s; not starting another
device WARNING - [2013-02-05 09:25:19,821] polling thread still running at 25s; not starting another
request WARNING - [2013-02-05 09:25:22,241] polling thread still running at 28s; not starting another
device WARNING - [2013-02-05 09:25:22,418] polling thread still running at 28s; not starting another
request WARNING - [2013-02-05 09:25:25,098] polling thread still running at 31s; not starting another
device WARNING - [2013-02-05 09:25:25,274] polling thread still running at 31s; not starting another
request WARNING - [2013-02-05 09:25:28,240] polling thread still running at 34s; not starting another
device WARNING - [2013-02-05 09:25:28,416] polling thread still running at 34s; not starting another
request WARNING - [2013-02-05 09:25:31,696] polling thread still running at 38s; not starting another
device WARNING - [2013-02-05 09:25:31,872] polling thread still running at 38s; not starting another
request WARNING - [2013-02-05 09:25:35,498] polling thread still running at 41s; not starting another
device WARNING - [2013-02-05 09:25:35,674] polling thread still running at 41s; not starting another
request WARNING - [2013-02-05 09:25:39,679] polling thread still running at 46s; not starting another
device WARNING - [2013-02-05 09:25:39,856] polling thread still running at 45s; not starting another
request WARNING - [2013-02-05 09:25:44,279] polling thread still running at 50s; not starting another
device WARNING - [2013-02-05 09:25:44,455] polling thread still running at 50s; not starting another
request WARNING - [2013-02-05 09:25:49,339] polling thread still running at 55s; not starting another
device WARNING - [2013-02-05 09:25:49,515] polling thread still running at 55s; not starting another
request WARNING - [2013-02-05 09:25:54,899] polling thread still running at 61s; not starting another
device WARNING - [2013-02-05 09:25:55,076] polling thread still running at 61s; not starting another
request WARNING - [2013-02-05 09:26:01,022] polling thread still running at 67s; not starting another
device WARNING - [2013-02-05 09:26:01,198] polling thread still running at 67s; not starting another
request WARNING - [2013-02-05 09:26:07,756] polling thread still running at 74s; not starting another
device WARNING - [2013-02-05 09:26:07,929] polling thread still running at 74s; not starting another
request WARNING - [2013-02-05 09:26:15,165] polling thread still running at 81s; not starting another
device WARNING - [2013-02-05 09:26:15,337] polling thread still running at 81s; not starting another
request WARNING - [2013-02-05 09:26:23,312] polling thread still running at 89s; not starting another
device WARNING - [2013-02-05 09:26:23,482] polling thread still running at 89s; not starting another
request WARNING - [2013-02-05 09:26:32,273] polling thread still running at 98s; not starting another
device WARNING - [2013-02-05 09:26:32,446] polling thread still running at 98s; not starting another
request WARNING - [2013-02-05 09:26:42,124] polling thread still running at 108s; not starting another
device WARNING - [2013-02-05 09:26:42,306] polling thread still running at 108s; not starting another
request WARNING - [2013-02-05 09:26:52,970] polling thread still running at 119s; not starting another
device WARNING - [2013-02-05 09:26:53,147] polling thread still running at 119s; not starting another
request WARNING - [2013-02-05 09:27:04,900] polling thread still running at 131s; not starting another
device WARNING - [2013-02-05 09:27:05,075] polling thread still running at 131s; not starting another
request WARNING - [2013-02-05 09:27:18,021] polling thread still running at 144s; not starting another
device WARNING - [2013-02-05 09:27:18,194] polling thread still running at 144s; not starting another
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
    return self.handle()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
    return self._delegate(fn, self.fvars, args)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
    return handle_class(cls)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
    return tocall(*args)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
    results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 86, in GET
    return data.list_pxe_configs()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 445, in list_pxe_configs
    conn = sql.get_conn()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
    return get_engine().connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 474, in checkout
    self)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/event.py", line 377, in __call__
    fn(*args, **kw)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 15, in _checkout_listener
    cursor.execute("SELECT 1")
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
    self.errorhandler(self, exc, value)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler
    raise Error(errorclass, errorvalue)
Error: (<class 'socket.error'>, error(104, 'Connection reset by peer'))

127.0.0.1:38413 - - [05/Feb/2013 09:27:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 500 Internal Server Error
request WARNING - [2013-02-05 09:27:32,455] polling thread still running at 158s; not starting another
device WARNING - [2013-02-05 09:27:32,626] polling thread still running at 158s; not starting another
request WARNING - [2013-02-05 09:27:48,329] polling thread still running at 174s; not starting another
device WARNING - [2013-02-05 09:27:48,505] polling thread still running at 174s; not starting another
request WARNING - [2013-02-05 09:28:05,789] polling thread still running at 192s; not starting another
device WARNING - [2013-02-05 09:28:05,972] polling thread still running at 192s; not starting another
request WARNING - [2013-02-05 09:28:24,997] polling thread still running at 211s; not starting another
device WARNING - [2013-02-05 09:28:25,181] polling thread still running at 211s; not starting another
request WARNING - [2013-02-05 09:28:46,130] polling thread still running at 232s; not starting another
device WARNING - [2013-02-05 09:28:46,313] polling thread still running at 232s; not starting another
request WARNING - [2013-02-05 09:29:09,369] polling thread still running at 255s; not starting another
device WARNING - [2013-02-05 09:29:09,561] polling thread still running at 255s; not starting another
127.0.0.1:38414 - - [05/Feb/2013 09:29:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
request WARNING - [2013-02-05 09:29:34,939] polling thread still running at 281s; not starting another
device WARNING - [2013-02-05 09:29:35,120] polling thread still running at 281s; not starting another
request WARNING - [2013-02-05 09:30:03,061] polling thread still running at 309s; not starting another
device WARNING - [2013-02-05 09:30:03,243] polling thread still running at 309s; not starting another
127.0.0.1:38417 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0021/state/" - 200 OK
127.0.0.1:38416 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0010/state/" - 200 OK
request WARNING - [2013-02-05 09:30:33,997] polling thread still running at 340s; not starting another
device WARNING - [2013-02-05 09:30:34,181] polling thread still running at 340s; not starting another
request WARNING - [2013-02-05 09:31:08,022] polling thread still running at 374s; not starting another
device WARNING - [2013-02-05 09:31:08,208] polling thread still running at 374s; not starting another
request WARNING - [2013-02-05 09:31:45,464] polling thread still running at 411s; not starting another
device WARNING - [2013-02-05 09:31:45,628] polling thread still running at 411s; not starting another
request WARNING - [2013-02-05 09:32:26,644] polling thread still running at 452s; not starting another
device WARNING - [2013-02-05 09:32:26,808] polling thread still running at 452s; not starting another
..

it never recovered from that one.

Immediate plan is to add some monitoring for this (probably by touching a file on every startup, and monitoring that file age with nagios).  I'll need to stare at the code some more to figure out what's really going wrong here.  The MySQL error is probably a big hint.
bug 838925 for the monitoring.
In my dev environment, I tried cutting off traffic from mozpool to mysql using iptables -j DROP.  I do see the timeout loop hang, which is not ideal -- but it comes back after the iptables rule is deleted.  This is the case even if I restart mysql in the interim - the ongoing queries eventually get an RST, at which time they raise an exception, which is caught and logged, and life goes on normally.

I would blame synchronous calls to `requests`, but those only occur in the mozpool layer, and both layers are seeing the timeouts at the same time.
Still can't replicate :(
I *may* have replicated this.  At least, I have a thread that's hung for 3558s, after blocking and unblocking traffic to the mysql server repeatedly.

It's blocked somewhere in DeviceMachine.handle_timeout().

My guess is that I managed to block traffic right after a query had been sent, but before the query response was returned.  So from the mozpool system, it looks like the query is still pending (with a full TCP connection and everything).  From the DB server, it probably tried to send a few packets for the response, got no ACKs, and finally closed the socket.

So I think that the fix is to add timeouts to all DB calls and/or set SO_KEEPALIVE and tweak the kernel to use something less than 2h for that timer.
Poking around with gdb shows a bunch of threads in __libc_recv.  The fd is optimized out, unfortunately, but that's strong evidence we're hung in a read here.
It's funny how all roads lead back to asynchronous programming being a good idea :)

Otherwise, the best fix I can find for this is to use multiprocessing to put all DB connections in a subprocess, and shoot that process in the head if it takes longer than desired to return a result.  That's pretty awful.
Or use SO_KEEPALIVE.  I've confirmed that MySQLdb sets SO_KEEPALIVE, so it only remains to dial the kernel's timeout to something < 2h.
Attached patch bug817762.patchSplinter Review
Attachment #712617 - Flags: review?(rsoderberg)
Comment on attachment 712617 [details] [diff] [review]
bug817762.patch

Review of attachment 712617 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, feel free to ignore my bikeshed about timeout values below.

::: modules/tweaks/manifests/tcp_keepalive.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp

s/every minute/60 seconds/, to match the value below ("every minute" is technically equivalent to "60 seconds", but the unit conversion obscures the link between this paragraph and that value).

@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp
> +# connections, and fail after 5m of no response (5 probes).  This is pretty

I don't think that itvl=60 is a good choice here. If the remote isn't responding, it probably isn't going to respond by waiting 60 seconds between probes; a delay of 5 seconds between probes should be more than sufficient?

300 second timeouts exceed the useful timeout window for most applications. What is the maximum timeout that provides a tolerable experience?

The tcp_keepalive_time here of "60 seconds" implies that your app needs to know within 60 seconds that the connection has gone away, and so you would use need time=40, itvl=5, probes=4 to make sure that the socket is terminated at T=60s for being idle.

If the maximum application timeout window is 300 seconds, then you would use e.g. time=280, itvl=5, probes=4.

@@ +9,5 @@
> +# failure at the other end of the connection.
> +
> +class tweaks::tcp_keepalive {
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {

You'll have better compatibility with either case $::osfamily { RedHat, Debian: ... } or case $::kernel { Linux: ... }

@@ +11,5 @@
> +class tweaks::tcp_keepalive {
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {
> +            sysctl::value {
> +                'net.ipv4.tcp_keepalive_time':

# wait N seconds since last data packet (or probe response) received before sending the first probe

@@ +13,5 @@
> +        CentOS, Ubuntu: {
> +            sysctl::value {
> +                'net.ipv4.tcp_keepalive_time':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_itvl':

# wait N seconds before a keepalive probe fails (times out)

@@ +15,5 @@
> +                'net.ipv4.tcp_keepalive_time':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_itvl':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_probes':

# terminate the connection after N probes have timed out

@@ +17,5 @@
> +                'net.ipv4.tcp_keepalive_itvl':
> +                    value => 60;
> +                'net.ipv4.tcp_keepalive_probes':
> +                    value => 5;
> +            }

Add per-line docs for each value, as these settings are particularly tricky to understand without context (and even the explanatory paragraph above isn't necessarily enough).
Attachment #712617 - Flags: review?(rsoderberg) → review+
(In reply to Richard Soderberg [:atoll] from comment #28)
> use need time=40, itvl=5, probes=4 to make sure that the socket is
> terminated at T=60s for being idle.

Correction: to make sure that the socket is terminated at T=60s for being dead.
"a few minutes" is OK for this application.  I picked 3.  $::operatingsystem is a very common pattern around puppetagain, so I kept that pattern, although you're right that $::kernel is equivalent.
I landed this.  Let's hope it works :)
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
This is still occurring for shorter times..  On mobile-imaging-002, I see some pauses for about 900s and about 1050s.  There's still no obvious cause in the logs - the timeout before is for a device in the free state, so it should just be pinging.

I see the same on mobile-imaging-005, 003, and 004.  All at the same time - beginning at 
  device WARNING - [2013-02-23 04:39:12,276] polling thread still running at 10s; not starting another

:cyborgshadow, I see

[04:35:55] <nagios-scl3> cyborgshadow: Downtime for backup2.db.scl3.mozilla.com scheduled for 4:00:00

right around that time.  Did anything related happen on the buildbot cluster?
Group: infra
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
FTR, racks 2 to 5 are not in used by the automation since we don't yet have enough load.

I could set rack 2 and see if it clears this.

FYI, racks 1 to 5 are allocated for b2g pandas.
I assume it's good/bad luck which racks are affected.
All of the servers failed with something like

request WARNING - [2013-02-28 13:48:30,166] polling thread still running at 211s; not starting another
device WARNING - [2013-02-28 13:48:38,886] polling thread still running at 211s; not starting another

at exactly that time.  They then recovered.  That was the Great Switch Failure Of 2013-02-28.  So that's good news.  However, one didn't recover, on mobile-imaging-007:

device WARNING - [2013-02-28 13:48:30,985] polling thread still running at 211s; not starting another
...
device WARNING - [2013-03-01 07:41:20,658] polling thread still running at 64581s; not starting another

note that only the device loop failed, not the requests loop.
Unfortunately, all of the cool introspection techniques require Python's symbols, which we don't build for the custom package.

The relevant threads are, I think,

Thread 14 (Thread 0x7fd5c7228700 (LWP 25028)):
#0  0x0000003e9dade513 in select () from /lib64/libc.so.6
#1  0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 25028 attached - interrupt to quit
select(0, NULL, NULL, NULL, {17, 489110}) = 0 (Timeout)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(2, "device WARNING - [2013-03-01 08:"..., 104) = 104
select(0, NULL, NULL, NULL, {60, 240069}

  ** the Lifeguard StateDriver, repeatedly logging about the polling thread still running


Thread 13 (Thread 0x7fd5c5e26700 (LWP 25030)):
#0  0x0000003e9dade513 in select () from /lib64/libc.so.6
#1  0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 25030 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 727435}) = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12054
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd5bc0013c0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0})    = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12057
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>

  ** the Mozpool StateDriver, repeatedly cloning off a new timeout handling thread



Thread 2 (Thread 0x7fd5961fc700 (LWP 2196)):
#0  0x0000003e9de0e8ec in recv () from /lib64/libpthread.so.0
#1  0x00007fd5c9a7fa81 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#2  0x00007fd5c9a7fcae in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#3  0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#5  0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#6  0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#7  0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#8  0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9  0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#13 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#14 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#15 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#16 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#17 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#18 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#19 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#20 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#21 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#22 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#23 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#24 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#25 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#26 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#27 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#28 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#29 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#30 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#31 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#32 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#33 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#34 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#35 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#36 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#37 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#38 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#39 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#40 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#41 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#42 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#43 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#44 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#45 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#46 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#47 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#48 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#49 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#50 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#51 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#52 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#53 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#54 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#55 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#56 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#57 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#58 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#59 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#60 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#61 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#62 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#63 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#64 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#65 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#66 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#67 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#68 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#69 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#70 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#71 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#72 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#73 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#74 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#75 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#76 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#77 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#78 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#79 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#80 0x0000003e9dae592d in clone () from /lib64/libc.so.6

Process 2196 attached - interrupt to quit
recvfrom(5, ^C <unfinished ...>

[root@mobile-imaging-007 ~]# lsof -p 25019
COMMAND     PID   USER   FD   TYPE   DEVICE SIZE/OFF     NODE NAME
...
mozpool-s 25019 apache    5u  IPv4 53125226      0t0      TCP mobile-imaging-007.p7.releng.scl1.mozilla.com:48075->buildbot-rw-vip.db.scl3.mozilla.com:mysql (ESTABLISHED)

meaning that it's waiting for incoming data from the DB.  Per above, that TCP connection should see traffic every 120s, if not every 5s.

[root@mobile-imaging-007 ~]# tcpdump tcp port 48075
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

shows no output over ~10m.  Is SO_KEEPALIVE set?  Linux doesn't provide a programmatic way to look at sockopts, but we can do it in gdb.

(gdb) call (int*)malloc(4)
$1 = (int *) 0x2c87990
(gdb) call (int*)malloc(4)
$2 = (int *) 0x2bc56d0
(gdb) call *(int *)$2 = 4
$3 = 4
(gdb) call getsockopt(5, 1, 9, $1, $2)
$4 = 0
(gdb) p *(int *)$1
$5 = 0
(gdb) quit

$1 is *optval and $2 is *optlen, so *(int *)$1 is the value of SO_KEEPALIVE.  The $5 = 0 means "disabled", where 1 would mean "enabled", as confirmed by creating a socket by hand in another process.  So at least for some connections, Python-MySQL is not setting SO_KEEPALIVE, as I thought in comment 26.  A grep of the Python-MySQL source for SO_KEEPALIVE confirms this.  Comment 26 was based on reading an strace of a new connection being made, but must have been a mis-read of the source.
Sorry, I forgot we're using PyMySQL and not Python-MySQL.  PyMySQL doesn't set SO_KEEPALIVE either, but at least it's in Python and thus monkey-patchable.
Group: infra
Unfortunately, this fix isn't possible with MySQLdb -- it uses the mysql client API, which hides the socket even from the C code, let alone the Python.  The patch at least logs loudly in that case.
Attachment #720043 - Flags: review?(catlee)
Comment on attachment 720043 [details] [diff] [review]
bug817762-set-keepalive.patch

Review of attachment 720043 [details] [diff] [review]:
-----------------------------------------------------------------

If we ever need to use the mysql driver, we could iterate over all open fd's and try to set the keepalive socket option on them.
Attachment #720043 - Flags: review?(catlee) → review+
Richard helpfully points out that the MySQL client libraries which Python-MySQL talks to *do* set this option.  So there's no need to work around this with Python-MySQL.  I've already landed attachment 720043 [details] [diff] [review], but I'll make some comment-only updates to indicate that this is a complete solution.  I'll update teh PyMySQL issue too.
I just deployed mozpool-2.0.3 in bug 848738, which contains only this fix.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
request.13980 INFO - [2013-03-11 15:01:02,522] entering state device_not_found
device WARNING - [2013-03-11 15:01:03,141] polling thread still running at 295639s; not starting another
request WARNING - [2013-03-11 15:01:12,357] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:01:13,359] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:01:14,460] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:01:15,672] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:01:17,004] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:01:18,470] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:01:20,083] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:01:21,856] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:01:23,807] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:01:25,953] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:01:28,314] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:01:30,910] polling thread still running at 28s; not starting another
request WARNING - [2013-03-11 15:01:33,767] polling thread still running at 31s; not starting another
request WARNING - [2013-03-11 15:01:36,908] polling thread still running at 34s; not starting another
request WARNING - [2013-03-11 15:01:40,364] polling thread still running at 38s; not starting another
request WARNING - [2013-03-11 15:01:44,166] polling thread still running at 41s; not starting another
request WARNING - [2013-03-11 15:01:48,348] polling thread still running at 46s; not starting another
request WARNING - [2013-03-11 15:01:52,947] polling thread still running at 50s; not starting another
request WARNING - [2013-03-11 15:01:58,007] polling thread still running at 55s; not starting another
request.13980 INFO - [2013-03-11 15:02:02,619] entering state closed
device WARNING - [2013-03-11 15:02:03,441] polling thread still running at 295699s; not starting another
device WARNING - [2013-03-11 15:03:03,742] polling thread still running at 295759s; not starting another
device WARNING - [2013-03-11 15:04:04,043] polling thread still running at 295820s; not starting another
db.sql DEBUG - [2013-03-11 15:04:13,723] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:04:33,744] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:05:04,307] polling thread still running at 295880s; not starting another
db.sql DEBUG - [2013-03-11 15:05:33,817] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:06:04,608] polling thread still running at 295940s; not starting another
device WARNING - [2013-03-11 15:07:04,908] polling thread still running at 296001s; not starting another
db.sql DEBUG - [2013-03-11 15:07:13,912] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:08:05,209] polling thread still running at 296061s; not starting another
db.sql DEBUG - [2013-03-11 15:08:13,975] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:09:05,510] polling thread still running at 296121s; not starting another
device WARNING - [2013-03-11 15:10:05,810] polling thread still running at 296182s; not starting another
device WARNING - [2013-03-11 15:11:06,111] polling thread still running at 296242s; not starting another
device WARNING - [2013-03-11 15:12:06,411] polling thread still running at 296302s; not starting another
device WARNING - [2013-03-11 15:13:06,712] polling thread still running at 296362s; not starting another
device WARNING - [2013-03-11 15:14:07,013] polling thread still running at 296423s; not starting another
db.sql DEBUG - [2013-03-11 15:14:14,354] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:14:34,375] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:15:07,313] polling thread still running at 296483s; not starting another
db.sql DEBUG - [2013-03-11 15:15:34,448] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:16:07,614] polling thread still running at 296543s; not starting another
device WARNING - [2013-03-11 15:17:07,914] polling thread still running at 296604s; not starting another
db.sql DEBUG - [2013-03-11 15:17:14,543] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:18:08,215] polling thread still running at 296664s; not starting another
db.sql DEBUG - [2013-03-11 15:18:14,607] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:19:08,516] polling thread still running at 296724s; not starting another

and from the sec mozpool:

device WARNING - [2013-03-11 15:23:34,240] polling thread still running at 293767s; not starting another
request WARNING - [2013-03-11 15:23:36,863] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:23:37,864] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:23:38,965] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:23:40,177] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:23:41,509] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:23:42,975] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:23:44,588] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:23:46,361] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:23:48,312] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:23:50,458] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:23:52,819] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:23:55,415] polling thread still running at 28s; not starting another
request ERROR - [2013-03-11 15:23:56,863] failure in _tick
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 78, in _tick
    self.poll_for_timeouts()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 101, in poll_for_timeouts
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/mozpool/requestmachine.py", line 71, in _get_timed_out_machine_names
    return data.get_timed_out_requests(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 491, in get_timed_out_requests
    return get_timed_out(model.requests, model.requests.c.id, imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 276, in get_timed_out
    res = sql.get_conn().execute(select(
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
    return get_engine().connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 401, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 738, in _do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
The sec server is running 2.0.2 (without the fix), but mobile-imaging-001 is running the fixed version:

/var/log/mozpool.log.1:root INFO - [2013-03-07 10:44:06,038] Mozpool-2.0.3 server starting

so something else is amiss.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Same on 003 and 004.  The heartbeat monitor is missing these because only one of the two timeout loops (for mozpool and lifeguard) is failing.
This isn't actually blocking 802317.

I need some help on this.  As you can see from the above, I'm beyond grasping at straws at this point.
No longer blocks: 802317
I verified that the socket fd's do have SO_KEEPALIVE set via the gdb trick above.

I shut off access to the db using iptables.  When the connection was restored, there were some "Connection reset by peer" errors, followed by normal operation - no wedging.
I think mozpool on mobile-imaging-001.p1.releng.scl1.mozilla.com got wedged as part of the firewall maintenance tonight.

20:00 < nagios-releng> Wed 20:00:37 PDT [433] mobile-imaging-001.p1.releng.scl1.mozilla.com:http is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http)
20:01 < nagios-releng> Wed 20:01:07 PDT [434] panda-0102.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:07 PDT [435] panda-0094.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [436] panda-0099.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [437] panda-0122.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)

I logged in as root@mobile-imaging-001.p1.releng.scl1.mozilla.com and killed
apache   30477 30475  0 Mar11 ?        00:16:45 /opt/mozpool/frontend/bin/python2.7 /opt/mozpool/frontend/bin/mozpool-server 8010

nagios started being happy again, and there was much rejoicing.
Well, there was less sadness anyway.

I learned about sys._current_frames() at PyCon.  It might be worth patching mozpool to dump that periodically if this is indeed still the problem.  The logs from the 13th are gone, unfortunately.
Surely this debugging output will lead me to a solution!

This only happens rarely now, so it could still take a while..
Attachment #732293 - Flags: review?(jwatkins)
Comment on attachment 732293 [details] [diff] [review]
bug817762-curframes.patch

Will this be backed out once the cause is discovered and fixed?

r+
Attachment #732293 - Flags: review?(jwatkins) → review+
This and a pile of other patches made on this bug's behalf.  Patch is landed, although it will wait until a new deployment.
We haven't seen this recur in a month now.  I'll give it another month, and then remove some of the debugging code.
36h ago, when there was a Zayo failure and ~90s of BGP reconvergence:

device WARNING - [2013-05-29 01:56:23,320] polling thread still running at 98s; not starting another
request WARNING - [2013-05-29 01:56:30,549] polling thread still running at 98s; not starting another

but they all recovered.  Another week, and we'll start backing this out.
dustin, is this related? There are 30 minutes of nothing.

2013-06-03T07:18:55 statemachine entering state operation_complete
2013-06-03T07:18:55 statemachine sending imaging result 'complete' to Mozpool
2013-06-03T07:18:55 statemachine entering state ready
2013-06-03T07:49:16 sut connecting to SUT agent

"If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin. "
Nope - that looks normal, actually.  It imaged the board, and then didn't do anything to it while it was reserved, then did a SUT check 40m later, after the reservation was closed.
Bug 817762: Roll back
    
    Revert "Bug 817762: write out current frames whenever a timeout thread
    is wedged; r=dividehex" but leave the monitoring of the polling thread's
    duration, as this gives useful output during network failures, etc.
    Soften the logging a little bit by not logging until the hang has lasted
    for 80s.
Attachment #759209 - Flags: review?(jwatkins)
Comment on attachment 759209 [details] [diff] [review]
bug817762-rollback.patch

looks good
Attachment #759209 - Flags: review?(jwatkins) → review+
Attachment #759209 - Flags: checked-in+
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
This just recurred on -010.

> 127.0.0.1:52550 - - [11/Jun/2013 11:48:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> sut.cli ERROR - [2013-06-11 11:49:06,060] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0071.p10.releng.scl1.mozilla.com after 1 attempts
> 127.0.0.1:52551 - - [11/Jun/2013 11:49:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52552 - - [11/Jun/2013 11:49:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> device INFO - [2013-06-11 11:49:14,965] handling timeout on panda-0072
> request INFO - [2013-06-11 11:49:16,060] handling timeout on 208650
> 127.0.0.1:52553 - - [11/Jun/2013 11:49:26] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> 127.0.0.1:52555 - - [11/Jun/2013 11:49:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:49:55,610] setting SO_KEEPALIVE on MySQL socket 13
> db.pool DEBUG - [2013-06-11 11:49:56,549] setting SO_KEEPALIVE on MySQL socket 15
> 127.0.0.1:52560 - - [11/Jun/2013 11:49:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:08,733] setting SO_KEEPALIVE on MySQL socket 18
> 127.0.0.1:52563 - - [11/Jun/2013 11:50:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52565 - - [11/Jun/2013 11:50:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:18,054] setting SO_KEEPALIVE on MySQL socket 20
> db.pool DEBUG - [2013-06-11 11:50:27,101] setting SO_KEEPALIVE on MySQL socket 22
> 127.0.0.1:52569 - - [11/Jun/2013 11:50:27] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,490] setting SO_KEEPALIVE on MySQL socket 24
> 127.0.0.1:52572 - - [11/Jun/2013 11:50:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,744] setting SO_KEEPALIVE on MySQL socket 26
> device WARNING - [2013-06-11 11:50:36,441] polling thread still running at 81s; not starting another
> 127.0.0.1:52576 - - [11/Jun/2013 11:50:37] "HTTP/1.1 GET /api/relay/panda-relay-004/test/" - 200 OK
> 127.0.0.1:52577 - - [11/Jun/2013 11:50:38] "HTTP/1.1 GET /api/relay/panda-relay-006/test/" - 200 OK
> 127.0.0.1:52578 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/relay/panda-relay-002/test/" - 200 OK
> 127.0.0.1:52580 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
> device WARNING - [2013-06-11 11:50:44,589] polling thread still running at 89s; not starting another
> 127.0.0.1:52583 - - [11/Jun/2013 11:50:48] "HTTP/1.1 GET /api/relay/panda-relay-079/test/" - 200 OK
> device WARNING - [2013-06-11 11:50:53,553] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 11:51:03,413] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 11:51:14,259] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 11:51:26,189] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 11:51:39,313] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 11:51:53,749] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 11:52:09,628] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 11:52:27,084] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 11:52:46,298] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 11:53:07,433] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 11:53:30,662] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 11:53:56,236] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 11:54:24,367] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 11:54:55,311] polling thread still running at 340s; not starting another
> device WARNING - [2013-06-11 11:55:29,349] polling thread still running at 374s; not starting another
> device WARNING - [2013-06-11 11:56:06,781] polling thread still running at 411s; not starting another
> device WARNING - [2013-06-11 11:56:47,968] polling thread still running at 453s; not starting another
> device WARNING - [2013-06-11 11:57:33,273] polling thread still running at 498s; not starting another
> device WARNING - [2013-06-11 11:58:23,108] polling thread still running at 548s; not starting another
> device WARNING - [2013-06-11 11:59:17,927] polling thread still running at 602s; not starting another
> device WARNING - [2013-06-11 12:00:18,228] polling thread still running at 663s; not starting another
> root INFO - [2013-06-11 12:01:01,564] Mozpool-4.1.1 server starting
> db.pool DEBUG - [2013-06-11 12:01:01,658] setting SO_KEEPALIVE on MySQL socket 3
> db.pool DEBUG - [2013-06-11 12:01:01,690] setting SO_KEEPALIVE on MySQL socket 4
> request INFO - [2013-06-11 12:01:01,696] handling timeout on 208643
> device INFO - [2013-06-11 12:01:01,704] handling timeout on panda-0061
> db.pool DEBUG - [2013-06-11 12:01:01,744] setting SO_KEEPALIVE on MySQL socket 6
> db.pool DEBUG - [2013-06-11 12:01:09,643] setting SO_KEEPALIVE on MySQL socket 10
> db.pool DEBUG - [2013-06-11 12:01:11,695] setting SO_KEEPALIVE on MySQL socket 12
> db.pool DEBUG - [2013-06-11 12:01:17,705] setting SO_KEEPALIVE on MySQL socket 14
> db.pool DEBUG - [2013-06-11 12:01:18,431] setting SO_KEEPALIVE on MySQL socket 17
> db.pool DEBUG - [2013-06-11 12:01:28,213] setting SO_KEEPALIVE on MySQL socket 19
> db.pool DEBUG - [2013-06-11 12:01:35,736] setting SO_KEEPALIVE on MySQL socket 22
> db.pool DEBUG - [2013-06-11 12:01:47,746] setting SO_KEEPALIVE on MySQL socket 24
> device WARNING - [2013-06-11 12:02:23,155] polling thread still running at 81s; not starting another
> device WARNING - [2013-06-11 12:02:31,304] polling thread still running at 89s; not starting another
> device WARNING - [2013-06-11 12:02:40,268] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 12:02:50,128] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 12:03:00,974] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 12:03:12,904] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 12:03:26,027] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 12:03:40,463] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 12:03:56,342] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 12:04:13,810] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 12:04:33,024] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 12:04:54,157] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 12:05:17,406] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 12:05:42,980] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 12:06:11,111] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 12:06:42,055] polling thread still running at 340s; not starting another
> request.208643 WARNING - [2013-06-11 12:01:01,756] HERE
> request WARNING - [2013-06-11 12:02:23,174] polling thread still running at 81s; not starting another
> db.pool DEBUG - [2013-06-11 12:07:08,132] setting SO_KEEPALIVE on MySQL socket 203
> device INFO - [2013-06-11 12:07:08,133] handling timeout on panda-0062
> request.208645 INFO - [2013-06-11 12:01:04,179] Request closed.

Note that this continued to time out for 5m *after* I restarted it ("Mozpool-4.1.1 server starting").  Note, too, the times on the log entries -- they're *way* out of order.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There were a bunch of dead items reported on AWS buildbot masters right after this, so maybe there was a network blip?
And the other weird thing here is that it fixed itself this time.
The mystery of the mis-ordered log entries is solved.  The message time is set early in the call to _log, and then it runs through the relevant handlers.  Well, for 'request.###', the first handler is writing to the logs table in the DB.  It would seem that all of the DB connections were a bit tied up at that time.

Looking on another imaging server:
device WARNING - [2013-06-11 12:04:15,207] polling thread still running at 904s; not starting another

So, I think that the DB just got overloaded at this time.  Mozpool weathered it as best it could.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
DB load graphs show a spike at almost exactly 9am pacific, but nothing at noon pacific.  So, this could be network or it could be DB.
mozpool-server on mobile-imaging-009.p9 hung today @ ~1646 PDT.  Restarting mozpool-server recovered it.  This also caused the nagios checks for the pandas and relays in p9 to alert since the checks are dependent on the mozpool http API.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
First Traceback started at 1530 PDT.  I would be surprised if this wasn't related to the major SCL3 network outage taking place today.

===============================

device ERROR - [2013-07-08 15:30:11,388] failure in _tick
Traceback (most recent call last):
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 81, in _tick
    self.poll_for_timeouts()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 115, in poll_for_timeouts
    for machine_name in self._get_timed_out_machine_names():
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 73, in _get_timed_out_machine_names
    return self.db.devices.list_timed_out(self.imaging_server_id)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/base.py", line 100, in list_timed_out
    & (tbl.c.imaging_server_id == imaging_server_id)))
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/pool.py", line 65, in execute
    conn = self.engine.connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
    return self._connection_cls(self, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
    self.__connection = connection or engine.raw_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
    return self.pool.unique_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
    return _ConnectionFairy(self).checkout()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 402, in __init__
    conn = self.connection = self._connection_record.get_connection()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 324, in get_connection
    self.connection = self.__connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 344, in __connect
    connection = self.__pool._creator()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 281, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/__init__.py", line 93, in Connect
    return Connection(*args, **kwargs)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 510, in __init__
    self._connect()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 677, in _connect
    self._request_authentication()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 719, in _request_authentication
    self._send_authentication()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 767, in _send_authentication
    auth_packet = MysqlPacket(self)
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 200, in __init__
    self.__recv_packet()
  File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 206, in __recv_packet
    raise OperationalError(2013, "Lost connection to MySQL server during query")
OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') None None
Yes, that tb is normal during a MySQL failure.  I'll be curious to know how long the hang was, and whether it was greater than the socket timeout.
For reference to the scl3 outage:
https://bugzilla.mozilla.org/show_bug.cgi?id=891128
On mobile-imaging-001, it looks like MySQL went away at about 2013-07-08 16:11 (pacific).  About 180s after that, the "Lost connection.." errors start.  There is a recovery a few minutes later, followed by another spate of polling failures.  The last "polling thread still running" is at 16:22:52.  There were some additional MySQL failures at 16:29:54, but those did not affect the timeout loop.

On mobile-imaging-009, things started at 16:10:26 with some lost connections.  Mozpool treated those as failed operations, handled within the normal flow of the device state machine.  The timeout loop failed at the same times as on mobile-imaging-001.  It saw the same MySQL failures at around 16:29:54.

However, its device polling thread continued to hang:

> device WARNING - [2013-07-08 17:25:11,049] polling thread still running at 4462s; not starting another

There were lots of errors about

> Timeout: HTTPConnectionPool(host='mobile-imaging-009.p9.releng.scl1.mozilla.com', port=80): Request timed out. (timeout=30)

and eventually

> socket.error: [Errno 24] Too many open files

followed by a bunch of failures to connect to MySQL that are probably hiding the same underlying errno.  Then the host was restarted.

Without seeing what happened, I can't say much more than that.  This could partly be a resource overload - too many async HTTP connections to self waiting for a long `requests`  timeout, while lifeguard is happily initiating more connections.  But that doesn't explain the state of the device loop.
This also affected mobile-imaging-008.

I hate threads.

I'm tired of this bug.

Let's change the "still running" warning to kill the entire process when it reaches, say, 600s.  Supervisord will restart the process.
(In reply to Dustin J. Mitchell [:dustin] from comment #71)
> This also affected mobile-imaging-008.
> 
> I hate threads.
> 
> I'm tired of this bug.
> 
> Let's change the "still running" warning to kill the entire process when it
> reaches, say, 600s.  Supervisord will restart the process.

I think that's is a good idea as long as it gets reported somewhere (nagios, email, etc) with 'loud noises and bright flashing lights'
This failed on -007 as well.  None of the threads were blocked reading or writing from a socket -- they were all in sem_wait, which indicates some kind of sync primitive.
Attached patch bug817762-suicide.patch (obsolete) — Splinter Review
DIAF
Attachment #774158 - Flags: review?(jwatkins)
I didn't add any notification to this, aside from the logging.  I forgot, but on reflection, I don't see why -- what action can we take when this occurs, other than to needlessly re-open this bug again?
Comment on attachment 774158 [details] [diff] [review]
bug817762-suicide.patch

I'll revise to send an email.
Attachment #774158 - Attachment is obsolete: true
Attachment #774158 - Flags: review?(jwatkins)
Attachment #774687 - Flags: review?(jwatkins)
Attachment #774696 - Flags: review?(jwatkins)
Comment on attachment 774687 [details] [diff] [review]
bug817762-suicide-r2.patch

Review of attachment 774687 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #774687 - Flags: review?(jwatkins) → review+
Attachment #774696 - Flags: review?(jwatkins) → review+
This closed the trees tonight :-{  SHIPIT PLS

https://bugzilla.mozilla.org/show_bug.cgi?id=893511
Attachment #774687 - Flags: checked-in+
Attachment #774696 - Flags: checked-in+
The preceding two patches will be in Mozpool-4.1.2 (bug 893757)
It occurs to me that the problem here is probably two-part now:
 1. DB error causes an exception, which skips or breaks lock cleanup
 2. Threads pile up on the lock

So a possible way to duplicate this is to make all DB methods randomly raise exceptions, and see if I can get a failure.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
So the root cause here is loooong DB queries, blocked while the entire DB server is locked for many minutes.  Some of the failures we saw were only that and nothing more.

There seems to be a deadlock that occurs, too, in this situation.  That may occur when something that is supposed to take only N seconds doesn't even *start* for N seconds.

At any rate, the suicide code seems to be holding this together, and getting production Mozpool off of the buildbot DB cluster (which has the multi-minute locks) will help, too (bug 897109).  There's nothing more to do here.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: