Closed Bug 959296 Opened 10 years ago Closed 10 years ago

Node-hubble becomes unavailable

Categories

(Webmaker Graveyard :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johns, Assigned: humph)

References

Details

Attachments

(2 files)

In the last two weeks, we have seen node-hubble stop answering requests.

On the nodes themselves, the app is running, and the TCP port is listening.

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:5666            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
ubuntu@i-f44eecd5:~$ ps -ef | grep node
root     12132 12130  0 Jan08 ?        00:00:00 sudo -u root NODE_ENV=production PORT=80 MEMCACHED_URL=node-hubble-cache.mofoprod.net:11211 /usr/bin/node /var/www/node-hubble/server.js
root     12133 12130  0 Jan08 ?        00:00:00 /usr/bin/logger -s -t app=node-hubble-production -p local3.info
root     12134 12132  0 Jan08 ?        00:00:48 /usr/bin/node /var/www/node-hubble/server.js


When tracing the process itself, it seemed to be failing accept calls:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  -nan    0.000000           0        13           read
  -nan    0.000000           0         6           write
  -nan    0.000000           0         8           close
  -nan    0.000000           0         1           getpeername
  -nan    0.000000           0        34           gettimeofday
  -nan    0.000000           0        22           futex
  -nan    0.000000           0        41           clock_gettime
  -nan    0.000000           0        20           epoll_wait
  -nan    0.000000           0        13           epoll_ctl
  -nan    0.000000           0        18         8 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   176         8 total


This is the pure output of some of the strace command, above is the summary.
17:15:29.161876 epoll_ctl(5, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
17:15:29.162037 epoll_ctl(5, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0
17:15:29.162198 epoll_wait(5, {{EPOLLIN, {u32=11, u64=11}}}, 1024, 119997) = 1
17:15:34.308530 clock_gettime(CLOCK_MONOTONIC, {418338, 154076498}) = 0
17:15:34.308776 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 16
17:15:34.309000 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1
17:15:34.309368 gettimeofday({1389633334, 309458}, NULL) = 0
17:15:34.309631 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
17:15:34.309832 clock_gettime(CLOCK_MONOTONIC, {418338, 155344179}) = 0
17:15:34.309998 epoll_ctl(5, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0
17:15:34.310185 epoll_wait(5, {{EPOLLIN, {u32=16, u64=16}}}, 1024, 114849) = 1
17:15:34.310343 clock_gettime(CLOCK_MONOTONIC, {418338, 155852633}) = 0
17:15:34.310514 read(16, "", 65536)     = 0
17:15:34.310734 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1
17:15:34.310933 gettimeofday({1389633334, 311000}, NULL) = 0
17:15:34.311197 close(16)               = 0
17:15:34.311436 clock_gettime(CLOCK_MONOTONIC, {418338, 156944226}) = 0
17:15:34.311590 epoll_wait(5, {{EPOLLIN, {u32=11, u64=11}}}, 1024, 114848) = 1
17:15:45.349378 clock_gettime(CLOCK_MONOTONIC, {418349, 194865152}) = 0
17:15:45.349474 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 16
17:15:45.349709 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1
17:15:45.350062 gettimeofday({1389633345, 350146}, NULL) = 0
17:15:45.350352 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)

TCP stats were as follows:
Ip:
    1466808 total packets received
    2 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1466806 incoming packets delivered
    723525 requests sent out
Icmp:
    0 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    0 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
Tcp:
    39170 active connections openings
    64034 passive connection openings
    46 failed connection attempts
    7545 connection resets received
    7 connections established
    1408016 segments received
    680547 segments send out
    235 segments retransmited
    0 bad segments received.
    9033 resets sent
Udp:
    58790 packets received
    0 packets to unknown port received.
    0 packet receive errors
    63830 packets sent
UdpLite:
TcpExt:
    1 invalid SYN cookies received
    13530 TCP sockets finished time wait in fast timer
    6463 delayed acks sent
    229 delayed acks further delayed because of locked socket
    Quick ack mode was activated 581 times
    91 packets directly queued to recvmsg prequeue.
    57283 bytes directly in process context from backlog
    20199 bytes directly received in process context from prequeue
    822534 packet headers predicted
    98 packets header predicted and directly queued to user
    230848 acknowledgments not containing data payload received
    59003 predicted acknowledgments
    3 times recovered from packet loss by selective acknowledgements
    9 congestion windows recovered without slow start after partial ack
    1 timeouts after SACK recovery
    3 fast retransmits
    65 other TCP timeouts
    584 DSACKs sent for old packets
    5 DSACKs received
    66 connections reset due to unexpected data
    7544 connections reset due to early user close
    41 connections aborted due to timeout
    TCPSackShifted: 2
    TCPSackMerged: 4
    TCPSackShiftFallback: 31
IpExt:
    InOctets: 1268140652
    OutOctets: 130484909
We've put an alarm on the specific endpoint  https://node-hubble.mofoprod.net/mime/https://google.ca
Additionally, I am polling multiple netstat commands on a 60 second basis to /home/ubuntu/netstat.log so we can correlated what the network is up to when we see an alert.
We just saw this happen again.  The node I grabbed logs from was Instance i-f44eecd5 which went out of service between 20:55 and 21:00.

I went through and have not seen anything crazy yet, but I'm still combing through it.
Attached file i-f44eecd5.txt
It looks like there are a number of TCP connections being tied up:

ubuntu@i-e44ae8c5:~$ sudo netstat -alpn
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      32641/node
tcp        0      0 10.178.40.208:80        10.245.73.118:43926     ESTABLISHED 32641/node
tcp        0      0 10.178.40.208:80        10.245.73.118:43918     ESTABLISHED 32641/node
tcp        0      0 10.178.40.208:37133     10.2.13.163:11211       ESTABLISHED 32641/node
tcp        0      0 10.178.40.208:80        10.245.73.118:43934     ESTABLISHED 32641/node
tcp        0      0 10.178.40.208:80        10.245.73.118:43931     ESTABLISHED 32641/node

5 is the default http pool limit, maybe we're doing that?
Jon points out on irc that it's the /mime/ endpoint that seems to go down.
I'll see if this is code related, and take this until we are sure it isn't...
Assignee: jon → david.humphrey
Status: NEW → ASSIGNED
Other info: we're using memcached on these instances.  I suspect if this is code related, it's something to do with a faulty cache hit never coming back, but need to test.  We could run the app with no cache for now.
I've disabled the cache for now, lets see if that helps things.
In terms of timing, its died on:

Thu Jan 16 20:07:07 UTC 2014
Thu Jan 16 20:00:07 UTC 2014
Wed Jan 15 18:45:35 UTC 2014
Tue Jan 14 21:55:33 UTC 2014

So, there's no time-related things I can see.
WIP fix

This adds timeout functionality to cache reads (defaults to 3s).  I've also added logging and update modules.

I still have failing tests in the img/ route, but I'll deal with that separate (they are pre-existing, not due to my changes).
Attachment #8361845 - Flags: review?(jon)
Comment on attachment 8361845 [details] [review]
https://github.com/mozilla/node-hubble/pull/17

Everything seems to be working as expected but I have some questions about the cache timeout function.
Attachment #8361845 - Flags: review?(jon) → review-
Comment on attachment 8361845 [details] [review]
https://github.com/mozilla/node-hubble/pull/17

Fixes in the PR.
Attachment #8361845 - Flags: review- → review?(jon)
Attachment #8361845 - Flags: review?(jon) → review+
We haven't seen this recur in a very, very long time.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: