Closed
Bug 959296
Opened 10 years ago
Closed 10 years ago
Node-hubble becomes unavailable
Categories
(Webmaker Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: johns, Assigned: humph)
References
Details
Attachments
(2 files)
In the last two weeks, we have seen node-hubble stop answering requests. On the nodes themselves, the app is running, and the TCP port is listening. Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp6 0 0 :::22 :::* LISTEN ubuntu@i-f44eecd5:~$ ps -ef | grep node root 12132 12130 0 Jan08 ? 00:00:00 sudo -u root NODE_ENV=production PORT=80 MEMCACHED_URL=node-hubble-cache.mofoprod.net:11211 /usr/bin/node /var/www/node-hubble/server.js root 12133 12130 0 Jan08 ? 00:00:00 /usr/bin/logger -s -t app=node-hubble-production -p local3.info root 12134 12132 0 Jan08 ? 00:00:48 /usr/bin/node /var/www/node-hubble/server.js When tracing the process itself, it seemed to be failing accept calls: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- -nan 0.000000 0 13 read -nan 0.000000 0 6 write -nan 0.000000 0 8 close -nan 0.000000 0 1 getpeername -nan 0.000000 0 34 gettimeofday -nan 0.000000 0 22 futex -nan 0.000000 0 41 clock_gettime -nan 0.000000 0 20 epoll_wait -nan 0.000000 0 13 epoll_ctl -nan 0.000000 0 18 8 accept4 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000000 176 8 total This is the pure output of some of the strace command, above is the summary. 17:15:29.161876 epoll_ctl(5, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0 17:15:29.162037 epoll_ctl(5, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0 17:15:29.162198 epoll_wait(5, {{EPOLLIN, {u32=11, u64=11}}}, 1024, 119997) = 1 17:15:34.308530 clock_gettime(CLOCK_MONOTONIC, {418338, 154076498}) = 0 17:15:34.308776 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 16 17:15:34.309000 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1 17:15:34.309368 gettimeofday({1389633334, 309458}, NULL) = 0 17:15:34.309631 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 17:15:34.309832 clock_gettime(CLOCK_MONOTONIC, {418338, 155344179}) = 0 17:15:34.309998 epoll_ctl(5, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0 17:15:34.310185 epoll_wait(5, {{EPOLLIN, {u32=16, u64=16}}}, 1024, 114849) = 1 17:15:34.310343 clock_gettime(CLOCK_MONOTONIC, {418338, 155852633}) = 0 17:15:34.310514 read(16, "", 65536) = 0 17:15:34.310734 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1 17:15:34.310933 gettimeofday({1389633334, 311000}, NULL) = 0 17:15:34.311197 close(16) = 0 17:15:34.311436 clock_gettime(CLOCK_MONOTONIC, {418338, 156944226}) = 0 17:15:34.311590 epoll_wait(5, {{EPOLLIN, {u32=11, u64=11}}}, 1024, 114848) = 1 17:15:45.349378 clock_gettime(CLOCK_MONOTONIC, {418349, 194865152}) = 0 17:15:45.349474 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 16 17:15:45.349709 futex(0x7f7c5c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1 17:15:45.350062 gettimeofday({1389633345, 350146}, NULL) = 0 17:15:45.350352 accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) TCP stats were as follows: Ip: 1466808 total packets received 2 with invalid addresses 0 forwarded 0 incoming packets discarded 1466806 incoming packets delivered 723525 requests sent out Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: Tcp: 39170 active connections openings 64034 passive connection openings 46 failed connection attempts 7545 connection resets received 7 connections established 1408016 segments received 680547 segments send out 235 segments retransmited 0 bad segments received. 9033 resets sent Udp: 58790 packets received 0 packets to unknown port received. 0 packet receive errors 63830 packets sent UdpLite: TcpExt: 1 invalid SYN cookies received 13530 TCP sockets finished time wait in fast timer 6463 delayed acks sent 229 delayed acks further delayed because of locked socket Quick ack mode was activated 581 times 91 packets directly queued to recvmsg prequeue. 57283 bytes directly in process context from backlog 20199 bytes directly received in process context from prequeue 822534 packet headers predicted 98 packets header predicted and directly queued to user 230848 acknowledgments not containing data payload received 59003 predicted acknowledgments 3 times recovered from packet loss by selective acknowledgements 9 congestion windows recovered without slow start after partial ack 1 timeouts after SACK recovery 3 fast retransmits 65 other TCP timeouts 584 DSACKs sent for old packets 5 DSACKs received 66 connections reset due to unexpected data 7544 connections reset due to early user close 41 connections aborted due to timeout TCPSackShifted: 2 TCPSackMerged: 4 TCPSackShiftFallback: 31 IpExt: InOctets: 1268140652 OutOctets: 130484909
Reporter | ||
Comment 1•10 years ago
|
||
We've put an alarm on the specific endpoint https://node-hubble.mofoprod.net/mime/https://google.ca Additionally, I am polling multiple netstat commands on a 60 second basis to /home/ubuntu/netstat.log so we can correlated what the network is up to when we see an alert.
Reporter | ||
Comment 2•10 years ago
|
||
We just saw this happen again. The node I grabbed logs from was Instance i-f44eecd5 which went out of service between 20:55 and 21:00. I went through and have not seen anything crazy yet, but I'm still combing through it.
Reporter | ||
Comment 3•10 years ago
|
||
Comment 5•10 years ago
|
||
It looks like there are a number of TCP connections being tied up: ubuntu@i-e44ae8c5:~$ sudo netstat -alpn Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 32641/node tcp 0 0 10.178.40.208:80 10.245.73.118:43926 ESTABLISHED 32641/node tcp 0 0 10.178.40.208:80 10.245.73.118:43918 ESTABLISHED 32641/node tcp 0 0 10.178.40.208:37133 10.2.13.163:11211 ESTABLISHED 32641/node tcp 0 0 10.178.40.208:80 10.245.73.118:43934 ESTABLISHED 32641/node tcp 0 0 10.178.40.208:80 10.245.73.118:43931 ESTABLISHED 32641/node 5 is the default http pool limit, maybe we're doing that?
Assignee | ||
Comment 6•10 years ago
|
||
Jon points out on irc that it's the /mime/ endpoint that seems to go down.
Assignee | ||
Comment 7•10 years ago
|
||
I'll see if this is code related, and take this until we are sure it isn't...
Assignee: jon → david.humphrey
Status: NEW → ASSIGNED
Assignee | ||
Comment 8•10 years ago
|
||
Other info: we're using memcached on these instances. I suspect if this is code related, it's something to do with a faulty cache hit never coming back, but need to test. We could run the app with no cache for now.
Comment 9•10 years ago
|
||
I've disabled the cache for now, lets see if that helps things.
Comment 10•10 years ago
|
||
In terms of timing, its died on: Thu Jan 16 20:07:07 UTC 2014 Thu Jan 16 20:00:07 UTC 2014 Wed Jan 15 18:45:35 UTC 2014 Tue Jan 14 21:55:33 UTC 2014 So, there's no time-related things I can see.
Assignee | ||
Comment 11•10 years ago
|
||
WIP fix This adds timeout functionality to cache reads (defaults to 3s). I've also added logging and update modules. I still have failing tests in the img/ route, but I'll deal with that separate (they are pre-existing, not due to my changes).
Attachment #8361845 -
Flags: review?(jon)
Comment 12•10 years ago
|
||
Comment on attachment 8361845 [details] [review] https://github.com/mozilla/node-hubble/pull/17 Everything seems to be working as expected but I have some questions about the cache timeout function.
Attachment #8361845 -
Flags: review?(jon) → review-
Assignee | ||
Comment 13•10 years ago
|
||
Comment on attachment 8361845 [details] [review] https://github.com/mozilla/node-hubble/pull/17 Fixes in the PR.
Attachment #8361845 -
Flags: review- → review?(jon)
Updated•10 years ago
|
Attachment #8361845 -
Flags: review?(jon) → review+
Reporter | ||
Comment 14•10 years ago
|
||
We haven't seen this recur in a very, very long time.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•