MakeAPI server constantly crashing

RESOLVED FIXED

Status

Webmaker
MakeAPI
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: mjschranz, Assigned: jbuck)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Apparently Chris knows about this, but I basically can't save anything in Popcorn Maker.

It just hangs and never winds up returning from the server. Dave seemed to be under the impression that it was due to the MakeAPI.
Depends on: 917080
(Assignee)

Comment 1

5 years ago
Yes, things are definitely failing:

localhost:Downloads jon$ elbh makeapi-production
INSTANCE_ID  i-29deef49  InService     N/A                                                                                         N/A
INSTANCE_ID  i-2bdeef4b  InService     N/A                                                                                         N/A
INSTANCE_ID  i-320f4158  InService     N/A                                                                                         N/A
INSTANCE_ID  i-2ddf934f  InService     N/A                                                                                         N/A
INSTANCE_ID  i-701a1113  InService     N/A                                                                                         N/A
INSTANCE_ID  i-8c02cbf7  InService     N/A                                                                                         N/A
INSTANCE_ID  i-8e02cbf5  InService     N/A                                                                                         N/A
INSTANCE_ID  i-fa1f4391  InService     N/A                                                                                         N/A
INSTANCE_ID  i-69af0913  InService     N/A                                                                                         N/A
INSTANCE_ID  i-6faf0915  InService     N/A                                                                                         N/A
INSTANCE_ID  i-9f4752f4  OutOfService  Instance has failed at least the UnhealthyThreshold number of health checks consecutively.  Instance
INSTANCE_ID  i-f1859b99  InService     N/A                                                                                         N/A
INSTANCE_ID  i-599c9731  InService     N/A                                                                                         N/A
INSTANCE_ID  i-5f9c9737  InService     N/A                                                                                         N/A
INSTANCE_ID  i-b06378d1  OutOfService  Instance has failed at least the UnhealthyThreshold number of health checks consecutively.  Instance
INSTANCE_ID  i-b26378d3  OutOfService  Instance has failed at least the UnhealthyThreshold number of health checks consecutively.  Instance
INSTANCE_ID  i-107a8873  InService     N/A                                                                                         N/A
(Assignee)

Comment 2

5 years ago
I ran an instance in my terminal and got the following error after about a minute:

{"name":"makeapi","hostname":"i-b06378d1","pid":3556,"level":60,"err":{"message":"connect EADDRNOTAVAIL","name":"Error","stack":"Error: connect EADDRNOTAVAIL\n    at errnoException (net.js:884:11)\n    at connect (net.js:747:19)\n    at net.js:825:9\n    at asyncCallback (dns.js:68:16)\n    at Object.onanswer [as oncomplete] (dns.js:121:9)","code":"EADDRNOTAVAIL"},"msg":"connect EADDRNOTAVAIL","time":"2013-09-17T09:49:55.639Z","v":0}

/var/www/makeapi/lib/logger.js:54
  throw err;
        ^
Error: connect EADDRNOTAVAIL
    at errnoException (net.js:884:11)
    at connect (net.js:747:19)
    at net.js:825:9
    at asyncCallback (dns.js:68:16)
    at Object.onanswer [as oncomplete] (dns.js:121:9)

Curiously enough, the server appeared to not work, but it's as if the process was still running?
(Assignee)

Comment 3

5 years ago
DATA

While running the makeapi in my terminal, I also ran this bash script which outputs the number of open tcp/udp sockets:

ubuntu@i-b06378d1:~$ while (true); do date; netstat -an | egrep -c 'tcp|udp'; sleep 5; done;
Tue Sep 17 10:36:49 UTC 2013
1191
Tue Sep 17 10:36:54 UTC 2013
2781
Tue Sep 17 10:36:59 UTC 2013
4382
Tue Sep 17 10:37:04 UTC 2013
5795
Tue Sep 17 10:37:09 UTC 2013
7431
Tue Sep 17 10:37:15 UTC 2013
9024
Tue Sep 17 10:37:20 UTC 2013
10664
Tue Sep 17 10:37:25 UTC 2013
12026
Tue Sep 17 10:37:30 UTC 2013
13073
Tue Sep 17 10:37:36 UTC 2013
14633
Tue Sep 17 10:37:41 UTC 2013
15533
Tue Sep 17 10:37:46 UTC 2013
17217
Tue Sep 17 10:37:52 UTC 2013
18868
Tue Sep 17 10:37:57 UTC 2013
18181
Tue Sep 17 10:38:03 UTC 2013
19786
Tue Sep 17 10:38:08 UTC 2013
21247
Tue Sep 17 10:38:14 UTC 2013
20666
Tue Sep 17 10:38:19 UTC 2013
22377
Tue Sep 17 10:38:25 UTC 2013
22627
Tue Sep 17 10:38:30 UTC 2013
23357
Tue Sep 17 10:38:36 UTC 2013
25002
Tue Sep 17 10:38:41 UTC 2013
24385
Tue Sep 17 10:38:47 UTC 2013
25858
Tue Sep 17 10:38:53 UTC 2013
27476
Tue Sep 17 10:38:58 UTC 2013
27073
Tue Sep 17 10:39:04 UTC 2013
28260

As you can see, we're getting that EADDRNOTAVAIL error because we quickly overload the number of local ports available for sending data. Now, what's using up all these ports?

ubuntu@i-b06378d1:~$ while (true); do date; netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n; sleep 5; done;
Tue Sep 17 11:29:53 UTC 2013
      1 172.16.0.23
      1 50.31.164.149
      1 Address
      1 servers)
      4 212.184.128.155
      5 10.139.0.148
    921 10.152.173.219
Tue Sep 17 11:29:58 UTC 2013
      1 50.31.164.149
      1 Address
      1 servers)
      4 212.184.128.155
      5 10.139.0.148
   2518 10.152.173.219
Tue Sep 17 11:30:03 UTC 2013
      1 172.16.0.23
      1 Address
      1 servers)
      4 212.184.128.155
      5 10.139.0.148
   4096 10.152.173.219

We're connecting to 10.152.173.219:9200. 

# Host and port for your Elastic search cluster
ELASTIC_SEARCH_URL='elasticsearch://makeapi-es.mofoprod.net:9200'

I don't know how to fix this (yet) but we do have a root cause!
JP showed me almost the same stack last night, and I saw references to it where you can easily overload a socket with async if you don't pause to let it flush and write out some of what's in its buffer.  I know Chris uses async-like stuff to bundle results.

http://stackoverflow.com/questions/17588237/error-connect-eaddrnotavail-while-processing-big-async-loop
(Assignee)

Comment 5

5 years ago
This behaviour where we overload the local number of sockets is this guys fault:

https://github.com/mozilla/MakeAPI/blob/master/lib/models/make.js#L144

Comment that out, and you don't spin up thousands of connections. Another solution would be to update the client to use keep-alive logic...
Depends on: 879432
(Assignee)

Comment 6

5 years ago
Sent https://github.com/jamescarr/mongoosastic/pull/70 to get merged. I'll use my repo for our fix in the meantime
(Assignee)

Comment 7

5 years ago
Created attachment 805957 [details] [review]
https://github.com/mozilla/MakeAPI/pull/146
Assignee: cade → jon
Attachment #805957 - Flags: review?(cade)

Comment 8

5 years ago
Commit pushed to master at https://github.com/mozilla/MakeAPI

https://github.com/mozilla/MakeAPI/commit/0521524c8c134eb9a2829d369b0d2698792f5ebd
Bug 917022 - Handle error events from MongoDB-ES sychronization
Comment on attachment 805957 [details] [review]
https://github.com/mozilla/MakeAPI/pull/146

There's a hilarious async bug with how synchronize emits events that we are going to ignore for now. We've confirmed that the sync does indeed take place. It's only noticeable on small collections that can close the stream before Elasticsearch can respond to update calls.

R+
Attachment #805957 - Flags: review?(cade) → review+
(Assignee)

Comment 10

5 years ago
Everything is looking a-okay now. Lets hope it stays that way...
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Attachment mime type: text/plain → text/x-github-pull-request
(Assignee)

Updated

4 years ago
Blocks: 943926
You need to log in before you can comment on or make changes to this bug.