993164 - APK Factory stage release/review has 502

Reporter

Description

•

11 years ago

I'm seeing 502s. There are several possible reasons for this 1) nginx is causing an early timeout and closing the connection 2) build queue logic doesn't work properly 3) build queue stale lock detection is set at 20 minutes and is too long 4) APK builds take longer than nginx timeout settings Tested about 30 apks, some failed on release, but not reviewer and vis versa.

Austin King [:ozten]

Reporter

Comment 1

•

11 years ago

I wanted to get an idea of how long it takes to build an APK, which we do a statsd timer for, but I can't figure out what's going on with the data. https://graphite.shared.us-east-1.stage.mozaws.net/ stats.timers.apk-controller-review.apk-generate.dur.count - values are None,1.0, 2.0 I would have expected integers like 323, 4339, etc.

Austin King [:ozten]

Reporter

Comment 2

•

11 years ago

I mispoke, one of the errors is HTTP/1.1 504 Gateway Timeout This happens exactly at 60 seconds into the request. The default setting for proxy_connect_timeout, proxy_read_timeout, and proxy_send_timeout is 60 seconds. APK https://marketplace.firefox.com/app/c252f794-369a-4f72-b3bf-0228ad997384/manifest.webapp Size of packaged app: 915K

Jason Thomas [:jason]

Assignee

Updated

•

11 years ago

Assignee: server-ops-amo → jthomas

Priority: -- → P1

Jason Thomas [:jason]

Assignee

Comment 3

•

11 years ago

I've bumped proxy_read_timeout to 300 seconds. The rest of the values seem okay. https://github.com/mozilla-services/puppet-config/commit/db85b2b201d6199acb61aa8e1b85e63a0b6cab14

Jason Thomas [:jason]

Assignee

Comment 4

•

11 years ago

I've also pushed these changes to stage.

Austin King [:ozten]

Reporter

Comment 5

•

11 years ago

Using marketplace's Fireplace API, I'm requesting thousands of APKs. Stage release - I'm seeing mostly 502s, some 504s and some 200s. Production release - I'm seeing mostly 200s. We're going to need to dig into what is going on. I don't think we can ship what is on stage.

Austin King [:ozten]

Reporter

Comment 6

•

11 years ago

I'm guessing this is nginx config around buffer sizes, timeouts and/or number of open connections. I'd like to get ssh access to apk-controller.dev.mozaws.net I'm studying Ops puppet configs, to see if I can setup nginx locally to reproduce https://github.com/mozilla-services/puppet-config/blob/master/apk/modules/apk_factory/manifests/nginx.pp but working in apk-controller.dev.mozaws.net will be fast and less error prone.

Austin King [:ozten]

Reporter

Comment 7

•

11 years ago

At first blush, I'm not seeing a ton of 502's running behind an nginx in front of the controller server { listen 80; client_max_body_size 1m; location / { try_files $uri @controller; } location @controller { proxy_pass_header Server; proxy_set_header Host $http_host; proxy_redirect off; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Scheme $scheme; proxy_set_header X-FORWARDED-PROTOCOL "ssl"; proxy_connect_timeout 60; proxy_read_timeout 300; proxy_pass http://localhost:8080; } } Next up, I'll put the generator behind an nginx also and run a long burn in.

Austin King [:ozten]

Reporter

Comment 8

•

11 years ago

Another possibility is that supervisord or whatever isn't restarting servers in a timely manner. When controller or generator are down, nginx will return a 502. nginx error logs would be really helpful.

Austin King [:ozten]

Reporter

Comment 9

•

11 years ago

I got a configuration going that will start producing 502s after a lot of load is applied. This was caused by EMFILE exceptions. I tracked this down to the Sentry logger leaking file handles. With Sentry we'd get up to 1024 FD. Without Sentry we oscillate between 12 and 23 FD. Not sure if this is the root cause of this bug, but it is a good leak to find. Filed https://github.com/mattrobenolt/raven-node/issues/75 upstream.

Austin King [:ozten]

Reporter

Comment 10

•

11 years ago

Controller nginx error logs are filled with client closed connection while waiting for request, client: 10.65.15.171, server: 0.0.0.0:81 I don't believe this is the root cause, but it brings up an idea. We might want to do proxy_ignore_client_abort = on So that long runninging apk builds have a chance to finish, even if the client aborts.

Austin King [:ozten]

Reporter

Comment 11

•

11 years ago

I restarted dev this morning. 5 minutes after restart, it had 502s. I hit it 4 hours afterwards and it looked much better. I'm not sure, as jason is working on a new dev stack. I applied load and after a few minutes, it started giving 502s. I'll hold off on trying two patches that I have, until jason is done and I have ssh access to dev.

Austin King [:ozten]

Reporter

Comment 12

•

11 years ago

I've got access to dev. In /var/log/controller-supervisor.log I see TypeError: Cannot read property 'length' of undefined at getFromList (/opt/apk-factory-service/lib/build_queue.js:135:25) at /opt/apk-factory-service/lib/build_queue.js:87:22 at Array.forEach (native) at /opt/apk-factory-service/lib/build_queue.js:85:28 at Query._callback (/opt/apk-factory-service/lib/db.js:196:16) ... at TCP.onread (net.js:528:21) I fixed this in https://github.com/mozilla/apk-factory-service/commit/7aa7bbfdae510b2f9ecee9bc6ab8a5fbd6205e40 I also see very frequent mysql errors: {"level":"error","message":"Error: ER_CON_COUNT_ERROR: Too many connections Connecting to the db and issuing show full processlist; The results go between 1 and 50+ SLEEPING processes I'm going to add mysql db pooling to reduce total number of connections.

Austin King [:ozten]

Reporter

Comment 13

•

11 years ago

I've switched from managing connections to using the connection pool in 19c3f7774564d2cbb4d77664b04c00c690e2d4ac. I've run hundreds of APKs through dev and getting mostly 200s. Cutting a new release (release-2014-04-02-02) and pushing to stage (331a37954f46abff7e). I'm having some trouble with stage Dreadnot.

Jason Thomas [:jason]

Assignee

Comment 14

•

11 years ago

Dreadnot stage should be back up.

Austin King [:ozten]

Reporter

Comment 15

•

11 years ago

I think this deployment is stuck https://dreadnot.apk.us-east-1.stage.mozaws.net/stacks/apk-factory-service/regions/us-east-1/deployments/1

Jason Thomas [:jason]

Assignee

Comment 16

•

11 years ago

Stage deployment will take about 15-20 minutes because we recycle the ec2 instances.

Austin King [:ozten]

Reporter

Comment 17

•

11 years ago

Thanks Jason. Still seeing 502s on stage. Re-deployed in case it was a recent deployment issue. No dice. We can't see supervisord's log, which has the reason servers fall over. As a short term workaround, I'll add an unhandled exception handler that logs to our standard logging system.

Austin King [:ozten]

Reporter

Comment 18

•

11 years ago

I deployed to stage https://dreadnot.apk.us-east-1.stage.mozaws.net/stacks/apk-factory-service/regions/us-east-1/deployments/3 release isn't accepting traffic (https://apk-controller.stage.mozaws.net) review is building apks (https://apk-controller-review.stage.mozaws.net)

Jason Thomas [:jason]

Assignee

Comment 19

•

11 years ago

Our stage wildcard cert was updated in AWS yesterday and it made the ELB sad. I fixed this up and should be working now.

Austin King [:ozten]

Reporter

Comment 20

•

11 years ago

Thanks! stage release - Consistent 200s under good load stage review - Many errors (different than this original bug report) 50% - [Error: socket hang up] code: 'ECONNRESET' 40% - 504 10% - 200 Can we expose the reviewer's /var/log/controller-supervisor.log and /var/log/generator-supervisor.log somehow? (Sentry, Kibaba or Heka)

Jason Thomas [:jason]

Assignee

Comment 21

•

11 years ago

On review instances all I see is 'listening on' for these logs.

Austin King [:ozten]

Reporter

Comment 22

•

11 years ago

jason - thanks for all your help! review instance is in much better shape. Closing this bug. I've filed bug#994767 and bug#994769 for new issues I'm seeing in the logs. Overall, we're in much better shape. release is looking good and review has errors but lots of 200s.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Myk Melez [:myk] [@mykmelez]

Comment 23

•

11 years ago

I'm still seeing this (or something else with similar symptoms). Here are two attempts to download the APK mentioned in bug 988644, comment 10: 04-09 16:38 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0 04-10 12:50 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2 100 2 0 0 1 0 0:00:02 0:00:01 0:00:01 1 04-10 12:52 > cat messageme.apk {}04-10 12:52 > The first attempt completely failed after exactly one minute. The second quickly returned a two-byte file with the contents "{}".

Myk Melez [:myk] [@mykmelez]

Comment 24

•

11 years ago

Update: ozten suggested I might have seen the issues in comment 23 because the server was in the middle of deployment, so I tried again and was unable to reproduce them.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: AMO Operations → Operations: Marketplace

Product: mozilla.org → Mozilla Services