Closed
Bug 993164
Opened 10 years ago
Closed 10 years ago
APK Factory stage release/review has 502
Categories
(Cloud Services :: Operations: Marketplace, task, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ozten, Assigned: jason)
Details
I'm seeing 502s. There are several possible reasons for this 1) nginx is causing an early timeout and closing the connection 2) build queue logic doesn't work properly 3) build queue stale lock detection is set at 20 minutes and is too long 4) APK builds take longer than nginx timeout settings Tested about 30 apks, some failed on release, but not reviewer and vis versa.
Reporter | ||
Comment 1•10 years ago
|
||
I wanted to get an idea of how long it takes to build an APK, which we do a statsd timer for, but I can't figure out what's going on with the data. https://graphite.shared.us-east-1.stage.mozaws.net/ stats.timers.apk-controller-review.apk-generate.dur.count - values are None,1.0, 2.0 I would have expected integers like 323, 4339, etc.
Reporter | ||
Comment 2•10 years ago
|
||
I mispoke, one of the errors is HTTP/1.1 504 Gateway Timeout This happens exactly at 60 seconds into the request. The default setting for proxy_connect_timeout, proxy_read_timeout, and proxy_send_timeout is 60 seconds. APK https://marketplace.firefox.com/app/c252f794-369a-4f72-b3bf-0228ad997384/manifest.webapp Size of packaged app: 915K
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-amo → jthomas
Priority: -- → P1
Assignee | ||
Comment 3•10 years ago
|
||
I've bumped proxy_read_timeout to 300 seconds. The rest of the values seem okay. https://github.com/mozilla-services/puppet-config/commit/db85b2b201d6199acb61aa8e1b85e63a0b6cab14
Assignee | ||
Comment 4•10 years ago
|
||
I've also pushed these changes to stage.
Reporter | ||
Comment 5•10 years ago
|
||
Using marketplace's Fireplace API, I'm requesting thousands of APKs. Stage release - I'm seeing mostly 502s, some 504s and some 200s. Production release - I'm seeing mostly 200s. We're going to need to dig into what is going on. I don't think we can ship what is on stage.
Reporter | ||
Comment 6•10 years ago
|
||
I'm guessing this is nginx config around buffer sizes, timeouts and/or number of open connections. I'd like to get ssh access to apk-controller.dev.mozaws.net I'm studying Ops puppet configs, to see if I can setup nginx locally to reproduce https://github.com/mozilla-services/puppet-config/blob/master/apk/modules/apk_factory/manifests/nginx.pp but working in apk-controller.dev.mozaws.net will be fast and less error prone.
Reporter | ||
Comment 7•10 years ago
|
||
At first blush, I'm not seeing a ton of 502's running behind an nginx in front of the controller server { listen 80; client_max_body_size 1m; location / { try_files $uri @controller; } location @controller { proxy_pass_header Server; proxy_set_header Host $http_host; proxy_redirect off; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Scheme $scheme; proxy_set_header X-FORWARDED-PROTOCOL "ssl"; proxy_connect_timeout 60; proxy_read_timeout 300; proxy_pass http://localhost:8080; } } Next up, I'll put the generator behind an nginx also and run a long burn in.
Reporter | ||
Comment 8•10 years ago
|
||
Another possibility is that supervisord or whatever isn't restarting servers in a timely manner. When controller or generator are down, nginx will return a 502. nginx error logs would be really helpful.
Reporter | ||
Comment 9•10 years ago
|
||
I got a configuration going that will start producing 502s after a lot of load is applied. This was caused by EMFILE exceptions. I tracked this down to the Sentry logger leaking file handles. With Sentry we'd get up to 1024 FD. Without Sentry we oscillate between 12 and 23 FD. Not sure if this is the root cause of this bug, but it is a good leak to find. Filed https://github.com/mattrobenolt/raven-node/issues/75 upstream.
Reporter | ||
Comment 10•10 years ago
|
||
Controller nginx error logs are filled with client closed connection while waiting for request, client: 10.65.15.171, server: 0.0.0.0:81 I don't believe this is the root cause, but it brings up an idea. We might want to do proxy_ignore_client_abort = on So that long runninging apk builds have a chance to finish, even if the client aborts.
Reporter | ||
Comment 11•10 years ago
|
||
I restarted dev this morning. 5 minutes after restart, it had 502s. I hit it 4 hours afterwards and it looked much better. I'm not sure, as jason is working on a new dev stack. I applied load and after a few minutes, it started giving 502s. I'll hold off on trying two patches that I have, until jason is done and I have ssh access to dev.
Reporter | ||
Comment 12•10 years ago
|
||
I've got access to dev. In /var/log/controller-supervisor.log I see TypeError: Cannot read property 'length' of undefined at getFromList (/opt/apk-factory-service/lib/build_queue.js:135:25) at /opt/apk-factory-service/lib/build_queue.js:87:22 at Array.forEach (native) at /opt/apk-factory-service/lib/build_queue.js:85:28 at Query._callback (/opt/apk-factory-service/lib/db.js:196:16) ... at TCP.onread (net.js:528:21) I fixed this in https://github.com/mozilla/apk-factory-service/commit/7aa7bbfdae510b2f9ecee9bc6ab8a5fbd6205e40 I also see very frequent mysql errors: {"level":"error","message":"Error: ER_CON_COUNT_ERROR: Too many connections Connecting to the db and issuing show full processlist; The results go between 1 and 50+ SLEEPING processes I'm going to add mysql db pooling to reduce total number of connections.
Reporter | ||
Comment 13•10 years ago
|
||
I've switched from managing connections to using the connection pool in 19c3f7774564d2cbb4d77664b04c00c690e2d4ac. I've run hundreds of APKs through dev and getting mostly 200s. Cutting a new release (release-2014-04-02-02) and pushing to stage (331a37954f46abff7e). I'm having some trouble with stage Dreadnot.
Assignee | ||
Comment 14•10 years ago
|
||
Dreadnot stage should be back up.
Reporter | ||
Comment 15•10 years ago
|
||
I think this deployment is stuck https://dreadnot.apk.us-east-1.stage.mozaws.net/stacks/apk-factory-service/regions/us-east-1/deployments/1
Assignee | ||
Comment 16•10 years ago
|
||
Stage deployment will take about 15-20 minutes because we recycle the ec2 instances.
Reporter | ||
Comment 17•10 years ago
|
||
Thanks Jason. Still seeing 502s on stage. Re-deployed in case it was a recent deployment issue. No dice. We can't see supervisord's log, which has the reason servers fall over. As a short term workaround, I'll add an unhandled exception handler that logs to our standard logging system.
Reporter | ||
Comment 18•10 years ago
|
||
I deployed to stage https://dreadnot.apk.us-east-1.stage.mozaws.net/stacks/apk-factory-service/regions/us-east-1/deployments/3 release isn't accepting traffic (https://apk-controller.stage.mozaws.net) review is building apks (https://apk-controller-review.stage.mozaws.net)
Assignee | ||
Comment 19•10 years ago
|
||
Our stage wildcard cert was updated in AWS yesterday and it made the ELB sad. I fixed this up and should be working now.
Reporter | ||
Comment 20•10 years ago
|
||
Thanks! stage release - Consistent 200s under good load stage review - Many errors (different than this original bug report) 50% - [Error: socket hang up] code: 'ECONNRESET' 40% - 504 10% - 200 Can we expose the reviewer's /var/log/controller-supervisor.log and /var/log/generator-supervisor.log somehow? (Sentry, Kibaba or Heka)
Assignee | ||
Comment 21•10 years ago
|
||
On review instances all I see is 'listening on' for these logs.
Reporter | ||
Comment 22•10 years ago
|
||
jason - thanks for all your help! review instance is in much better shape. Closing this bug. I've filed bug#994767 and bug#994769 for new issues I'm seeing in the logs. Overall, we're in much better shape. release is looking good and review has errors but lots of 200s.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 23•10 years ago
|
||
I'm still seeing this (or something else with similar symptoms). Here are two attempts to download the APK mentioned in bug 988644, comment 10: 04-09 16:38 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0 04-10 12:50 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2 100 2 0 0 1 0 0:00:02 0:00:01 0:00:01 1 04-10 12:52 > cat messageme.apk {}04-10 12:52 > The first attempt completely failed after exactly one minute. The second quickly returned a two-byte file with the contents "{}".
Comment 24•10 years ago
|
||
Update: ozten suggested I might have seen the issues in comment 23 because the server was in the middle of deployment, so I tried again and was unable to reproduce them.
Updated•10 years ago
|
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
You need to log in
before you can comment on or make changes to this bug.
Description
•