Closed Bug 993164 Opened 10 years ago Closed 10 years ago

APK Factory stage release/review has 502

Categories

(Cloud Services :: Operations: Marketplace, task, P1)

x86
macOS
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ozten, Assigned: jason)

Details

I'm seeing 502s. There are several possible reasons for this

1) nginx is causing an early timeout and closing the connection
2) build queue logic doesn't work properly
3) build queue stale lock detection is set at 20 minutes and is too long
4) APK builds take longer than nginx timeout settings

Tested about 30 apks, some failed on release, but not reviewer and vis versa.
I wanted to get an idea of how long it takes to build an APK, which we do a statsd timer for, but I can't figure out what's going on with the data.

https://graphite.shared.us-east-1.stage.mozaws.net/
stats.timers.apk-controller-review.apk-generate.dur.count - values are None,1.0, 2.0
I  would have expected integers like 323, 4339, etc.
I mispoke, one of the errors is

    HTTP/1.1 504 Gateway Timeout

This happens exactly at 60 seconds into the request.
The default setting for proxy_connect_timeout, proxy_read_timeout, and proxy_send_timeout is 60 seconds.

APK https://marketplace.firefox.com/app/c252f794-369a-4f72-b3bf-0228ad997384/manifest.webapp
Size of packaged app: 915K
Assignee: server-ops-amo → jthomas
Priority: -- → P1
I've bumped proxy_read_timeout to 300 seconds. The rest of the values seem okay. 

https://github.com/mozilla-services/puppet-config/commit/db85b2b201d6199acb61aa8e1b85e63a0b6cab14
I've also pushed these changes to stage.
Using marketplace's Fireplace API, I'm requesting thousands of APKs.

Stage release - I'm seeing mostly 502s, some 504s and some 200s.

Production release - I'm seeing mostly 200s.

We're going to need to dig into what is going on.

I don't think we can ship what is on stage.
I'm guessing this is nginx config around buffer sizes, timeouts and/or number of open connections.

I'd like to get ssh access to apk-controller.dev.mozaws.net

I'm studying Ops puppet configs, to see if I can setup nginx locally to reproduce
https://github.com/mozilla-services/puppet-config/blob/master/apk/modules/apk_factory/manifests/nginx.pp
but working in apk-controller.dev.mozaws.net will be fast and less error prone.
At first blush, I'm not seeing a ton of 502's running behind an nginx in front of the controller

    server {
        listen   80;

            client_max_body_size   1m;

            location / {
                try_files $uri @controller;
            }

        location @controller {
            proxy_pass_header Server;
            proxy_set_header Host $http_host;
            proxy_redirect off;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Scheme $scheme;
            proxy_set_header X-FORWARDED-PROTOCOL "ssl";
            proxy_connect_timeout 60;
            proxy_read_timeout 300;
            proxy_pass http://localhost:8080;
        }
    }

Next up, I'll put the generator behind an nginx also and run a long burn in.
Another possibility is that supervisord or whatever isn't restarting servers in a timely manner.

When controller or generator are down, nginx will return a 502.

nginx error logs would be really helpful.
I got a configuration going that will start producing 502s after a lot of load is applied.

This was caused by EMFILE exceptions.

I tracked this down to the Sentry logger leaking file handles.
With Sentry we'd get up to 1024 FD.
Without Sentry we oscillate between 12 and 23 FD.

Not sure if this is the root cause of this bug, but it is a good leak to find.
Filed https://github.com/mattrobenolt/raven-node/issues/75 upstream.
Controller nginx error logs are filled with 

    client closed connection while waiting for request, client: 10.65.15.171, server: 0.0.0.0:81

I don't believe this is the root cause, but it brings up an idea. We might want to do

    proxy_ignore_client_abort = on

So that long runninging apk builds have a chance to finish, even if the client aborts.
I restarted dev this morning. 5 minutes after restart, it had 502s.

I hit it 4 hours afterwards and it looked much better. I'm not sure, as jason is working on a new dev stack.

I applied load and after a few minutes, it started giving 502s.

I'll hold off on trying two patches that I have, until jason is done and I have ssh access to dev.
I've got access to dev.

In /var/log/controller-supervisor.log

I see 
    TypeError: Cannot read property 'length' of undefined
    at getFromList (/opt/apk-factory-service/lib/build_queue.js:135:25)
    at /opt/apk-factory-service/lib/build_queue.js:87:22
    at Array.forEach (native)
    at /opt/apk-factory-service/lib/build_queue.js:85:28
    at Query._callback (/opt/apk-factory-service/lib/db.js:196:16)
    ...
    at TCP.onread (net.js:528:21)

I fixed this in https://github.com/mozilla/apk-factory-service/commit/7aa7bbfdae510b2f9ecee9bc6ab8a5fbd6205e40

I also see very frequent mysql errors:

    {"level":"error","message":"Error: ER_CON_COUNT_ERROR: Too many connections

Connecting to the db and issuing

    show full processlist;

The results go between 1 and 50+ SLEEPING processes

I'm going to add mysql db pooling to reduce total number of connections.
I've switched from managing connections to using the connection pool in 19c3f7774564d2cbb4d77664b04c00c690e2d4ac.

I've run hundreds of APKs through dev and getting mostly 200s.

Cutting a new release (release-2014-04-02-02) and pushing to stage (331a37954f46abff7e).

I'm having some trouble with stage Dreadnot.
Dreadnot stage should be back up.
Stage deployment will take about 15-20 minutes because we recycle the ec2 instances.
Thanks Jason.

Still seeing 502s on stage.

Re-deployed in case it was a recent deployment issue. No dice.

We can't see supervisord's log, which has the reason servers fall over.

As a short term workaround, I'll add an unhandled exception handler that logs to our standard logging system.
Our stage wildcard cert was updated in AWS yesterday and it made the ELB sad. I fixed this up and should be working now.
Thanks!

stage release - Consistent 200s under good load

stage review - Many errors (different than this original bug report)

50% - [Error: socket hang up] code: 'ECONNRESET'
40% - 504
10% - 200

Can we expose the reviewer's /var/log/controller-supervisor.log and /var/log/generator-supervisor.log somehow? (Sentry, Kibaba or Heka)
On review instances all I see is 'listening on' for these logs.
jason - thanks for all your help!

review instance is in much better shape.

Closing this bug. I've filed bug#994767 and bug#994769 for new issues I'm seeing in the logs.

Overall, we're in much better shape. release is looking good and review has errors but lots of 200s.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
I'm still seeing this (or something else with similar symptoms).  Here are two attempts to download the APK mentioned in bug 988644, comment 10:

04-09 16:38 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0
04-10 12:50 > curl https://apk-controller-review.stage.mozaws.net/application.apk?manifestUrl=http://chat.messageme.com/manifest.webapp -o messageme.apk
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0      1      0  0:00:02  0:00:01  0:00:01     1
04-10 12:52 > cat messageme.apk 
{}04-10 12:52 > 

The first attempt completely failed after exactly one minute.  The second quickly returned a two-byte file with the contents "{}".
Update: ozten suggested I might have seen the issues in comment 23 because the server was in the middle of deployment, so I tried again and was unable to reproduce them.
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
You need to log in before you can comment on or make changes to this bug.