Closed Bug 979149 Opened 11 years ago Closed 10 years ago

Production paas gives intermittent HTTP 500 errors during 'stackato push'

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jgmize, Unassigned)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/137] [blocks-nucleus][unplanned - troubleshooting])

Example output: $ stackato push Would you like to deploy from the current directory ? [Yn]: Using manifest file "stackato.yml" Updating application 'nucleus'... stackato.dea: Stopping application 'nucleus' on DEA 1b1588 stackato.dea: Stopping application 'nucleus' on DEA 1b1588 Comparing application [nucleus] to [https://api.paas.mozilla.org] ... Framework: python Runtime: <framework-specific default> Application Url: nucleus.mozilla.org Application Url: nucleus.paas.mozilla.org Enter Memory Reservation [256M]: Enter Disk Reservation [2048]: Preserving Environment Variable [ADMINS] Preserving Environment Variable [DJANGO_HMAC_KEY] Preserving Environment Variable [DJANGO_SECRET_KEY] Preserving Environment Variable [DJANGO_SERVE_STATIC] Preserving Environment Variable [DJANGO_SETTINGS_MODULE] Preserving Environment Variable [EMAIL_HOST] Preserving Environment Variable [SERVER_EMAIL] Updating environment ... No changes Uploading Application [nucleus] ... Checking for bad links ... 11609 OK Copying to temp space ... 11608 OK Checking for available resources ... OK Processing resources ... OK Packing application ... OK Uploading (574K) ... 100% OK Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Service Unavailable</title> <style type="text/css"> body, p, h1 { font-family: Verdana, Arial, Helvetica, sans-serif; } h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29; } </style> </head> <body> <h2>Service Unavailable</h2> <p>The service is temporarily unavailable. Please try again later.</p> </body> </html> Usually I just run the command repeatedly (tonight it took 3 tries), and it eventually works, but then you have to manually run "stackato start" because the application is in a stopped state after the first push that gets a 500 error. This example is with nucleus, but I have seen this happen with other apps as well.
Whiteboard: [blocks-nucleus]
i spent quite a bit of time this morning trying to replicate this without success. are you seeing this every time you deploy nucleus? and does this happen with other apps you push? i have a test app on a deploy loop, which has now run hundreds of times without returning any errors.
Flags: needinfo?(jmize)
No, this has only happened intermittently. Based on the timing and your comment on https://bugzilla.mozilla.org/show_bug.cgi?id=973677#c5 I believe this may be another failure that can occur during puppet runs.
Flags: needinfo?(jmize)
Unless, of course, puppet runs were disabled before last night/early this morning, in which case this was another failure mode entirely.
puppet was disabled on the dea/stager nodes, but hasn't been on the core node, which is where the router lives. lets hold on this until we've got the puppet changes in places and see if we can reproduce after that. again, i've had no luck all morning with a for loop running pushes multiple times a minute.
:cturra does your test app that you mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=979149#c1 use a media service? During my troubleshooting I realized I could get by without it for nucleus, and I switched the caching to LocMem as well, so I'm only using the DB service for nucleus now, which seems to have helped as far as running into fewer issues.
to make my test app closer to yours, i added a filesystem service to it last night called 'media'. it's a very basic php app that has a couple larger files so there is data to transfer on push. $ s delete s-cturra-com Provisioned service [media] detected would you like to delete it ?: [yN]: y Deleting application [s-cturra-com] ... OK Deleting service [media] ... OK $ s push -n Using manifest file "stackato.yml" Framework: php Runtime: <framework-specific default> Application Url: s-cturra-com.paas.mozilla.org Creating Application [s-cturra-com] in [https://api.paas.mozilla.org] ... OK Creating new service [media] ... OK Binding Service media to s-cturra-com ... OK Uploading Application [s-cturra-com] ... Checking for bad links ... 8 OK Copying to temp space ... 7 OK Checking for available resources ... OK Processing resources ... OK Packing application ... OK Uploading (359) ... 100% OK Push Status: OK stackato.stager: Staging application 's-cturra-com' staging: staging: end of staging stackato.stager: Completed staging application 's-cturra-com' stackato.dea: Starting application 's-cturra-com' on DEA 6b74ff app[stderr.log.0]: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName stackato.dea: Application 's-cturra-com' is now running on DEA 6b74ff http://s-cturra-com.paas.mozilla.org/ deployed
FYI I just ran into this again doing a production push to nucleus. This is still only using the db service-- no media or memcache for now. Second time I tried worked, I just had to do a "stackato start" afterwards, because it was in a stopped state after the first push. Transcript of the failed push follows: $ stackato push Would you like to deploy from the current directory ? [Yn]: Using manifest file "stackato.yml" Updating application 'nucleus'... stackato.dea: Stopping application 'nucleus' on DEA 6b74ff stackato.dea: Stopping application 'nucleus' on DEA 6b74ff Comparing application [nucleus] to [https://api.paas.mozilla.org] ... Framework: python Runtime: <framework-specific default> Application Url: nucleus.mozilla.org Application Url: nucleus.paas.mozilla.org Enter Memory Reservation [256M]: Enter Disk Reservation [2048]: Preserving Environment Variable [ADMIN_EMAILS] Preserving Environment Variable [BROWSERID_AUDIENCES] Preserving Environment Variable [DJANGO_HMAC_KEY] Preserving Environment Variable [DJANGO_SECRET_KEY] Preserving Environment Variable [DJANGO_SERVE_STATIC] Preserving Environment Variable [DJANGO_SETTINGS_MODULE] Preserving Environment Variable [EMAIL_HOST] Preserving Environment Variable [NEW_RELIC_API_KEY] Preserving Environment Variable [NEW_RELIC_APP_NAME] Preserving Environment Variable [NEW_RELIC_LICENSE_KEY] Preserving Environment Variable [SERVER_EMAIL] Updating environment ... No changes Uploading Application [nucleus] ... Checking for bad links ... 11602 OK Copying to temp space ... 11601 OK Checking for available resources ... OK Processing resources ... OK Packing application ... OK Uploading (3M) ... 100% OK Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Service Unavailable</title> <style type="text/css"> body, p, h1 { font-family: Verdana, Arial, Helvetica, sans-serif; } h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29; } </style> </head> <body> <h2>Service Unavailable</h2> <p>The service is temporarily unavailable. Please try again later.</p> </body> </html>
Whiteboard: [blocks-nucleus] → [blocks-nucleus][unplanned - troubleshooting]
I continue to run into the situation described in comment #7 -- not every time, but for example I ran into it on Friday at approximately 19:37 UTC and again just now. Each time this happens, the downtime due to production pushes is extended from the "normal" 5-10 minutes to about 15-20 minutes. It appears to be something specific to the production paas cluster, as I have not run into this on the dev paas cluster. The main difference I am aware between the dev and prod clusters is the way the database services are configured with the db servers external to the paas itself. Is it possible that there are timeouts or other network errors due to communication issues between newly created lxc containers and the database servers?
Assignee: server-ops-webops → cturra
Whiteboard: [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3072] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3072] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3077] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3077] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3082] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3082] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
throwing this back into the unassigned queue. this bug is to track intermittent HTTP 500 errors. the plan was to evaluate this after a stackato upgrade to 3.x
Assignee: cturra → server-ops-webops
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://webops.kanbanize.com/ctrl_board/2/137] [blocks-nucleus][unplanned - troubleshooting]
The HTML error shown above is the output from Zeus when no backend nodes were able to serve the request, or when no backend nodes were healthy at the time of the request. :jgmize, is this still occurring? If so, we can use current logs to try and correlate; otherwise, I'd like to close as WFM.
It just happened to me again: $ stackato update Updating application 'nucleus'... Stopping Application [nucleus] ... OK Interrupted Application Url: nucleus.mozilla.org Application Url: nucleus.paas.mozilla.org Preserving Environment Variable [ADMIN_EMAILS] Preserving Environment Variable [BROWSERID_AUDIENCES] Preserving Environment Variable [DJANGO_HMAC_KEY] Preserving Environment Variable [DJANGO_SECRET_KEY] Preserving Environment Variable [DJANGO_SERVE_STATIC] Preserving Environment Variable [DJANGO_SETTINGS_MODULE] Preserving Environment Variable [EMAIL_HOST] Preserving Environment Variable [NEW_RELIC_LICENSE_KEY] Preserving Environment Variable [NEW_RELIC_API_KEY] Preserving Environment Variable [NEW_RELIC_APP_NAME] Preserving Environment Variable [SERVER_EMAIL] Updating environment: OK Uploading Application [nucleus] ... Checking for bad links: 11882 OK Copying to temp space: 11881 OK Checking for available resources: OK Packing application: OK Uploading (32M): 100% OK Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Service Unavailable</title> <style type="text/css"> body, p, h1 { font-family: Verdana, Arial, Helvetica, sans-serif; } h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29; } </style> </head> <body> <h2>Service Unavailable</h2> <p>The service is temporarily unavailable. Please try again later.</p> </body> </html>
After several tries, the stackato update command eventually succeeded, but now the app will no longer start; I've tried several times, and it doesn't always fail at the same point, but here is an example output: $ stackato start stackato.stager: Staging application 'nucleus' staging: staging: -----> Installing dependencies using PyPM staging: Get: [pypm-be.activestate.com] :repository-index: Get: [pypm-free.activestate.com] :repository-index: autosync: synced 2 repositories staging: The following packages will be installed into "/staging/staged/python" (2.7): staging: readline-6.2.4.1 ipython-2.3.1 py-bcrypt-0.3 jinja2-2.5.5 staging: mysql-python-1.2.3 staging: Get: [pypm-free.activestate.com] ipython 2.3.1 staging: Get: [pypm-free.activestate.com] jinja2 2.5.5 staging: Get: [pypm-free.activestate.com] mysql-python 1.2.3 staging: Get: [pypm-free.activestate.com] py-bcrypt 0.3 staging: Get: [pypm-free.activestate.com] readline 6.2.4.1 staging: Installing readline-6.2.4.1 staging: Installing ipython-2.3.1 Failed to stage application: staging plugin exited with non-zero exit code.
After several more attempts to update and start the service, I finally tried deleting the nucleus app and pushing again, which worked, and https://nucleus.mozilla.org is finally working again, after about 30 minutes of downtime. Feel free to check the logs to see what you can find, but frankly I'm not sure if this platform is worth it. If you close this bug, please use the WONTFIX resolution due to declaring an official EOL of this platform at Mozilla, and I'll accelerate my plans to move services off of it.
Summary: Production paas gives intermittent HTTP 500 errors during "stackato push" → Production paas gives intermittent HTTP 500 errors during 'stackato push'
:jgmize I'm sorry we couldn't help you debug this. But to be frank. I think you are right. This platform is EOL and on it's way out. I've heard rumours of end of year plans to decom it.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.