Closed
Bug 979149
Opened 11 years ago
Closed 10 years ago
Production paas gives intermittent HTTP 500 errors during 'stackato push'
Categories
(Infrastructure & Operations :: IT-Managed Tools, task)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: jgmize, Unassigned)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/137] [blocks-nucleus][unplanned - troubleshooting])
Example output:
$ stackato push
Would you like to deploy from the current directory ? [Yn]:
Using manifest file "stackato.yml"
Updating application 'nucleus'...
stackato.dea: Stopping application 'nucleus' on DEA 1b1588
stackato.dea: Stopping application 'nucleus' on DEA 1b1588
Comparing application [nucleus] to [https://api.paas.mozilla.org] ...
Framework: python
Runtime: <framework-specific default>
Application Url: nucleus.mozilla.org
Application Url: nucleus.paas.mozilla.org
Enter Memory Reservation [256M]:
Enter Disk Reservation [2048]:
Preserving Environment Variable [ADMINS]
Preserving Environment Variable [DJANGO_HMAC_KEY]
Preserving Environment Variable [DJANGO_SECRET_KEY]
Preserving Environment Variable [DJANGO_SERVE_STATIC]
Preserving Environment Variable [DJANGO_SETTINGS_MODULE]
Preserving Environment Variable [EMAIL_HOST]
Preserving Environment Variable [SERVER_EMAIL]
Updating environment ... No changes
Uploading Application [nucleus] ...
Checking for bad links ... 11609 OK
Copying to temp space ... 11608 OK
Checking for available resources ... OK
Processing resources ... OK
Packing application ... OK
Uploading (574K) ... 100% OK
Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Service Unavailable</title> <style type="text/css"> body, p, h1 { font-family:
Verdana, Arial, Helvetica, sans-serif; } h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29; } </style> </head> <body> <h2>Service Unavailable</h2> <p>The service is temporarily unavailable. Please try again later.</p> </body> </html>
Usually I just run the command repeatedly (tonight it took 3 tries), and it eventually works, but then you have to manually run "stackato start" because the application is in a stopped state after the first push that gets a 500 error. This example is with nucleus, but I have seen this happen with other apps as well.
Updated•11 years ago
|
Whiteboard: [blocks-nucleus]
Comment 1•11 years ago
|
||
i spent quite a bit of time this morning trying to replicate this without success. are you seeing this every time you deploy nucleus? and does this happen with other apps you push?
i have a test app on a deploy loop, which has now run hundreds of times without returning any errors.
Flags: needinfo?(jmize)
| Reporter | ||
Comment 2•11 years ago
|
||
No, this has only happened intermittently. Based on the timing and your comment on https://bugzilla.mozilla.org/show_bug.cgi?id=973677#c5 I believe this may be another failure that can occur during puppet runs.
Flags: needinfo?(jmize)
| Reporter | ||
Comment 3•11 years ago
|
||
Unless, of course, puppet runs were disabled before last night/early this morning, in which case this was another failure mode entirely.
Comment 4•11 years ago
|
||
puppet was disabled on the dea/stager nodes, but hasn't been on the core node, which is where the router lives. lets hold on this until we've got the puppet changes in places and see if we can reproduce after that.
again, i've had no luck all morning with a for loop running pushes multiple times a minute.
| Reporter | ||
Comment 5•11 years ago
|
||
:cturra does your test app that you mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=979149#c1 use a media service? During my troubleshooting I realized I could get by without it for nucleus, and I switched the caching to LocMem as well, so I'm only using the DB service for nucleus now, which seems to have helped as far as running into fewer issues.
Comment 6•11 years ago
|
||
to make my test app closer to yours, i added a filesystem service to it last night called 'media'. it's a very basic php app that has a couple larger files so there is data to transfer on push.
$ s delete s-cturra-com
Provisioned service [media] detected would you like to delete it ?: [yN]: y
Deleting application [s-cturra-com] ... OK
Deleting service [media] ... OK
$ s push -n
Using manifest file "stackato.yml"
Framework: php
Runtime: <framework-specific default>
Application Url: s-cturra-com.paas.mozilla.org
Creating Application [s-cturra-com] in [https://api.paas.mozilla.org] ... OK
Creating new service [media] ... OK
Binding Service media to s-cturra-com ... OK
Uploading Application [s-cturra-com] ...
Checking for bad links ... 8 OK
Copying to temp space ... 7 OK
Checking for available resources ... OK
Processing resources ... OK
Packing application ... OK
Uploading (359) ... 100% OK
Push Status: OK
stackato.stager: Staging application 's-cturra-com'
staging:
staging: end of staging
stackato.stager: Completed staging application 's-cturra-com'
stackato.dea: Starting application 's-cturra-com' on DEA 6b74ff
app[stderr.log.0]: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
stackato.dea: Application 's-cturra-com' is now running on DEA 6b74ff
http://s-cturra-com.paas.mozilla.org/ deployed
| Reporter | ||
Comment 7•11 years ago
|
||
FYI I just ran into this again doing a production push to nucleus. This is still only using the db service-- no media or memcache for now. Second time I tried worked, I just had to do a "stackato start" afterwards, because it was in a stopped state after the first push. Transcript of the failed push follows:
$ stackato push
Would you like to deploy from the current directory ? [Yn]:
Using manifest file "stackato.yml"
Updating application 'nucleus'...
stackato.dea: Stopping application 'nucleus' on DEA 6b74ff
stackato.dea: Stopping application 'nucleus' on DEA 6b74ff
Comparing application [nucleus] to [https://api.paas.mozilla.org] ...
Framework: python
Runtime: <framework-specific default>
Application Url: nucleus.mozilla.org
Application Url: nucleus.paas.mozilla.org
Enter Memory Reservation [256M]:
Enter Disk Reservation [2048]:
Preserving Environment Variable [ADMIN_EMAILS]
Preserving Environment Variable [BROWSERID_AUDIENCES]
Preserving Environment Variable [DJANGO_HMAC_KEY]
Preserving Environment Variable [DJANGO_SECRET_KEY]
Preserving Environment Variable [DJANGO_SERVE_STATIC]
Preserving Environment Variable [DJANGO_SETTINGS_MODULE]
Preserving Environment Variable [EMAIL_HOST]
Preserving Environment Variable [NEW_RELIC_API_KEY]
Preserving Environment Variable [NEW_RELIC_APP_NAME]
Preserving Environment Variable [NEW_RELIC_LICENSE_KEY]
Preserving Environment Variable [SERVER_EMAIL]
Updating environment ... No changes
Uploading Application [nucleus] ...
Checking for bad links ... 11602 OK
Copying to temp space ... 11601 OK
Checking for available resources ... OK
Processing resources ... OK
Packing application ... OK
Uploading (3M) ... 100% OK
Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta
http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Service Unavailable</title> <style type="text/css"> body, p, h1 {
font-family: Verdana, Arial, Helvetica, sans-serif; } h2 { font-family: Arial, Helvetica, sans-serif; color: #b10b29; } </style> </head>
<body> <h2>Service Unavailable</h2> <p>The service is temporarily unavailable. Please try again later.</p> </body> </html>
Updated•11 years ago
|
Whiteboard: [blocks-nucleus] → [blocks-nucleus][unplanned - troubleshooting]
| Reporter | ||
Comment 8•11 years ago
|
||
I continue to run into the situation described in comment #7 -- not every time, but for example I ran into it on Friday at approximately 19:37 UTC and again just now. Each time this happens, the downtime due to production pushes is extended from the "normal" 5-10 minutes to about 15-20 minutes.
It appears to be something specific to the production paas cluster, as I have not run into this on the dev paas cluster. The main difference I am aware between the dev and prod clusters is the way the database services are configured with the db servers external to the paas itself. Is it possible that there are timeouts or other network errors due to communication issues between newly created lxc containers and the database servers?
Updated•11 years ago
|
Assignee: server-ops-webops → cturra
Updated•11 years ago
|
Whiteboard: [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Updated•11 years ago
|
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3072] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3072] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3077] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3077] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3082] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3082] [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting]
Comment 9•11 years ago
|
||
throwing this back into the unassigned queue. this bug is to track intermittent HTTP 500 errors. the plan was to evaluate this after a stackato upgrade to 3.x
Assignee: cturra → server-ops-webops
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/225] [blocks-nucleus][unplanned - troubleshooting] → [kanban:https://webops.kanbanize.com/ctrl_board/2/137] [blocks-nucleus][unplanned - troubleshooting]
Comment 10•10 years ago
|
||
The HTML error shown above is the output from Zeus when no backend nodes were able to serve the request, or when no backend nodes were healthy at the time of the request.
:jgmize, is this still occurring? If so, we can use current logs to try and correlate; otherwise, I'd like to close as WFM.
| Reporter | ||
Comment 11•10 years ago
|
||
It just happened to me again:
$ stackato update
Updating application 'nucleus'...
Stopping Application [nucleus] ... OK
Interrupted
Application Url: nucleus.mozilla.org
Application Url: nucleus.paas.mozilla.org
Preserving Environment Variable [ADMIN_EMAILS]
Preserving Environment Variable [BROWSERID_AUDIENCES]
Preserving Environment Variable [DJANGO_HMAC_KEY]
Preserving Environment Variable [DJANGO_SECRET_KEY]
Preserving Environment Variable [DJANGO_SERVE_STATIC]
Preserving Environment Variable [DJANGO_SETTINGS_MODULE]
Preserving Environment Variable [EMAIL_HOST]
Preserving Environment Variable [NEW_RELIC_LICENSE_KEY]
Preserving Environment Variable [NEW_RELIC_API_KEY]
Preserving Environment Variable [NEW_RELIC_APP_NAME]
Preserving Environment Variable [SERVER_EMAIL]
Updating environment: OK
Uploading Application [nucleus] ...
Checking for bad links: 11882 OK
Copying to temp space: 11881 OK
Checking for available resources: OK
Packing application: OK
Uploading (32M): 100% OK
Error (HTTP 500): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Service Unavailable</title>
<style type="text/css">
body, p, h1 {
font-family: Verdana, Arial, Helvetica, sans-serif;
}
h2 {
font-family: Arial, Helvetica, sans-serif;
color: #b10b29;
}
</style>
</head>
<body>
<h2>Service Unavailable</h2>
<p>The service is temporarily unavailable. Please try again later.</p>
</body>
</html>
| Reporter | ||
Comment 12•10 years ago
|
||
After several tries, the stackato update command eventually succeeded, but now the app will no longer start; I've tried several times, and it doesn't always fail at the same point, but here is an example output:
$ stackato start
stackato.stager: Staging application 'nucleus'
staging:
staging: -----> Installing dependencies using PyPM
staging: Get: [pypm-be.activestate.com] :repository-index:
Get: [pypm-free.activestate.com] :repository-index:
autosync: synced 2 repositories
staging: The following packages will be installed into "/staging/staged/python" (2.7):
staging: readline-6.2.4.1 ipython-2.3.1 py-bcrypt-0.3 jinja2-2.5.5
staging: mysql-python-1.2.3
staging: Get: [pypm-free.activestate.com] ipython 2.3.1
staging: Get: [pypm-free.activestate.com] jinja2 2.5.5
staging: Get: [pypm-free.activestate.com] mysql-python 1.2.3
staging: Get: [pypm-free.activestate.com] py-bcrypt 0.3
staging: Get: [pypm-free.activestate.com] readline 6.2.4.1
staging: Installing readline-6.2.4.1
staging: Installing ipython-2.3.1
Failed to stage application:
staging plugin exited with non-zero exit code.
| Reporter | ||
Comment 13•10 years ago
|
||
After several more attempts to update and start the service, I finally tried deleting the nucleus app and pushing again, which worked, and https://nucleus.mozilla.org is finally working again, after about 30 minutes of downtime. Feel free to check the logs to see what you can find, but frankly I'm not sure if this platform is worth it. If you close this bug, please use the WONTFIX resolution due to declaring an official EOL of this platform at Mozilla, and I'll accelerate my plans to move services off of it.
Summary: Production paas gives intermittent HTTP 500 errors during "stackato push" → Production paas gives intermittent HTTP 500 errors during 'stackato push'
Comment 14•10 years ago
|
||
:jgmize I'm sorry we couldn't help you debug this. But to be frank. I think you are right. This platform is EOL and on it's way out. I've heard rumours of end of year plans to decom it.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•