1161644 - Unable to successfully deploy One and Done to production PAAS

Reporter

Description

•

10 years ago

The production site for One and Done just went down. The staging env. is fine. I was working on it at the time and it was fine and then just started timing out about ~5 mins ago [roughly 11:05am PST].

:kanban

Updated

•

10 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1115]

Rebecca Billings [:rbillings]

Reporter

Comment 1

•

10 years ago

ok, and it just came back up seconds ago....

Stephen Donner [:stephend] Not actively reading bugmail

Comment 2

•

10 years ago

Attached image oneanddone-prod-down.png — Details

I happened to catch it and some headers - do we know what caused/fixed this?

Rebecca Billings [:rbillings]

Reporter

Comment 3

•

10 years ago

And it's down again.

C. Liang [:cyliang]

Comment 4

•

10 years ago

It looks like the monitoring on the load balancer end was a bit over-sensitive and had a tendency to mark PaaS nodes as "unavailable" if the app was slow to respond. Those settings have been changed and the app was restarted on May 5th, at around 11:30 PDT. It looks like the site is up and available for the moment.

Rebecca Billings [:rbillings]

Reporter

Comment 5

•

10 years ago

The site has been up and down since the original report. I just tried to load https://oneanddone.mozilla.org/tasks/45/ and it failed. I was able to repro in other tabs & browsers.

Bob Silverberg [:bsilverberg]

Comment 6

•

10 years ago

Apologies for the really ugly output below, but I wanted to capture the log file as seen via the Stackato Management Console, which shows some issues: [pid: 4569|app: 0|req: 5378/9388] 10.22.93.30 () {52 vars in 2832 bytes} [Wed May 6 14:12:20 2015] GET /static/img/smoothness/ui-bg_flat_75_ffffff_40x100.png => generated 0 bytes in 0 msecs (HTTP/1.1 304) 0 headers in 29 bytes (0 switches on core 0) app 0 2015-05-06T14:12:20-0400 [pid: 4569|app: 0|req: 5379/9389] 10.22.93.31 () {50 vars in 2748 bytes} [Wed May 6 14:12:24 2015] GET /en-US/tasks/metrics/ => generated 42743 bytes in 1165 msecs (HTTP/1.1 200) 4 headers in 119 bytes (1 switches on core 0) app 0 2015-05-06T14:12:25-0400 [pid: 4569|app: 0|req: 5380/9390] 10.22.93.30 () {46 vars in 2289 bytes} [Wed May 6 14:12:26 2015] GET /en-US/tasks/45 => generated 0 bytes in 4 msecs (HTTP/1.1 301) 4 headers in 174 bytes (1 switches on core 0) app 0 2015-05-06T14:12:26-0400 DAMN ! worker 1 (pid: 4569) died, killed by signal 9 :( trying respawn ... app 0 2015-05-06T14:14:19-0400 Respawned uWSGI worker 1 (new pid: 4591) app 0 2015-05-06T14:14:19-0400 2015-05-06 14:14:19,699 (4591/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55) app 0 2015-05-06T14:14:19-0400 2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4591, redirect_host='collector-130.newrelic.com' and agent_run_id=343599759, in 0.87 seconds. app 0 2015-05-06T14:14:26-0400 INFO:newrelic.core.data_collector:Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4591, redirect_host='collector-130.newrelic.com' and agent_run_id=343599759, in 0.87 seconds. app 0 2015-05-06T14:14:26-0400 2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - High security mode is being applied to all communications between the agent and the data collector for this session. app 0 2015-05-06T14:14:26-0400 INFO:newrelic.core.data_collector:High security mode is being applied to all communications between the agent and the data collector for this session. app 0 2015-05-06T14:14:26-0400 2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614 app 0 2015-05-06T14:14:26-0400 INFO:newrelic.core.data_collector:Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614 app 0 2015-05-06T14:14:26-0400 DAMN ! worker 1 (pid: 4591) died, killed by signal 9 :( trying respawn ... app 0 2015-05-06T14:16:11-0400 Respawned uWSGI worker 1 (new pid: 4608) app 0 2015-05-06T14:16:11-0400 2015-05-06 14:16:11,239 (4608/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55) app 0 2015-05-06T14:16:11-0400 [pid: 4608|app: 0|req: 5383/9391] 10.22.93.30 () {50 vars in 2726 bytes} [Wed May 6 14:16:11 2015] GET /tasks/45/ => generated 0 bytes in 26 msecs (HTTP/1.1 301) 4 headers in 178 bytes (1 switches on core 0) app 0 2015-05-06T14:16:11-0400 2015-05-06 14:16:17,666 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4608, redirect_host='collector-130.newrelic.com' and agent_run_id=343605103, in 0.73 seconds. app 0 2015-05-06T14:16:17-0400 INFO:newrelic.core.data_collector:Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4608, redirect_host='collector-130.newrelic.com' and agent_run_id=343605103, in 0.73 seconds. app 0 2015-05-06T14:16:17-0400 2015-05-06 14:16:17,667 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - High security mode is being applied to all communications between the agent and the data collector for this session. app 0 2015-05-06T14:16:17-0400 INFO:newrelic.core.data_collector:High security mode is being applied to all communications between the agent and the data collector for this session. app 0 2015-05-06T14:16:17-0400 2015-05-06 14:16:17,667 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614 app 0 2015-05-06T14:16:17-0400 INFO:newrelic.core.data_collector:Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614 app 0 2015-05-06T14:16:17-0400 DAMN ! worker 1 (pid: 4608) died, killed by signal 9 :( trying respawn ... app 0 2015-05-06T14:18:10-0400 Respawned uWSGI worker 1 (new pid: 4629) app 0 2015-05-06T14:18:10-0400 2015-05-06 14:18:10,819 (4629/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55)

Bob Silverberg [:bsilverberg]

Comment 7

•

10 years ago

After some testing it seems that this is related to the last deploy which updated the site to Django 1.7, among many other changes. At this point I have no idea what is causing it. All I can see is that at certain times, after a GET request, a message shows up in the log like: DAMN ! worker 1 (pid: 4591) died, killed by signal 9 :( trying respawn ... Looking earlier in the log, I can see that this is a uWSGI worker. For example, here are some entries, in chronological order: Application 'oneanddone' is now running on DEA 5f6f45 WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x1433080 pid: 18303 (default app) spawned uWSGI master process (pid: 18303) spawned uWSGI worker 1 (pid: 18308, cores: 1) DAMN ! worker 1 (pid: 18308) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 1 (new pid: 18334) So, for whatever reason, the uWSGI worker died, and was respawned, which seems to leave the site in a "Service Unavailable" state for some time. It did seem, during testing, as if we were able to issue a number of requests of specific types once the site came back up, such as logging in, logging out, listing tasks, and also were able to look at the details of some tasks, without the site going down. But then looking at the details of some particular tasks would cause the site to go down. We cannot say for sure that there is a correlation between those tasks and the site going down, but it seems possible. I have backed out the code on prod for now and the site seems to be available and working fine again. I never get any emails or errors when this happens, all I can see is that message in the log. If anyone has any advice on how I might debug this or try to narrow down the cause I would appreciate it.

Bob Silverberg [:bsilverberg]

Comment 8

•

10 years ago

Shyam, we briefly discussed this on IRC today. I would very much like to try to get to the bottom of this, and am willing to help debug it with some assistance. Here are some thoughts: 1. Is there a log somewhere that will tell us why this worker process is being killed? I am not getting any error messages from Python, and all I see in the app log is what I pasted above. 2. I read something on SO [1] about this likely being due to the app running out of memory (and Laura mentioned that too). Is there a way to monitor or profile the memory usage of the worker to figure out if and why this is happening? 3. Related to 2, above, perhaps just increasing the amount of memory the app can use might address this, but I am wary of just indiscriminately doing that. I'm also not sure _how_ to do it, as, as far as I can see, the uWSGI config us hidden from me and not available to be configured by me. I look forward to hearing back from you when you have some time to look at this. I would very much like to address this as soon as possible as it is preventing me from upgrading the app to a modern (and supported) version of Django. [1] http://stackoverflow.com/questions/25414902/how-to-find-out-why-uwsgi-kill-workers

Flags: needinfo?(smani)

Bob Silverberg [:bsilverberg]

Comment 9

•

10 years ago

I had another thought about this: As it is only happening on prod and not dev and stage, it would make sense to compare the configurations between the prod paas and the stage paas to see if anything jumps out that could possibly be changed to resolve this issue. Unfortunately I don't have access to either information, nor the ability to change any of it, so I would need help with this from someone who does.

Summary: One and Done production is down → Unable to successfully deploy One and Done to production PAAS

Shyam Mani [:fox2mike]

Assignee

Updated

•

10 years ago

Assignee: server-ops-webops → smani

Flags: needinfo?(smani)

Jake Maul [:jakem]

Updated

•

10 years ago

Depends on: 1180767

Bob Silverberg [:bsilverberg]

Comment 10

•

10 years ago

By doing some local debugging with the production database I was able to trace this to a bug in the bleach package that we were using, which caused an infinite loop that would just keep consuming more and more memory. Upgrading the package to the latest from GitHub solved the problem, so I am now able to deploy successfully. Therefore I am going to close this bug, but I believe that we should still migrate the site off of the paas as it doesn't seem to be supported. I will leave bug 1180767 open and am happy to contribute my time to working on that bug.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Unable to successfully deploy One and Done to production PAAS

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: rbillings, Assigned: fox2mike)

References

(
URL
)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1115] )

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10

Updated

Attachment

General

Description

File Name

Content Type