Unable to successfully deploy One and Done to production PAAS

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: rbillings, Assigned: fox2mike)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1115] , URL)

Attachments

(1 attachment)

(Reporter)

Description

4 years ago
The production site for One and Done just went down. The staging env. is fine. I was working on it at the time and it was fine and then just started timing out about ~5 mins ago [roughly 11:05am PST].

Updated

4 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1115]
(Reporter)

Comment 1

4 years ago
ok, and it just came back up seconds ago....
Created attachment 8601633 [details]
oneanddone-prod-down.png

I happened to catch it and some headers - do we know what caused/fixed this?
(Reporter)

Comment 3

4 years ago
And it's down again.

Comment 4

4 years ago
It looks like the monitoring on the load balancer end was a bit over-sensitive and had a tendency to mark PaaS nodes as "unavailable" if the app was slow to respond.  Those settings have been changed and the app was restarted on May 5th, at around 11:30 PDT.

It looks like the site is up and available for the moment.
(Reporter)

Comment 5

4 years ago
The site has been up and down since the original report. I just tried to load https://oneanddone.mozilla.org/tasks/45/ and it failed. I was able to repro in other tabs & browsers.
Apologies for the really ugly output below, but I wanted to capture the log file as seen via the Stackato Management Console, which shows some issues:

[pid: 4569|app: 0|req: 5378/9388] 10.22.93.30 () {52 vars in 2832 bytes} [Wed May 6 14:12:20 2015] GET /static/img/smoothness/ui-bg_flat_75_ffffff_40x100.png => generated 0 bytes in 0 msecs (HTTP/1.1 304) 0 headers in 29 bytes (0 switches on core 0)	app	0	2015-05-06T14:12:20-0400
[pid: 4569|app: 0|req: 5379/9389] 10.22.93.31 () {50 vars in 2748 bytes} [Wed May 6 14:12:24 2015] GET /en-US/tasks/metrics/ => generated 42743 bytes in 1165 msecs (HTTP/1.1 200) 4 headers in 119 bytes (1 switches on core 0)	app	0	2015-05-06T14:12:25-0400
[pid: 4569|app: 0|req: 5380/9390] 10.22.93.30 () {46 vars in 2289 bytes} [Wed May 6 14:12:26 2015] GET /en-US/tasks/45 => generated 0 bytes in 4 msecs (HTTP/1.1 301) 4 headers in 174 bytes (1 switches on core 0)	app	0	2015-05-06T14:12:26-0400
DAMN ! worker 1 (pid: 4569) died, killed by signal 9 :( trying respawn ...	app	0	2015-05-06T14:14:19-0400
Respawned uWSGI worker 1 (new pid: 4591)	app	0	2015-05-06T14:14:19-0400
2015-05-06 14:14:19,699 (4591/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55)	app	0	2015-05-06T14:14:19-0400
2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4591, redirect_host='collector-130.newrelic.com' and agent_run_id=343599759, in 0.87 seconds.	app	0	2015-05-06T14:14:26-0400
INFO:newrelic.core.data_collector:Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4591, redirect_host='collector-130.newrelic.com' and agent_run_id=343599759, in 0.87 seconds.	app	0	2015-05-06T14:14:26-0400
2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - High security mode is being applied to all communications between the agent and the data collector for this session.	app	0	2015-05-06T14:14:26-0400
INFO:newrelic.core.data_collector:High security mode is being applied to all communications between the agent and the data collector for this session.	app	0	2015-05-06T14:14:26-0400
2015-05-06 14:14:26,802 (4591/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614	app	0	2015-05-06T14:14:26-0400
INFO:newrelic.core.data_collector:Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614	app	0	2015-05-06T14:14:26-0400
DAMN ! worker 1 (pid: 4591) died, killed by signal 9 :( trying respawn ...	app	0	2015-05-06T14:16:11-0400
Respawned uWSGI worker 1 (new pid: 4608)	app	0	2015-05-06T14:16:11-0400
2015-05-06 14:16:11,239 (4608/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55)	app	0	2015-05-06T14:16:11-0400
[pid: 4608|app: 0|req: 5383/9391] 10.22.93.30 () {50 vars in 2726 bytes} [Wed May 6 14:16:11 2015] GET /tasks/45/ => generated 0 bytes in 26 msecs (HTTP/1.1 301) 4 headers in 178 bytes (1 switches on core 0)	app	0	2015-05-06T14:16:11-0400
2015-05-06 14:16:17,666 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4608, redirect_host='collector-130.newrelic.com' and agent_run_id=343605103, in 0.73 seconds.	app	0	2015-05-06T14:16:17-0400
INFO:newrelic.core.data_collector:Successfully registered New Relic Python agent where app_name='oneanddone.mozilla.org', pid=4608, redirect_host='collector-130.newrelic.com' and agent_run_id=343605103, in 0.73 seconds.	app	0	2015-05-06T14:16:17-0400
2015-05-06 14:16:17,667 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - High security mode is being applied to all communications between the agent and the data collector for this session.	app	0	2015-05-06T14:16:17-0400
INFO:newrelic.core.data_collector:High security mode is being applied to all communications between the agent and the data collector for this session.	app	0	2015-05-06T14:16:17-0400
2015-05-06 14:16:17,667 (4608/NR-Activate-Session/oneanddone.mozilla.org) newrelic.core.data_collector INFO - Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614	app	0	2015-05-06T14:16:17-0400
INFO:newrelic.core.data_collector:Reporting to: https://rpm.newrelic.com/accounts/263620/applications/3612614	app	0	2015-05-06T14:16:17-0400
DAMN ! worker 1 (pid: 4608) died, killed by signal 9 :( trying respawn ...	app	0	2015-05-06T14:18:10-0400
Respawned uWSGI worker 1 (new pid: 4629)	app	0	2015-05-06T14:18:10-0400
2015-05-06 14:18:10,819 (4629/MainThread) newrelic.core.agent INFO - New Relic Python Agent (1.11.0.55)
After some testing it seems that this is related to the last deploy which updated the site to Django 1.7, among many other changes. At this point I have no idea what is causing it. All I can see is that at certain times, after a GET request, a message shows up in the log like:

DAMN ! worker 1 (pid: 4591) died, killed by signal 9 :( trying respawn ...

Looking earlier in the log, I can see that this is a uWSGI worker. For example, here are some entries, in chronological order:

Application 'oneanddone' is now running on DEA 5f6f45
WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x1433080 pid: 18303 (default app)
spawned uWSGI master process (pid: 18303)
spawned uWSGI worker 1 (pid: 18308, cores: 1)
DAMN ! worker 1 (pid: 18308) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 18334)

So, for whatever reason, the uWSGI worker died, and was respawned, which seems to leave the site in a "Service Unavailable" state for some time.

It did seem, during testing, as if we were able to issue a number of requests of specific types once the site came back up, such as logging in, logging out, listing tasks, and also were able to look at the details of some tasks, without the site going down. But then looking at the details of some particular tasks would cause the site to go down. We cannot say for sure that there is a correlation between those tasks and the site going down, but it seems possible.

I have backed out the code on prod for now and the site seems to be available and working fine again. I never get any emails or errors when this happens, all I can see is that message in the log. If anyone has any advice on how I might debug this or try to narrow down the cause I would appreciate it.
Shyam, we briefly discussed this on IRC today. I would very much like to try to get to the bottom of this, and am willing to help debug it with some assistance. Here are some thoughts:

1. Is there a log somewhere that will tell us why this worker process is being killed? I am not getting any error messages from Python, and all I see in the app log is what I pasted above.
2. I read something on SO [1] about this likely being due to the app running out of memory (and Laura mentioned that too). Is there a way to monitor or profile the memory usage of the worker to figure out if and why this is happening?
3. Related to 2, above, perhaps just increasing the amount of memory the app can use might address this, but I am wary of just indiscriminately doing that. I'm also not sure _how_ to do it, as, as far as I can see, the uWSGI config us hidden from me and not available to be configured by me.

I look forward to hearing back from you when you have some time to look at this. I would very much like to address this as soon as possible as it is preventing me from upgrading the app to a modern (and supported) version of Django.

[1] http://stackoverflow.com/questions/25414902/how-to-find-out-why-uwsgi-kill-workers
Flags: needinfo?(smani)
I had another thought about this: As it is only happening on prod and not dev and stage, it would make sense to compare the configurations between the prod paas and the stage paas to see if anything jumps out that could possibly be changed to resolve this issue. Unfortunately I don't have access to either information, nor the ability to change any of it, so I would need help with this from someone who does.
Summary: One and Done production is down → Unable to successfully deploy One and Done to production PAAS
(Assignee)

Updated

3 years ago
Assignee: server-ops-webops → smani
Flags: needinfo?(smani)

Updated

3 years ago
Depends on: 1180767
By doing some local debugging with the production database I was able to trace this to a bug in the bleach package that we were using, which caused an infinite loop that would just keep consuming more and more memory. Upgrading the package to the latest from GitHub solved the problem, so I am now able to deploy successfully.

Therefore I am going to close this bug, but I believe that we should still migrate the site off of the paas as it doesn't seem to be supported. I will leave bug 1180767 open and am happy to contribute my time to working on that bug.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.