Army of Awesome website returning an error

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
--
blocker
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: verdi, Assigned: oremj)

Tracking

Details

(URL)

(Reporter)

Description

7 years ago
The site was reported down about 40 minutes ago. I don't know how long it was down before that.
Assignee: server-ops → mburns
Assignee: mburns → nobody
Component: Server Operations → Army of Awesome
Product: mozilla.org → support.mozilla.com
QA Contact: cshields → community-care
Version: other → unspecified
The fastest way to bring the site back up is for someone in IT to run the following:

$ python26 manage.py shell
>>> from django.core.cache import cache
>>> from django.conf import settings
>>> cache.delete_many([settings.CC_TOP_CONTRIB_CACHE_KEY, settings.CC_TWEET_ACTIVITY_CACHE_KEY])

This is a result of bad data coming from Metrics, so additionally please stop (for now) the "get_customercare_stats" cron (it may be running on both the admin box and sumocelery1).

I will file a follow up to handle bad data better, as well as a metrics bug to understand why the data is coming in wrong, but both of those will take time. This will get AoA back up in the meantime.
Assignee: nobody → server-ops
Component: Army of Awesome → Server Operations: Web Operations
Product: support.mozilla.com → mozilla.org
QA Contact: community-care → cshields
Version: unspecified → other
Please text or call me on my cell if the above doesn't work.
Severity: critical → blocker
Assignee: server-ops → mburns
when it will be fix? :|
James, thanks for that insight. The site is back up and running now.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Thanks Michael!

Swarnava, like I said in comment 1, and in that thread, this can happen from time to time when we get bad data from metrics. The statistics will come back when we get that sorted out.
Status: RESOLVED → VERIFIED
problem is back again :|
Apparently the cron was in comment 1 was not disabled and it horked itself again.

Please re-run the steps in comment 1 and disable the "get_customercare_stats" cron.
Assignee: mburns → server-ops
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Assignee: server-ops → ashish
Site back up now. Commented the get_customercare_stats cron job on sumocelery1.
Thanks Ashish!

Before we close this, we should have someone familiar with SUMO's deploy process make sure this cron isn't going to get turned on again before we push a fix in code. Unless something has changed, I think that's still possible.
:jsocol - Agreed. I've updated the next SUMO push Bug 691793 so that this doesn't get lost.
It's down again, please rerun the commands in comment 1, and we really, really need to turn the cron off, for real.
Assignee: ashish → server-ops
re-ran the quickfix commands.
Assignee: server-ops → mburns
It's going to keep breaking if we can't fix the cron. It's clearly either running in two places (sumocelery and the admin box) or it's getting restored, or both. Jake Maul, Shyam, Jeremy, or Jason should be familiar enough with SUMO deploys to help.
Agreed. I tracked the cron job down (set to get run every 6 hours) and commented it out. We can coordinate with the SUMO folks to turn it back on

root@sumocelery1:/data/www/support.mozilla.com/kitsune/scripts/crontab/crontab.tpl
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looks like this is bust again.

Also looks like crons are generated from code, so can we please get this removed from the script for now? Everything is from git so removing that will certainly help solving this faster.
(In reply to Shyam Mani [:fox2mike] from comment #16)
> Looks like this is bust again.

And again, or still.

> Also looks like crons are generated from code, so can we please get this
> removed from the script for now? Everything is from git so removing that
> will certainly help solving this faster.

By the time we have our next push, we'll have worked around the problem. There can't *not* be a way to turn this off, temporarily.

The uptime over the past few days has been ridiculous, because we can't figure out how to make a cron not automatically restore itself? What is wrong with our deploy infrastructure that this is impossible. The only thing we need to do to keep this site up right now is disable this cron and keep it disabled, and re-run the steps in comment 1.
Assignee: mburns → server-ops
Assignee: server-ops → ashish
when its back,statistic still not comes up :|
oremj - assigning this bug to you, please take a look at this while pushing out Bug 691793. Thanks!
Assignee: ashish → jeremy.orem+bugs
(Assignee)

Comment 20

7 years ago
Looks like someone fixed this already. We can enable it after the push again, if it is ready.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
(In reply to Jeremy Orem [oremj@mozilla.com] from comment #20)
> Looks like someone fixed this already. We can enable it after the push
> again, if it is ready.

Hopefully it won't re-enable itself again. We have a fix ready to go out this afternoon so we can restart the cron after our push.
(Assignee)

Comment 22

7 years ago
It looks like someone finally fixed the re-enable this morning.
in right tab of statastic yesterday waa missing along with twitter contributor!
(In reply to Swarnava Sengupta (:Swarnava) from comment #23)
> in right tab of statastic yesterday waa missing along with twitter
> contributor!

Hi Swarnava, the list of top contributors in the sidebar should be appearing again shortly. More background information can be found in James' comment on the SUMO community discussion forum.

https://support.mozilla.com/en-US/forums/contributors/707707?last=43005

Apologies for the inconveniences the past few days.
Thanks for the link,

Comment 26

7 years ago
I fixed this yesterday, but forgot to update this bug.

The cron needed to be commented out on the copy on mradm02, then committed to the internal git repo there. From there, it will propagate out to ip-admin02 (phx admin node), then to sumocelery1, where it will be made "live" (in this case, live meaning commented out).

See Bug 669388#c4 for an explanation of how this all works.


I modified the 'www' copy on mradm02. This will be overwritten on the next push, which will automatically result in the job being re-enabled.
FTR, we pushed a change that fixes this on our end, so the cron can continue running as normal.

But it's good that we have a strategy for turning them off--and keeping them that way--in the future!
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.