693078 - Army of Awesome website returning an error

Reporter

Description

•

13 years ago

The site was reported down about 40 minutes ago. I don't know how long it was down before that.

Michael Burns [:mburns]

Updated

•

13 years ago

Assignee: server-ops → mburns

Michael Burns [:mburns]

Updated

•

13 years ago

Assignee: mburns → nobody

Component: Server Operations → Army of Awesome

Product: mozilla.org → support.mozilla.com

QA Contact: cshields → community-care

Version: other → unspecified

James Socol [:jsocol, :james]

Comment 1

•

13 years ago

The fastest way to bring the site back up is for someone in IT to run the following:

$ python26 manage.py shell
>>> from django.core.cache import cache
>>> from django.conf import settings
>>> cache.delete_many([settings.CC_TOP_CONTRIB_CACHE_KEY, settings.CC_TWEET_ACTIVITY_CACHE_KEY])

This is a result of bad data coming from Metrics, so additionally please stop (for now) the "get_customercare_stats" cron (it may be running on both the admin box and sumocelery1).

I will file a follow up to handle bad data better, as well as a metrics bug to understand why the data is coming in wrong, but both of those will take time. This will get AoA back up in the meantime.

Assignee: nobody → server-ops

Component: Army of Awesome → Server Operations: Web Operations

Product: support.mozilla.com → mozilla.org

QA Contact: community-care → cshields

Version: unspecified → other

James Socol [:jsocol, :james]

Comment 2

•

13 years ago

Please text or call me on my cell if the above doesn't work.

Severity: critical → blocker

Michael Burns [:mburns]

Updated

•

13 years ago

Assignee: server-ops → mburns

Swarnava Sengupta (:Swarnava)

Comment 3

•

13 years ago

when it will be fix? :|

Michael Burns [:mburns]

Comment 4

•

13 years ago

James, thanks for that insight. The site is back up and running now.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Swarnava Sengupta (:Swarnava)

Comment 5

•

13 years ago

this bug again

https://support.mozilla.com/en-US/forums/contributors/707685

James Socol [:jsocol, :james]

Comment 6

•

13 years ago

Thanks Michael!

Swarnava, like I said in comment 1, and in that thread, this can happen from time to time when we get bad data from metrics. The statistics will come back when we get that sorted out.

Status: RESOLVED → VERIFIED

Swarnava Sengupta (:Swarnava)

Comment 7

•

13 years ago

problem is back again :|

James Socol [:jsocol, :james]

Comment 8

•

13 years ago

Apparently the cron was in comment 1 was not disabled and it horked itself again.

Please re-run the steps in comment 1 and disable the "get_customercare_stats" cron.

Assignee: mburns → server-ops

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

Ashish Vijayaram [:ashish]

Updated

•

13 years ago

Assignee: server-ops → ashish

Ashish Vijayaram [:ashish]

Comment 9

•

13 years ago

Site back up now. Commented the get_customercare_stats cron job on sumocelery1.

James Socol [:jsocol, :james]

Comment 10

•

13 years ago

Thanks Ashish!

Before we close this, we should have someone familiar with SUMO's deploy process make sure this cron isn't going to get turned on again before we push a fix in code. Unless something has changed, I think that's still possible.

Ashish Vijayaram [:ashish]

Comment 11

•

13 years ago

:jsocol - Agreed. I've updated the next SUMO push Bug 691793 so that this doesn't get lost.

James Socol [:jsocol, :james]

Comment 12

•

13 years ago

It's down again, please rerun the commands in comment 1, and we really, really need to turn the cron off, for real.

Assignee: ashish → server-ops

Michael Burns [:mburns]

Comment 13

•

13 years ago

re-ran the quickfix commands.

Assignee: server-ops → mburns

James Socol [:jsocol, :james]

Comment 14

•

13 years ago

It's going to keep breaking if we can't fix the cron. It's clearly either running in two places (sumocelery and the admin box) or it's getting restored, or both. Jake Maul, Shyam, Jeremy, or Jason should be familiar enough with SUMO deploys to help.

Michael Burns [:mburns]

Comment 15

•

13 years ago

Agreed. I tracked the cron job down (set to get run every 6 hours) and commented it out. We can coordinate with the SUMO folks to turn it back on

root@sumocelery1:/data/www/support.mozilla.com/kitsune/scripts/crontab/crontab.tpl

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

James Socol [:jsocol, :james]

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Shyam Mani [:fox2mike]

Comment 16

•

13 years ago

Looks like this is bust again.

Also looks like crons are generated from code, so can we please get this removed from the script for now? Everything is from git so removing that will certainly help solving this faster.

James Socol [:jsocol, :james]

Comment 17

•

13 years ago

(In reply to Shyam Mani [:fox2mike] from comment #16)
> Looks like this is bust again.

And again, or still.

> Also looks like crons are generated from code, so can we please get this
> removed from the script for now? Everything is from git so removing that
> will certainly help solving this faster.

By the time we have our next push, we'll have worked around the problem. There can't *not* be a way to turn this off, temporarily.

The uptime over the past few days has been ridiculous, because we can't figure out how to make a cron not automatically restore itself? What is wrong with our deploy infrastructure that this is impossible. The only thing we need to do to keep this site up right now is disable this cron and keep it disabled, and re-run the steps in comment 1.

Assignee: mburns → server-ops

Ashish Vijayaram [:ashish]

Updated

•

13 years ago

Assignee: server-ops → ashish

Swarnava Sengupta (:Swarnava)

Comment 18

•

13 years ago

when its back,statistic still not comes up :|

Ashish Vijayaram [:ashish]

Comment 19

•

13 years ago

oremj - assigning this bug to you, please take a look at this while pushing out Bug 691793. Thanks!

Assignee: ashish → jeremy.orem+bugs

Jeremy Orem [:oremj]

Assignee

Comment 20

•

13 years ago

Looks like someone fixed this already. We can enable it after the push again, if it is ready.

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

James Socol [:jsocol, :james]

Comment 21

•

13 years ago

(In reply to Jeremy Orem [oremj@mozilla.com] from comment #20)
> Looks like someone fixed this already. We can enable it after the push
> again, if it is ready.

Hopefully it won't re-enable itself again. We have a fix ready to go out this afternoon so we can restart the cron after our push.

Jeremy Orem [:oremj]

Assignee

Comment 22

•

13 years ago

It looks like someone finally fixed the re-enable this morning.

Swarnava Sengupta (:Swarnava)

Comment 23

•

13 years ago

in right tab of statastic yesterday waa missing along with twitter contributor!

William Reynolds [:williamr]

Comment 24

•

13 years ago

(In reply to Swarnava Sengupta (:Swarnava) from comment #23)
> in right tab of statastic yesterday waa missing along with twitter
> contributor!

Hi Swarnava, the list of top contributors in the sidebar should be appearing again shortly. More background information can be found in James' comment on the SUMO community discussion forum.

https://support.mozilla.com/en-US/forums/contributors/707707?last=43005

Apologies for the inconveniences the past few days.

Swarnava Sengupta (:Swarnava)

Comment 25

•

13 years ago

Thanks for the link,

Jake Maul [:jakem]

Comment 26

•

13 years ago

I fixed this yesterday, but forgot to update this bug.

The cron needed to be commented out on the copy on mradm02, then committed to the internal git repo there. From there, it will propagate out to ip-admin02 (phx admin node), then to sumocelery1, where it will be made "live" (in this case, live meaning commented out).

See Bug 669388#c4 for an explanation of how this all works.


I modified the 'www' copy on mradm02. This will be overwritten on the next push, which will automatically result in the job being re-enabled.

James Socol [:jsocol, :james]

Comment 27

•

13 years ago

FTR, we pushed a change that fixes this on our end, so the cron can continue running as normal.

But it's good that we have a strategy for turning them off--and keeping them that way--in the future!

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard