976600 - MDN site performance is killer & unable to push changes

https://github.com/mozilla/kuma/commit/18d6e74799415f8d388df758502520fc6266ba72 is a known-good commit. But the only difference between it and the current master are a few front-end changes: https://github.com/mozilla/kuma/compare/18d6e7...master So, we may try to update our chief_deploy.py script to use `service httpd restart` (instead of graceful): https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67 Should we also have chief clear out all *.pyc files after deploy? We will need to push twice for it to have an effect.

Luke Crouch [:groovecoder]

Reporter

Updated

•

12 years ago

Flags: needinfo?(bburton)

Brandon Burton [:solarce]

Assignee

Comment 7

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #6) > So, we may try to update our chief_deploy.py script to use `service httpd > restart` (instead of graceful): > > https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67 > * Committed in https://github.com/mozilla/kuma/commit/29c6f2f8e283f52f22c322a4e0d68346b52b4591 * I updated Chief to point directly to the files, instead of a symlink from the old file location > Should we also have chief clear out all *.pyc files after deploy? > The the part that does the pull on the webservers includes /usr/bin/rsync -aq --delete \ --delete-excluded \ --exclude='.git' \ --exclude='.svn' \ --exclude='*.pyc' \ --exclude='.hg' \ developeradm.private.scl3.mozilla.com::developer/$SYNC_DIR /data/www/$SYNC_DIR Which should prevent .pyc from being pushed, but we could add something like below to happen before the rsync happens find settings.SRC_DIR -type f -iname '*.pyc' | xargs rm -rf

Flags: needinfo?(bburton)

Brandon Burton [:solarce]

Assignee

Comment 8

•

12 years ago

Wanted to note the webops actions taken so far * brought in DBA to look at things at 9AM PST * DBA and devs reviewed database activity and code changes, performed initial 'service httpd restart' manually at 9:27AM PST * confirmed memcache is disabled for kuma right now, known issue with kuma + memcache * discussed disabling kumascript, would cause more issues than possibly fixing * reviewed possibility of submodule update in https://github.com/mozilla/kuma-lib getting updated, devs confirmed all submodules are versioned locked, last update was a few days ago * 9:54AM - tried reverting the regex change from https://bugzilla.mozilla.org/show_bug.cgi?id=976328#c4 on developer1.webapp.phx1 to see if it helped, did not have any noticable impact, later reverted at 11:30AM PST * did some quick analysis of top URLs to look for possible traffic spike problem * for url in $(cat access_2014-02-25-16 | cut -d' ' -f 7 | sort | uniq); do echo "$(grep $url access_2014-02-25-16 | wc -l) -- $url"; done > /tmp/access_2014-02-25-16_sorted.log * top 20 urls from 8-9am pst: http://git.io/mrhjJw * cyborgshadow (DBA) noted the issue started yesterday evening, based on graphs, it was just exacerbated by morning PST traffic * 10:20 - 11:00AM - devs rollback to revert commit which made database changes and manually reverse migration, did a couple 'service httpd restart' to apache during this process

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Depends on: 976847

Luke Crouch [:groovecoder]

Reporter

Comment 10

•

12 years ago

We saw a crazy amount of requests being dropped to 503, so we disabled KumaScript for now. 18:22:20 - davidwalsh: With no optimizely, we saw no change 18:22:25 - davidwalsh: With no analytics, we saw no change 18:22:33 - davidwalsh: With no kumascript, we've seen change 18:22:53 - _6a68: davidwalsh: yes, I had JS disabled on the client this whole time I've been seeing intermittent 503s 18:22:56 - davidwalsh: While I did see errors outside of the wiki, iit could have been caused by pain within the wiki 18:23:18 - davidwalsh: I never saw client-side redirects outside the wiki 18:23:27 - davidwalsh: Kumascript is only used inside the wiki 18:23:34 - davidwalsh: So I feel liek we have a template issue somewhere

Luke Crouch [:groovecoder]

Reporter

Comment 11

•

12 years ago

FFR: To disable kumascript, set KUMASCRIPT_TIMEOUT to 0 at https://developer.mozilla.org/admin/constance/config/

Luke Crouch [:groovecoder]

Reporter

Comment 12

•

12 years ago

The problem was a combination of https://github.com/mozilla/kuma/blob/master/media/redesign/js/google-analytics.js#L66-L75 and https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L76-L81 which was making all page-views with subnav items (i.e., a bunch) infinitely re-direct upon themselves. Fixed in https://github.com/mozilla/kuma/pull/2066/files to disable the click-tracking until we make it a better selector.

Luke Crouch [:groovecoder]

Reporter

Updated

•

12 years ago

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

David Walsh :davidwalsh

Comment 13

•

12 years ago

I'm still not convinced the `trigger` in wiki.js caused the issue, but will be looking more into it before restoring the analytics tracking.

David Walsh :davidwalsh

Comment 14

•

12 years ago

OK, I believe I found the actual issue: the legacy "htab" code that we carried forward from the previous MDN designs (and possibly mindtouch): https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L192 ...ends in a click call: https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L204 The "A" elements which act as the compat table tab trigger have no HREF and thus the hostnames don't match from the google-analytics delegation listener. We should do three things: 1. Merge this PR (https://github.com/mozilla/kuma/pull/2067) to use synthetic event names for triggers 2. Provide real HREF's for those compat table tabs (A elements) 3. The delegation listener in google-analytics.js should ensure that a hostname exists when comparing to the site hostname The reason that disabling kumascript eased in this was that compat tables weren't being generated. The left subnav / quick links *weren't* the root cause because their link hostnames matched (unless users writing docs didn't provide HREF's for links, which is entirely possible, considering anyone can write a doc). Kudos to :openjck for reverting this merge initially, he was right!

David Walsh :davidwalsh

Comment 15

•

12 years ago

Of the list of action items in my previous message, #3 is most important, then #1, then #2.

Luke Crouch [:groovecoder]

Reporter

Comment 16

•

12 years ago

We effectively DDOS'd ourselves with client-side location redirects. Every <a> element on the site for which we triggered a click call was part of the problem - both htab's and subnav links. (Their href attributes are relative url's, so this.hostname evaluates to '' in their click handler, while window.location.hostname evaluates to 'developer.mozilla.org'.) https://bugzilla.mozilla.org/show_bug.cgi?id=976600#c14 is a good plan forward. Agree #3 is the most important. +1 :openjck

Flags: needinfo?(jkarahalis)

Luke Crouch [:groovecoder]

Reporter

Comment 17

•

12 years ago

Oops, only meant to cc :openjck, not needinfo?

Flags: needinfo?(jkarahalis)

Luke Crouch [:groovecoder]

Reporter

Comment 18

•

12 years ago

:solarce, can we create an alert (New Relic or otherwise) when there's a sudden spike in number of requests and/or errors? I don't see an easy way to do it in New Relic?

Flags: needinfo?(bburton)

Brandon Burton [:solarce]

Assignee

Comment 19

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #18) > :solarce, can we create an alert (New Relic or otherwise) when there's a > sudden spike in number of requests and/or errors? I don't see an easy way to > do it in New Relic? There is some alerting support in NewRelic for alerting based on AppIndex or Error Rate, https://docs.newrelic.com/docs/alert-policies/alerting-in-new-relic You can see a default (disabled) config in https://rpm.newrelic.com/accounts/263620/application_alert_policies?search[q]=developer As far as configuring the alerts and folks receiving them, you'd want to open a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations to discuss the specifics with the MOC, how it might integrate with Nagios, and how on-call would respond to the alerts, as the MOC currentlys owns the monitoring pipeline and are the first responders to alerts

Flags: needinfo?(bburton)

BMO Automation

Updated

•

7 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

MDN site performance is killer & unable to push changes

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

Tracking

(Not tracked)

People

(Reporter: groovecoder, Assigned: bburton)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Updated

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Updated