Closed Bug 976600 Opened 10 years ago Closed 10 years ago

MDN site performance is killer & unable to push changes

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

x86
macOS
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Assigned: bburton)

References

Details

Please join us in #mdndev to resolve.
Taking this until I can find someone from webops to act.
Assignee: server-ops-webops → rwatson
Transaction profile indicates overall slowness for every part of the transaction:

https://rpm.newrelic.com/accounts/263620/applications/3172075/transactions#id=279154078
Assignee: rwatson → bburton
Seeing high CPU load on web servers and increased load on DB, investigating further and asking oncall DBA to review database activity as well
Status: NEW → ASSIGNED
drained in zeus and graceful'd httpd 25 min ago.
https://github.com/mozilla/kuma/commit/18d6e74799415f8d388df758502520fc6266ba72 is a known-good commit.

But the only difference between it and the current master are a few front-end changes:

https://github.com/mozilla/kuma/compare/18d6e7...master

So, we may try to update our chief_deploy.py script to use `service httpd restart` (instead of graceful):

https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67

Should we also have chief clear out all *.pyc files after deploy?

We will need to push twice for it to have an effect.
Flags: needinfo?(bburton)
(In reply to Luke Crouch [:groovecoder] from comment #6)

> So, we may try to update our chief_deploy.py script to use `service httpd
> restart` (instead of graceful):
> 
> https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67
> 

* Committed in https://github.com/mozilla/kuma/commit/29c6f2f8e283f52f22c322a4e0d68346b52b4591
* I updated Chief to point directly to the files, instead of a symlink from the old file location

> Should we also have chief clear out all *.pyc files after deploy?
> 
The the part that does the pull on the webservers includes

/usr/bin/rsync -aq --delete \
    --delete-excluded \
    --exclude='.git' \
    --exclude='.svn' \
    --exclude='*.pyc' \
    --exclude='.hg' \
    developeradm.private.scl3.mozilla.com::developer/$SYNC_DIR /data/www/$SYNC_DIR

Which should prevent .pyc from being pushed, but we could add something like below to happen before the rsync happens

find settings.SRC_DIR -type f -iname '*.pyc' | xargs rm -rf
Flags: needinfo?(bburton)
Wanted to note the webops actions taken so far

* brought in DBA to look at things at 9AM PST
* DBA and devs reviewed database activity and code changes, performed initial 'service httpd restart' manually at 9:27AM PST
 * confirmed memcache is disabled for kuma right now, known issue with kuma + memcache
* discussed disabling kumascript, would cause more issues than possibly fixing
* reviewed possibility of submodule update in https://github.com/mozilla/kuma-lib getting updated, devs confirmed all submodules are versioned locked, last update was a few days ago
* 9:54AM - tried reverting the regex change from https://bugzilla.mozilla.org/show_bug.cgi?id=976328#c4 on developer1.webapp.phx1 to see if it helped, did not have any noticable impact, later reverted at 11:30AM PST
* did some quick analysis of top URLs to look for possible traffic spike problem
 * for url in $(cat access_2014-02-25-16 | cut -d' ' -f 7 | sort | uniq); do echo "$(grep $url access_2014-02-25-16 | wc -l) -- $url"; done > /tmp/access_2014-02-25-16_sorted.log
 * top 20 urls from 8-9am pst: http://git.io/mrhjJw
* cyborgshadow (DBA) noted the issue started yesterday evening, based on graphs, it was just exacerbated by morning PST traffic
* 10:20 - 11:00AM - devs rollback to revert commit which made database changes and manually reverse migration, did a couple 'service httpd restart' to apache during this process
Depends on: 976847
We saw a crazy amount of requests being dropped to 503, so we disabled KumaScript for now.

18:22:20 - davidwalsh: With no optimizely, we saw no change
18:22:25 - davidwalsh: With no analytics, we saw no change
18:22:33 - davidwalsh: With no kumascript, we've seen change
18:22:53 - _6a68: davidwalsh: yes, I had JS disabled on the client this whole time I've been seeing intermittent 503s
18:22:56 - davidwalsh: While I did see errors outside of the wiki, iit could have been caused by pain within the wiki
18:23:18 - davidwalsh: I never saw client-side redirects outside the wiki
18:23:27 - davidwalsh: Kumascript is only used inside the wiki
18:23:34 - davidwalsh: So I feel liek we have a template issue somewhere
FFR: To disable kumascript, set KUMASCRIPT_TIMEOUT to 0 at https://developer.mozilla.org/admin/constance/config/
The problem was a combination of https://github.com/mozilla/kuma/blob/master/media/redesign/js/google-analytics.js#L66-L75 and https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L76-L81 which was making all page-views with subnav items (i.e., a bunch) infinitely re-direct upon themselves.

Fixed in https://github.com/mozilla/kuma/pull/2066/files to disable the click-tracking until we make it a better selector.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
I'm still not convinced the `trigger` in wiki.js caused the issue, but will be looking more into it before restoring the analytics tracking.
OK, I believe I found the actual issue: the legacy "htab" code that we carried forward from the previous MDN designs (and possibly mindtouch):

https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L192

...ends in a click call:

https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L204

The "A" elements which act as the compat table tab trigger have no HREF and thus the hostnames don't match from the google-analytics delegation listener.  We should do three things:

1.  Merge this PR (https://github.com/mozilla/kuma/pull/2067) to use synthetic event names for triggers

2.  Provide real HREF's for those compat table tabs (A elements)

3.  The delegation listener in google-analytics.js should ensure that a hostname exists when comparing to the site hostname

The reason that disabling kumascript eased in this was that compat tables weren't being generated.  The left subnav / quick links *weren't* the root cause because their link hostnames matched (unless users writing docs didn't provide HREF's for links, which is entirely possible, considering anyone can write a doc).

Kudos to :openjck for reverting this merge initially, he was right!
Of the list of action items in my previous message, #3 is most important, then #1, then #2.
We effectively DDOS'd ourselves with client-side location redirects.

Every <a> element on the site for which we triggered a click call was part of the problem - both htab's and subnav links. (Their href attributes are relative url's, so this.hostname evaluates to '' in their click handler, while window.location.hostname evaluates to 'developer.mozilla.org'.)

https://bugzilla.mozilla.org/show_bug.cgi?id=976600#c14 is a good plan forward. Agree #3 is the most important. +1 :openjck
Flags: needinfo?(jkarahalis)
Oops, only meant to cc :openjck, not needinfo?
Flags: needinfo?(jkarahalis)
:solarce, can we create an alert (New Relic or otherwise) when there's a sudden spike in number of requests and/or errors? I don't see an easy way to do it in New Relic?
Flags: needinfo?(bburton)
(In reply to Luke Crouch [:groovecoder] from comment #18)
> :solarce, can we create an alert (New Relic or otherwise) when there's a
> sudden spike in number of requests and/or errors? I don't see an easy way to
> do it in New Relic?

There is some alerting support in NewRelic for alerting based on AppIndex or Error Rate, https://docs.newrelic.com/docs/alert-policies/alerting-in-new-relic

You can see a default (disabled) config in https://rpm.newrelic.com/accounts/263620/application_alert_policies?search[q]=developer

As far as configuring the alerts and folks receiving them, you'd want to open a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations to discuss the specifics with the MOC, how it might integrate with Nagios, and how on-call would respond to the alerts, as the MOC currentlys owns the monitoring pipeline and are the first responders to alerts
Flags: needinfo?(bburton)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.