Closed
Bug 976600
Opened 10 years ago
Closed 10 years ago
MDN site performance is killer & unable to push changes
Categories
(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)
Infrastructure & Operations Graveyard
WebOps: Community Platform
x86
macOS
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: groovecoder, Assigned: bburton)
References
Details
Response times are 50s and climbing: https://rpm.newrelic.com/accounts/263620/applications/3172075 Stage chief is stuck so we can't even push changes to try to fix: http://developeradm.private.scl3.mozilla.com/chief/developer.stage/logs/fe3caa7f75e7fa59ee5bf302f4a273638917e71f.1393345687
Reporter | ||
Comment 1•10 years ago
|
||
Please join us in #mdndev to resolve.
Comment 2•10 years ago
|
||
Taking this until I can find someone from webops to act.
Updated•10 years ago
|
Assignee: server-ops-webops → rwatson
Reporter | ||
Comment 3•10 years ago
|
||
Transaction profile indicates overall slowness for every part of the transaction: https://rpm.newrelic.com/accounts/263620/applications/3172075/transactions#id=279154078
Assignee | ||
Updated•10 years ago
|
Assignee: rwatson → bburton
Assignee | ||
Comment 4•10 years ago
|
||
Seeing high CPU load on web servers and increased load on DB, investigating further and asking oncall DBA to review database activity as well
Status: NEW → ASSIGNED
Comment 5•10 years ago
|
||
drained in zeus and graceful'd httpd 25 min ago.
Reporter | ||
Comment 6•10 years ago
|
||
https://github.com/mozilla/kuma/commit/18d6e74799415f8d388df758502520fc6266ba72 is a known-good commit. But the only difference between it and the current master are a few front-end changes: https://github.com/mozilla/kuma/compare/18d6e7...master So, we may try to update our chief_deploy.py script to use `service httpd restart` (instead of graceful): https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67 Should we also have chief clear out all *.pyc files after deploy? We will need to push twice for it to have an effect.
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(bburton)
Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #6) > So, we may try to update our chief_deploy.py script to use `service httpd > restart` (instead of graceful): > > https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L67 > * Committed in https://github.com/mozilla/kuma/commit/29c6f2f8e283f52f22c322a4e0d68346b52b4591 * I updated Chief to point directly to the files, instead of a symlink from the old file location > Should we also have chief clear out all *.pyc files after deploy? > The the part that does the pull on the webservers includes /usr/bin/rsync -aq --delete \ --delete-excluded \ --exclude='.git' \ --exclude='.svn' \ --exclude='*.pyc' \ --exclude='.hg' \ developeradm.private.scl3.mozilla.com::developer/$SYNC_DIR /data/www/$SYNC_DIR Which should prevent .pyc from being pushed, but we could add something like below to happen before the rsync happens find settings.SRC_DIR -type f -iname '*.pyc' | xargs rm -rf
Flags: needinfo?(bburton)
Assignee | ||
Comment 8•10 years ago
|
||
Wanted to note the webops actions taken so far * brought in DBA to look at things at 9AM PST * DBA and devs reviewed database activity and code changes, performed initial 'service httpd restart' manually at 9:27AM PST * confirmed memcache is disabled for kuma right now, known issue with kuma + memcache * discussed disabling kumascript, would cause more issues than possibly fixing * reviewed possibility of submodule update in https://github.com/mozilla/kuma-lib getting updated, devs confirmed all submodules are versioned locked, last update was a few days ago * 9:54AM - tried reverting the regex change from https://bugzilla.mozilla.org/show_bug.cgi?id=976328#c4 on developer1.webapp.phx1 to see if it helped, did not have any noticable impact, later reverted at 11:30AM PST * did some quick analysis of top URLs to look for possible traffic spike problem * for url in $(cat access_2014-02-25-16 | cut -d' ' -f 7 | sort | uniq); do echo "$(grep $url access_2014-02-25-16 | wc -l) -- $url"; done > /tmp/access_2014-02-25-16_sorted.log * top 20 urls from 8-9am pst: http://git.io/mrhjJw * cyborgshadow (DBA) noted the issue started yesterday evening, based on graphs, it was just exacerbated by morning PST traffic * 10:20 - 11:00AM - devs rollback to revert commit which made database changes and manually reverse migration, did a couple 'service httpd restart' to apache during this process
Reporter | ||
Comment 10•10 years ago
|
||
We saw a crazy amount of requests being dropped to 503, so we disabled KumaScript for now. 18:22:20 - davidwalsh: With no optimizely, we saw no change 18:22:25 - davidwalsh: With no analytics, we saw no change 18:22:33 - davidwalsh: With no kumascript, we've seen change 18:22:53 - _6a68: davidwalsh: yes, I had JS disabled on the client this whole time I've been seeing intermittent 503s 18:22:56 - davidwalsh: While I did see errors outside of the wiki, iit could have been caused by pain within the wiki 18:23:18 - davidwalsh: I never saw client-side redirects outside the wiki 18:23:27 - davidwalsh: Kumascript is only used inside the wiki 18:23:34 - davidwalsh: So I feel liek we have a template issue somewhere
Reporter | ||
Comment 11•10 years ago
|
||
FFR: To disable kumascript, set KUMASCRIPT_TIMEOUT to 0 at https://developer.mozilla.org/admin/constance/config/
Reporter | ||
Comment 12•10 years ago
|
||
The problem was a combination of https://github.com/mozilla/kuma/blob/master/media/redesign/js/google-analytics.js#L66-L75 and https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L76-L81 which was making all page-views with subnav items (i.e., a bunch) infinitely re-direct upon themselves. Fixed in https://github.com/mozilla/kuma/pull/2066/files to disable the click-tracking until we make it a better selector.
Reporter | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 13•10 years ago
|
||
I'm still not convinced the `trigger` in wiki.js caused the issue, but will be looking more into it before restoring the analytics tracking.
Comment 14•10 years ago
|
||
OK, I believe I found the actual issue: the legacy "htab" code that we carried forward from the previous MDN designs (and possibly mindtouch): https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L192 ...ends in a click call: https://github.com/mozilla/kuma/blob/master/media/redesign/js/wiki.js#L204 The "A" elements which act as the compat table tab trigger have no HREF and thus the hostnames don't match from the google-analytics delegation listener. We should do three things: 1. Merge this PR (https://github.com/mozilla/kuma/pull/2067) to use synthetic event names for triggers 2. Provide real HREF's for those compat table tabs (A elements) 3. The delegation listener in google-analytics.js should ensure that a hostname exists when comparing to the site hostname The reason that disabling kumascript eased in this was that compat tables weren't being generated. The left subnav / quick links *weren't* the root cause because their link hostnames matched (unless users writing docs didn't provide HREF's for links, which is entirely possible, considering anyone can write a doc). Kudos to :openjck for reverting this merge initially, he was right!
Comment 15•10 years ago
|
||
Of the list of action items in my previous message, #3 is most important, then #1, then #2.
Reporter | ||
Comment 16•10 years ago
|
||
We effectively DDOS'd ourselves with client-side location redirects. Every <a> element on the site for which we triggered a click call was part of the problem - both htab's and subnav links. (Their href attributes are relative url's, so this.hostname evaluates to '' in their click handler, while window.location.hostname evaluates to 'developer.mozilla.org'.) https://bugzilla.mozilla.org/show_bug.cgi?id=976600#c14 is a good plan forward. Agree #3 is the most important. +1 :openjck
Flags: needinfo?(jkarahalis)
Reporter | ||
Comment 17•10 years ago
|
||
Oops, only meant to cc :openjck, not needinfo?
Flags: needinfo?(jkarahalis)
Reporter | ||
Comment 18•10 years ago
|
||
:solarce, can we create an alert (New Relic or otherwise) when there's a sudden spike in number of requests and/or errors? I don't see an easy way to do it in New Relic?
Flags: needinfo?(bburton)
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #18) > :solarce, can we create an alert (New Relic or otherwise) when there's a > sudden spike in number of requests and/or errors? I don't see an easy way to > do it in New Relic? There is some alerting support in NewRelic for alerting based on AppIndex or Error Rate, https://docs.newrelic.com/docs/alert-policies/alerting-in-new-relic You can see a default (disabled) config in https://rpm.newrelic.com/accounts/263620/application_alert_policies?search[q]=developer As far as configuring the alerts and folks receiving them, you'd want to open a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations to discuss the specifics with the MOC, how it might integrate with Nagios, and how on-call would respond to the alerts, as the MOC currentlys owns the monitoring pipeline and are the first responders to alerts
Flags: needinfo?(bburton)
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•