Closed Bug 396617 Opened 13 years ago Closed 12 years ago
Investigate source(s) of Litmus slowdowns
Tomcat I have noticed of late that Litmus can be very slow. Apparently the story we usually get is that it is because of whatever else is running on the machine is slowing things down. Can Litmus get upgraded hardware or can we do something to improve things a bit here?
confirmed, surfing to litmus.mozilla.org takes sometimes a lot time and its a bad user experience, maybe its a issue with the shared VM ?
OS: Mac OS X → All
Hardware: PC → All
Assignee: ccooper → server-ops
Component: Litmus → Server Operations
Product: Webtools → mozilla.org
QA Contact: litmus → justin
Version: Trunk → other
I'm having a hard time identifying the bottleneck here, and would appreciate any advice people have. Memory usage doesn't seem to be the issue: Mem: 514420k total, 110684k used, 403736k free, 1892k buffers Swap: 524280k total, 21408k used, 502872k free, 21392k cached And top reports CPU usage around 80% idle when I hit it with a bunch of requests at once. The same exact actions are sometimes quite speedy and sometimes dog slow, so I'm inclined to believe that slow MySQL queries aren't the issue here. Could this be something related to activity by other users of the VM host/the db server?
When can we reboot and tweak configuration settings on this box?
We can be pretty flexible about downtime as long as we don't do it during testdays or pre-release testing when everyone needs access to the tests. Easiest yhing to do would probably be to ask in #as if a time is ok for everyone. If you need me to generate some load on the server for diagnostics, let me know and I'd be happy to do so. Thanks for helping out with this.
So, the server that hosts the VM has plenty of RAM, I know you said that RAM probably wasn't the issue, but 512 MB is probably too low for any kind of production load. I doubled the RAM on the box, created a 2 GB local swap file and gave it another CPU. Lets see if things look any better.
Stephen and I noticed that the server seemed to speed up right after Aravind rebooted it, but Stephen noted that even after that he was having issues with errors. I noticed that it is not quite as snappy with searches as it was right after it was restarted.
(In reply to comment #6) > Stephen and I noticed that the server seemed to speed up right after Aravind > rebooted it, but Stephen noted that even after that he was having issues with > errors. I noticed that it is not quite as snappy with searches as it was right > after it was restarted. The errors I saw were all 500 - internal server errors, and were most definitely post-upgrade, but I too did indeed see the speed boost--when I was able to log in.
At this point I don't think the problem is hardware related, since the symptoms seem to be the same regardless of how much we throw at it. Would it be possible to narrow this down to specific instances or usage scenarios? I am guessing some kind of application memory leak or something like that, but I can't be sure. Usage scenarios would help narrow it down.
haven't heard from anyone yet, please re-open once you have concrete examples to replicate this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
reopen - a connection to litmus takes several minutes today (started on the testday) and also community members noticed that and litmus react dog slow when i choose testruns or other actions.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Some of this is *apparent* slowness. We do a lot of lookups to generate the index page, but we don't have to do it before display. I'm going to push all the big summations into AJAX (like the coverage already is), and that should improve that initial load time substantially. The index page and the test runs page are largely identical, so this will improve both. However, that doesn't mean that something is still not broken at the server level. Aravind: the slow query log isn't turned on in the db, at least not according to 'show variables.' Can we turn it on for the Litmus db if it's not already, if not permanently, at least for the next testday? It's possible we're doing dumb things in SQL-land, and I'm not 100% trusting of Class::DBI either.
Priority: -- → P3
I landed the index/test run page display improvement this morning. They display pretty quick. We could still use the slow query logs to investigate other parts of the interface for slowdown.
Based on coop's comments, this is an app issue that needs to be optimized? If new hardware isn't needed, can I close this (or move it to the person who is working on it)?
Justin: no one's been able to nail down a set of reproduceable conditions for slowdown. I'm going to take this off of IT's plate for now until we can say definitively that this isn't an app problem. Those slow query logs would still help a lot in diagnosing this, if we can get those turned on, please.
Assignee: aravind → ccooper
Status: REOPENED → NEW
Component: Server Operations → Litmus
Product: mozilla.org → Webtools
QA Contact: justin → litmus
Dropping the severity on this since I think the index/test run changes help a lot.
Severity: major → normal
Status: NEW → ASSIGNED
Summary: Litmus needs to run on better hardware (dog slow on some days) → Investigate source(s) of Litmus slowdowns
I used the YSlow extension from Yahoo to diagnose a few easy performance wins. I've got that code landed and pushed to the staging server already, and will push it to production later tonight once the testday stragglers trail off.
Priority: P3 → P2
This code is in production now.
P3-ing this until I have a chance to do some diagnosis in the staging env.
Priority: P2 → P3
Seems to be reasonably peppy for me now. Please reopen or (better) file bugs on specific areas of slowness if necessary. Note: we'll use bug 401139 to track the server error/db disconnect issue which is not really related.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.