Closed Bug 419688 Opened 13 years ago Closed 8 years ago

Litmus Performance is very slow

Categories

(Webtools Graveyard :: Litmus, defect, P1)

x86
All
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: perf)

Seen using  Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b4pre) Gecko/2008022604 Minefield/3.0b4pre.

STR:
1. Load the set of tests that comprise the FFT for Firefox or, click a link in Litmus.

Result: It is extremely slow, especially to load the entire set of subgroups for the FFT.

Tomcat says he noticed this over the weekend as well. Does having all the Japanese characters in the query cause any kind of overhead during the search?
Adding a new testcase is also slow, after I press the submit button there is a pronounced delay.
utf8 adds some overhead, but that's likely not the root cause.

Based on the circumstances you've outlined, it sounds like speed is an issue wherever:

* we load large lists into JS from the database, and then try to manipulate them, e.g. testcase management; and,
* we have many database calls on a single page, e.g. loading subgroups (with completion %) when running tests

A couple of strategies we could use to mitigate the slowness:

* for testcases, don't load them all at the outset and then filter. We could force the user to choose some criteria, say product and branch before loading any testcases at all.

* for subgroup completion %, load this info using AJAX rather than at initial page load, just like we do for test runs.
Severity: normal → major
coop: The other day when stephen asked me to move a few testcases into the 2.0 FFT run, I noticed that sorting the list by ID takes a real long time. I am not sure what can be done to improve this. I don't always need to sort, but as the number of test cases has grown, this list has probably grown a bit unwieldy, especially if all the other FFT test suites runs (Japanese) are comingled in this list (In this instance I was editing the subgroup Firefox FFT).
Status: NEW → ASSIGNED
OS: Mac OS X → All
Priority: P3 → P2
(In reply to comment #3)
> coop: The other day when stephen asked me to move a few testcases into the 2.0
> FFT run, I noticed that sorting the list by ID takes a real long time. I am not
> sure what can be done to improve this. I don't always need to sort, but as the
> number of test cases has grown, this list has probably grown a bit unwieldy,
> especially if all the other FFT test suites runs (Japanese) are comingled in
> this list (In this instance I was editing the subgroup Firefox FFT).

There are lots of testcases in that list -- in some cases >1000 -- so sorting in JS is going to be slow. :/

The load on the Litmus VM never goes above 0.01. One potential solution would be to move back to a local database to remove network latency and db server load from the equation.
Assignee: ccooper → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
it does seem slow. filter on manage testcases frequently painful - especially given it's a serial process to pick product, then branch - for what I'm guessing is a smallish db.
Keywords: perf
(In reply to comment #6)
> it does seem slow. filter on manage testcases frequently painful - especially
> given it's a serial process to pick product, then branch - for what I'm
> guessing is a smallish db.

The filtering is happening on the client side in JS. That's the painfully slow part. 

The first step here should be to stop generating the coverage data inline every time it is needed. If we can offline (cron, or on result submission) the coverage lookup, and put that data into its own database table, that *should* improve perceived performance across the whole site.
That last comment was from me, BTW, just using Marcia's bugzilla instance as I fought with wireless on my own laptop.

To elaborate a bit, we'll need a separate coverage table in the db, and then we can do a simple lookup of this data rather than calculating it on the fly on every pageload. 

My preferred way of populating the coverage would be to do a smart lookup for matching results on new result submission. If we already have a matching result for the same config *and* coverage numbers, we don't have to do any calculations, i.e. adding a new result for the same config won't change existing coverage. If we're missing either value, we recalculate the coverage numbers for each subgroup/testgroup/test run that contains the testcase. This will add some overhead to result submission depending on how many subgroups/testgroups/test runs the testcase belongs to, probably on the order of a second or two.

If we choose to implement the coverage lookup via cron, we'll have the downside that the coverage data may become stale during the interval between cron calculations.
90 sec to get select product=firefox from https://litmus.mozilla.org/manage_testcases.cgi

I don't think there's much value in loading data before at least selecting product.
For the past two days, Litmus has been painfully slow, especially when trying to query for an individual test case. This was noticed by both jmaher and whimboo, and continues to be an issue today.
performance is critical.  marking P1.   some options to try: automate a user load on selenium.
Severity: major → critical
Priority: P3 → P1
Running today's FFTs, Litmus seems slow in two instances:

(1) Editing an individual test case in a test run
(2) Expanding on the sections of the FFT for the first time
Would it help to benchmark performance with one user, versus several?

Should we create a separate bug for issues that are heavily dependent on client side performance, eg comment 4, comment 10?

Comment 5's observation on load average is particularly interesting. Do we have long term data confirming such an extremely low load average? (or benchmarked testcases?)  Has the db been moved to put everything under one machine as suggested? Since this is a VM and not bare hardware, to what extent do we know, or can determine, how unpredictable it's performance might be constrained due to resource availability within the VM?  This combination of items would help resolve the question of whether more hardware (real or virtual) is required.

(In reply to comment #9)
> That last comment was from me, BTW, just using Marcia's bugzilla instance as I
> fought with wireless on my own laptop.
> 
> To elaborate a bit, we'll need a separate coverage table in the db, and then we
> can do a simple lookup of this data rather than calculating it on the fly on
> every pageload. 

It would be surprising for this to not have a significant impact at times. And yet, surely not all cases of bad performance are due to this. Indeed it seems unlikely in most cases, given the probability of someone else being at and running that opening screen when I am on.
Can we get a WebDev engineer to work with Marcia and Wayne?  This seems like a top issue and we need to make progress on a few top issues with Litmus.
(In reply to comment #16)
> Can we get a WebDev engineer to work with Marcia and Wayne?  This seems like a
> top issue and we need to make progress on a few top issues with Litmus.

With Q2 starting next week, now would be a good time to talk to morgamic and get a webdev resource scheduled for Litmus for next quarter.
Today while doing some test case creation and cloning a few test cases, Litmus was extremely slow. Last comment on this bug is from almost a month ago - can we get some traction on this issue or at least start some investigation of how we can improve the situation?
Fred took a look at most of this.  Could you comment here about what you found when you were looking at the code?  What about the infra being slow?
Assignee: nobody → fwenzel
It seems that there are different problems:
- we are unsure if there is a hardware problem with the production litmus instance. The DB queries for litmus (advanced search, for example) can become sufficiently complicated at times. Depending on what the boxes and DB are shared with, it is also possible that other apps negatively influence litmus' performance. This step is currently blocked on bug 484016, waiting for some performance analysis/graphs from IT.
- Second, I am with Wayne in comment 15 here. The AJAX actions are very slow, as they are hit by every user, and that should change. However, these is not the only reason for overall improvable performance.
I have made a comment in Bug 484016 and will discuss with Reed when he arrives next week.

Frédéric's Comment 20 is interesting, as some of us have speculated about whether the fact the machine is shared somehow affect's the performance.
Depends on: 484016
still an ongoing problem.

Lumbee noted in #thunderbird today 9:30am EST that litmus is slow - he's running tests.  (tests aren't complicated, are they?)  I agree it's slow - just to get partial initial litmus screen (auto logged in) was >15 sec (I didn't run tests). I tried some other administrator stuff - starting the thunderbird testrun didn't seem slow, but some some admin stuff was 

ludo mentioned this week or last encountering slowness in manipulating definitions.

anyone monitoring cacti? I can't get there.  (but I don't see 5 minute intervals being sufficient to catch what we've been seeing)
(In reply to comment #22)

> ludo mentioned this week or last encountering slowness in manipulating
> definitions.

So What I've been trying to do is run test_run id=40. Things that are slow / feel slow :
 * Loading the page that let's me submit my configuration
 * Loading the test themselves when I'v choosed which one I want to execute
 * submiting test results -
(In reply to comment #22)
> Lumbee noted in #thunderbird today 9:30am EST that litmus is slow - he's
> running tests.  (tests aren't complicated, are they?) 

correction - lubmee's problem was testing yesterday. (he created account Friday and was testing)
do we know that indexes and such are all still good?  (bug 360954/bug 350251)
(purely for historical purposes, ref bug 396617)
this is frustrating.
Can you guys post specific URLs that are slow for you?  We're trying to debug the problem.  Thanks.
Also, if anyone wants to take a look at the code and point out possible performance problems (now in hg: http://hg.mozilla.org/webtools/litmus/), please go right ahead.
I can probably throw some memcaching at some of the statistics queries. I am thinking about caching them for an hour or so.

Turning on slow query logging for litmus shows, just on the front page, for example such gems as:

# Query_time: 7  Lock_time: 1  Rows_sent: 954  Rows_examined: 5016
# Query_time: 1  Lock_time: 0  Rows_sent: 1  Rows_examined: 406024
# Query_time: 1  Lock_time: 0  Rows_sent: 1  Rows_examined: 292223
# Query_time: 1  Lock_time: 0  Rows_sent: 1  Rows_examined: 418163
# Query_time: 2  Lock_time: 0  Rows_sent: 1  Rows_examined: 669998

There are a lot of these, just there, and possibly on other pages as well. Examining a few million rows on every page load has the potential to harm performance quite a bit.
bah, sorry I have been remiss in not responding.

In my worst experiences, bad performance has often been seemingly random. For example, (if my memory is correct) two weeks ago a couple times it took about 15 sec to get any hint of the login screen (or it may have been after the initial screen after login). 

There may be some predictable issues, but my suspicion is most issues are not predictable (occurrence ratio of 1 in 20 to 1 in 50, or short duration) and caused by things not related to the work that performs poorly, ergo supplying URLs may be a crap shoot.

wenzel, Does that jive with your comment 28?
(In reply to comment #29)
> wenzel, Does that jive with your comment 28?

Only partly. Constant slowness would be more consistent with my observations, but of course, if you run a lot of complicated queries, the app is much more susceptible to random slowness whenever the server is otherwise in distress -- or when a lot of people access the app simultaneously.
Depends on: 505720
I added basic memcache support in bug 505720 and used it on the huge stats queries for now.

Not all of these slow ones may be that easy to spot. For example, on the advanced search page I don't see any huge queries other than the actual search query, which we have to actually perform for obvious reasons. I might cache that too, but that'll only help if the same one is being executed repeatedly, either by the same person or by others.

The manage testcases page from comment 11 is slow even on my local copy, so we might be able to do something there, not necessarily with caching, but with not showing such a huge list in the first place. I'll file a bug.
Depends on: 505805
One way to speed up the advanced_search.cgi page is to stop looking up the complete list of users. This list is now >10K users, and by itself can double the pageload time.

I don't know whether anyone in QA uses the existing admin interface on the advanced search page to limit to a single user. I think we could trivially replace that drop-down list with a text field that could be used as a regexp. The resulting search might take marginally longer to run, but the form itself would load much faster.
(In reply to comment #32)
> One way to speed up the advanced_search.cgi page is to stop looking up the
> complete list of users. This list is now >10K users, and by itself can double
> the pageload time.
> 
> I don't know whether anyone in QA uses the existing admin interface on the
> advanced search page to limit to a single user. I think we could trivially

I know I don't I rarely search using the user as a criteria (In fact I'm pretty sure I never did).
Raymond and I were just discussing how slow Litmus has been over the last few days when simply running tests. For example, it was fairly painful trying to run the Litmus 3.6 Smoketest yesterday. Raymond mentioned that he was seeing the same thing with his test runs.
Assignee: fwenzel → nobody
Closing this one out since Litmus is not being used any longer, we have moved to MozTrap.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.