Although we put a lot of effort into optimizing Firefox for running web apps, we don't really test this at the moment. TP5 loads pages, but doesn't interact with them in the way that one interacts with Gmail, Google Calendar, or Facebook.
I propose we create some new tests to measure how we're doing here.
Obviously we'd need to have smart testing infrastructure which knows how to replay a Gmail/Gcal/FB session.
I think there are three quantities we'd want to measure for each web app:
* Time taken to complete the test (load the page and play a set of interactions),
* Maximum RSS during the test, and
* RSS after closing the tab and triggering a GC.
This kind of measurement is particularly important for B2G. People basically never restart their phones, so Firefox has to be really good at freeing up memory after a page is closed; otherwise, we'll slowly leak over the course of weeks.
We are building infrastructure that can support this type of testing this quarter (2011 Q4) in the marionette project .
But there are larger problems here that we'd need to deal with:
* Repeatability - we need to pull these canned recorded sessions down off the net and ensure that we can reliably replay them on our own servers. This is both because sites change and because the machines running in automation cannot talk to the outside world. However, if you pull one of these larger interesting web-apps off the net, how much of what makes them prone to browser errors do you lose? I'm thinking about all those delicious ads that rotate through those experiences and load on the fly, do animations etc. Can those be stubbed out enough that you have a true test that is representative of real browsing on the page in question?
* Time to done & deploy - Running these with current talos infrastructure is possible, but not ideal. We can create something using driver scripts from other projects, but due to the new push for a native UI fennec, we won't be able to run these tests on fennec (those driver scripts hook themselves into the app via XUL and depend on chrome DOM to drive the browser). With Marionette, this won't be a problem. So if we can wait for that I recommend we do, or, we simply do not test this on Fennec in the meantime.
: Marionette: https://wiki.mozilla.org/Auto-tools/Projects/Marionette
> However, if you pull one of these larger interesting web-apps off the net, how much of what makes
> them prone to browser errors do you lose?
I think this is a red herring. By its nature, a performance test must be repeatable. So it's going to be static, and it's not going to have full coverage of the web app. That's OK and unavoidable, I think.
> We can create something using driver scripts from other projects, but due to the new push for a
> native UI fennec, we won't be able to run these tests on fennec (those driver scripts hook
> themselves into the app via XUL and depend on chrome DOM to drive the browser).
I guess the question is how much work we'd waste developing these tests and then porting to Marionette, as opposed to developing for Marionette in the first place.
But given the sorry state of our performance test coverage, especially wrt memory, I'd rather move forward on desktop now and figure out mobile when we can.
I don't think Marionette is going to help in the short term on mobile. We will still have gecko and the dom which we can traverse and sent events (mouse/keyboard event). But these events will not be at the nativeUI level. From what I know of Marionette, it will be at the gecko layer for the first round, then later (next quarter maybe?) add shims to work at the OS layer.
Desktop vs mobile will need a different set of tests. For example, gmail on android is a mobile site, not a full app, same with zimbra. So investing time into waiting for a tool that is mobile specific might not be very useful. I would say we should try to use technologies that are friendlier for these mobile tools instead of writing a bunch of custom code. So if Marionette is going to work with Tool X (say Selenium) then we should use Tool X for the web app tests.
Most of these webapps rely on databases for the backend. How would we collect this information? I agree we could take a page which contains a bunch of links and wget while we click, but then how is that different from loading a bunch of webpages.
In short, I see the need for more integrated testing of webpages to gain more insight into the performance (time and memory) of Firefox. If somebody could make this more well defined, I would appreciate it.
> Most of these webapps rely on databases for the backend. How would we collect this information? I
> agree we could take a page which contains a bunch of links and wget while we click, but then how is
> that different from loading a bunch of webpages.
How is gmail different than a static webpage, from the browser's perspective? It's not different because it uses a database; it'd be different in the same way if all my e-mails were random gibberish delivered from the server.
These sites make heavy use of XHR to stream data to the browser; that's one difference. They also dynamically-generate a lot of the content they display; the content isn't delivered as plain HTML, as it is in a static page.
(In reply to Justin Lebar [:jlebar] from comment #4)
> These sites make heavy use of XHR to stream data to the browser; that's one
> difference. They also dynamically-generate a lot of the content they
> display; the content isn't delivered as plain HTML, as it is in a static
Right, this is the part that I find most interesting about this proposal and the most confounding. How do we simulate this with any justice whatsoever to the real thing in an offline, localhost only network? Do we wireshark and cache an entire gmail session? Once we do that, how do we know if what we're testing bears any resemblance (latency wise, for example) to the real thing and the real user's experience?
I too love the idea of testing it, but I'm just having a tough time getting my head around how we're going to do this.
Right now, "porting the tests to Marionette" would mean "write them in selenium 2" because Marionette will be able to understand selenium 2's JSON protocol and use that to drive the browser. But Marionette is a long way from there. Right now, you'd be much better off doing these as a one-off with selenium on the real website and measuring true end-to-end time. (i.e. outside of Talos).
> Do we wireshark and cache an entire gmail session? Once we do that, how do we know if what we're
> testing bears any resemblance (latency wise, for example) to the real thing and the real user's
How do we know that Firefox's TP5 score has any relation to the user's experience? The pages in the test all have many resources; in the real world, they'd be delivered at different rates, with different latencies and so on.
I think yours is a good point, but I'm not sure that it applies more to web apps than to our existing tests!
Note also that I'm particularly concerned with memory benchmarks, which should be basically the same regardless of the network's speed. (And if they're not the same, we could always slow down the network.)
(In reply to Justin Lebar [:jlebar] from comment #7)
> Note also that I'm particularly concerned with memory benchmarks, which
> should be basically the same regardless of the network's speed. (And if
> they're not the same, we could always slow down the network.)
Good point, good point.
Looping in a few JS guys here, since I know next to nothing about what makes a good versus a bad JS benchmark. It would be great to get your thoughts on whether this kind of interaction-with-a-webapp test would be a useful benchmark.
Out of curiosity, have you guys talked with anyone working on NeckoNet (https://wiki.mozilla.org/NeckoNet)? Since NeckoNet allows replay at the network level, we had been talking about (but no concrete plans) integrating with Selenium to give the JS team a reproducible harness for webapp perf. CC'ing Nick who actually knows about NeckoNet :)
It seems like we now have all the tools necessary to do this, but someone needs to own the project.
Clint, do you guys have any resources available to help with this? This is probably MemShrink's number-one automation-related issue.
WONTFIX now that we have AWSY, and it doesn't seem like this is going anywhere.