1409002 - Teaching tp5o page loader test to load multiple pages in parallel

Reporter

Description

•

8 years ago

Attached patch WIP (obsolete) — Details — Splinter Review

For measuring the efficiency of e10s-multi on various platforms with various setups, I want this test to be customizable to load each sites in N windows in parallel instead of only one. I'm not sure yet if we want to run a tp5o with let's say with 4 windows by default on perfherder, let's see first the usefulness of this customization and the results.

Gabor Krizsanits (INACTIVE)

Reporter

Updated

•

8 years ago

Attachment #8918841 - Attachment is patch: true

Gabor Krizsanits (INACTIVE)

Reporter

Updated

•

8 years ago

Assignee: nobody → gkrizsanits

Priority: -- → P1

Whiteboard: [e10s-multi:+]

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

8 years ago

is this really different than tps (where we load all pages in different tabs and switch between the tabs)?

Gabor Krizsanits (INACTIVE)

Reporter

Comment 2

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #1) > is this really different than tps (where we load all pages in different tabs > and switch between the tabs)? Yeah. Tps test loads all the pages, waits until all of them are loaded, and then measures the smoothness of tab switches. So it does not measure page loads nor does it wind a lot from multiple processes. What I'm proposing here is starting up several pageloads at once and stop the timer when all the pages are loaded. So it measures parallel pageloads (DOM/network/etc work) where e10s-multi should be a big win if the machine has multiple cores, and might not be all that great if it only has one...

Gabor Krizsanits (INACTIVE)

Reporter

Comment 3

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1d2dd6b4ebec6af62e8caa327d951b75d680da2&selectedJob=137871195 So this experiment loads the same page concurrently in 4 windows at once, while only one is visible. And measures the time it takes to load it in each window and then calculate the average for that page. (and then repeats this a bunch of times and does another average) Now I do this for the first 3 URLs from the tp5o page set (I wanted to get some results fast...). And on all platform roughly we're twice as fast with 4 content processes than with one. Not too bad.

Gabor Krizsanits (INACTIVE)

Reporter

Comment 4

•

8 years ago

Attached file necko log for parent (obsolete) — Details

There is a necko issue that is blocking me here. What I'm doing here: It's a modified pageloader test. loading the tp5o pages, one URL at a time, but in several windows at once. So I initiate a page load in all (let's say 8 windows) with the same URL and when all of the pages are loaded I move on to the next step. Multiple content processes are enabled, but I think vanilla e10s is enough as well to hit the problem. Problem: sometimes some of the pages do not finish loading in some of the windows. A network request just hangs (I increased the timeout for this case and 30s was not enough for the browser to start doing anything again, so I'd say it's a pretty long hang, not sure if it would last forever or not). It's not a UI hang just a network hang. I did some logging and the interesting part starts around: 10:28:13.082000 in the parent process, that poll never returns and a lot of http channel stuff gets destroyed and then the test times out after a long pause. I think something goes wrong in one of the previous fetch that takes quite long: 2017-10-27 10:28:13.081000 UTC - [Socket Thread]: D/nsSocketTransport ...returned after 413 milliseconds and involves some cached resource. This issue is not just blocking me from landing this talos test and doing some important measurements, but also I'm kind of worried that this can happen for our users since this is a pretty simple setup. It might justa be something with out python file server though, but I think it's important to figure out what's going on and I really need some help for that.

Flags: needinfo?(honzab.moz)

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

8 years ago

is this on a specific platform, or all plaforms?

Gabor Krizsanits (INACTIVE)

Reporter

Comment 6

•

8 years ago

Attached file necko log for one of the hanging child (obsolete) — Details

Gabor Krizsanits (INACTIVE)

Reporter

Comment 7

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5) > is this on a specific platform, or all plaforms? Seems to happen on all platforms based on my try runs.

Gabor Krizsanits (INACTIVE)

Reporter

Comment 8

•

8 years ago

It seems like that the problem is that before everything starts to hang I hit: http://searchfox.org/mozilla-central/rev/1ebd2eff44617df3b82eea7d2f3ca1b60cc591a0/image/imgLoader.cpp#2882 multiple times from one of the child processes with stack: 10:54:32 INFO - PID 10396 | #01: imgCacheValidator::OnStartRequest (c:\development\mozilla-central\image\imgloader.cpp:2926) 10:54:32 INFO - PID 10396 | #02: mozilla::net::HttpChannelChild::DoOnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:689) 10:54:32 INFO - PID 10396 | #03: mozilla::net::HttpChannelChild::OnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:620) 10:54:32 INFO - PID 10396 | #04: mozilla::net::StartRequestEvent::Run (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:451) 10:54:32 INFO - PID 10396 | #05: mozilla::net::HttpChannelChild::RecvOnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:523) 10:54:32 INFO - PID 10396 | #06: mozilla::net::PHttpChannelChild::OnMessageReceived (c:\development\mozilla-central\obj-i686-pc-mingw32\ipc\ipdl\phttpchannelchild.cpp:638) 10:54:32 INFO - PID 10396 | #07: mozilla::dom::PContentChild::OnMessageReceived (c:\development\mozilla-central\obj-i686-pc-mingw32\ipc\ipdl\pcontentchild.cpp:4868) 10:54:32 INFO - PID 10396 | #08: mozilla::dom::ContentChild::OnMessageReceived (c:\development\mozilla-central\dom\ipc\contentchild.cpp:3722) 10:54:32 INFO - PID 10396 | #09: mozilla::ipc::MessageChannel::DispatchAsyncMessage (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:2120) 10:54:32 INFO - PID 10396 | #10: mozilla::ipc::MessageChannel::DispatchMessageW (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:2052) 10:54:32 INFO - PID 10396 | #11: mozilla::ipc::MessageChannel::RunMessage (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:1895) 10:54:32 INFO - PID 10396 | #12: mozilla::ipc::MessageChannel::MessageTask::Run (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:1929) 10:54:32 INFO - PID 10396 | #13: mozilla::SchedulerGroup::Runnable::Run (c:\development\mozilla-central\xpcom\threads\schedulergroup.cpp:396) 10:54:32 INFO - PID 10396 | #14: nsThread::ProcessNextEvent (c:\development\mozilla-central\xpcom\threads\nsthread.cpp:1038) 10:54:32 INFO - PID 10396 | #15: NS_ProcessNextEvent (c:\development\mozilla-central\xpcom\threads\nsthreadutils.cpp:524) 10:54:32 INFO - PID 10396 | #16: mozilla::ipc::MessagePump::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:97) 10:54:32 INFO - PID 10396 | #17: mozilla::ipc::MessagePumpForChildProcess::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:301) 10:54:32 INFO - PID 10396 | #18: MessageLoop::RunHandler (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:320) 10:54:32 INFO - PID 10396 | #19: MessageLoop::Run (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:300) 10:54:32 INFO - PID 10396 | #20: nsBaseAppShell::Run (c:\development\mozilla-central\widget\nsbaseappshell.cpp:160) 10:54:32 INFO - PID 10396 | #21: nsAppShell::Run (c:\development\mozilla-central\widget\windows\nsappshell.cpp:232) 10:54:32 INFO - PID 10396 | #22: XRE_RunAppShell (c:\development\mozilla-central\toolkit\xre\nsembedfunctions.cpp:877) 10:54:32 INFO - PID 10396 | #23: mozilla::ipc::MessagePumpForChildProcess::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:278) 10:54:32 INFO - PID 10396 | #24: MessageLoop::RunHandler (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:320) 10:54:32 INFO - PID 10396 | #25: MessageLoop::Run (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:300) 10:54:32 INFO - PID 10396 | #26: XRE_InitChildProcess (c:\development\mozilla-central\toolkit\xre\nsembedfunctions.cpp:707) 10:54:32 INFO - PID 10396 | #27: mozilla::BootstrapImpl::XRE_InitChildProcess (c:\development\mozilla-central\toolkit\xre\bootstrap.cpp:69) 10:54:32 INFO - PID 10396 | #28: content_process_main (c:\development\mozilla-central\ipc\contentproc\plugin-container.cpp:63) 10:54:32 INFO - PID 10396 | #29: NS_internal_main (c:\development\mozilla-central\browser\app\nsbrowserapp.cpp:283) 10:54:32 INFO - PID 10396 | #30: wmain (c:\development\mozilla-central\toolkit\xre\nswindowswmain.cpp:114) 10:54:32 INFO - PID 10396 | #31: __scrt_common_main_seh (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:25

Gabor Krizsanits (INACTIVE)

•

8 years ago

Attachment #8922801 - Flags: feedback?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 12

•

8 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11) > Created attachment 8924202 [details] > necko parent log with cache > > 2017-11-01 16:17:31 is where the timeout call kills the current round and > moves on to the next one, so the part before that what matters Sorry, this log is useless. There is no interesting data. Could you use environment vars to start logging instead of using about:networking? I want a complete log. Also URLs that are hanging would be interesting to know. What feedback do you want on the old child log from me? Most of the http channels that are logged in that log are successfully done. I don't see any hangs, but the log is incomplete (started late).

Honza Bambas (:mayhemer)

Updated

•

8 years ago

Attachment #8922801 - Flags: feedback?(honzab.moz)

Gabor Krizsanits (INACTIVE)

Reporter

Comment 13

•

8 years ago

Attached file log.txt.0 - parent, from start — Details

(In reply to Honza Bambas (:mayhemer) from comment #12) > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11) > Sorry, this log is useless. There is no interesting data. Could you use > environment vars to start logging instead of using about:networking? I want > a complete log. Also URLs that are hanging would be interesting to know. During the mid-run hangs I couldn't spot anything we're hanging on in these logs... And according to RawCap there is no http packet missing (so it's not like a hang in the middle of fetching some image as far as I can tell). There is some TCP traffic during the hang but not http (I have no idea what exactly). Anyway, let's start with something simpler, this is the first hang short after the test starts. Seems to be related to URLs like https://127.0.0.1/selfsupport-dummy/ , http://127.0.0.1/safebrowsing-dummy/update , and other weird urls (so not the first page I'm trying to load) > > What feedback do you want on the old child log from me? Most of the http > channels that are logged in that log are successfully done. I don't see any > hangs, but the log is incomplete (started late). Sorry, I accidentally set the flag on the wrong log. I wanted to set on the "necko parent log with cache" log. Anyway, a full log would be gigantic for the mid-run hangs, but I have a longer version, I'll try to share it with you via google drive, since bugzilla does not accept larger files. If the full log from the beginning of the test run is really needed I can prepare you a patch to reproduce it locally with a single command. Or if you could give me some hints about what are you typically looking for in these logs I can try it myself as well, or do a better work at capturing the meaningful bits with the logger.

Attachment #8922793 - Attachment is obsolete: true

Attachment #8922801 - Attachment is obsolete: true

Attachment #8924202 - Attachment is obsolete: true

Attachment #8924530 - Flags: feedback?(honzab.moz)

Gabor Krizsanits (INACTIVE)

Reporter

Comment 14

•

8 years ago

In case this is not a networking/cache issue as we thought, what's the best way to track what is blocking the load event on the document? Is there any better way than trying to fetch and printf something from the caller of nsDocument::BlockOnload and nsDocument::UnblockOnload?

Flags: needinfo?(bugs)

Honza Bambas (:mayhemer)

Comment 15

•

8 years ago

Comment on attachment 8924530 [details] log.txt.0 - parent, from start All channels are done (i.e. all of them have delivered data and onstop). This is not a net issue. Document loading and onload blocking happens on the child process, so a child log would be good to share.

Attachment #8924530 - Flags: feedback?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 16

•

8 years ago

Pressed enter too soon! I can see an error in the log for all of the 3 channels: nsHttpConnection::Activate [this=199F7F90] Bad Socket 804b000d the error is NS_ERROR_CONNECTION_REFUSED. I'd like to ask you for one more log run with added "nsSocketTransport:5" module, it will tell us where from exactly does that comes. Thanks.

•

8 years ago

It's getting somewhere: https://treeherder.mozilla.org/#/jobs?repo=try&revision=95aa801706510c2e6a976715bc3bb204bf0e0b29

Gabor Krizsanits (INACTIVE)

Reporter

Comment 21

•

8 years ago

Attached patch Teaching the pageloader to load multiple pages at once. v1 (obsolete) — Details — Splinter Review

Attachment #8918841 - Attachment is obsolete: true

Gabor Krizsanits (INACTIVE)

Reporter

Comment 22

•

8 years ago

Comment on attachment 8932511 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v1 So the patch does mainly two things, first it removes a few functions that are no longer used, second it teaches the page loader to specify a number of windows parameter, which will trigger multiple page loads at once for the tp5o test. Hopefully without breaking all the dependent talos tests. I still have to see if all the tests are working as before (see my last try push in the previous comment about the status). But since I've heard that you're refactoring this file I wanted to request an early feedback.

Attachment #8932511 - Flags: feedback?(rwood)

Robert Wood [:rwood]

Comment 23

•

8 years ago

Comment on attachment 8932511 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v1 Review of attachment 8932511 [details] [diff] [review]: ----------------------------------------------------------------- Awesome! Thanks for this. I really like this approach and the refactoring, much cleaner. See comments inline. My main concern is it being complicated to debug failures with a large number of concurrent windows. For example, if say 3 windows have completed pageload but a 4th times out, it should be known that was the state. Having more logging of the number of windows open (and the number expected, and their state etc.) would help alot when investigating test failures. Thanks. I'm excited to see it in action. ::: testing/talos/talos.json @@ +90,5 @@ > "talos_options": ["--disable-stylo"], > "tests": ["speedometer"] > }, > "tp5o-e10s": { > + "tests": ["tp5o", "tp5o_multiwindow_4", "tp5o_multiwindow_4_singlecp"], You probably have it this way temporarily just so that it's possible to run the new tests on try; when they actually land I'd prefer a separate test 'tp5o-multiwin-e10s' or something like that ::: testing/talos/talos/pageloader/chrome/pageloader.js @@ +32,5 @@ > var cycle; > var pageCycle; > var report; > var timeout = -1; > +var delay = 500; I prefer this delay only be changed when numWindows > 1 so that existing job durations are not affected @@ +302,5 @@ > + browserWindow: wwatch.openWindow(null, > + "chrome://browser/content/", "_blank", > + `chrome,${toolbars},dialog=no,width=${winWidth},height=${winHeight}`, blank) > + }; > + I think it would be good to dumpLine indicating a window was opened; that way we can see more evidence in the log of multiple windows / how many were opened @@ +307,5 @@ > + gBrowserData[i].browserWindow.addEventListener("load", > + browserLoadFunc(gBrowserData[i]), true); > + } > + > + setTimeout(plLoadPage, 1000); Again, if possible probably best to only have this increased by 900ms if numWindows > 1 @@ +566,5 @@ > } > } else { > report.recordTime(recordedName, time); > } > dumpLine("Cycle " + (cycle + 1) + "(" + pageCycle + "): loaded " + pageName + " (next: " + nextName + ")"); If numWindows > 1 I'd add to this dumpLine "in N concurrent windows" @@ +581,5 @@ > mm.sendAsyncMessage("PageLoader:ScrollTest", { details }); > } > > +function _loadHandler(fnbpaint = 0, browserData) { > + if (gToBeLoaded == 0) Suggestion: dump some output here indicating load handler activated for window N or load handler activated with N windows remaining to be loaded or something. Just so if there are crashes/failures/intermittents it is more clear where we are at i.e. how many windows are open and what state each is in at that time. ::: testing/talos/talos/test.py @@ +575,5 @@ > unit = 'ms' > > > @register_test() > +class tp5o_multiwindow_4(tp5o): I'd make a base class for this: class tp5o_multiwindow(tp5o): tppagecycles = 10 tptimeout = 15000 timeout = 5000 Then we could derive tests for x number of windows, i.e.: class tp5o_multiwindow_2(tp5o_multiwindow): tpnumwindows = 2 class tp5o_multiwindow_4(tp5o_multiwindow): tpnumwindows = 4 @@ +589,5 @@ > + tpnumwindows = 4 > + preferences = {'dom.ipc.processCount': 1, > + 'dom.ipc.multiOptOut': 1} > + tptimeout = 15000 > + timeout = 5000 Same idea here, i.e.: class tp5o_multiwindow_singlecp(tp5o_multiwindow): preferences = {'dom.ipc.processCount': 1, 'dom.ipc.multiOptOut': 1} class tp5o_multiwindow_4_singlecp(tp5o_multiwindow_singlecp): tpnumwindows = 4

Attachment #8932511 - Flags: feedback?(rwood) → feedback+

Gabor Krizsanits (INACTIVE)

Reporter

•

8 years ago

Attached patch Teaching the pageloader to load multiple pages at once. v2 — Details — Splinter Review

(In reply to Robert Wood [:rwood] from comment #23) > You probably have it this way temporarily just so that it's possible to run > the new tests on try; when they actually land I'd prefer a separate test > 'tp5o-multiwin-e10s' or something like that Honestly I'm nor sure how to introduce a new test, can we do that part in a followup? I also removed the single process version, since that was only for perf testing. The one I'm keeping would make sense for perf regression finding for the future. > > + setTimeout(plLoadPage, 1000); > > Again, if possible probably best to only have this increased by 900ms if > numWindows > 1 I removed the other delay but have to keep this one. It's not really a 900ms, previously this call was from another setTimeout with 500 so that is at least 500 + 100 but in practice I had to keep the 1000ms (see the try runs above). The proper solution is to use BrowserTestUtils here but integrating that into talos test is a bit more work I'm afraid. I would look into that but these are my last days at Mozilla... Anyway, I think I addressed all the other comments.

Attachment #8932511 - Attachment is obsolete: true

Attachment #8935453 - Flags: review?(rwood)

Robert Wood [:rwood]

Comment 31

•

8 years ago

Comment on attachment 8935453 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v2 Review of attachment 8935453 [details] [diff] [review]: ----------------------------------------------------------------- Apologies for the review delay. This looks great - yes if you like file a bug against talos please, to actually introduce a new tp50 test using this multiwidow support and we can add that separately. Since the test is going to land separately just an r- here for one change in talos.json (revert it back to it's original state). Thanks! ::: testing/talos/talos.json @@ +90,5 @@ > "talos_options": ["--disable-stylo"], > "tests": ["speedometer"] > }, > "tp5o-e10s": { > + "tests": ["tp5o", "tp5o_multiwindow_4"], Please remove 'tp5o_multiwindow_4' since you're going to turn on / land the actual tests separately in the future

Attachment #8935453 - Flags: review?(rwood) → review-

Robert Wood [:rwood]

•

6 years ago

This bug stalled 2 years ago when it seemed to be important to have. Not sure if this is true also nowadays. Lets discuss that in the next perf triage meeting.

Priority: P2 → P3

Whiteboard: [e10s-multi:+] → [perftest:triage][e10s-multi:+]

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 35

•

5 years ago

Now that the reporter and Blake aren't with us anymore, Andrew what do you think about this request? Do you think it is still valuable to have?

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Comment 36

•

5 years ago

Caveat: I know very little about the Talos pageload tests.
But because we're migrating all of our pageload tests to browsertime, my feeling is that is low value.
Maybe a multi-page open in browsertime (part of session restore?) could be logged and triaged.
Given the current gaps in testing that I'm aware of (e.g. url bar entry, applink on android), the priority would still be low in my opinion.

Flags: needinfo?(acreskey)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 37

•

5 years ago

Thanks Andrew. Maybe lets also ask Mike given that he works on some performance stuff in Firefox for desktop and maybe he can give feedback or knows a good person to ask for.

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 38

•

5 years ago

Ultimately, this seems like a form of stress test. I concur with acreskey - this is likely low priority at this point. It's just another gap in our performance testing story.

Flags: needinfo?(mconley)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

5 years ago

Whiteboard: [perftest:triage][e10s-multi:+] → [e10s-multi:+]

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Alex Finder|PTO, returning August 18th

Comment 39

•

10 months ago

Closing this as WONTFIX, since it would be best to use the raptor or mozperftest test harness to achieve this.

Status: NEW → RESOLVED

Closed: 10 months ago

Resolution: --- → WONTFIX

Whiteboard: [e10s-multi:+] → [e10s-multi:+][fxp]

Jira Integration Bot

Updated

•

10 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/FXP-3804

WIP 8 years ago Gabor Krizsanits (INACTIVE) 68.74 KB, patch		Details \| Diff \| Splinter Review
necko log for parent 8 years ago Gabor Krizsanits (INACTIVE) 4.80 MB, text/plain		Details
necko log for one of the hanging child 8 years ago Gabor Krizsanits (INACTIVE) 6.46 MB, text/plain		Details
necko parent log with cache 8 years ago Gabor Krizsanits (INACTIVE) 2.65 MB, text/plain		Details
log.txt.0 - parent, from start 8 years ago Gabor Krizsanits (INACTIVE) 834.70 KB, text/plain		Details
Teaching the pageloader to load multiple pages at once. v1 8 years ago Gabor Krizsanits (INACTIVE) 29.65 KB, patch	rwood : feedback+	Details \| Diff \| Splinter Review
Teaching the pageloader to load multiple pages at once. v2 8 years ago Gabor Krizsanits (INACTIVE) 30.98 KB, patch	rwood : review-	Details \| Diff \| Splinter Review