Closed Bug 1409002 Opened 8 years ago Closed 9 months ago

Teaching tp5o page loader test to load multiple pages in parallel

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gkrizsanits, Unassigned)

References

Details

(Whiteboard: [e10s-multi:+][fxp])

Attachments

(2 files, 5 obsolete files)

Attached patch WIP (obsolete) — Splinter Review
For measuring the efficiency of e10s-multi on various platforms with various setups, I want this test to be customizable to load each sites in N windows in parallel instead of only one. I'm not sure yet if we want to run a tp5o with let's say with 4 windows by default on perfherder, let's see first the usefulness of this customization and the results.
Attachment #8918841 - Attachment is patch: true
Assignee: nobody → gkrizsanits
Priority: -- → P1
Whiteboard: [e10s-multi:+]
is this really different than tps (where we load all pages in different tabs and switch between the tabs)?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #1) > is this really different than tps (where we load all pages in different tabs > and switch between the tabs)? Yeah. Tps test loads all the pages, waits until all of them are loaded, and then measures the smoothness of tab switches. So it does not measure page loads nor does it wind a lot from multiple processes. What I'm proposing here is starting up several pageloads at once and stop the timer when all the pages are loaded. So it measures parallel pageloads (DOM/network/etc work) where e10s-multi should be a big win if the machine has multiple cores, and might not be all that great if it only has one...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1d2dd6b4ebec6af62e8caa327d951b75d680da2&selectedJob=137871195 So this experiment loads the same page concurrently in 4 windows at once, while only one is visible. And measures the time it takes to load it in each window and then calculate the average for that page. (and then repeats this a bunch of times and does another average) Now I do this for the first 3 URLs from the tp5o page set (I wanted to get some results fast...). And on all platform roughly we're twice as fast with 4 content processes than with one. Not too bad.
Attached file necko log for parent (obsolete) —
There is a necko issue that is blocking me here. What I'm doing here: It's a modified pageloader test. loading the tp5o pages, one URL at a time, but in several windows at once. So I initiate a page load in all (let's say 8 windows) with the same URL and when all of the pages are loaded I move on to the next step. Multiple content processes are enabled, but I think vanilla e10s is enough as well to hit the problem. Problem: sometimes some of the pages do not finish loading in some of the windows. A network request just hangs (I increased the timeout for this case and 30s was not enough for the browser to start doing anything again, so I'd say it's a pretty long hang, not sure if it would last forever or not). It's not a UI hang just a network hang. I did some logging and the interesting part starts around: 10:28:13.082000 in the parent process, that poll never returns and a lot of http channel stuff gets destroyed and then the test times out after a long pause. I think something goes wrong in one of the previous fetch that takes quite long: 2017-10-27 10:28:13.081000 UTC - [Socket Thread]: D/nsSocketTransport ...returned after 413 milliseconds and involves some cached resource. This issue is not just blocking me from landing this talos test and doing some important measurements, but also I'm kind of worried that this can happen for our users since this is a pretty simple setup. It might justa be something with out python file server though, but I think it's important to figure out what's going on and I really need some help for that.
Flags: needinfo?(honzab.moz)
is this on a specific platform, or all plaforms?
Attached file necko log for one of the hanging child (obsolete) —
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5) > is this on a specific platform, or all plaforms? Seems to happen on all platforms based on my try runs.
It seems like that the problem is that before everything starts to hang I hit: http://searchfox.org/mozilla-central/rev/1ebd2eff44617df3b82eea7d2f3ca1b60cc591a0/image/imgLoader.cpp#2882 multiple times from one of the child processes with stack: 10:54:32 INFO - PID 10396 | #01: imgCacheValidator::OnStartRequest (c:\development\mozilla-central\image\imgloader.cpp:2926) 10:54:32 INFO - PID 10396 | #02: mozilla::net::HttpChannelChild::DoOnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:689) 10:54:32 INFO - PID 10396 | #03: mozilla::net::HttpChannelChild::OnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:620) 10:54:32 INFO - PID 10396 | #04: mozilla::net::StartRequestEvent::Run (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:451) 10:54:32 INFO - PID 10396 | #05: mozilla::net::HttpChannelChild::RecvOnStartRequest (c:\development\mozilla-central\netwerk\protocol\http\httpchannelchild.cpp:523) 10:54:32 INFO - PID 10396 | #06: mozilla::net::PHttpChannelChild::OnMessageReceived (c:\development\mozilla-central\obj-i686-pc-mingw32\ipc\ipdl\phttpchannelchild.cpp:638) 10:54:32 INFO - PID 10396 | #07: mozilla::dom::PContentChild::OnMessageReceived (c:\development\mozilla-central\obj-i686-pc-mingw32\ipc\ipdl\pcontentchild.cpp:4868) 10:54:32 INFO - PID 10396 | #08: mozilla::dom::ContentChild::OnMessageReceived (c:\development\mozilla-central\dom\ipc\contentchild.cpp:3722) 10:54:32 INFO - PID 10396 | #09: mozilla::ipc::MessageChannel::DispatchAsyncMessage (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:2120) 10:54:32 INFO - PID 10396 | #10: mozilla::ipc::MessageChannel::DispatchMessageW (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:2052) 10:54:32 INFO - PID 10396 | #11: mozilla::ipc::MessageChannel::RunMessage (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:1895) 10:54:32 INFO - PID 10396 | #12: mozilla::ipc::MessageChannel::MessageTask::Run (c:\development\mozilla-central\ipc\glue\messagechannel.cpp:1929) 10:54:32 INFO - PID 10396 | #13: mozilla::SchedulerGroup::Runnable::Run (c:\development\mozilla-central\xpcom\threads\schedulergroup.cpp:396) 10:54:32 INFO - PID 10396 | #14: nsThread::ProcessNextEvent (c:\development\mozilla-central\xpcom\threads\nsthread.cpp:1038) 10:54:32 INFO - PID 10396 | #15: NS_ProcessNextEvent (c:\development\mozilla-central\xpcom\threads\nsthreadutils.cpp:524) 10:54:32 INFO - PID 10396 | #16: mozilla::ipc::MessagePump::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:97) 10:54:32 INFO - PID 10396 | #17: mozilla::ipc::MessagePumpForChildProcess::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:301) 10:54:32 INFO - PID 10396 | #18: MessageLoop::RunHandler (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:320) 10:54:32 INFO - PID 10396 | #19: MessageLoop::Run (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:300) 10:54:32 INFO - PID 10396 | #20: nsBaseAppShell::Run (c:\development\mozilla-central\widget\nsbaseappshell.cpp:160) 10:54:32 INFO - PID 10396 | #21: nsAppShell::Run (c:\development\mozilla-central\widget\windows\nsappshell.cpp:232) 10:54:32 INFO - PID 10396 | #22: XRE_RunAppShell (c:\development\mozilla-central\toolkit\xre\nsembedfunctions.cpp:877) 10:54:32 INFO - PID 10396 | #23: mozilla::ipc::MessagePumpForChildProcess::Run (c:\development\mozilla-central\ipc\glue\messagepump.cpp:278) 10:54:32 INFO - PID 10396 | #24: MessageLoop::RunHandler (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:320) 10:54:32 INFO - PID 10396 | #25: MessageLoop::Run (c:\development\mozilla-central\ipc\chromium\src\base\message_loop.cc:300) 10:54:32 INFO - PID 10396 | #26: XRE_InitChildProcess (c:\development\mozilla-central\toolkit\xre\nsembedfunctions.cpp:707) 10:54:32 INFO - PID 10396 | #27: mozilla::BootstrapImpl::XRE_InitChildProcess (c:\development\mozilla-central\toolkit\xre\bootstrap.cpp:69) 10:54:32 INFO - PID 10396 | #28: content_process_main (c:\development\mozilla-central\ipc\contentproc\plugin-container.cpp:63) 10:54:32 INFO - PID 10396 | #29: NS_internal_main (c:\development\mozilla-central\browser\app\nsbrowserapp.cpp:283) 10:54:32 INFO - PID 10396 | #30: wmain (c:\development\mozilla-central\toolkit\xre\nswindowswmain.cpp:114) 10:54:32 INFO - PID 10396 | #31: __scrt_common_main_seh (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:25
Turning on the logging for cache as well it seems like the hangs happen right after CacheFile::WriteMetadataIfNeeded.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #4) > Created attachment 8922793 [details] > necko log for parent Please add "sync,cache2:5,nsChannelClassifier:5" to the list of modules and rerun. There might be lines missing in the log. Thanks. Parent log will likely be sufficient.
Flags: needinfo?(honzab.moz)
Attached file necko parent log with cache (obsolete) —
2017-11-01 16:17:31 is where the timeout call kills the current round and moves on to the next one, so the part before that what matters
Attachment #8922801 - Flags: feedback?(honzab.moz)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11) > Created attachment 8924202 [details] > necko parent log with cache > > 2017-11-01 16:17:31 is where the timeout call kills the current round and > moves on to the next one, so the part before that what matters Sorry, this log is useless. There is no interesting data. Could you use environment vars to start logging instead of using about:networking? I want a complete log. Also URLs that are hanging would be interesting to know. What feedback do you want on the old child log from me? Most of the http channels that are logged in that log are successfully done. I don't see any hangs, but the log is incomplete (started late).
Attachment #8922801 - Flags: feedback?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #12) > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11) > Sorry, this log is useless. There is no interesting data. Could you use > environment vars to start logging instead of using about:networking? I want > a complete log. Also URLs that are hanging would be interesting to know. During the mid-run hangs I couldn't spot anything we're hanging on in these logs... And according to RawCap there is no http packet missing (so it's not like a hang in the middle of fetching some image as far as I can tell). There is some TCP traffic during the hang but not http (I have no idea what exactly). Anyway, let's start with something simpler, this is the first hang short after the test starts. Seems to be related to URLs like https://127.0.0.1/selfsupport-dummy/ , http://127.0.0.1/safebrowsing-dummy/update , and other weird urls (so not the first page I'm trying to load) > > What feedback do you want on the old child log from me? Most of the http > channels that are logged in that log are successfully done. I don't see any > hangs, but the log is incomplete (started late). Sorry, I accidentally set the flag on the wrong log. I wanted to set on the "necko parent log with cache" log. Anyway, a full log would be gigantic for the mid-run hangs, but I have a longer version, I'll try to share it with you via google drive, since bugzilla does not accept larger files. If the full log from the beginning of the test run is really needed I can prepare you a patch to reproduce it locally with a single command. Or if you could give me some hints about what are you typically looking for in these logs I can try it myself as well, or do a better work at capturing the meaningful bits with the logger.
Attachment #8922793 - Attachment is obsolete: true
Attachment #8922801 - Attachment is obsolete: true
Attachment #8924202 - Attachment is obsolete: true
Attachment #8924530 - Flags: feedback?(honzab.moz)
In case this is not a networking/cache issue as we thought, what's the best way to track what is blocking the load event on the document? Is there any better way than trying to fetch and printf something from the caller of nsDocument::BlockOnload and nsDocument::UnblockOnload?
Flags: needinfo?(bugs)
Comment on attachment 8924530 [details] log.txt.0 - parent, from start All channels are done (i.e. all of them have delivered data and onstop). This is not a net issue. Document loading and onload blocking happens on the child process, so a child log would be good to share.
Attachment #8924530 - Flags: feedback?(honzab.moz)
Pressed enter too soon! I can see an error in the log for all of the 3 channels: nsHttpConnection::Activate [this=199F7F90] Bad Socket 804b000d the error is NS_ERROR_CONNECTION_REFUSED. I'd like to ask you for one more log run with added "nsSocketTransport:5" module, it will tell us where from exactly does that comes. Thanks.
Grrr.. it's there already: 2017-11-02 13:17:50.434000 UTC - [Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5981 out=804b000d] 2017-11-02 13:17:50.434000 UTC - [Socket Thread]: D/nsSocketTransport connection failed! [reason=804b000d] Which is PR_CONNECT_REFUSED_ERROR which means there is no server running. This should lead to a standard error page shown, if this is a navigation load. I can see it takes around 1 sec to get the error.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #14) > In case this is not a networking/cache issue as we thought, what's the best > way to track what is blocking the load event on the document? Is there any > better way than trying to fetch and printf something from the caller of > nsDocument::BlockOnload and nsDocument::UnblockOnload? Add some breakpoint there and print each callstacks? Or just use printf, that might be simpler.
Flags: needinfo?(bugs)
As it turns out it was an imgRequestProxy onload blocker issue. After figuring out what the problem was I found a related bug and there is a fix landed for it already.
Depends on: 1404422
Attachment #8918841 - Attachment is obsolete: true
Comment on attachment 8932511 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v1 So the patch does mainly two things, first it removes a few functions that are no longer used, second it teaches the page loader to specify a number of windows parameter, which will trigger multiple page loads at once for the tp5o test. Hopefully without breaking all the dependent talos tests. I still have to see if all the tests are working as before (see my last try push in the previous comment about the status). But since I've heard that you're refactoring this file I wanted to request an early feedback.
Attachment #8932511 - Flags: feedback?(rwood)
Comment on attachment 8932511 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v1 Review of attachment 8932511 [details] [diff] [review]: ----------------------------------------------------------------- Awesome! Thanks for this. I really like this approach and the refactoring, much cleaner. See comments inline. My main concern is it being complicated to debug failures with a large number of concurrent windows. For example, if say 3 windows have completed pageload but a 4th times out, it should be known that was the state. Having more logging of the number of windows open (and the number expected, and their state etc.) would help alot when investigating test failures. Thanks. I'm excited to see it in action. ::: testing/talos/talos.json @@ +90,5 @@ > "talos_options": ["--disable-stylo"], > "tests": ["speedometer"] > }, > "tp5o-e10s": { > + "tests": ["tp5o", "tp5o_multiwindow_4", "tp5o_multiwindow_4_singlecp"], You probably have it this way temporarily just so that it's possible to run the new tests on try; when they actually land I'd prefer a separate test 'tp5o-multiwin-e10s' or something like that ::: testing/talos/talos/pageloader/chrome/pageloader.js @@ +32,5 @@ > var cycle; > var pageCycle; > var report; > var timeout = -1; > +var delay = 500; I prefer this delay only be changed when numWindows > 1 so that existing job durations are not affected @@ +302,5 @@ > + browserWindow: wwatch.openWindow(null, > + "chrome://browser/content/", "_blank", > + `chrome,${toolbars},dialog=no,width=${winWidth},height=${winHeight}`, blank) > + }; > + I think it would be good to dumpLine indicating a window was opened; that way we can see more evidence in the log of multiple windows / how many were opened @@ +307,5 @@ > + gBrowserData[i].browserWindow.addEventListener("load", > + browserLoadFunc(gBrowserData[i]), true); > + } > + > + setTimeout(plLoadPage, 1000); Again, if possible probably best to only have this increased by 900ms if numWindows > 1 @@ +566,5 @@ > } > } else { > report.recordTime(recordedName, time); > } > dumpLine("Cycle " + (cycle + 1) + "(" + pageCycle + "): loaded " + pageName + " (next: " + nextName + ")"); If numWindows > 1 I'd add to this dumpLine "in N concurrent windows" @@ +581,5 @@ > mm.sendAsyncMessage("PageLoader:ScrollTest", { details }); > } > > +function _loadHandler(fnbpaint = 0, browserData) { > + if (gToBeLoaded == 0) Suggestion: dump some output here indicating load handler activated for window N or load handler activated with N windows remaining to be loaded or something. Just so if there are crashes/failures/intermittents it is more clear where we are at i.e. how many windows are open and what state each is in at that time. ::: testing/talos/talos/test.py @@ +575,5 @@ > unit = 'ms' > > > @register_test() > +class tp5o_multiwindow_4(tp5o): I'd make a base class for this: class tp5o_multiwindow(tp5o): tppagecycles = 10 tptimeout = 15000 timeout = 5000 Then we could derive tests for x number of windows, i.e.: class tp5o_multiwindow_2(tp5o_multiwindow): tpnumwindows = 2 class tp5o_multiwindow_4(tp5o_multiwindow): tpnumwindows = 4 @@ +589,5 @@ > + tpnumwindows = 4 > + preferences = {'dom.ipc.processCount': 1, > + 'dom.ipc.multiOptOut': 1} > + tptimeout = 15000 > + timeout = 5000 Same idea here, i.e.: class tp5o_multiwindow_singlecp(tp5o_multiwindow): preferences = {'dom.ipc.processCount': 1, 'dom.ipc.multiOptOut': 1} class tp5o_multiwindow_4_singlecp(tp5o_multiwindow_singlecp): tpnumwindows = 4
Attachment #8932511 - Flags: feedback?(rwood) → feedback+
I wonder if bug 1416682 has something to do with the load issues you've seen.
(In reply to Olli Pettay [:smaug] (ni?s and f?s processed after r?s) from comment #25) > I wonder if bug 1416682 has something to do with the load issues you've seen. The load issue I've seen was bug 1404422.
(In reply to Robert Wood [:rwood] from comment #23) > You probably have it this way temporarily just so that it's possible to run > the new tests on try; when they actually land I'd prefer a separate test > 'tp5o-multiwin-e10s' or something like that Honestly I'm nor sure how to introduce a new test, can we do that part in a followup? I also removed the single process version, since that was only for perf testing. The one I'm keeping would make sense for perf regression finding for the future. > > + setTimeout(plLoadPage, 1000); > > Again, if possible probably best to only have this increased by 900ms if > numWindows > 1 I removed the other delay but have to keep this one. It's not really a 900ms, previously this call was from another setTimeout with 500 so that is at least 500 + 100 but in practice I had to keep the 1000ms (see the try runs above). The proper solution is to use BrowserTestUtils here but integrating that into talos test is a bit more work I'm afraid. I would look into that but these are my last days at Mozilla... Anyway, I think I addressed all the other comments.
Attachment #8932511 - Attachment is obsolete: true
Attachment #8935453 - Flags: review?(rwood)
Comment on attachment 8935453 [details] [diff] [review] Teaching the pageloader to load multiple pages at once. v2 Review of attachment 8935453 [details] [diff] [review]: ----------------------------------------------------------------- Apologies for the review delay. This looks great - yes if you like file a bug against talos please, to actually introduce a new tp50 test using this multiwidow support and we can add that separately. Since the test is going to land separately just an r- here for one change in talos.json (revert it back to it's original state). Thanks! ::: testing/talos/talos.json @@ +90,5 @@ > "talos_options": ["--disable-stylo"], > "tests": ["speedometer"] > }, > "tp5o-e10s": { > + "tests": ["tp5o", "tp5o_multiwindow_4"], Please remove 'tp5o_multiwindow_4' since you're going to turn on / land the actual tests separately in the future
Attachment #8935453 - Flags: review?(rwood) → review-
Also please note, you'll need to rebase and update accordingly as there was some pageloader refactoring/cleanup done in Bug 1423295 that will conflict with this. Thanks!
I'll take this since I've been working on bug 1399962 and this test is quite related.
Assignee: gkrizsanits → mrbkap
Assignee: mrbkap → nobody
Priority: P1 → P2

This bug stalled 2 years ago when it seemed to be important to have. Not sure if this is true also nowadays. Lets discuss that in the next perf triage meeting.

Priority: P2 → P3
Whiteboard: [e10s-multi:+] → [perftest:triage][e10s-multi:+]

Now that the reporter and Blake aren't with us anymore, Andrew what do you think about this request? Do you think it is still valuable to have?

Flags: needinfo?(acreskey)

Caveat: I know very little about the Talos pageload tests.
But because we're migrating all of our pageload tests to browsertime, my feeling is that is low value.
Maybe a multi-page open in browsertime (part of session restore?) could be logged and triaged.
Given the current gaps in testing that I'm aware of (e.g. url bar entry, applink on android), the priority would still be low in my opinion.

Flags: needinfo?(acreskey)

Thanks Andrew. Maybe lets also ask Mike given that he works on some performance stuff in Firefox for desktop and maybe he can give feedback or knows a good person to ask for.

Flags: needinfo?(mconley)

Ultimately, this seems like a form of stress test. I concur with acreskey - this is likely low priority at this point. It's just another gap in our performance testing story.

Flags: needinfo?(mconley)
Whiteboard: [perftest:triage][e10s-multi:+] → [e10s-multi:+]
Severity: normal → S3

Closing this as WONTFIX, since it would be best to use the raptor or mozperftest test harness to achieve this.

Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → WONTFIX
Whiteboard: [e10s-multi:+] → [e10s-multi:+][fxp]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: