Open Bug 1505493 Opened Last year Updated 2 months ago

Retarget nsHttpChannel handled by HttpChannelParent to a background thread, avoid main thread loop on critical path

Categories

(Core :: Networking: HTTP, enhancement, P3)

enhancement

Tracking

()

ASSIGNED
Tracking Status
firefox65 --- affected

People

(Reporter: mayhemer, Assigned: kershaw)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file, 1 obsolete file)

Attached file bt log
This is the first bug based on Backtrack data :)

We do an unnecessary parent process main thread loop for every chunk of data we deliver to the content process via PBackground, being it from cache or net.

Retargetting nsHttpChannel is the last piece of the PBg-HTTP project we are currently missing.  This has never been filed or I can't find the bug, and it has been completely forgotten about.

Please see the attachment for rational - a Backtrack log, a list of runnables/events leading to load of a css file (tho, this applies to any type of response/content), specifically this part:



EXECUTE_BEGIN "net::HttpBackgroundChannelParent::OnTransportAndData"
  24.665344s, Main Process(10532)/IPDL Background

DISPATCH ""
  24.665323s, Main Process(10532)/Main Thread

EXECUTE_BEGIN "CacheFileInputStream::NotifyListener"  <--- no need to post to main thread
  24.664373s, Main Process(10532)/Main Thread

DISPATCH ""
  24.53428s, Main Process(10532)/Main Thread

EXECUTE_BEGIN "CacheEntry::AvailableCallbackRunnable"
  24.533744s, Main Process(10532)/Main Thread



"CacheFileInputStream::NotifyListener" runnable is dispatched to the main thread regardless we are eventually sending the data via PBackground (IPDL Background).  This happens because we never retarget the http channel to any suitable background thread.

I think we have to have an "internal" variant of retargetting.  The http channel, when allowed, should decide where to retarget the data based on where from the content is coming (cache or net -> cache or socket thread).  http channel parent/bck parent has to be inspected for possible races.
Kershaw -- can you have a look at this? Thanks!
Assignee: nobody → kershaw
Priority: -- → P2
Status: NEW → ASSIGNED
Whiteboard: [necko-triaged]
Note that now this will effect both network and cache served responses.  When we enable the SPI, it will effect only cached responses (definitely still worth).
Blocks: 1516121
Note that this may help to also ease main thread to main thread IPC queuing on the parent process as all the data are sent using this critical IPC channel.
Depends on: 1528285

(In reply to Honza Bambas (:mayhemer) from comment #0)

Created attachment 9023355 [details]
bt log

This is the first bug based on Backtrack data :)

We do an unnecessary parent process main thread loop for every chunk of data
we deliver to the content process via PBackground, being it from cache or
net.

Retargetting nsHttpChannel is the last piece of the PBg-HTTP project we are
currently missing. This has never been filed or I can't find the bug, and
it has been completely forgotten about.

nsHttpChannel is already implemented nsIThreadRetargetableRequest. It seems we can just try to call nsHttpChannel::RetargetDeliveryTo from HttpBackgroundChannelParent.

That's what I had in mind, and it would definitely be better than what we have now.

The real affect is tho blocked by bug 1528285. CacheFileInputStream::NotifyListener (from comment 0) also calls OnStartRequest, which now happens on the main thread only.

We could also split this bug to two - to simply retarget nsHttpChannel to call OnData on a suitable worker thread as part 1 and then let call HttpChannelParent::OnStart/StopRequest off the main thread as part 2.

:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.

Flags: needinfo?(sdeckelmann)

On second thoughts, it seems not right to turn off retargeting just because of test timeouts.

FWIW, Junior's encournted something like this before [1].

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1524154

(In reply to Kershaw Chang [:kershaw] from comment #9)

:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.

I think there is no need to keep this ni.

Flags: needinfo?(sdeckelmann)

Hi John,

Recently, I've been investigating the reason of a media test [1] timeout, which I think might have something to do with my patch in this bug. The log shows that the "canplaythrough" event is not fired. I also found that in the successful cases, the http channel is closed at [2]. For the failed cases, the channel is closed at [3]. Do you know what could make this difference?
Moreover, from networking point of view, it seems that all the data is completely delivered to ChannelMediaResource::Listener::OnDataAvailable. Do you have any idea why the "canplaythrough" event is not fired?

Thanks.

[1] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/testing/web-platform/tests/2dcontext/imagebitmap/createImageBitmap-origin.sub.html
[2] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#838
[3] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#689

Flags: needinfo?(jolin)

Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with MOZ_LOG=nsMediaElement:5 and upload the output? FWICT, canplaythrough event is fired at [1], and debug log would be very helpful.

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562

Flags: needinfo?(kershaw)

(In reply to John Lin [:jhlin][:jolin] from comment #13)

Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with MOZ_LOG=nsMediaElement:5 and upload the output? FWICT, canplaythrough event is fired at [1], and debug log would be very helpful.

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562

I am only using linux platform for the test. Please see the failed test with MOZ_LOG=nsMediaElement:5 enabled at [1]. Unfortunately, I think this is unlikely to reproduce locally. This seems to be only reproducible on try.
The debug log is at [2]. You can search the http channel with the address 0xe6d26000 and you can see this channel is canceled with the status 0x805d0021 (NS_ERROR_PARSED_DATA_CACHED).

Thanks!

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8df420e6a64674faa1061f0d2791e0c32894ba10&selectedJob=237923266
[2] https://taskcluster-artifacts.net/FcFT28CsRjCGYjkyAVLQEg/0/public/logs/live_backing.log

Flags: needinfo?(kershaw)

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

Flags: needinfo?(jolin) → needinfo?(kershaw)

(In reply to John Lin [:jhlin][:jolin] from comment #15)

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.

Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?

Thanks.

Flags: needinfo?(kershaw) → needinfo?(jolin)

Honza,

Could you take a look at the talos comparison result [1] of f03bdc6a8047 (without retargeting) and fa2553da1517 (with retargetgin)?

Honestly, I can't tell if retargeting have some improvments from the result. But I think there is also no performance regression.
FWIW, I think it's still worth to continue working on this. What do you think?

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f03bdc6a8047c27df6fdc8880f70513ef4347f3b&newProject=try&newRevision=fa2553da15172fe3280431629de9599dcb25ba39

Flags: needinfo?(honzab.moz)

Ah.. I think acreskey might help here more than me, as I said before. I personally have trouble comparing talos results myself.

I usually push with ./mach try -b o -p linux64,macosx64,win32,win64 -u none -t all --rebuild-talos 10. Still, this doesn't run ALL talos tests anyway (despite the -t all!) I think I should add -shippable (former -pgo) to the list of platforms. You ran only a fraction of the tests and only 5 times? Maybe push with rebuild=20 twice: a vanilla, base cset and your patch applied on it.

Flags: needinfo?(honzab.moz)

(In reply to Kershaw Chang [:kershaw] from comment #16)

(In reply to John Lin [:jhlin][:jolin] from comment #15)

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.

Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5?

Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?

Thanks.

IIUC, [1] makes sure readyState advance to HAVE_ENOUGH_DATA and canplaythrough event fired. Hope the log can show us how data arrive sooner could affect this.

Flags: needinfo?(jolin) → needinfo?(kershaw)

Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5?

I finally managed to reproduce this again.
Please take a look at the logs [1][2].

[1] https://taskcluster-artifacts.net/c5PXlESrSV-2TRQ-YZDyBA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/ZuNKcPcLSl-I8uk0fXSq6g/0/public/logs/live_backing.log

Flags: needinfo?(kershaw) → needinfo?(jolin)

Thanks a lot for your help!

I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to HAVE_ENOUGH_DATA because mDownloadSuspendedByCache was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to call NotifyDataEnded() or not according to the state of media cache. And I guess data arrival timing could affect that.

Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5 to confirm my theory?

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407

Flags: needinfo?(jolin)

(In reply to John Lin [:jhlin][:jolin] from comment #21)

Thanks a lot for your help!

I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to HAVE_ENOUGH_DATA because mDownloadSuspendedByCache was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to call NotifyDataEnded() or not according to the state of media cache. And I guess data arrival timing could affect that.

Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5 to confirm my theory?

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407

Please see the log at [1][2].
It seems that ChannelMediaResource::Seek is called at [3], so I can't see the second log you added at [4].

[1] https://taskcluster-artifacts.net/V_szf4jKQiekHrdEiKsPDA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/FfiXLBHfTEOXZn6FwLNi1A/0/public/logs/live_backing.log
[3] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#860
[4] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#396

Flags: needinfo?(jolin)

as a note, 20x for each job is a bit overkill- typically 6 is plenty, 10 is top end. Also running this at scale (all tests, all platforms) puts a high load on our limited device pool which causes a backlog for all users on try.

Kershaw, what's the status here?

Flags: needinfo?(kershaw)

(In reply to Honza Bambas (:mayhemer) from comment #27)

Kershaw, what's the status here?

Sorry for the late progress here. I'll push this to try and see how many tests are still failure this week.
I also have to capture the log again for John to analyse the reason of failed media wpt test.

About the talos result, I've run 4 times and I can't tell if there is any improvement or regression of the results. But, I think we can continue working on this bug.

Flags: needinfo?(kershaw)

I'm not pushing, only asking :) Thanks!

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

(In reply to Kershaw Chang [:kershaw] from comment #31)

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.

I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.

(In reply to Andrew Creskey from comment #32)

(In reply to Kershaw Chang [:kershaw] from comment #31)

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.

I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.

I've rebased the patch to current central and pushed it to phabricator.
https://phabricator.services.mozilla.com/D22516

Thanks Kershaw.

We'll be able to see results here as I add repeat jobs:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10

I chose to only test android because I find that it takes time to test platforms and android is sensitive to performance but also reasonably stable.
We can always add tests for more targets to these same pushes.

Flags: needinfo?(acreskey)

... there was a problem with the android job queue so this will take a bit longer to get performance results...

Surprisingly that try result looks greener than previous push. One possible reason is that the easily failed media wpt test was modified in Bug 1529056.
So, there is no need to ask John's help right now.

I'll also try to fix the review comments and move this bug forward this week.

Flags: needinfo?(jolin)

Perfherder compare has got a reasonably number of results, feel free to add any jobs to it.
(left hand side is baseline, right hand side has the patch applied)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10

Flags: needinfo?(acreskey)

Andrew, what other jobs would you suggest to run? From what I see in the log (and how I can understand it) it seems this change doesn't have much impact, the differences seems to be either a noise (but with 25 runs each?) or just a reorder of things w/o bringing a real consistent benefit.

Honza, the results are a bit puzzling to me.
Certainly there is a more noise than we'd like, so it's hard to say.

This one result, if this is real, is worrisome:
Amazon Load time on the Pixel 2 has almost tripled?
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&originalSignature=1995546&newSignature=1995546&framework=10&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003

I'll add tests for Windows.

One theory: in backtrack logs I often see that, specially for responses coming from the cache, we do fire OnStart/OnData (once)/OnStop in one main thread loop (one event). With retargetting, we are adding another loop to deliver OnStop again on the main thread - in sequence, which can get significantly delayed by main thread blockers... Prioritization could help here, but it's still questionable if overall this will help at all anyway.

We can try to not retarget for cached loads to verify my theory, but to be honest, the same can happen for network loads when they are small enough to again deliver all 3 notifications at the same loop.

We may rather wait for bug 1528285 to be fixed first and use it here to retarget the whole stream listener.

Priority: P2 → P3
Attachment #9049208 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.