Bugzilla

Updated

•

6 years ago

Status: NEW → ASSIGNED

u408661

Updated

•

6 years ago

Whiteboard: [necko-triaged]

Reporter

Comment 2

•

6 years ago

Note that now this will effect both network and cache served responses.  When we enable the SPI, it will effect only cached responses (definitely still worth).

Reporter

Updated

•

5 years ago

Blocks: 1516121

Reporter

Comment 3

•

5 years ago

Note that this may help to also ease main thread to main thread IPC queuing on the parent process as all the data are sent using this critical IPC channel.

Reporter

Updated

•

5 years ago

Depends on: omt-networking

Comment 4

•

5 years ago

(In reply to Honza Bambas (:mayhemer) from comment #0)

Created attachment 9023355 [details]
bt log

This is the first bug based on Backtrack data :)

We do an unnecessary parent process main thread loop for every chunk of data
we deliver to the content process via PBackground, being it from cache or
net.

Retargetting nsHttpChannel is the last piece of the PBg-HTTP project we are
currently missing. This has never been filed or I can't find the bug, and
it has been completely forgotten about.

nsHttpChannel is already implemented nsIThreadRetargetableRequest. It seems we can just try to call nsHttpChannel::RetargetDeliveryTo from HttpBackgroundChannelParent.

Reporter

Comment 5

•

5 years ago

That's what I had in mind, and it would definitely be better than what we have now.

The real affect is tho blocked by bug 1528285. CacheFileInputStream::NotifyListener (from comment 0) also calls OnStartRequest, which now happens on the main thread only.

We could also split this bug to two - to simply retarget nsHttpChannel to call OnData on a suitable worker thread as part 1 and then let call HttpChannelParent::OnStart/StopRequest off the main thread as part 2.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c2ad2d52286e23d432d4ff7b2223a2e3f08b287

Comment 6

•

5 years ago

Comment 7

•

5 years ago

Attached file Bug 1505493 - Retarget http channel to background thread (obsolete) — Details

https://treeherder.mozilla.org/#/jobs?repo=try&revision=845897e98c2c13353677063328b5330703955083

Comment 8

•

5 years ago

•

Edited

There are still some test timeouts.

Comment 9

•

5 years ago

:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.

Flags: needinfo?(sdeckelmann)

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1524154

Comment 10

•

5 years ago

On second thoughts, it seems not right to turn off retargeting just because of test timeouts.

FWIW, Junior's encournted something like this before [1].

Comment 11

•

5 years ago

(In reply to Kershaw Chang [:kershaw] from comment #9)

:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.

I think there is no need to keep this ni.

Flags: needinfo?(sdeckelmann)

[1] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/testing/web-platform/tests/2dcontext/imagebitmap/createImageBitmap-origin.sub.html
[2] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#838
[3] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#689

Comment 12

•

5 years ago

Hi John,

Recently, I've been investigating the reason of a media test [1] timeout, which I think might have something to do with my patch in this bug. The log shows that the "canplaythrough" event is not fired. I also found that in the successful cases, the http channel is closed at [2]. For the failed cases, the channel is closed at [3]. Do you know what could make this difference?
Moreover, from networking point of view, it seems that all the data is completely delivered to ChannelMediaResource::Listener::OnDataAvailable. Do you have any idea why the "canplaythrough" event is not fired?

Thanks.

Flags: needinfo?(jolin)

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562

Comment 13

•

5 years ago

Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with MOZ_LOG=nsMediaElement:5 and upload the output? FWICT, canplaythrough event is fired at [1], and debug log would be very helpful.

Flags: needinfo?(kershaw)

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8df420e6a64674faa1061f0d2791e0c32894ba10&selectedJob=237923266
[2] https://taskcluster-artifacts.net/FcFT28CsRjCGYjkyAVLQEg/0/public/logs/live_backing.log

Comment 14

•

5 years ago

(In reply to John Lin [:jhlin][:jolin] from comment #13)

Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with MOZ_LOG=nsMediaElement:5 and upload the output? FWICT, canplaythrough event is fired at [1], and debug log would be very helpful.

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562

I am only using linux platform for the test. Please see the failed test with MOZ_LOG=nsMediaElement:5 enabled at [1]. Unfortunately, I think this is unlikely to reproduce locally. This seems to be only reproducible on try.
The debug log is at [2]. You can search the http channel with the address 0xe6d26000 and you can see this channel is canceled with the status 0x805d0021 (NS_ERROR_PARSED_DATA_CACHED).

Thanks!

Flags: needinfo?(kershaw)

Comment 15

•

5 years ago

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

Flags: needinfo?(jolin) → needinfo?(kershaw)

Comment 16

•

5 years ago

(In reply to John Lin [:jhlin][:jolin] from comment #15)

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.

Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?

Thanks.

Flags: needinfo?(kershaw) → needinfo?(jolin)

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f03bdc6a8047c27df6fdc8880f70513ef4347f3b&newProject=try&newRevision=fa2553da15172fe3280431629de9599dcb25ba39

Comment 17

•

5 years ago

Honza,

Could you take a look at the talos comparison result [1] of f03bdc6a8047 (without retargeting) and fa2553da1517 (with retargetgin)?

Honestly, I can't tell if retargeting have some improvments from the result. But I think there is also no performance regression.
FWIW, I think it's still worth to continue working on this. What do you think?

Flags: needinfo?(honzab.moz)

Reporter

Comment 18

•

5 years ago

Ah.. I think acreskey might help here more than me, as I said before. I personally have trouble comparing talos results myself.

I usually push with ./mach try -b o -p linux64,macosx64,win32,win64 -u none -t all --rebuild-talos 10. Still, this doesn't run ALL talos tests anyway (despite the -t all!) I think I should add -shippable (former -pgo) to the list of platforms. You ran only a fraction of the tests and only 5 times? Maybe push with rebuild=20 twice: a vanilla, base cset and your patch applied on it.

Flags: needinfo?(honzab.moz)

Comment 19

•

5 years ago

(In reply to Kershaw Chang [:kershaw] from comment #16)

(In reply to John Lin [:jhlin][:jolin] from comment #15)

Kershaw,

Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!

As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.

Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5?

Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?

Thanks.

IIUC, [1] makes sure readyState advance to HAVE_ENOUGH_DATA and canplaythrough event fired. Hope the log can show us how data arrive sooner could affect this.

Flags: needinfo?(jolin) → needinfo?(kershaw)

[1] https://taskcluster-artifacts.net/c5PXlESrSV-2TRQ-YZDyBA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/ZuNKcPcLSl-I8uk0fXSq6g/0/public/logs/live_backing.log

Comment 20

•

5 years ago

Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5?

I finally managed to reproduce this again.
Please take a look at the logs [1][2].

Flags: needinfo?(kershaw) → needinfo?(jolin)

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407

Comment 21

•

5 years ago

Thanks a lot for your help!

I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to HAVE_ENOUGH_DATA because mDownloadSuspendedByCache was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to call NotifyDataEnded() or not according to the state of media cache. And I guess data arrival timing could affect that.

Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5 to confirm my theory?

Flags: needinfo?(jolin)

[1] https://taskcluster-artifacts.net/V_szf4jKQiekHrdEiKsPDA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/FfiXLBHfTEOXZn6FwLNi1A/0/public/logs/live_backing.log
[3] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#860
[4] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#396

Comment 22

•

5 years ago

(In reply to John Lin [:jhlin][:jolin] from comment #21)

Thanks a lot for your help!

I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to HAVE_ENOUGH_DATA because mDownloadSuspendedByCache was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to call NotifyDataEnded() or not according to the state of media cache. And I guess data arrival timing could affect that.

Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5 to confirm my theory?

[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407

Please see the log at [1][2].
It seems that ChannelMediaResource::Seek is called at [3], so I can't see the second log you added at [4].

Flags: needinfo?(jolin)

Talos result: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=be8b5ba10e4865a35091d31b1135917651cec472&newProject=try&newRevision=6683afca147d14e00661616c7706982ddf8c815a

Comment 23

•

5 years ago

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=11bd412954c2dcb6ace42ffa6a62f1d8b361c56c&newProject=try&newRevision=42653091e333cdf1a66359fe89f3ebf15e25f90b

Comment 24

•

5 years ago

(In reply to Kershaw Chang [:kershaw] from comment #23)

Talos result: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=be8b5ba10e4865a35091d31b1135917651cec472&newProject=try&newRevision=6683afca147d14e00661616c7706982ddf8c815a

Joel Maher ( :jmaher ) (UTC -8)

Comment 25

•

5 years ago

as a note, 20x for each job is a bit overkill- typically 6 is plenty, 10 is top end. Also running this at scale (all tests, all platforms) puts a high load on our limited device pool which causes a backlog for all users on try.

More talos result:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0d5c0f6924d65e36bbf0483fe1f947399feaac6e&newProject=try&newRevision=83b56b4787a62553446a799bd103efc2dc66c680

Comment 26

•

5 years ago

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=23fa6f4ba990b5a09262b01f55a1599089b999c8&newProject=try&newRevision=3222de2534ce083deb131243d1479ae67769b14d&framework=10

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2d6364bf6a4d699a5523831f68d3dfca85c46b75&newProject=try&newRevision=7c205f02e3735fbce74384daa8250b200ba9f856&framework=10

Reporter

Comment 27

•

5 years ago

Kershaw, what's the status here?

Flags: needinfo?(kershaw)

Comment 28

•

5 years ago

(In reply to Honza Bambas (:mayhemer) from comment #27)

Kershaw, what's the status here?

Sorry for the late progress here. I'll push this to try and see how many tests are still failure this week.
I also have to capture the log again for John to analyse the reason of failed media wpt test.

About the talos result, I've run 4 times and I can't tell if there is any improvement or regression of the results. But, I think we can continue working on this bug.

Flags: needinfo?(kershaw)

Reporter

Comment 29

•

5 years ago

I'm not pushing, only asking :) Thanks!

Comment 30

•

5 years ago

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

Comment 31

•

5 years ago

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

Comment 32

•

5 years ago

(In reply to Kershaw Chang [:kershaw] from comment #31)

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.

I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.

Comment 33

•

5 years ago

(In reply to Andrew Creskey from comment #32)

(In reply to Kershaw Chang [:kershaw] from comment #31)

(In reply to Andrew Creskey from comment #30)

I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?

I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.

Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.

I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.

I've rebased the patch to current central and pushed it to phabricator.
https://phabricator.services.mozilla.com/D22516

Comment 34

•

5 years ago

Thanks Kershaw.

We'll be able to see results here as I add repeat jobs:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10

I chose to only test android because I find that it takes time to test platforms and android is sensitive to performance but also reasonably stable.
We can always add tests for more targets to these same pushes.

Flags: needinfo?(acreskey)

Comment 35

•

5 years ago

... there was a problem with the android job queue so this will take a bit longer to get performance results...

Comment 36

•

5 years ago

Surprisingly that try result looks greener than previous push. One possible reason is that the easily failed media wpt test was modified in Bug 1529056.
So, there is no need to ask John's help right now.

I'll also try to fix the review comments and move this bug forward this week.

Flags: needinfo?(jolin)

Comment 37

•

5 years ago

Perfherder compare has got a reasonably number of results, feel free to add any jobs to it.
(left hand side is baseline, right hand side has the patch applied)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10

Flags: needinfo?(acreskey)

Reporter

Comment 38

•

5 years ago

Andrew, what other jobs would you suggest to run? From what I see in the log (and how I can understand it) it seems this change doesn't have much impact, the differences seems to be either a noise (but with 25 runs each?) or just a reorder of things w/o bringing a real consistent benefit.

Comment 39

•

5 years ago

Honza, the results are a bit puzzling to me.
Certainly there is a more noise than we'd like, so it's hard to say.

This one result, if this is real, is worrisome:
Amazon Load time on the Pixel 2 has almost tripled?
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&originalSignature=1995546&newSignature=1995546&framework=10&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003

I'll add tests for Windows.

Reporter

Comment 40

•

5 years ago

One theory: in backtrack logs I often see that, specially for responses coming from the cache, we do fire OnStart/OnData (once)/OnStop in one main thread loop (one event). With retargetting, we are adding another loop to deliver OnStop again on the main thread - in sequence, which can get significantly delayed by main thread blockers... Prioritization could help here, but it's still questionable if overall this will help at all anyway.

We can try to not retarget for cached loads to verify my theory, but to be honest, the same can happen for network loads when they are small enough to again deliver all 3 notifications at the same loop.

We may rather wait for bug 1528285 to be fixed first and use it here to retarget the whole stream listener.

Randell Jesup [:jesup] (needinfo me)

Updated

•

5 years ago

Priority: P2 → P3

Phabricator Automation

Updated

•

5 years ago

Attachment #9049208 - Attachment is obsolete: true

Updated

•

4 years ago

Blocks: 1604248

Stephen Donner [:stephend] Not actively reading bugmail

Comment 41

•

4 years ago

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)