1925335 - Some sp3 tests (especially NewsSite) spend an appreciable amount of time in RasterImage::NotifyDecodeComplete ("notify images")

Reporter

Description

•

4 months ago

NewsSite-Next spends 4% of its time in RasterImage::NotifyDecodeComplete, NewsSite-Nuxt spends 1.1% in it.
I can't find the equivalent work in the Chrome profiles anywhere.

NewsSite-Next: https://share.firefox.dev/3UcDiqv (4753 samples)
NewsSite-Nuxt: https://share.firefox.dev/3Ytgw06 (1051 samples)

Furthermore, some time is spent in AsyncNotifyRunnable::Run.

NewsSite-Next: https://share.firefox.dev/40b4hGy (345 samples)
NewsSite-Nuxt: https://share.firefox.dev/40azYju (309 samples)

If, somehow, we were able to eliminate all of this time completely, we would improve the overall sp3 score by 0.5%.

Jira Integration Bot

Updated

•

4 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-832

Timothy Nikkel (:tnikkel)

Comment 1

•

4 months ago

Nice find.

Wow, basically 3 seconds in addref is most of the first profile. Maybe our observers are more well behaved and we can avoid addref'ing now?

Timothy Nikkel (:tnikkel)

Updated

•

4 months ago

Depends on: 1925423

Timothy Nikkel (:tnikkel)

Comment 2

•

4 months ago

These profiles have some surprising results. Filtering stacks on "addref" gives us about 17% of all samples (seems reasonable). Of these samples 87% are AddRef'ing when sending image notifications. Seems too high for image notifications to be responsible for 87% of all addref time.

I tried to replicate this myself. I used the built in profiler on a Windows and a mac machine. I did 100 iterations of NewsSite-Next and NewsSite-Nuxt. In both cases addref'ing for image notifications take much much less of the time.

https://share.firefox.dev/4f8tgP4

https://share.firefox.dev/4eSmTQc

So something is different about the profiles that I'm doing and the ones posted in the first comment here. Maybe it's a difference of the hardware being used? Perhaps its a difference between the built in profiler and whatever profiler was used for the profiles in comment 0?

Markus, do you have any thoughts about that difference?

Flags: needinfo?(mstange.moz)

Timothy Nikkel (:tnikkel)

Comment 3

•

4 months ago

In fact, just addref'ing to send image notification is fully 15% (72k of 483k samples) of the entire profile from the first profile in comment 0 (after expanding to show all samples). If that figure is accurate then is there something that makes these specific virtual addref calls more expensive then other virtual addref calls? Or is that the price we pay for all virtual addref calls?

Markus Stange [:mstange]

Reporter

Comment 4

•

4 months ago

That is very curious and I don't have an answer at the moment. I'll try to see if it's a machine issue by comparing a samply profile with a Gecko profiler profile.

Timothy Nikkel (:tnikkel)

Comment 5

•

4 months ago

What is the hardware that that profile was done on?

I will also try using samply to see if I get different results.

Timothy Nikkel (:tnikkel)

Comment 6

•

4 months ago

I tried on a very old under powered machine (with the theory that slower memory access, smaller cpu caches, and less intelligent pre-fetching might cause slower virtual function calls) and as a percent of total time the image notifications work is actually less there.

Timothy Nikkel (:tnikkel)

Comment 7

•

4 months ago

I tried samply on macos, it's results are in line with what I see from the built in profiler.

Jim Mathies [:jimm]

Updated

•

4 months ago

Severity: -- → S2

Timothy Nikkel (:tnikkel)

Updated

•

4 months ago

Depends on: 1863046

Timothy Nikkel (:tnikkel)

Updated

•

4 months ago

Comment 8

•

4 months ago

I tried the built in profiler on what should be an identical machine to what was used for the profile in comment 0

https://share.firefox.dev/3AkUB1R

The sum of time spent in stacks that include IProgressObserver or imgINotificationObserver (which double counts a lot) is about 3%. This is inline with what I've been seeing locally.

Timothy Nikkel (:tnikkel)

Comment 9

•

4 months ago

And a profile with samply

https://share.firefox.dev/3YOHPlp

Sum of IProgressObserver or imgINotificationObserver stacks is about 4%. This number is 30% for the profile in comment 1.

Timothy Nikkel (:tnikkel)

Comment 10

•

4 months ago

And a samply profile with the recommended command line from Markus

/c/samply/samply/target/release/samply.exe record --browsers -r 4000 --breakpad-symbol-server https://symbols.mozilla.org/ ETW_ENABLED=1 JIT_OPTION_enableICFramePointers=true JIT_OPTION_onlyInlineSelfHosted=true JIT_OPTION_emitInterpreterEntryTrampoline=true "C:\Program Files\Firefox Nightly\firefox.exe" --wait-for-browser

https://share.firefox.dev/3UC08YI

Similar result, about 4% of time in IProgressObserver or imgINotificationObserver.

Timothy Nikkel (:tnikkel)

Comment 11

•

4 months ago

Working with Markus, it seems the difference was the iteration count. I was using something around 40. Markus' profile used 200 iterations. Doing 200 iterations I can reproduce a similar looking profile.

Timothy Nikkel (:tnikkel)

Updated

•

4 months ago

Comment 12

•

4 months ago

•

Edited

Some observations.

The number of notifications per iteration stays the same whether we are doing 20 iters or 200 iters. So the time taken to do a single notification increases with the total number of iterations. The time taken to do notifications seems to monotonically increase as we do iterations.

It takes approx 20000 ms to do 10000 discard notifications, 2ms per notification (almost all of the time is spent doing addref). This is the average over all 200 iters, so it's probably at least 4ms per notification/addref towards the end of the iterations.

All of the discard notifications are for the same RasterImage object (makes sense since the test uses the same image repeatedly as a stand in). So this makes it less likely (though still possible) that it is long memory accesses to fetch something that is not in any cache (due to memory contention/thrashing elsewhere). Perhaps the long cycle collections that bug 1900470 is about plays a role here. Towards the end of the iterations I can see that we are basically alternating a long CC, followed by a long time in these notifications.

Samply seems to be more reliable to trigger the issue. Using the built in profiler the effect doesn't seem as exaggerated.

Timothy Nikkel (:tnikkel)

Comment 13

•

4 months ago

And the problem is not limited to Windows/the specific model the original profile was on. I can reproduce on a mac as well.

Timothy Nikkel (:tnikkel)

Comment 14

•

4 months ago

Doing 40 iterations, then manually reloading the page and repeating 5 times (so 200 total iters) doesn't show the same amount of time spent addref during these image notifications. So that sort of feels like it is something about putting pressure on the memory somehow.

Timothy Nikkel (:tnikkel)

Comment 15

•

3 months ago

This comment https://bugzilla.mozilla.org/show_bug.cgi?id=1927919#c3 sounds like it could be quite relevant here as well. Posting so I remember.

Timothy Nikkel (:tnikkel)

Comment 16

•

2 months ago

The evidence right now is pointing towards the overhead of the sampling at 4000Hz by the profiler is skewing the impact of this code because it is cache bound towards the end of the profile. That seems more like an S3.

Severity: S2 → S3

Bugzilla

Some sp3 tests (especially NewsSite) spend an appreciable amount of time in RasterImage::NotifyDecodeComplete ("notify images")

Categories

(Core :: Graphics: ImageLib, defect)

Tracking

()

People

(Reporter: mstange, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16