Open Bug 1925335 Opened 4 months ago Updated 2 months ago

Some sp3 tests (especially NewsSite) spend an appreciable amount of time in RasterImage::NotifyDecodeComplete ("notify images")

Categories

(Core :: Graphics: ImageLib, defect)

defect

Tracking

()

People

(Reporter: mstange, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

NewsSite-Next spends 4% of its time in RasterImage::NotifyDecodeComplete, NewsSite-Nuxt spends 1.1% in it.
I can't find the equivalent work in the Chrome profiles anywhere.

NewsSite-Next: https://share.firefox.dev/3UcDiqv (4753 samples)
NewsSite-Nuxt: https://share.firefox.dev/3Ytgw06 (1051 samples)

Furthermore, some time is spent in AsyncNotifyRunnable::Run.

NewsSite-Next: https://share.firefox.dev/40b4hGy (345 samples)
NewsSite-Nuxt: https://share.firefox.dev/40azYju (309 samples)

If, somehow, we were able to eliminate all of this time completely, we would improve the overall sp3 score by 0.5%.

Nice find.

Wow, basically 3 seconds in addref is most of the first profile. Maybe our observers are more well behaved and we can avoid addref'ing now?

Depends on: 1925423

These profiles have some surprising results. Filtering stacks on "addref" gives us about 17% of all samples (seems reasonable). Of these samples 87% are AddRef'ing when sending image notifications. Seems too high for image notifications to be responsible for 87% of all addref time.

I tried to replicate this myself. I used the built in profiler on a Windows and a mac machine. I did 100 iterations of NewsSite-Next and NewsSite-Nuxt. In both cases addref'ing for image notifications take much much less of the time.

https://share.firefox.dev/4f8tgP4

https://share.firefox.dev/4eSmTQc

So something is different about the profiles that I'm doing and the ones posted in the first comment here. Maybe it's a difference of the hardware being used? Perhaps its a difference between the built in profiler and whatever profiler was used for the profiles in comment 0?

Markus, do you have any thoughts about that difference?

Flags: needinfo?(mstange.moz)

In fact, just addref'ing to send image notification is fully 15% (72k of 483k samples) of the entire profile from the first profile in comment 0 (after expanding to show all samples). If that figure is accurate then is there something that makes these specific virtual addref calls more expensive then other virtual addref calls? Or is that the price we pay for all virtual addref calls?

That is very curious and I don't have an answer at the moment. I'll try to see if it's a machine issue by comparing a samply profile with a Gecko profiler profile.

What is the hardware that that profile was done on?

I will also try using samply to see if I get different results.

I tried on a very old under powered machine (with the theory that slower memory access, smaller cpu caches, and less intelligent pre-fetching might cause slower virtual function calls) and as a percent of total time the image notifications work is actually less there.

I tried samply on macos, it's results are in line with what I see from the built in profiler.

Severity: -- → S2
Depends on: 1863046
See Also: → 1927919

I tried the built in profiler on what should be an identical machine to what was used for the profile in comment 0

https://share.firefox.dev/3AkUB1R

The sum of time spent in stacks that include IProgressObserver or imgINotificationObserver (which double counts a lot) is about 3%. This is inline with what I've been seeing locally.

And a profile with samply

https://share.firefox.dev/3YOHPlp

Sum of IProgressObserver or imgINotificationObserver stacks is about 4%. This number is 30% for the profile in comment 1.

And a samply profile with the recommended command line from Markus

/c/samply/samply/target/release/samply.exe record --browsers -r 4000 --breakpad-symbol-server https://symbols.mozilla.org/ ETW_ENABLED=1 JIT_OPTION_enableICFramePointers=true JIT_OPTION_onlyInlineSelfHosted=true JIT_OPTION_emitInterpreterEntryTrampoline=true "C:\Program Files\Firefox Nightly\firefox.exe" --wait-for-browser

https://share.firefox.dev/3UC08YI

Similar result, about 4% of time in IProgressObserver or imgINotificationObserver.

Working with Markus, it seems the difference was the iteration count. I was using something around 40. Markus' profile used 200 iterations. Doing 200 iterations I can reproduce a similar looking profile.

See Also: → 1900470

Some observations.

The number of notifications per iteration stays the same whether we are doing 20 iters or 200 iters. So the time taken to do a single notification increases with the total number of iterations. The time taken to do notifications seems to monotonically increase as we do iterations.

It takes approx 20000 ms to do 10000 discard notifications, 2ms per notification (almost all of the time is spent doing addref). This is the average over all 200 iters, so it's probably at least 4ms per notification/addref towards the end of the iterations.

All of the discard notifications are for the same RasterImage object (makes sense since the test uses the same image repeatedly as a stand in). So this makes it less likely (though still possible) that it is long memory accesses to fetch something that is not in any cache (due to memory contention/thrashing elsewhere). Perhaps the long cycle collections that bug 1900470 is about plays a role here. Towards the end of the iterations I can see that we are basically alternating a long CC, followed by a long time in these notifications.

Samply seems to be more reliable to trigger the issue. Using the built in profiler the effect doesn't seem as exaggerated.

And the problem is not limited to Windows/the specific model the original profile was on. I can reproduce on a mac as well.

Doing 40 iterations, then manually reloading the page and repeating 5 times (so 200 total iters) doesn't show the same amount of time spent addref during these image notifications. So that sort of feels like it is something about putting pressure on the memory somehow.

This comment https://bugzilla.mozilla.org/show_bug.cgi?id=1927919#c3 sounds like it could be quite relevant here as well. Posting so I remember.

The evidence right now is pointing towards the overhead of the sampling at 4000Hz by the profiler is skewing the impact of this code because it is cache bound towards the end of the profile. That seems more like an S3.

Severity: S2 → S3
You need to log in before you can comment on or make changes to this bug.