Closed Bug 1709679 Opened 2 years ago Closed 1 year ago

Some images in Google Docs sometimes disappear, turn white/blank

Categories

(Core :: Graphics: WebRender, defect, P1)

x86
Windows
defect

Tracking

()

VERIFIED FIXED
95 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- verified
firefox90 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 --- verified

People

(Reporter: cpeterson, Assigned: aosmond)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: crash, regression)

Attachments

(5 files)

Attached image image.png

When reading Google Docs that contain images I've pasted into the docs, sometimes some of the images turn white/blank. The problem usually seems to affect images that I pasted into the doc minutes or hours earlier that day, not when I immediately paste the image into the doc. See the attached screenshot.

This problem seems to have gotten worse during the 90 Nightly cycle. I only see this problem when running a 32-bit Firefox build (on 64-bit Windows OS). I typically have about 5-10 Google Docs tabs open at the same time. Perhaps this is a graphics OOM issue like bug 1703839?

I see this problem with or without Fission.

Potentially you could use the dev tools to find the uri of the image in question and then use a verbose memory report to find that uri, it would have information about what imagelib thinks is the status of the image.

Blocks: gfx-triage
Severity: -- → S3
Component: Graphics → Graphics: WebRender

A full memory report would be interesting too.

If the raster image fails to decode because of an OOM in the content process, then the canvas probably shows nothing. Bug 1703839 and friends are a little different, where the GPU process ran out of virtual memory -- the image would have been decoded -- and we would crash the GPU process explicitly in that case.

Depends on: 1711137

(In reply to Timothy Nikkel (:tnikkel) from comment #1)

Potentially you could use the dev tools to find the uri of the image in question and then use a verbose memory report to find that uri, it would have information about what imagelib thinks is the status of the image.

The inspector shows the blank image element is:

<image xlink:href="blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09" width="100%" height="100%" preserveAspectRatio="none" transform="scale(0.0005)" transform-origin="0 0"><title></title><desc></desc></image>

The web console warns:

Cannot access blob URL “blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09” from a different agent cluster.
Cannot access blob URL “blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09” from a different agent cluster.

Info about that blob URL in about:memory:

├───31,045,584 B (08.28%) -- images
│   ├──30,993,104 B (08.27%) -- content
│   │  ├──20,966,256 B (05.59%) -- raster/used
│   │  │  ├──14,882,720 B (03.97%) -- progress=18f
│   │  │  │  ├───6,279,440 B (01.68%) -- image(2048x503, blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09)
│   │  │  │  │   ├──6,021,312 B (01.61%) -- unlocked/types=2000
│   │  │  │  │   │  ├──3,178,592 B (00.85%) -- surface(1797x442)
│   │  │  │  │   │  │  ├──3,178,496 B (00.85%) ── decoded-nonheap
│   │  │  │  │   │  │  └─────────96 B (00.00%) ── decoded-heap
│   │  │  │  │   │  └──2,842,720 B (00.76%) -- surface(1702x417)
│   │  │  │  │   │     ├──2,842,624 B (00.76%) ── decoded-nonheap
│   │  │  │  │   │     └─────────96 B (00.00%) ── decoded-heap
│   │  │  │  │   └────258,128 B (00.07%) ── source

...

514,942 B (100.0%) -- memory-blob-urls
└──514,942 B (100.0%) -- owner(https://docs.google.com/document/u/1/d/14THPtVTl3jwdvkrnuxybn678N3F0-84-VjsMR4wlq0c/edit)
   ├──257,471 B (50.00%) ── blob:https://docs.google.com/e88b8584-6cef-481a-9d88-2bf5fdce79d6
   └──257,471 B (50.00%) ── blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09

Also, the GPU process section of about:memory warns:

WARNING: the following values are negative or unreasonably large.

    explicit/gfx
    explicit/gfx/webrender
    explicit/gfx/webrender/swgl
    explicit/heap-unclassified 

This indicates a defect in one or more memory reporters. The invalid values are highlighted.
Explicit Allocations

476,663,808 B (100.0%) -- explicit
├──1,371,045,144 B (287.63%) -- gfx [?!]
│  ├──1,371,045,144 B (287.63%) -- webrender [?!]
│  │  ├──1,343,361,312 B (281.83%) ── swgl [?!]
...
└──-903,281,400 B (-189.50%) ── heap-unclassified [?!]
Flags: needinfo?(aosmond)

(In reply to Jim Mathies [:jimm] from comment #2)

A full memory report would be interesting too.

Is attaching the full memory report to a public bug safe?

I saved a full memory report file, but the memory report is from my regular user profile, which is logged into my Google and Bugzilla accounts. I saved both a full and "anonymized" version of the memory report.

The memory report looks okay, I'm thinking the "different agent cluster" might be the reason. You could try flipping privacy.partition.bloburl_per_agent_cluster to false and see if it still happens.

(In reply to Chris Peterson [:cpeterson] from comment #4)

The web console warns:

Cannot access blob URL “blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09” from a different agent cluster.
Cannot access blob URL “blob:https://docs.google.com/f90b21d2-359c-4ef7-abc6-7d790fc2bc09” from a different agent cluster.

@ Baku, could this "image pasted into a Google Doc is blank" bug be a privacy.partition.bloburl_per_agent_cluster bug?

Nika says: "If you're getting that error, it doesn't look like [a WebRender OOM]. Sounds like a privacy.partition.bloburl_per_agent_cluster bug. perhaps the blob url is being created by a gdocs service worker or shared worker."

kmag says: "I think BroadcastChannel may be a possibility too... If there are multiple docs tabs and they're trying to conserve resources."

Flags: needinfo?(amarchesini)

I have been testing with privacy.partition.bloburl_per_agent_cluster = false for two days and I just reproduced the blank images problem again. I did have Fission disabled for the last two days and I re-enabled it just this morning and minutes later reproduced this bug. Perhaps there is a connection with Fission, but I know I have seen this bug without Fission (comment 0).

No longer blocks: gfx-triage
Priority: -- → P3

I'll try to fix the memory reporter.

Flags: needinfo?(jmuizelaar)

I still see this disappearing image problem almost every day using a 32-bit build of Firefox Nightly on Windows 10 and 11 (with or without SW-WR), even when I don't have a ton of tabs or Google Docs open.

We limit the surface cache in image lib to 1GB in 32 bit builds

https://searchfox.org/mozilla-central/rev/45e308665eb9fc52fd21e2d4b3b959f3cec13ef1/image/SurfaceCache.cpp#1540

so you could be hitting that without hitting OOMs.

What if you set image.mem.surfacecache.max_size_kb to 1024*1024 (1 GB) or smaller in a 64 bit build, does that make it show up more often in 64 bit builds?

(In reply to Timothy Nikkel (:tnikkel) from comment #11)

What if you set image.mem.surfacecache.max_size_kb to 1024*1024 (1 GB) or smaller in a 64 bit build, does that make it show up more often in 64 bit builds?

I tested a 64-bit build with image.mem.surfacecache.max_size_kb = 512 * 1024 (smaller than the 32-bit default 1024 * 1024 to try to make the problem easier to reproduce) for three days and couldn't reproduce the disappearing image problem.

I wonder if the problem is related to 32-bit and multiple Fission content processes? I will test with 32-bit and Fission disabled.

I tested a 64-bit build with image.mem.surfacecache.max_size_kb = 512 * 1024 (smaller than the 32-bit default 1024 * 1024 to try to make the problem easier to reproduce) for three days and couldn't reproduce the disappearing image problem.

I did see some images disappear (in particular, the slide preview thumbnails in Google Presentations), but they always reloaded when I scrolled the page. In contrast, when images disappeared in 32-bit build, I had to reload the page to get them to reappear.

Summary: Some images in Google Docs sometimes turn white/blank → Some images in Google Docs sometimes disappear, turn white/blank

I reinstalled a 32-bit build and reproduced the problem in less than an hour of using Google Docs (pasting screenshots into a doc). I then disabled Fission (and reduced the e10s process limit from 8 to 1) and again reproduced the problem in less than an hour.

Perhaps this problem is related to 32-bit's image.mem.surfacecache.max_size_kb limit 1024 * 1024, but I wasn't able to reproduce the problem with a 64-bit build and image.mem.surfacecache.max_size_kb = 512 * 1024 after even a couple days of using Google Docs.

I still see this bug every day when using 32-bit Windows Firefox. This bug makes Google Docs practically unusable with 32-bit. I want to use 32-bit Nightly to help find unique bugs (like this one), but I have to revert to 64-bit to do work.

See Also: → 1683164

Can you get a new about:memory report when this happens? If we're still getting bad numbers for WebRender we should fix that.

Flags: needinfo?(cpeterson)
Flags: needinfo?(aosmond)

Comment 4 seems very fishy to me, as if we decoded it, but was unable to use it in the rendering pipeline in the GPU process. If we fail to map it in, we OOM abort the process, but maybe we don't even try that because something failed earlier in the pipeline?

Chris, if you could flip image.mem.debug-reporting to true, and capture the memory report again (check verbose as well) for a missing image, I would be most interested in its contents. There will be relevant bits included in the explicit/images section of the content process, and possibly in the gfx.webrender.images section of the compositor process.

To explain a bit more what I'm looking for:

│ │ │ │ ├────4,512 B (00.00%) -- image(16x16, resource://gre-resources/loading-image.png)
│ │ │ │ │ ├──4,240 B (00.00%) -- locked/types=400/surface(16x16, external_id:1200000001, compositor_ref:1)
│ │ │ │ │ │ ├──4,096 B (00.00%) ── decoded-nonheap
│ │ │ │ │ │ └────144 B (00.00%) ── decoded-heap
│ │ │ │ │ └────272 B (00.00%) ── source

The presence of an external_id and non-zero compositor_ref will at least tell us that it should be able to be mapped in by the compositor process.

Under gfx/webrender/images/owner_cache_missing in the content process, you will find references to images that the compositor process knows about but no longer have entries in the surface cache. These should go away with time, and a good idea to double check to make sure there aren't too many (which would suggest a leak). I would not expect this number to be zero due to the extra layer of caching by ImageContainers (which I will be removing!).

Under gfx/webrender/images/mapped_from_owner in the compositor process, you will find a list of all of the images mapped into the compositor process. You should be able to cross reference using the external_id to find the corresponding entry in the content process (either under explicit/images or gfx/webrender/images/owner_cache_missing). It would be interesting to see how many images / how much memory appears to be mapped in via the summary here as well.

Here is a memory report (with verbose and image.mem.debug-reporting = true) after some images disappeared on Google Slides.

Caveat: after I saved the memory report and switched back to the Google Slides tab, the images had reloaded. So I don't know if the memory report was taken before or after the images reloaded.

Flags: needinfo?(cpeterson) → needinfo?(aosmond)

(In reply to Chris Peterson [:cpeterson] from comment #20)

Created attachment 9246916 [details]
Google_Slides_missing_image_memory-report.json.gz

Here is a memory report (with verbose and image.mem.debug-reporting = true) after some images disappeared on Google Slides.

Caveat: after I saved the memory report and switched back to the Google Slides tab, the images had reloaded. So I don't know if the memory report was taken before or after the images reloaded.

In this report you have a bunch of memory under swgl. Are you intentionally using swgl?

Flags: needinfo?(jmuizelaar) → needinfo?(cpeterson)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #21)

In this report you have a bunch of memory under swgl. Are you intentionally using swgl?

Yes. I've been running with gfx.webrender.software = true since last year to dogfood test swgl (since most Nightly users will be using Accelerated WR and 64-bit builds). I can disable swgl and retest with Accelerated WR.

Flags: needinfo?(cpeterson)

I disabled swgl (by restting gfx.webrender.software = false and restarting) and I still reproduced the missing Google Slide images with Accelerated WR. Here's a new memory report with verbose, image.mem.debug-reporting, and Accelerated WR.

There is nothing suspicious in those memory reports unfortunately. My hypothesis that perhaps we were not getting the necessary bindings in the compositor process seems false. As part of my investigation, I filed bug 1737257 and bug 1737277, with patches to fix minor issues with SVGImageElement, which I suppose could be related, but unlikely.

Another thing you can try is flipping the pref image.mem.shared.unmap.force-enabled (restart required) when testing on 64-bit builds. On 32-bit builds we will unmap the memory for images out of the compositor process when we hit memory pressure and/or just hit the expiry timer for unused images. On 64-bit builds we do not do this, and as such as a difference between the two variants. You can force the same behaviour on with 64-bit builds via said pref. You won't hit the address space pressure, but it should unmap images over time.

Flags: needinfo?(aosmond) → needinfo?(cpeterson)

Also set image.mem.shared.unmap.min_threshold_kb to 0 to more closely mirror the 32-bit behaviour. image.mem.shared.unmap.min_expiration_ms should be the same on both, but perhaps you can lower the expiration period to be more aggressive about the unmapping to aid reproduction.

I was able to reproduce the problem easily using a 64-bit Firefox and just image.mem.shared.unmap.force-enabled = true, leaving image.mem.shared.unmap.min_threshold_kb, image.mem.shared.unmap.min_expiration_ms, and gfx.webrender.software at their default values. (I was also able to reproduce the problems a little more easily when changing those other prefs, but they weren't required.)

Here is a new memory report with just image.mem.shared.unmap.force-enabled = true.

Flags: needinfo?(cpeterson)

Okay, that's a useful smoking gun. I've already reviewed SourceSurfaceSharedDataWrapper/RenderSharedSurfaceTextureHost/SharedSurfacesParent, but clearly somewhere in those 3 classes something is broken.

Assignee: nobody → aosmond
Flags: needinfo?(amarchesini)
Regressed by: 1699224

When replaying blob recordings, we need to ensure that we map and unmap
the surface properly so that we don't accidently expire the mapping too
early to free up virtual memory on 32-bit Firefox. The code was written
but not activated because we failed to use the wrong type in the
SourceSurfaceSharedDataWrapper::GetType implementation.

The images not showing up is when we unmapped the image before we tried to create a Skia binding for the surface.

There must be Skia crash signatures related to this on the flip side, where it was still mapped in when we started replaying the blob, but it got unmapped in the middle of the Skia draw operation.

Has Regression Range: --- → yes
Has STR: --- → yes
Keywords: crash, regression
Pushed by aosmond@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/994d2dfcf812
Properly map/unmap SourceSurfaceSharedDataWrapper for blob recordings. r=jrmuizel

Comment on attachment 9247342 [details]
Bug 1709679 - Properly map/unmap SourceSurfaceSharedDataWrapper for blob recordings.

Beta/Release Uplift Approval Request

  • User impact if declined: Users on 32-bit operating systems may sometimes see missing images, or experience a low volume crash in the compositor process (either parent or GPU process).
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch forces us to explicitly map/unmap shared surfaces for blob images, effectively forcing a lock and unlock on the pixel data. This is a strict improvement over today where it is nondeterministic what will happen. The scope of code which is impacted is small and well understood.
  • String changes made/needed:
Attachment #9247342 - Flags: approval-mozilla-beta?

Comment on attachment 9247342 [details]
Bug 1709679 - Properly map/unmap SourceSurfaceSharedDataWrapper for blob recordings.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Poor experience, won't see all web content as expected, particular for users who load some content, tab away and later return to it.
  • User impact if declined: Users may not see images embedded in blob recordings (e.g. Google Slides is a common use case) or may experience low volume crashes as a result.
  • Fix Landed on Version: 95
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch forces us to explicitly map/unmap shared surfaces for blob images, effectively forcing a lock and unlock on the pixel data. This is a strict improvement over today where it is nondeterministic what will happen. The scope of code which is impacted is small and well understood.
  • String or UUID changes made by this patch:
Attachment #9247342 - Flags: approval-mozilla-esr91?

Upping the severity as I think this warrants more attention/prioritization now that I understand what is going wrong and how common this is likely to be for users with particular patterns (effectively open a tab with blob recordings with embedded images, move away from it, and return later, on 32-bit OSes).

Severity: S3 → S2
Priority: P3 → P1

Chris, now that the fix is in the autoland queue, I'd appreciate if you could confirm that it fixes your problem once it makes it into the next nightly/daily :).

Flags: needinfo?(cpeterson)
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

We're out of betas this cycle and this doesn't feel like an issue that warrants uplift directly into an RC without any bake time first. Let's let this ride 95 to release.

Comment on attachment 9247342 [details]
Bug 1709679 - Properly map/unmap SourceSurfaceSharedDataWrapper for blob recordings.

Leaving the ESR nomination for possible inclusion in the 91.4 release next cycle, however.

Attachment #9247342 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

(In reply to Andrew Osmond [:aosmond] from comment #38)

Chris, now that the fix is in the autoland queue, I'd appreciate if you could confirm that it fixes your problem once it makes it into the next nightly/daily :).

LGTM. I've tested 32-bit Firefox for a day now and haven't seen any images disappear in Google Docs.

Thanks for tracking down this bug!

Status: RESOLVED → VERIFIED
Flags: needinfo?(cpeterson)

Comment on attachment 9247342 [details]
Bug 1709679 - Properly map/unmap SourceSurfaceSharedDataWrapper for blob recordings.

This has gotten a good amount of bake time with no known regressions and has been verified on 95. Approved for 91.4esr.

Attachment #9247342 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+
Flags: qe-verify+

Chris, could you please when time permits help us to verify this also on Firefox esr91?

Flags: needinfo?(cpeterson)

(In reply to Hani Yacoub from comment #45)

Chris, could you please when time permits help us to verify this also on Firefox esr91?

Hani, where can I find the esr91 build to test?

Flags: needinfo?(cpeterson) → needinfo?(hani.yacoub)

(In reply to Hani Yacoub from comment #45)

Chris, could you please when time permits help us to verify this also on Firefox esr91?

LGTM. I tested Google Docs in 91.4.0esr-candidates build1 for a few hours today. I didn't see any missing image, so I'll mark esr91 as verified fixed.

Flags: needinfo?(cpeterson)
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.