Closed Bug 1882573 Opened 4 months ago Closed 3 months ago

Every open Google Docs tab shows "An error occurred" message and its canvas text goes blank

Categories

(Core :: Graphics: Canvas2D, defect)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
126 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox123 - wontfix
firefox124 + wontfix
firefox125 + fixed
firefox126 + fixed

People

(Reporter: cpeterson, Assigned: aosmond)

References

(Regression)

Details

(Keywords: regression)

Attachments

(6 files)

Attached image screenshot.png

About one or twice a day, each of my open Google Docs tabs show "An error occurred" message and its canvas text goes blank (but not the doc's toolbars), all at the same time. Reloading each Google Docs page fixes the error (for at least a few minutes or hours before I see the error again). At the same time, my Google Calendar, Google Sheets, and Gmail pages are unaffected.

I don't know if this is a Google bug or a Firefox bug. The problem feels like a graphics OOM, as if the docs lost the canvas and can no longer render text. I'm using a 32-bit Firefox build on Windows and I have about a dozen Google Docs tabs open, so I thought this might might be an OOM. But :gstoll says he's seen these errors even though he's using a 64-bit build.

I first noticed these errors around 2024-01-30 with Nightly 124. (I know because I commented in Slack). I still see these errors in Nightly 125.

Here is a about:memory dump from about a minute after the errors, while the affected Google Docs are are still open. One thing that caught my eye was multiple copies of some 13 MB images with dimensions of 83 x 14,794. These are SVG image sheets for Google Docs' toolbars.

https://ssl.gstatic.com/docs/common/material_common_sprite647_blue.svg
https://ssl.gstatic.com/docs/common/material_common_sprite647_gm3_grey_medium.svg

I've heard from people on both Windows and macOS experiencing these errors, so this problem is not platform specific. Uncertain whether this is actually a Firefox bug or whether Google Docs is A/B testing some code that might also affect Chrome.

jblumberg says this error seemed to be triggered (intermittently) when trying to use Google Docs keyboard shortcuts.

OS: Unspecified → All

(In reply to Chris Peterson [:cpeterson] from comment #1)

Here is a about:memory dump from about a minute after the errors, while the affected Google Docs are are still open. One thing that caught my eye was multiple copies of some 13 MB images with dimensions of 83 x 14,794. These are SVG image sheets for Google Docs' toolbars.

I think that's expected. Google docs uses a large sprite sheet. It's been like that for a while.

Attached file about:support.txt

Thanks, from the about support:

Failure Log
(#0): GP+[GFX1-]: Failed to create a valid ShmemTextureHost
(#20): CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
....
(#31): CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#32): CP+[GFX1-]: Attempt to render into a Canvas2d after shutdown.
(#33) Error: Attempt to render into a Canvas2d after shutdown.
(#34): CP+[GFX1-]: Attempt to render into a Canvas2d after shutdown.

I wonder if one of the processes involved crashed? Is there anything in about:crashes maybe?

Also, is there anything in the web console when this happens? Maybe open it after loading google docs and take a look to see what output happens when its running normally so we can tell apart "normal" errors from ones that are specific to this failing case.

When you get the error, what do the two links ("see what else you can do" "help us improve") link to?

Answers to any of those questions for what other people are seeing on different machines or OSes could be useful to see what is common and what is different.

Flags: needinfo?(cpeterson)

(In reply to Timothy Nikkel (:tnikkel) from comment #5)

I wonder if one of the processes involved crashed? Is there anything in about:crashes maybe?

No, I don't see any crash reports in about:crashes during the period when I know I was seeing these errors daily.

Also, is there anything in the web console when this happens? Maybe open it after loading google docs and take a look to see what output happens when its running normally so we can tell apart "normal" errors from ones that are specific to this failing case.

When you get the error, what do the two links ("see what else you can do" "help us improve") link to?

I'll check when I next see the error. I switched to a 64-bit build after filing this bug and haven't see the errors since. I'll downgrade to 32-bit again.

Flags: needinfo?(cpeterson)

Thanks.

Does this happen in google docs and sheets or just docs?

If you inspect the dom after it happens what does it look like, is the canvas element still there? Can you inspect it, dump its contents as data url?

Of the people that can reproduce, are they all on nightly? I know you said you had the problem on nightly 124, so that would suggest it's on beta now, unless it is some setting that is gated to be nightly only.

(In reply to Timothy Nikkel (:tnikkel) from comment #7)

Does this happen in google docs and sheets or just docs?

I thought only Google Docs was affected, but now I see the same error on Google Sheets (but not Google Calendar or Gmail).

Of the people that can reproduce, are they all on nightly? I know you said you had the problem on nightly 124, so that would suggest it's on beta now, unless it is some setting that is gated to be nightly only.

3 of 3 people that told me they've seen these errors were using Nightly (124 at the time they saw the errors). So I don't know if this bug is reproducible in Beta 124. I can try testing in Beta 124 or even Firefox 123 release to help pinpoint if this is a Firefox regression riding the trains.

When you get the error, what do the two links ("see what else you can do" "help us improve") link to?

Neither link is helpful. The "See what else you can do to fix this error" link points to https://support.google.com/docs/answer/7505592. The "Help us improve" link opens a "Send feedback to Google" sidebar on the Google Docs page.

is there anything in the web console when this happens? Maybe open it after loading google docs and take a look to see what output happens when its running normally so we can tell apart "normal" errors from ones that are specific to this failing case.

I see these JavaScript errors in the web console after the error happens. These errors don't happen in the regular use of Google Docs.

Uncaught 
Object { stack: "B.scale@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3497:420\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3217:292\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3785:83\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3781:370\nAae/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3223:262\nxae.prototype.requestAnimationFrame/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3219:453\nc@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3315:104\n", message: 'Error in protected function: [Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js :: B.scale :: line 3497"  data: no]', cause: NS_ERROR_FAILURE, D: true }
D: true
cause: NS_ERROR_FAILURE: 
columnNumber: 0
data: null
filename: "https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js"
lineNumber: 3497
message: ""
name: "NS_ERROR_FAILURE"
result: 2147500037
stack: "B.scale@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3497:420\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3217:292\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3785:83\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3781:370\nAae/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3223:262\nxae.prototype.requestAnimationFrame/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3219:453\nc@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3315:104\n"
<prototype>: ExceptionPrototype { toString: toString(), name: Getter, message: Getter, … }
message: 'Error in protected function: [Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js :: B.scale :: line 3497"  data: no]'
stack: "B.scale@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3497:420\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3217:292\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3785:83\nB.hW@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3781:370\nAae/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3223:262\nxae.prototype.requestAnimationFrame/<@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3219:453\nc@https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3315:104\n"
<prototype>: Object { constructor: dee(a)
, stack: "" }

Uncaught NS_ERROR_FAILURE: 
    scale https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3497
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3217
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3785
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3781
    Aae https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3223
    zae https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3220
102140562-client_js_prod_kix_core.js:3497
    scale https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3497
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3217
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3785
    hW https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3781
    Aae https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3223
    zae https://docs.google.com/static/document/client/js/102140562-client_js_prod_kix_core.js:3220
Attached image canvas.png

If you inspect the dom after it happens what does it look like, is the canvas element still there? Can you inspect it, dump its contents as data url?

Here is the PNG image from the data: URI I copied from the doc's canvas element.

Attached image doc_screenshot.png

But a screenshot of the full page shows that, while the doc text is blank, some images pasted in the doc are remain rendered.

I tried using 32-bit Firefox 123 Release for my daily Google Docs work on Tuesday and 32-bit Beta 124 on Wednesday. Even with dozens over extra Google Docs tabs open in a background window to increase memory pressure on my Google Docs content processes, I did not reproduce this Google Docs error. Of course, that's not proof that 123 and 124 don't have a bug that might trigger the error in the right conditions.

In contrast, I switched back to 32-bit Nightly 125 this morning and hit this error again after a few hours. So maybe there is some Nightly-only code that was active in Nightly 124 (when I and others first saw these error) but not in Beta 124?

For what it's worth, I've seen this error multiple times a day on macOS last week, but it stopped days ago.
I wonder if this might be caused by a change on Google's side?

I got that error dialog on a google doc after it had been in a background tab for a long time, but the document still seemed to be functional, I was able to click on a link in the document before the dialog popped up, and the document was still drawn in the background. I deleted the dialog from the dom using inspector and the document seemed fine. I wonder if the problem here was that the network connection to google got interrupted and it got confused at some point?

Severity: -- → S3

[Tracking Requested - why for this release]:
Impacting a big website

Severity: S3 → --
Severity: -- → S2
Component: Graphics → Graphics: Canvas2D

We're presumably hitting:
https://searchfox.org/mozilla-central/rev/109bb25545f0d2df31954dc0a9afbf30d900b6bb/dom/canvas/CanvasRenderingContext2D.cpp#2130

With

  if (!IsTargetValid()) {
    aError.Throw(NS_ERROR_FAILURE);
    return;
  }

This sounds like an OOM, which I suppose on 32-bit is all too common:

(#0): GP+[GFX1-]: Failed to create a valid ShmemTextureHost

I wonder if we are hitting OOM related edge cases with canvas, and it is showing up in strange ways. To my knowledge, there are no pref / #ifdef related differences between nightly and beta with respect to canvas.

Potential candidates on the OffscreenCanvas side that might be useful:

gfx.canvas.remote.allow-offscreen controls whether or not we use D2D canvas (on Windows) or accelerated canvas (on Linux/OSX/Android) with DOM workers. Flipping it to false will make us fallback to using Skia with OffscreenCanvas. Note that it is possible that we fallback from D2D or accelerated canvas without flipping this pref to false, just not the default.

From the about:support provided, I see no evidence we fell back to Skia, so it probably kept using D2D.

gfx.offscreencanvas.shared-provider controls whether or not we use the PersistentBufferProviderShared or PersistentBufferProviderBasic with Skia based OffscreenCanvas. We use the former by default. The previous behaviour was to use the latter. I think the memory leaks associated with it before should be fixed as well even if the pref is flipped to false.

If we put more information into the exception that we're throwing Google could presumably pass some of that information on to us.

I saw such an error message on Google Docs yesterday on macOS arm64. It could be a different issue though.

Depends on: 1885365

The bug is marked as tracked for firefox124 (beta) and tracked for firefox125 (nightly). However, the bug still isn't assigned.

:bhood, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)
Assignee: nobody → lsalzman
Flags: needinfo?(bhood)
Assignee: lsalzman → nobody

Chris, your about:support is saying that you're running Software WebRender. Is this on the 32 bit build or on the 64 bit build? That by itself looks a little bit suspicious. We currently disable Accelerated Canvas2D for other platforms in that scenario, but we still allow D2D Remote Canvas on Windows even in that case.

It would be worth trying setting gfx.canvas.remote, restarting, and see if the problem goes away?

While I don't immediately have a distinct answer to why SW-WR + Remote Canvas would be causative in this scenario, it might be worth just establishing if it is somehow implicated.

Flags: needinfo?(cpeterson)

(In reply to Lee Salzman [:lsalzman] from comment #20)

Chris, your about:support is saying that you're running Software WebRender. Is this on the 32 bit build or on the 64 bit build? That by itself looks a little bit suspicious. We currently disable Accelerated Canvas2D for other platforms in that scenario, but we still allow D2D Remote Canvas on Windows even in that case.

I manually enabled SW-WR long ago to help dogfood it, since few Nightly users probably have a hardware configuration that would require SW-WR. But I doubt that's related because I assume none of the other people in this bug who have seen this error message are using SW-WR. (That's also why I was purposely using 32-bit Firefox, even though I have Win64 OS.) Should I not manually set gfx.webrender.software = true?

It would be worth trying setting gfx.canvas.remote, restarting, and see if the problem goes away?

My gfx.canvas.remote pref is true.

Flags: needinfo?(cpeterson) → needinfo?(lsalzman)

(In reply to Chris Peterson [:cpeterson] from comment #21)

(In reply to Lee Salzman [:lsalzman] from comment #20)

Chris, your about:support is saying that you're running Software WebRender. Is this on the 32 bit build or on the 64 bit build? That by itself looks a little bit suspicious. We currently disable Accelerated Canvas2D for other platforms in that scenario, but we still allow D2D Remote Canvas on Windows even in that case.

I manually enabled Software WebRender long ago to help dogfood it since few Nightly users probably have use such a configuration. (That's also why I was purposely using 32-bit Firefox, even though I have Win64 OS.) Should I not manually set gfx.webrender.software = true?

It would be worth trying setting gfx.canvas.remote, restarting, and see if the problem goes away?

My gfx.canvas.remote pref is true.

Oops, I meant set it to false.

Flags: needinfo?(lsalzman)

Google is seeing the newly introduced "Cannot use canvas after shutdown initiated." error.

I am reworking shutdown to avoid issues with GPU process crashes in bug 1886022.

Depends on: 1886022

I hit the newly introduced "Cannot use canvas after shutdown initiated." error in a 32-bit build of 126 Nightly even with gfx.webrender.software = false and gfx.canvas.remote = false.

I am reworking shutdown to avoid issues with GPU process crashes in bug 1886022.

My about:crashes lists this crash report for a GPU process OOM in WebRender code:

https://crash-stats.mozilla.org/report/index/1ef36db4-ce89-49e0-b657-f02f40240319

(In reply to Chris Peterson [:cpeterson] from comment #25)

My about:crashes lists this crash report for a GPU process OOM in WebRender code:

https://crash-stats.mozilla.org/report/index/1ef36db4-ce89-49e0-b657-f02f40240319

Do you have a way to get an about:memory report when you're nearing OOM in the GPU process?

Flags: needinfo?(cpeterson)

Yep, the OOM crashing the GPU process should hit the problem, even if remote canvas is disabled. Please retest when bug 1886022 lands which should hopefully solve it.

cpeterson, my attempted fix in bug 1886022 should be fixed in the latest nightly. Please let us know if the problem is gone for you :).

(In reply to Jeff Muizelaar [:jrmuizel] from comment #26)

Do you have a way to get an about:memory report when you're nearing OOM in the GPU process?

I opened about 40 Google Docs tabs in a 32-bit Windows build and captured this about:memory report. I don't really know how close the GPU process is too OOM'ing and triggering the Google Docs error, but about:memory says the GPU process has about 2.5 GB of committed address space and WebRender is using about 1.5 GB.

My crash report bp-1ef36db4-ce89-49e0-b657-f02f40240319 says the GPU process is OOMing here when asked to allocate 12,423,840 bytes, even though the crash report also says the process had 2.67 GB of Available Virtual Memory:

https://hg.mozilla.org/mozilla-central/file/ca250739126614322a325fededc713174a6ad565/gfx/wr/webrender/src/renderer/upload.rs#l114

Flags: needinfo?(cpeterson) → needinfo?(jmuizelaar)
Flags: needinfo?(jmuizelaar)
Keywords: regression
Regressed by: 1870957
Depends on: 1887729
Depends on: 1887950

This is a reminder regarding comment #19!

The bug is marked as tracked for firefox124 (release) and tracked for firefox125 (beta). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned.

Now that bug 1887729 is in nightly (pending uplift to beta), we should have leapfrogged our historical behaviour to allow for a complete recovery from a GPU process crash on Google Docs. Any properly implemented web application that listens to oncontextlost/restored should be able to recover gracefully.

Note that I found from testing that Google Sheets specifically doesn't full redraw without moving around the spreadsheet, but that from what I can tell, it is because Google Sheets itself doesn't give us all the necessary instructions to redraw (unlike on Chrome).

Chris, have you seen improvements with the most recent Nightly builds since bug 1887729 landed?

Flags: needinfo?(cpeterson)

I guess as a more general point, do we have any real way of verifying that things are working better? Bug 1887729 seems like a pretty scary thing to uplift late in the cycle and any data we can find to demonstrate any improvement in the situation would be helpful for evaluating its risk vs. reward.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #33)

I guess as a more general point, do we have any real way of verifying that things are working better? Bug 1887729 seems like a pretty scary thing to uplift late in the cycle and any data we can find to demonstrate any improvement in the situation would be helpful for evaluating its risk vs. reward.

We do. Killing the GPU process without bug 1887729 leaves Google docs in a state where it doesn't think anything is wrong but nothing paints this was caused by bug 1871467 but was covered up by bug 1870957

Regressed by: 1871467

(In reply to Ryan VanderMeulen [:RyanVM] from comment #32)

Chris, have you seen improvements with the most recent Nightly builds since bug 1887729 landed?

I haven't seen any more Google Docs errors since then, though I've been using a 64-bit Windows build instead of a 32-bit build. I can try downgrading to a 32-bit build again.

Flags: needinfo?(cpeterson)

To update the status of this bug, we've gotten some positive feedback that the most recent fixes landed on Nightly are showing improvements in the error rates they've been seeing on their end. The most recent fix will be included in 125.0b8 as well. We should probably wait for wider testing before closing this bug out, however.

This is a reminder regarding comment #19!

The bug is marked as tracked for firefox125 (beta) and tracked for firefox126 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

Assignee: nobody → aosmond

We've received further confirmation from Google that the error rate from 125 & 126 continues to show favorable results since this change was rolled out. Closing this bug based on that feedback. Feel free to reopen if we see indications that there's still an ongoing issue once 125 ride to release next week.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 126 Branch
No longer depends on: 1887950
See Also: → 1887950
No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: