Closed Bug 1828073 Opened 2 years ago Closed 1 years ago

Startup Crash in [@ BaseNotification::SetContentString]

Categories

(Toolkit Graveyard :: Notifications and Alerts, defect)

Unspecified
Windows
defect

Tracking

(firefox-esr102 unaffected, firefox112 wontfix, firefox113 wontfix, firefox114 fixed, firefox115 fixed)

RESOLVED FIXED
115 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- fixed
firefox115 --- fixed

People

(Reporter: aryx, Assigned: nrishel)

References

(Regression)

Details

(Keywords: crash, regression, Whiteboard: [fidedi-notifications])

Crash Data

Attachments

(2 files)

250 crashes for Firefox 111.0.x on Windows, newer versions also affected. Firefox 111 had native Windows notifications turned on in bug 1497425.

Crash report: https://crash-stats.mozilla.org/report/index/9c773c95-e6cb-4531-9907-efd0b0230414

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  wpnapps.dll  BaseNotification::SetContentString  onecoreuap\base\diagnosis\platform\notifications\developer\basenotification.cpp:659
0  wpnapps.dll  BaseNotification::Post  onecoreuap\base\diagnosis\platform\notifications\developer\basenotification.cpp:314
1  wpnapps.dll  <lambda_d9910e541950f8d272504b6cd8567d65>::operator const  onecoreuap\base\diagnosis\platform\notifications\developer\toast.cpp:669
2  wpnapps.dll  ToastNotifierImpl::Show  onecoreuap\base\diagnosis\platform\notifications\developer\toast.cpp:673
3  rpcrt4.dll  Invoke  
4  rpcrt4.dll  ?Ndr64StubWorker@@YAJPEAX0PEAU_RPC_MESSAGE@@PEAU_MIDL_SERVER_INFO_@@PEBQ6AJXZPEAU_MIDL_SYNTAX_INFO@@PEAK@Z  
5  rpcrt4.dll  NdrStubCall3  
6  combase.dll  CStdStubBuffer_Invoke  onecore\com\combase\ndr\ndrole\stub.cxx:1440
7  rpcrt4.dll  CStdStubBuffer_Invoke  
8  combase.dll  InvokeStubWithExceptionPolicyAndTracing::__l6::<lambda_c9f3956a20c9da92a64affc24fdd69ec>::operator const  onecore\com\combase\dcomrem\channelb.cxx:1279

:nrishel, since you are the author of the regressor, bug 1497425, could you take a look? Also, could you set the severity field?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nrishel)

:aryx would this segfault have crashed the browser, or would we have recovered from it given the error is on a non-main thread? I'm assuming the former but I'm currently unable to recreate a crash.

Some hypotheses: might be an issue with our use of HString and HStringReference classes - possibly Windows violating "The Windows Runtime releases all references to the string reference when the call returns." - or an issue with the supplied AString not being valid UTF-16. I was thinking the former to me makes more sense wrt causing a null pointer segfault as the classes are likely nulling their content on destruction, but I'd suspect significantly more crashes if that were the case. Alternatively maybe an issue with UTF-16 4 byte characters, but testing with \ud834\udd1e didn't have any issue.

Flags: needinfo?(nrishel) → needinfo?(aryx.bugmail)

crash-stats lists the process type as parent - the browser crashed.

Flags: needinfo?(aryx.bugmail)

67% of the crashes are in the first minute after launch. These might be users with session restore enabled or who visit a page which sends notifications. In case the user submitted the url of the page which they had open (assumption here it's not necessarily the one which reques7 the notification), there are also established assets like Google, Spotify, and Facebook - unlikely they all provide invalid strings.

Firefox in Japanese represent 7% of the crashes, Spanish and French 5% - but English (US) 48% and English (UK) 5%. No pattern.

Summary: Crash in [@ BaseNotification::SetContentString] → Startup Crash in [@ BaseNotification::SetContentString]

:aryx Is there a way to query the matching bugs? From a random sampling I suspect they all contain OnWriteBitmapFinished or OnWriteImageFinished (same function renamed) but not OnWriteImageSuccess. This would indicate an issue for notifications with images that fail to save to disk.

Flags: needinfo?(aryx.bugmail)

(In reply to Nick Rishel [:nrishel] from comment #5)

:aryx Is there a way to query the matching bugs? From a random sampling I suspect they all contain OnWriteBitmapFinished or OnWriteImageFinished (same function renamed) but not OnWriteImageSuccess. This would indicate an issue for notifications with images that fail to save to disk.

Yes: if you click on the crash report and then either "More reports" or "Search", you can log in to crash-stats (which might require access?) and get much more information. I will note that of the public data I am looking at, 87% are on AMD64. That's suspicious.

But as I do this, I don't see how to query the entire stack for a) frames that are not top-most and b) frames from the main thread. Hopefully :aryx can guide us.

Update: it looks like "proto signature" is what we want, but when I try to search for, e.g., proto signature contains "mozilla::widget::ToastNotificationHandler::OnWriteBitmapFinished" I don't get sensible hits. I don't see how to view the "proto signature" in order to figure out what it actually contains so I'm stuck here.

I think proto signature only contains the crashing stack trace (you can aggregate on proto signature here to get an idea of what it contains). We need to match the stack trace of a thread blocked by the crashing thread.

Proto signature contains the top 15 frames of the crashing thread as far as I know.

Will, is there a way to search for crash reports which have a non-crashing thread with a substring in its stack?

Flags: needinfo?(aryx.bugmail) → needinfo?(willkg)

Documentation for the proto signature field: https://crash-stats.mozilla.org/documentation/datadictionary/dataset/processed/field/proto_signature

Non-crashing thread information isn't indexed, so I think the answer to your question is no.

Flags: needinfo?(willkg)

Do we have other metrics, e.g. can we check for low disk/memory availability? This is increasingly looking like something not being handled correctly Windows-side. When an image fails to write to disk we guard against using image toasts. Everything looks correctly guarded and I haven't seen any contradiction that this only occurs when either we're not able to write to disk (or retrieve the image surface). That suggests to me that perhaps either the disk/memory is at capacity and a null check isn't being performed.

Edit, if I'm reading the reports correctly I'm guessing they have sufficient disk/memory.

Assignee: nobody → nrishel

The severity field is not set for this bug.
:tspurway, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(tspurway)
Severity: -- → S2

Hi Nick, any news/updates on this bug?

Flags: needinfo?(nrishel)

:RyanVM No updates. I don't have any solid hypothesis what's triggering this other than it appears to correlate with failure to save an image to disk, but the control flow path that triggers are easy to reason about and don't have any apparent errors. Absent a MRE, I think this may be a Windows bug that triggers under some resource constrained scenario.

:aryx, missed NI for in my last comment's question.

Flags: needinfo?(nrishel) → needinfo?(aryx.bugmail)

bp-c646ac09-e1c9-48ce-9557-30d610230508 has "System Memory Use Percentage 41" - 59% of the memory are available. There is no telemetry about available disk space as far as I can see - both in the crash reports and locally at about:telemetry.

Flags: needinfo?(aryx.bugmail)

Yannis, does anything stand out to you in these crashes?

Flags: needinfo?(yjuglaret)

This crash happens while Microsoft's wpnapps.dll tries to convert the XML for a toast notification from UTF16 to UTF8. For some reason the translation fails. The Microsoft internal function wpnapps!ConvertUTF16ToUtf8 outputs a null pointer in error case, which wpnapps!BaseNotification::SetContentString will use without first checking that it is not null. This should result in a benign error instead of a crash if the check was there; so we can probably ping Microsoft about this part. However we should also investigate how a failing UTF16 sequence could be found there in the first place.

This analysis seems confirmed by the few crashes in Nightly (e.g. here), where the last error value is available and listed as ERROR_NO_UNICODE_TRANSLATION. This is because wpnapps!ConvertUTF16ToUtf8 calls WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, lpWideCharStr, ...), which fails, resulting in this ERROR_NO_UNICODE_TRANSLATION last error code followed by the crash.

It is possible to get this last error code manually e.g. by breaking surrogate pair rules:

For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF.

wchar_t chars[]{ 0xD800, 0 }; // Use a high surrogate followed by anything *except* a low surrogate
auto result = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, chars, -1, nullptr, 0, nullptr, nullptr);
printf("%d %d\n", result, GetLastError()); // Outputs: 0 1113

So far I haven't had success in reproducing the crash e.g. by modifying mTitle and mMsg in ToastNotificationHandler::CreateToastXmlDocument to try to introduce this kind of problem. But it does reproduce if I manually alter the UTF16 argument of wpnapps!ConvertUTF16ToUtf8 in a debugger. Maybe this gives more ideas of things to try?

This being a startup crash could be explained by users reloading their session after crashing, and the same notification getting dispatched again, thus re-crashing the browser like comment 4 suggests.

Flags: needinfo?(yjuglaret)

(In reply to Nick Rishel [:nrishel] from comment #5)

:aryx Is there a way to query the matching bugs? From a random sampling I suspect they all contain OnWriteBitmapFinished or OnWriteImageFinished (same function renamed) but not OnWriteImageSuccess. This would indicate an issue for notifications with images that fail to save to disk.

This is a very interesting observation, this crash could definitely be related to notifications with images. But I don't think we should assume that this is limited to the failure-to-save case. If I read the code correctly the notification is always sent from within OnWriteImageFinished even in case of success; OnWriteImageSuccess only modifies what will be sent in the notification. To me this additional information could make the call to AppendUTF8toUTF16 rather suspicious here.

After finding some hints in the user comments, using aggregation on crash reports revealed that 619 of the 634 crashes have a common addon, listed with identifier jid0-GjwrPchS3Ugt7xydvqVK4DQk8Ls@jetpack:1.0.5: Notifier for Gmail™, version 1.0.5. It appears to be this addon which reports the same id/name/version. This addon should be helpful to reproduce the issue -- which may be more general.

I installed it and created a custom Gmail account, but so far have no success in reproducing the crash. The notifications are always shown with the same image: the icon for the addon. I wonder if receiving simultaneously hundreds of notifications (for unread emails) all with the same image (the icon) could trigger the crash?

I also found that this particular crash has some limited heap information that is relevant: at address 000001598f09881c we have the beginning of the faulty UTF16 string:

<toast launch="program&#xA;firefox&#xA;profile&#xA;C:\Users\<removed>\AppData\Roaming\Mozilla\Firefox\Profiles\[removed]&#xA;windowsTag&#xA;[removed]"><visual><binding template="ToastImageAndText03"><image id="1" src="file:///C:/Users/[removed]/AppData/Local/Temp/notificationimages/0c63809e-10f7-45b9-8efd-0c07f10eb151.png"/><text id="1">Notifier for Gmail™

However the full string is not available in the crash report, but at least this shows that the problem is not with the image's URL like I suggested in comment 18. The string ends like this in raw bytes:

00000159`8f098ac0  3c 00 74 00 65 00 78 00-74 00 20 00 69 00 64 00  <.t.e.x.t. .i.d.
00000159`8f098ad0  3d 00 22 00 31 00 22 00-3e 00 4e 00 6f 00 74 00  =.".1.".>.N.o.t.
00000159`8f098ae0  69 00 66 00 69 00 65 00-72 00 20 00 66 00 6f 00  i.f.i.e.r. .f.o.
00000159`8f098af0  72 00 20 00 47 00 6d 00-61 00 69 00 6c 00 22 21  r. .G.m.a.i.l."!
00000159`8f098b00  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ???????????????? // remaining data not available in the crash dump

If I look at what the notifications from this addon look like on my own machine, I have the same start of UTF-16 string (and no crash):

<toast launch="program&#xA;firefox&#xA;profile&#xA;C:\Users\yjugl\AppData\Roaming\Mozilla\Firefox\Profiles\h8yxxtwx.default-release-5&#xA;windowsTag&#xA;2964831140"><visual><binding template="ToastImageAndText03"><image id="1" src="file:///C:/Users/yjugl/AppData/Local/Temp/notificationimages/5378b913-4a4f-4d7f-aaf7-84beb79820a3.png"/><text id="1">Notifier for Gmail™</text><text id="2">From: yjuglaret@whatever.com
Title: Hello
Summary: This is a test</text></binding></visual><actions><action content="Notification settings" arguments="program&#xA;firefox&#xA;profile&#xA;C:\Users\yjugl\AppData\Roaming\Mozilla\Firefox\Profiles\h8yxxtwx.default-release-5&#xA;windowsTag&#xA;2964831140&#xA;action&#xA;settings" placement="contextmenu"/></actions></toast>

(the line breaks are \r\n)

Here is what the raw bytes look like around the ™ (U+2122) on my machine, all seems normal:

00000248`bbb73930  70 00 6e 00 67 00 22 00-2f 00 3e 00 3c 00 74 00  p.n.g."./.>.<.t.
00000248`bbb73940  65 00 78 00 74 00 20 00-69 00 64 00 3d 00 22 00  e.x.t. .i.d.=.".
00000248`bbb73950  31 00 22 00 3e 00 4e 00-6f 00 74 00 69 00 66 00  1.".>.N.o.t.i.f.
00000248`bbb73960  69 00 65 00 72 00 20 00-66 00 6f 00 72 00 20 00  i.e.r. .f.o.r. .
00000248`bbb73970  47 00 6d 00 61 00 69 00-6c 00 22 21 3c 00 2f 00  G.m.a.i.l."!<./.
00000248`bbb73980  74 00 65 00 78 00 74 00-3e 00 3c 00 74 00 65 00  t.e.x.t.>.<.t.e.

Edit: Maybe the bad UTF-16 comes from the message title / summary. For example, maybe the summary can cut a message in the middle of a surrogate pair. I'll try that.

The addon from comment 19 contains code which looks like this:

// ...
app.notify = function(text, title, callback, buttons = []) {
  // ...
  chrome.notifications.create(null, options, id => {
    // ...
  });
};
// ...
        function shorten(str) {
          if (str.length < config.email.truncate) {
            return str;
          }
          return str.substr(0, config.email.truncate / 2) + '...' + str.substr(str.length - config.email.truncate / 2);
        }
        var report = tmp.map(e => config.notification.format
          .replace('[author_name]', e.author_name)
          .replace('[author_email]', e.author_email)
          .replace('[summary]', shorten(e.summary))
          .replace('[title]', shorten(e.title))
          .replace(/\[break\]/g, '\n'));
// ...
            app.notify(report, '', () => {
              // ...
            }, buttons);

Where config.email.truncate is 70 by default. And it seems that JS strings can indeed cut into the middle of a surrogate pair, as illustrated here in dev tools:

"𒐪".substr(0,1)
-> "\ud809"
"𒐪".substr(1,1)
-> "\udc2a"

Maybe the web notification path is safeguarded against the kind of strings described in comment 17 but not the Browser Extensions notification path? I'll do more tests in this direction next week.

I am able to reproduce the crash using the following code in a local HTML file:

<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
function notify(title) {
  new Notification(title)
}
function tryNotify(title) {
  if (!("Notification" in window)) {
    alert("This browser does not support desktop notification")
  } else if (Notification.permission === "granted") {
    notify(title)
  } else if (Notification.permission !== "denied") {
    Notification.requestPermission().then((permission) => {
      if (permission === "granted") {
        notify(title)
      }
    });
  }
}
window.onload = function() {
  var charCode = parseInt(prompt("Enter a char code", defaultValue="dfff"), 16);
  if (!isNaN(charCode)) {
      tryNotify(String.fromCharCode(charCode))
  }
};
</script>
</head>
</html>

Under the following conditions (otherwise a notification gets dispatched with no crash):

  • using a char code between DC00 and DFFF i.e. a lonely low surrogate;
  • assuming that this is the first notification dispatched by Firefox.

This seems to crash both in the title and body of the notification; and both using the web API and addon API.

The occasional presence of lonely low surrogates when using the addon mentioned in comment 19 is explained by the second substr call in the shorten function shown in comment 20. The first call can lead to lonely high surrogates, which somehow do not lead to a crash; while the second call can lead to lonely low surrogates, which somehow lead to a crash if this is the first notification getting dispatched.

[:nrishel], would you like to continue the investigation from here?

Flags: needinfo?(nrishel)

Much appreciated yannis! Yes, I'll take over from here.

Flags: needinfo?(nrishel)
Whiteboard: [fidedi-notifications]

Is this going to land shortly? Thanks

:pascalc I'm planning to try to land this tomorrow. No significant problems found in review, just needs a cleanup pass.

Pushed by nrishel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/338450ea3be0 Fix invalid UTF-16 crashing Windows Toast notifications. r=nalexander https://hg.mozilla.org/integration/autoland/rev/62eed0dd28da Post: Add tests for invalid UTF-16 as arguments to `nsIAlertNotification`s. r=nalexander

Comment on attachment 9335363 [details]
Bug 1828073 - Fix invalid UTF-16 crashing Windows Toast notifications. r=nalexander

Beta/Release Uplift Approval Request

  • User impact if declined: Crashing under uncommon circumstances for Windows users.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Trivial to reason about, using existing functionality to clean up UTF-16 strings.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9335363 - Flags: approval-mozilla-beta?
Attachment #9335364 - Flags: approval-mozilla-beta?

Comment on attachment 9335363 [details]
Bug 1828073 - Fix invalid UTF-16 crashing Windows Toast notifications. r=nalexander

Fx114 is in RC, moving uplift request to release

Attachment #9335363 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Attachment #9335364 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Status: NEW → RESOLVED
Closed: 1 years ago
Resolution: --- → FIXED
Target Milestone: --- → 115 Branch

Comment on attachment 9335363 [details]
Bug 1828073 - Fix invalid UTF-16 crashing Windows Toast notifications. r=nalexander

Approved for 114.0.2 next week, thanks.

Attachment #9335363 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9335364 - Flags: approval-mozilla-release? → approval-mozilla-release+
Flags: needinfo?(tspurway)
Product: Toolkit → Toolkit Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: