Closed Bug 1358159 Opened 6 years ago Closed 6 years ago

Crash in GetLocaleInfoHelper() called from :intl::OSPreferences::ReadDateTimePattern

Categories

(Core :: Internationalization, defect)

55 Branch
Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: marcia, Assigned: zbraniecki)

Details

(4 keywords, Whiteboard: [post-critsmash-triage])

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-d939c8a3-c208-4ad9-92d3-944ee0170420.
=============================================================

Seen while looking at nightly crash stats - crashes started using 20170419030223: http://bit.ly/2pj9fjX

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bb38d935d699e0529f9e0bb35578d381026415c4&tochange=c0ea5ed7f91a6be996a4a3c5ab25e2cdf6b4377e

ni on gandalf as maybe he can shed some light on the crash. Affects Tbird as well as FF.  Some of the comments mention crashing when downloading files.
Flags: needinfo?(gandalf)
Ouch, that's beyond Gecko. Not sure what's going on here. I'll look into known crashes in Windows API but also NI Jonathan who may shed some light on how are we triggering the crash.
Flags: needinfo?(jfkthame)
There's not much on the web about crashes in this function :(

Also, the range is strange. There's nothing related to OSPreferences in that range.

My guess would be either bug 1355308 which landed on the 10th and switched Windows to use UserLocale, or bug 1354445 which landed on the 8th and switched callsites to use mozIntl.DateTimeFormat.

Not sure why crashes started only on 19th.
Crashing with a write violation inside the call at https://hg.mozilla.org/mozilla-central/annotate/c0ea5ed7f91a/intl/locale/windows/OSPreferences_win.cpp#l172 suggests to me that the target string length was not long enough for the result that Windows is trying to write to it.

I wonder about the SetLength(len - 1) on the previous line; without some digging, I'm unsure whether this guarantees to allocate enough space for the null terminator as well as the actual character data. Maybe we need to SetLength(len) there, to ensure the terminator will fit when Windows writes the data to its output buffer, and then SetLength(len - 1) afterwards so that it isn't considered part of the actual string.
Flags: needinfo?(jfkthame)
What's the course of action then in such a case? I can't reproduce it locally, so I'd be operating blind if I were to write a patch that changes the SetLength.
Can you advise or NI the right people? I'm concerned that while :jfkthame indicated a potential fix, I cannot reason about whether it will be a complete fix and I don't know how to reproduce it.

It puts me in a really bad position to pursue patching it.
Flags: needinfo?(mozillamarcia.knous)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #4)
> What's the course of action then in such a case? I can't reproduce it
> locally, so I'd be operating blind if I were to write a patch that changes
> the SetLength.

I took a (brief) look at the string code, and I think its behavior matches what the code here expects: SetLength(n) should ensure the string has enough capacity for n characters _plus_ a null terminator. If that's the case, my suggestion above may well be irrelevant anyway. :(
I think the best course is for someone to track down a better regression range. Adding the relevant keywords to trigger that action.
Flags: needinfo?(mozillamarcia.knous)
I don't believe that the regression window is going to surprise us. We know when we landed this change - bug 1308329 - we just don't know how this code causes the crash. :(
It looks as if Comment 0 is incorrect in the date I see crashes start - I see crashes in 20170416030209 build, which is a few days earlier than what I report in Comment 0.

All crashes so far are Windows 10. User comments mention:

* majorgeeks.com.
* crashes on any download update attempts of sites 
* unable to download updates for privaZer and smart defrag
* downloading files

ni on Andrei to see if someone in QA can reproduce on Windows 10.
Flags: needinfo?(gandalf) → needinfo?(andrei.vaida)
Lots of writes (and some reads) to bad addresses...  at least sec-high; could be sec-crit

First build with a hit is 20170419
Group: core-security
Summary: Crash in GetLocaleInfoHelper → Crash in GetLocaleInfoHelper() called from :intl::OSPreferences::ReadDateTimePattern
Group: core-security → dom-core-security
I'm still stuck with this. Not sure what should be the next step. We know it's a sec-high, we know when we introduced it, but we don't know how to fix it.
Flags: needinfo?(andrei.vaida)
I could not reproduce this issue, I tried downloading files and updates. I also noticed that the crash number remained the same (12). Am I missing something?
(In reply to Cristian Comorasu [:CristiComo] from comment #12)
> I could not reproduce this issue, I tried downloading files and updates. I
> also noticed that the crash number remained the same (12). Am I missing
> something?

I see 14 crashes/41 installations as of today.
I think Jonathan in comment 3 is right; the edge case here is that if len == 1, then we SetLength to zero, which sets the string to point at a shared, read-only buffer:

https://dxr.mozilla.org/mozilla-central/source/xpcom/string/nsTSubstring.cpp#684

So if we hand out a pointer to that, and GetLocaleInfoEx tries to write a null terminator to that string, we're going to try and write to read-only memory, which would be consistent with the crash reports here.

Zibi, can you implement the fix described in comment 3, please?
Flags: needinfo?(gandalf)
Debugging a minidump from any of these crashes should be enough to figure out what lengths we're setting the string to; I'd bet we're setting it to zero in all cases.
Attached patch bug1358159.diff (obsolete) — Splinter Review
Sure I can. Here's the patch.

My problem is that I'm operating blindly. I have no idea if this is the bug and if this is fixing it and I don't know how to test it :(
Flags: needinfo?(gandalf)
Attachment #8866474 - Flags: review?(jfkthame)
Comment on attachment 8866474 [details] [diff] [review]
bug1358159.diff

Review of attachment 8866474 [details] [diff] [review]:
-----------------------------------------------------------------

Can you modify your date/time format patterns in the Windows control panel so that they are just a single character each? It seems like that ought to reproduce this bug. (It seems a bit surprising if that would happen with any "normal" format settings, though.)

We should do the same thing at the other GetLocaleInfoEx() call-site in the `if (isTime) { ... }` block later in this method, too, as the same potential issue exists there AFAICS.
Tried that, changed my short and long date to "d" and my time to "t".

Then launched Firefox nightly with browser console and typed:

"(Services.intl.createDateTimeFormat('en-US', {dateStyle: 'short'})).format(0)"

tried with different date styles and time styles. No crash.
Attached patch bug1358159.diff v2 (obsolete) — Splinter Review
Updated patch.

As explained in comment 18, I failed to reproduce the crash with the shortest possible date/time format (1 char), so I'm not sure if this patch fixes anything tbh.
Attachment #8866474 - Attachment is obsolete: true
Attachment #8866474 - Flags: review?(jfkthame)
Attachment #8866518 - Flags: review?(jfkthame)
Comment on attachment 8866518 [details] [diff] [review]
bug1358159.diff v2

Review of attachment 8866518 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, in theory the problem would arise if you could set the format to an empty string (so the length returned by Windows, including the terminator, would be 1), but the control panel doesn't let me do that. So I haven't been able to reproduce a failure here.

Still, we should take this patch in case it does ever happen that windows returns an empty string.

I suggest marking the bug leave-open when this lands, until we see whether it does have any effect on crash-stats.
Attachment #8866518 - Flags: review?(jfkthame) → review+
Assignee: nobody → gandalf
Status: NEW → ASSIGNED
Keywords: checkin-needed
Attached patch bug1358159.diff v3 (obsolete) — Splinter Review
Now with 100% more commit meta-data!
Attachment #8866518 - Attachment is obsolete: true
Attachment #8866925 - Flags: review+
Now, with r= instead of r?

Sorry for the noise! I'm learning `hg export` instead of MR :)
Attachment #8866925 - Attachment is obsolete: true
Attachment #8866927 - Flags: review+
Seems like potentially this patch did fix this bug.

So far, since 13th no new reports except of one from 38esr which apparently had the same problem in the old code - https://crash-stats.mozilla.com/report/index/58f6a0a9-f347-49a4-919f-2d4ed0170515 :)

I'm going to keep watching this for another week and by Friday, if no new reports show up from recent builds I'll close it as fixed.
AFAICS, the code in 38esr didn't have exactly -this- issue (attempting to write a null terminator to a non-writable zero-length string), as it passed a fixed local buffer when calling the Win32 APIs (see dateBuffer and timeBuffer at https://hg.mozilla.org/releases/mozilla-esr38/annotate/6afcba951b0f/intl/locale/windows/nsDateTimeFormatWin.cpp#l96).

So I think the 38esr crash suggests a different failure mode than what we (theoretically, at least) fixed here, though I can't really tell what went wrong there.
I'm going to make it as fixed.

There are still new reports coming but all of them now are from 38 and 52/53. None from nightly and nothing related to the code we fixed here.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Group: dom-core-security → core-security-release
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.