Closed Bug 1415598 Opened 7 years ago Closed 7 years ago

Crash in nsTHashtable<T>::s_ClearEntry | PLDHashTable::RawRemove | mozilla::places::History::RegisterVisitedCallback

Categories

(Toolkit :: Places, defect, P1)

x86
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox-esr52 58+ fixed
firefox57 --- wontfix
firefox58 + fixed
firefox59 + fixed

People

(Reporter: marcia, Assigned: mak)

References

Details

(4 keywords, Whiteboard: [fxsearch][adv-main58+][adv-esr52.6+][post-critsmash-triage])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-a405876e-5e50-4088-b53c-8f8300171107.
=============================================================

Seen while looking at nightly crash stats - marking as security sensitive due to some signatures exhibiting possible UAFs: http://bit.ly/2yjJNvm

Crashes started on 58 using 20171027220059. Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d734e6acf7778df7c933d33540203f08b44ff977&tochange=d58424c244c38f88357a26fb61c333d3c6e552d7

Possibly Bug 1412142? ni on :kit
Flags: needinfo?
Fixing ni for comment #0
Flags: needinfo? → needinfo?(kit)
That's...surprising. Bug 1412142 moved an existing block of code into a helper function, and the new `hashURL` method has no in-tree callers yet.

I don't understand what `RegisterVisitedCallback` does yet, or why moving a block of code in  around would cause this crash. Is this because the `KeyClass` doesn't live long enough now? Maybe `VisitedQuery::Start` is failing now...though, based on the comment and the stacks, we're always getting called with a null `aLink` "in IPC builds", which I'm guessing is e10s?

Mak, do you have any suggestions on where to start? I'm also happy to back out bug 1412142 and see what happens.
Flags: needinfo?(kit) → needinfo?(mak77)
(In reply to Kit Cambridge (he/him) [:kitcambridge] from comment #2)
> I don't understand what `RegisterVisitedCallback` does yet, or why moving a
> block of code in  around would cause this crash. Is this because the
> `KeyClass` doesn't live long enough now? Maybe `VisitedQuery::Start` is
> failing now...though, based on the comment and the stacks, we're always
> getting called with a null `aLink` "in IPC builds", which I'm guessing is
> e10s?

RegisterVisitedCallback is the way links on a page register with history to be updated (link coloring) when a page is visited. The content process passes a Link, while the parent process passes null (it doesn't have access to the link).

while Kit's change is the only one explicitly about an history change, there are another couple changes that could be related:
Bug 1408789 - Use VS2017 for official Windows builds
Bug 1385438 - AtomicRefCounted doesn't need to use sequential consistency

I must note that we have another crash bug pointing to History::KeyClass (Bug 1412647) so it's very likely there's a misuse here and it just became more evident. That crash also increased a lot recently, maybe after we enabled Stylo? Additionally bug 1382923 changed the relationship between History and Links and the crashes increased briefly after that landing.

I'm needinfo-ing Nathan, because he worked on the AtomicRefCounted, and he has a deep expertize in refcounted use patterns, I guess at this point any help is welcome and he may have ideas. We may need to completely re evaluate the history links relationship :(
I'll wrap again my head around this code and see if I can figure out something.
Flags: needinfo?(nfroyd)
Flags: needinfo?(mak77)
Priority: -- → P1
Whiteboard: [fxsearch]
taking to have it on my radar. As I said, feel free to evaluate or propose solutions if you see obvious culprits.
Assignee: nobody → mak77
Status: NEW → ASSIGNED
I don't have any super-specific refcounting advice to add; bug 1385438 should have had zero effect on things, other than making them ever-so-slightly faster.  We landed a virtually identical change in bug 1277709 that touched the XPCOM refcounting code, which is much more widely used than RefCounted, and that had no fallout that I know of.

The crash is pretty strange: we're removing an entry from the mObservers table, and we find that the entry (or perhaps memory the entry points to) has been freed.  But we were able to look up the entry without a problem earlier in the function, which suggests that either:

1) We have a race on modifying the mObservers table itself; or
2) Somebody is modifying the data in an mObservers entry out from underneath us.

Glancing through History.cpp, I don't see any evidence that mObservers is modified off the main thread.  Is that correct?  Is it possible that individual entries could be touched off the main thread?
Flags: needinfo?(nfroyd)
Another insteresting fact is bug 1372433 changed the way we use mObservers, before it was all synchronous, now we dispatch to the document group or to the main thread. Though, we should still be on the same thread, unless the dispatch does something magic with threads.
It seems to go through xpcom's SchedulerGroup, though from there it's unclear to me if things go to the main-thread or not. It looks like documents, not being chrome, may have their own event target?
Nathan do you know anything about this scheduling, or do you know who could help? Could the doc->Dispatch calls end up executing the runnables off the main thread? Then yes, they could remove the entry between the mainthread gets it and removes it and we may need a mutex.
Flags: needinfo?(nfroyd)
Dispatching to the document still ends up running things on the main thread.  No worries there.
Flags: needinfo?(nfroyd)
I was looking at aggregates, the only interesting thing is that uptime is, in most cases, very short: less than 20 seconds. So either for these users this hits so often that it crashes at the first page load, or it's related to some startup events.
I spent various hours checkin threads, hashtable usage and keys usage, I didn't find anything evident that may be causing this :(
The only things I found is:
1. nsURIHashTable has a public non-virtual destructor, that could be a footgun if someone deletes a derived entry from the base class. Though, it doesn't look like our case off-hand.
2. Looks like sometimes we could mutate an nsIURI (thus bug 677281), and then our code would fail. It's unclear if we could hit one of these cases.

I'm not sure how to proceed, apart from adding workarounds like converting the getEntry debug check just before the RemoveEntry call to an actual if, and maybe try to make nsURIHashTable destructor virtual.
(In reply to Nathan Froyd [:froydnj] from comment #5)
> 1) We have a race on modifying the mObservers table itself; or

Provided everything seems to run on the main-thread, the only thing that could modify mObservers between the put and the remove calls is VisitedQuery::Start. But that method always uses NS_DispatchToMainThread(), thus unless NS_DispatchToMainThread does magic things like invoking ->Run synchronously (doesn't look like and it would be dumb) it doesn't seem possible.

> 2) Somebody is modifying the data in an mObservers entry out from underneath
> us.

History is the only consumer of the entries in mObservers, and I couldn't find anything between get and put that modifies those.
(In reply to Marco Bonardo [::mak] from comment #8)
> I was looking at aggregates, the only interesting thing is that uptime is,
> in most cases, very short: less than 20 seconds. So either for these users
> this hits so often that it crashes at the first page load, or it's related
> to some startup events.

The dupe is a security researcher with a startup crash, in case that helps...
yep, it's mostly startup crashes. I guess for now I'll try to go with a stop-gap, because I can't explain this crash.
I noticed `NotifyVisited` and `NotifyVisitedForDocument` hold script blockers when they access `mObservers`. Is it possible a script is spinning the event loop and changing `mObservers`, and we need to add blockers to `RegisterVisitedCallback` and `UnregisterVisitedCallback`, too?
The reason for the script blockers is that notifying the visited status calls SetLinkState() that may cause a modal dialog to be shown, that spins the events loop, causing a nested notification. We had a clear stack in that case (bug 668245) showing the modal loop and the nested call. The stack here doesn't show a nested call unfortunately.
Though, you could be onto something, between the putEntry and the removeEntry calls, the nsTHashTable may realloc, and that usually happens with a nested loop. It's hard to verify though.
Marco - can you at least take a shot at a stopgap, since this is a 58 regression and we need to resolve this soon.

Thanks
Flags: needinfo?(mak77)
I'll make a stop-gap patch today, that hopefully will help.
Fwiw, this doesn't look like a topcrash, and it's not a new crash, since it existed before the regression. What regressed is the amount of times it hits.
Attached patch patch v1Splinter Review
Flags: needinfo?(mak77)
Attachment #8935373 - Flags: review?(gijskruitbosch+bugs)
Comment on attachment 8935373 [details] [diff] [review]
patch v1

Review of attachment 8935373 [details] [diff] [review]:
-----------------------------------------------------------------

Tentative rs=me. Maybe Doug Thayer could do a second review here in case there's anything else obvious to do/try?

::: toolkit/components/places/History.cpp
@@ +2705,5 @@
>      if (NS_FAILED(rv) || !aLink) {
>        // Remove our array from the hashtable so we don't keep it around.
>        MOZ_ASSERT(key == mObservers.GetEntry(aURI), "The URIs hash mutated!");
> +      // In some rare case this crashes for unknown reasons.  Our suspect is
> +      // that something between PutEntry and this call causes a nested loop

Nit: "this crashes" - could do with clarifying like "calling RemoveEntry() with the current value of `key` " or something like that.
Attachment #8935373 - Flags: review?(gijskruitbosch+bugs) → review+
(In reply to :Gijs from comment #19)
> Tentative rs=me. Maybe Doug Thayer could do a second review here in case
> there's anything else obvious to do/try?

We can continue the investigation in bug 1412647, that imo is the same exact thing wit ha slightly different signature, but I don't think there's anything obvious here, based on the hours I spent investigating it (surely I may have missed something, no doubt). The most likely theory (nested loop) looks hard to confirm and without a way to reproduce the crash (even by chance on Try!).
Blocks: 1412647
https://hg.mozilla.org/mozilla-central/rev/94f1fb1043a8
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Comment on attachment 8935373 [details] [diff] [review]
patch v1

Approval Request Comment
[Feature/Bug causing the regression]: unknown
[User impact if declined]: UAF crash
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: not yet
[Needs manual test from QE? If yes, steps to reproduce]: no known steps
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: it's a simple stop-gap fix
[String changes made/needed]: none
Attachment #8935373 - Flags: approval-mozilla-beta?
Comment on attachment 8935373 [details] [diff] [review]
patch v1

sec-high, stop-gap fix for beta58

let's try this for 58.0b11
Attachment #8935373 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/ff2ac648087a

Does ESR52 have the same underlying defect (even if it's not actively crashing)?
Flags: needinfo?(mak77)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #25)
> https://hg.mozilla.org/releases/mozilla-beta/rev/ff2ac648087a
> 
> Does ESR52 have the same underlying defect (even if it's not actively
> crashing)?

the code is the same, the cause (nested loop or event loop spinning in a notification) may be missing.
Flags: needinfo?(mak77)
Comment on attachment 8935373 [details] [diff] [review]
patch v1

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: it's a safe stop-gap for a possible UAF
User impact if declined: possible crash
Fix Landed on Version: 58, 59
Risk to taking this patch (and alternatives if risky): none, it's quite simple
String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8935373 - Flags: approval-mozilla-esr52?
Group: toolkit-core-security → core-security-release
Hi Al, Dan, do we need a sec-approval for this one? I'll do A+ for ESR52 as I am assuming this is something we need for 58 and ESR52.6
Flags: needinfo?(dveditz)
Flags: needinfo?(abillings)
Comment on attachment 8935373 [details] [diff] [review]
patch v1

Sec-high, ESR52+
Attachment #8935373 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
(In reply to Ritu Kothari (:ritu) from comment #28)
> Hi Al, Dan, do we need a sec-approval for this one? I'll do A+ for ESR52 as
> I am assuming this is something we need for 58 and ESR52.6

Yes, this should have had sec-approval? and the template questions answered before it was checked in at all. We need this set and the questions answered.
Flags: needinfo?(abillings) → needinfo?(mak77)
Flags: needinfo?(dveditz)
Comment on attachment 8935373 [details] [diff] [review]
patch v1

[Security approval request comment]
How easily could an exploit be constructed based on the patch?
unlikely, the patch is a complete stop gap

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
it surely makes clear where we crashed, but there are no known ways to cause the crash and the stop gap fix makes that not possible

Which older supported branches are affected by this flaw?
Likely all of them

If not all supported branches, which bug introduced the flaw?
unknown

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
The patch is trivial enough and not risky.

How likely is this patch to cause regressions; how much testing does it need?
Unlikely to cause regressions.
Flags: needinfo?(mak77)
Attachment #8935373 - Flags: sec-approval?
Sorry for not setting sec-approval, I wasn't well informed about its usage.
Attachment #8935373 - Flags: sec-approval? → sec-approval+
Whiteboard: [fxsearch] → [fxsearch][adv-main58+][adv-esr52.6+]
Flags: qe-verify-
Whiteboard: [fxsearch][adv-main58+][adv-esr52.6+] → [fxsearch][adv-main58+][adv-esr52.6+][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: