Crash in nsDNSRecord::GetNextAddr

VERIFIED FIXED in Firefox 59

Status

()

P1
critical
VERIFIED FIXED
a year ago
a year ago

People

(Reporter: valentin, Assigned: valentin)

Tracking

({crash})

Trunk
mozilla59
Unspecified
Windows 7
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox57 unaffected, firefox58 unaffected, firefox59 verified)

Details

(Whiteboard: [necko-triaged], crash signature)

Attachments

(2 attachments)

(Assignee)

Description

a year ago
This bug was filed from the Socorro interface and is
report bp-3ffb699b-8630-466b-9fc5-8c7d90171130.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll nsDNSRecord::GetNextAddr netwerk/dns/nsDNSService2.cpp:128
1 xul.dll mozilla::net::nsSocketTransport::RecoverFromError netwerk/base/nsSocketTransport2.cpp:1793
2 xul.dll mozilla::net::nsSocketTransport::OnSocketDetached netwerk/base/nsSocketTransport2.cpp:2332
3 xul.dll mozilla::net::nsSocketTransportService::DetachSocket netwerk/base/nsSocketTransportService2.cpp:259
4 xul.dll mozilla::net::nsSocketTransportService::DoPollIteration netwerk/base/nsSocketTransportService2.cpp:1190
5 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:921
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1033
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:508
8 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:334
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:319

=============================================================

From what I can tell, this occurs when nsHostRecord::~nsHostRecord() is called on one thread, and GetNextAddr on another.
We can confirm by adding a lock to ~nsHostRecord when deleting addr_info.

Comment 1

a year ago
What's the user impact? Are we clear to fix forward, or should we undo our bug 1420677 patch?

Comment 2

a year ago
Also, does this mean that we revealed a use-after-free bug? This code is full of surprises :P
(In reply to Valentin Gosu [:valentin] from comment #0)
> From what I can tell, this occurs when nsHostRecord::~nsHostRecord() is
> called on one thread, and GetNextAddr on another.
> We can confirm by adding a lock to ~nsHostRecord when deleting addr_info.

Yes, because nsHostRecord is not refcounted!
(Assignee)

Comment 4

a year ago
Actually, nsHostRecord does have NS_INLINE_DECL_THREADSAFE_REFCOUNTING and is used with a RefPtr.
But I think I found the issue. In nsHostResolver::ResolveHost we don't get a lock before nulling/changing addr_info.
155 crashes/109 installs so far. There are some UAF addresses so marking this as security sensitive.
Group: network-core-security
(Assignee)

Comment 6

a year ago
MozReview-Commit-ID: A6Qk6G3Y1SY
Attachment #8933627 - Flags: review?(honzab.moz)
Seems like a totally reasonable and likely explanation!
Comment on attachment 8933627 [details] [diff] [review]
Use lock when changing nsHostRecord.addr_info

Review of attachment 8933627 [details] [diff] [review]:
-----------------------------------------------------------------

OK, but there are at least 3 other places we access that member w/o the lock being held (like branch on presence out the lock, later access under the lock w/o rechecking; unlocked check+dereference).  We should probably audit this code in another bug.  privatizing addr_info (any other members?) on nsHostRecord and have an inline accessor asserting the lock thread ownership would be a wise thing to do.

Note that we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade.
Attachment #8933627 - Flags: review?(honzab.moz) → review+

Comment 10

a year ago
> we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade

What is a cascade? Googling "multithreaded cascade" gives me nothing.
Flags: needinfo?(honzab.moz)
(In reply to jthemphill from comment #10)
> > we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade
> 
> What is a cascade? Googling "multithreaded cascade" gives me nothing.

Keeping both:

{ 
  AutoLock lock(outer);
  {
     AutoLock lock2(inner);
     do something...
  }
}

If you do it in reverse order on some other thread:

{ 
  AutoLock lock(inner);
  {
     AutoLock lock2(outer);
     do something...
  }
}

you deadlock.
Flags: needinfo?(honzab.moz)
status-firefox57: --- → unaffected
status-firefox58: --- → unaffected
status-firefox-esr52: --- → unaffected
https://hg.mozilla.org/mozilla-central/rev/253690f0ab21
Status: NEW → RESOLVED
Last Resolved: a year ago
status-firefox59: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
There still are 177 crashes in nightly 59 starting with buildid 20171202100103.
Status: RESOLVED → REOPENED
status-firefox59: fixed → affected
Resolution: FIXED → ---
(Assignee)

Comment 14

a year ago
I really expected the patch would have fixed it.
At this point, there are two possibilities:
 1. this is a race where the socket thread is calling GetNextAddr and some other thread is nulling/changing a smart pointer that leads to the destruction of the objects holding the nsDNSRecord and nsHostRecord.
 2. There is an unpaired NS_RELEASE for nsHostRecord that is leading to a double free.
 3. [unlikely] some addon or DLL is doing some bad things

I will try one more fix; if that doesn't work, we probably should back out all the patches to avoid alienating users.

Comment 15

a year ago
There’s still a data race; I have a (somewhat large) patch in the works to harden the code with an RWLock. I’ll postthe patch when I get home, and if you want to push your own more conservative fix that’s fine.
(Assignee)

Comment 16

a year ago
Did you find the data race? I'm quite curious to see what that is.

Comment 17

a year ago
Attachment #8934043 - Flags: review?(valentin.gosu)
Attachment #8934043 - Flags: review?(honzab.moz)

Comment 18

a year ago
Yeah, so there are a few places where we look at addr_info but don't grab the lock. Look at the places in this patch where I didn't remove an existing lock - those are the places you can fix in a more conservative patch. There's a pretty I want to RWLock the entire nsHostResolver struct  The short answer is that nsHostResolver::HasUsableResult needs a lock, but doesn't look like it. This patch converts addr_info into a private member and uses public getters which assert the lock, per Honza's suggestion. Note the places where I didn't remove a lock; I think all of them are missing because HasUsableResult doesn't look like it needs a lock.

For https://hg.mozilla.org/mozilla-central/file/709f355a7a8c/netwerk/dns/nsHostResolver.cpp#l860 , the only options were to add a helper functions or add another nested if. I detest heavily-nested if blocks, so my patch opts for the helper function.

I don't like grabbing locks multiple times within a function. I'd probably recommend redesigning this code so that we apply RWLock to the entire nsHostRecord, and grab the lock as few times as possible. I think the old code existed before RWLock - addr_info was the only mutable data member, and the only tool the original authors had was a full-fledged Mutex. So they put this heavyweight lock on addr_info to emulate RWLock semantics.

I don't know much about writing multithreaded code. If this were Rust, I'd just put the resource inside an RWLock object, but that doesn't seem possible with the C++ XPCOM API. Is there an accepted pattern for locking resources? Is my getter-with-a-useless-argument pattern okay?

Try run here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8df0c521d1c14e7ed325c0ff306a520b8cee85f

Updated

a year ago
Attachment #8934043 - Flags: review?(valentin.gosu)
Attachment #8934043 - Flags: review?(honzab.moz)

Comment 20

a year ago
Cool. I'll file another bug for the race condition, and polish my patch up a bit more.
Group: network-core-security → core-security-release
No more crashes in nightly 59.
Status: RESOLVED → VERIFIED
status-firefox59: fixed → verified
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.