Closed Bug 1422173 Opened 7 years ago Closed 7 years ago

Crash in nsDNSRecord::GetNextAddr

Categories

(Core :: Networking: DNS, defect, P1)

Unspecified
Windows 7
defect

Tracking

()

VERIFIED FIXED
mozilla59
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 --- unaffected
firefox59 --- verified

People

(Reporter: valentin, Assigned: valentin)

References

Details

(Keywords: crash, Whiteboard: [necko-triaged])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is
report bp-3ffb699b-8630-466b-9fc5-8c7d90171130.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll nsDNSRecord::GetNextAddr netwerk/dns/nsDNSService2.cpp:128
1 xul.dll mozilla::net::nsSocketTransport::RecoverFromError netwerk/base/nsSocketTransport2.cpp:1793
2 xul.dll mozilla::net::nsSocketTransport::OnSocketDetached netwerk/base/nsSocketTransport2.cpp:2332
3 xul.dll mozilla::net::nsSocketTransportService::DetachSocket netwerk/base/nsSocketTransportService2.cpp:259
4 xul.dll mozilla::net::nsSocketTransportService::DoPollIteration netwerk/base/nsSocketTransportService2.cpp:1190
5 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:921
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1033
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:508
8 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:334
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:319

=============================================================

From what I can tell, this occurs when nsHostRecord::~nsHostRecord() is called on one thread, and GetNextAddr on another.
We can confirm by adding a lock to ~nsHostRecord when deleting addr_info.
What's the user impact? Are we clear to fix forward, or should we undo our bug 1420677 patch?
Also, does this mean that we revealed a use-after-free bug? This code is full of surprises :P
(In reply to Valentin Gosu [:valentin] from comment #0)
> From what I can tell, this occurs when nsHostRecord::~nsHostRecord() is
> called on one thread, and GetNextAddr on another.
> We can confirm by adding a lock to ~nsHostRecord when deleting addr_info.

Yes, because nsHostRecord is not refcounted!
Actually, nsHostRecord does have NS_INLINE_DECL_THREADSAFE_REFCOUNTING and is used with a RefPtr.
But I think I found the issue. In nsHostResolver::ResolveHost we don't get a lock before nulling/changing addr_info.
155 crashes/109 installs so far. There are some UAF addresses so marking this as security sensitive.
Group: network-core-security
MozReview-Commit-ID: A6Qk6G3Y1SY
Attachment #8933627 - Flags: review?(honzab.moz)
Seems like a totally reasonable and likely explanation!
Comment on attachment 8933627 [details] [diff] [review]
Use lock when changing nsHostRecord.addr_info

Review of attachment 8933627 [details] [diff] [review]:
-----------------------------------------------------------------

OK, but there are at least 3 other places we access that member w/o the lock being held (like branch on presence out the lock, later access under the lock w/o rechecking; unlocked check+dereference).  We should probably audit this code in another bug.  privatizing addr_info (any other members?) on nsHostRecord and have an inline accessor asserting the lock thread ownership would be a wise thing to do.

Note that we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade.
Attachment #8933627 - Flags: review?(honzab.moz) → review+
> we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade

What is a cascade? Googling "multithreaded cascade" gives me nothing.
Flags: needinfo?(honzab.moz)
(In reply to jthemphill from comment #10)
> > we already do keep nsHostResolver mLock and nsHostRecord addr_info_lock in a cascade
> 
> What is a cascade? Googling "multithreaded cascade" gives me nothing.

Keeping both:

{ 
  AutoLock lock(outer);
  {
     AutoLock lock2(inner);
     do something...
  }
}

If you do it in reverse order on some other thread:

{ 
  AutoLock lock(inner);
  {
     AutoLock lock2(outer);
     do something...
  }
}

you deadlock.
Flags: needinfo?(honzab.moz)
https://hg.mozilla.org/mozilla-central/rev/253690f0ab21
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
There still are 177 crashes in nightly 59 starting with buildid 20171202100103.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I really expected the patch would have fixed it.
At this point, there are two possibilities:
 1. this is a race where the socket thread is calling GetNextAddr and some other thread is nulling/changing a smart pointer that leads to the destruction of the objects holding the nsDNSRecord and nsHostRecord.
 2. There is an unpaired NS_RELEASE for nsHostRecord that is leading to a double free.
 3. [unlikely] some addon or DLL is doing some bad things

I will try one more fix; if that doesn't work, we probably should back out all the patches to avoid alienating users.
There’s still a data race; I have a (somewhat large) patch in the works to harden the code with an RWLock. I’ll postthe patch when I get home, and if you want to push your own more conservative fix that’s fine.
Did you find the data race? I'm quite curious to see what that is.
Attached patch sample-fix.patchSplinter Review
Attachment #8934043 - Flags: review?(valentin.gosu)
Attachment #8934043 - Flags: review?(honzab.moz)
Yeah, so there are a few places where we look at addr_info but don't grab the lock. Look at the places in this patch where I didn't remove an existing lock - those are the places you can fix in a more conservative patch. There's a pretty I want to RWLock the entire nsHostResolver struct  The short answer is that nsHostResolver::HasUsableResult needs a lock, but doesn't look like it. This patch converts addr_info into a private member and uses public getters which assert the lock, per Honza's suggestion. Note the places where I didn't remove a lock; I think all of them are missing because HasUsableResult doesn't look like it needs a lock.

For https://hg.mozilla.org/mozilla-central/file/709f355a7a8c/netwerk/dns/nsHostResolver.cpp#l860 , the only options were to add a helper functions or add another nested if. I detest heavily-nested if blocks, so my patch opts for the helper function.

I don't like grabbing locks multiple times within a function. I'd probably recommend redesigning this code so that we apply RWLock to the entire nsHostRecord, and grab the lock as few times as possible. I think the old code existed before RWLock - addr_info was the only mutable data member, and the only tool the original authors had was a full-fledged Mutex. So they put this heavyweight lock on addr_info to emulate RWLock semantics.

I don't know much about writing multithreaded code. If this were Rust, I'd just put the resource inside an RWLock object, but that doesn't seem possible with the C++ XPCOM API. Is there an accepted pattern for locking resources? Is my getter-with-a-useless-argument pattern okay?

Try run here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8df0c521d1c14e7ed325c0ff306a520b8cee85f
Attachment #8934043 - Flags: review?(valentin.gosu)
Attachment #8934043 - Flags: review?(honzab.moz)
Cool. I'll file another bug for the race condition, and polish my patch up a bit more.
Group: network-core-security → core-security-release
No more crashes in nightly 59.
Status: RESOLVED → VERIFIED
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: