Closed Bug 1438947 Opened 5 years ago Closed 5 years ago

Crash in mozilla::net::TRR::Cancel


(Core :: Networking: DNS, defect, P1)

Windows 10



Tracking Status
firefox-esr52 --- unaffected
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- fixed


(Reporter: calixte, Assigned: bagder)


(Blocks 1 open bug)


(Keywords: crash, regression, Whiteboard: [necko-triaged][trr])

Crash Data


(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-d894c6f9-e5ef-4bb2-b529-c90f60180216.

Top 10 frames of crashing thread:

0 xul.dll mozilla::net::TRR::Cancel netwerk/dns/TRR.cpp:1036
1 xul.dll nsHostRecord::Cancel netwerk/dns/nsHostResolver.cpp:230
2 xul.dll nsHostResolver::ClearPendingQueue netwerk/dns/nsHostResolver.cpp:599
3 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:684
4 xul.dll nsDNSService::Shutdown netwerk/dns/nsDNSService2.cpp:676
5 xul.dll mozilla::net::nsIOService::SetOffline netwerk/base/nsIOService.cpp:1173
6 xul.dll mozilla::net::nsIOService::Observe netwerk/base/nsIOService.cpp:1488
7 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112
8 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:296
9 xul.dll nsXREDirProvider::DoShutdown toolkit/xre/nsXREDirProvider.cpp:1095


There is 1 crash in nightly 60 with buildid 20180216104033. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1434852.

Flags: needinfo?(daniel)
Clearly TRR related.
Assignee: nobody → daniel
Flags: needinfo?(daniel)
Priority: -- → P1
Whiteboard: [necko-triaged]
mTimeout is of type nsCOMPtr<nsITimer>
So why do we have to initialize it to nullptr? I would expect the default constructor to do that already.
Flags: needinfo?(daniel)
Ehm, right. I just checked TRR.cpp:1036 and figured it looked exactly like an uninitialized mTimeout in there... So its something else!
Flags: needinfo?(daniel)
Attachment #8951786 - Flags: review?(valentin.gosu)
It seems to me that it might already be nullptr, and maybe there's a race with it being null-ed on another thread.
That might be it, yes. I'll go the other way instead and avoid some of the early cancel+clears since they're done in the destructor anyway. Another take is coming.
Hm, I bet it isn't even necessary to Cancel it in the destructor!
(In reply to Daniel Stenberg [:bagder] from comment #7)
> Hm, I bet it isn't even necessary to Cancel it in the destructor!

No. The destructor for nsTimer does that anyway.
Comment on attachment 8951786 [details]
bug 1438947 - avoid unnecessary timer cancels + clears
Attachment #8951786 - Flags: review?(valentin.gosu) → review+
Pushed by
avoid unnecessary timer cancels + clears r=valentin
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
There still are 12 crashes in nightly 60 starting with buildid 20180217100053 so after the patch landed.
Resolution: FIXED → ---
In TRR::Cancel?
Yes, see to get the crash reports.
The crash reports indicate that there are at least two or three different ways that lead into the TRR::Cancel crash.

1 - (example Via nsHostResolver::FlushCache that gets called from nsDNSService::Observe() from mozilla::net::nsIOService::NotifyWakeup. This is a machine waking up and it gets a network-change event and flushes the DNS cache.

2 - (example Via nsHostRecord::Cancel() and nsHostResolver::Shutdown(). This is the browser shutting down.

3 - A variation of the first one (example Via nsHostRecord::Cancel and mozilla::net::nsIOService::SetOffline. This is the browser going offline and as a part of that, it flushes the DNS cache.

I can't see how any of them can be reached without TRR being activated, which is good since TRR is still preffed off by default.
I believe this might be a race when accessing/changing the mTrrA and mTrrAAAA in the nsHostRecord objects.

Partly this happens because the code currently cancels TRR resolves from nsHostRecord::RemoveOrRefresh, which I don't think it should since we don't need to refresh TRR name resolves due to network changes.

I'm testing out a fix for this locally which I'll submit for review as soon as I feel confident.
(reviewboard refuses me to push another patch to the same bug that already had a patch previously so I'm doing it the old style as a work-around)

This patch puts a mutex around all nsHostRecord::mTrrA[AAA] accesses since they might be accessed and updated from different threads.
... and remove Cancel() from nsHostRecord::RemoveOrRefresh since we don't need to cancel TRR resolves due to network changes.
Attachment #8952965 - Flags: review?(valentin.gosu)
Whiteboard: [necko-triaged] → [necko-triaged][trr]
Comment on attachment 8952965 [details] [diff] [review]

Review of attachment 8952965 [details] [diff] [review]:

::: netwerk/dns/nsHostResolver.cpp
@@ +1076,5 @@
>      RefPtr<nsHostRecord> rec(aRec);
>      mLock.AssertCurrentThreadOwns();
> +    {
> +        MutexAutoLock trrlock(rec->mTrr_lock);
> +        MOZ_ASSERT(!TRROutstanding());

Let's wrap this block in a #ifdef DEBUG, since we're only locking for this assert

::: netwerk/dns/nsHostResolver.h
@@ +189,5 @@
>      enum {
>      } mTrrAUsed, mTrrAAAAUsed;
> +    Mutex mTrr_lock; // lock when accessing the mTrrA[AAA] pointers

Let's call it mTrrLock or mTRRLock.
Attachment #8952965 - Flags: review?(valentin.gosu) → review+
Keywords: checkin-needed
Pushed by
mutex nsHostRecord::mTrrA[AAA] accesses r=valentin
Keywords: checkin-needed
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.