Closed Bug 1438947 Opened 2 years ago Closed 2 years ago

Crash in mozilla::net::TRR::Cancel

Categories

(Core :: Networking: DNS, defect, P1, critical)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- fixed

People

(Reporter: calixte, Assigned: bagder)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged][trr])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-d894c6f9-e5ef-4bb2-b529-c90f60180216.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll mozilla::net::TRR::Cancel netwerk/dns/TRR.cpp:1036
1 xul.dll nsHostRecord::Cancel netwerk/dns/nsHostResolver.cpp:230
2 xul.dll nsHostResolver::ClearPendingQueue netwerk/dns/nsHostResolver.cpp:599
3 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:684
4 xul.dll nsDNSService::Shutdown netwerk/dns/nsDNSService2.cpp:676
5 xul.dll mozilla::net::nsIOService::SetOffline netwerk/base/nsIOService.cpp:1173
6 xul.dll mozilla::net::nsIOService::Observe netwerk/base/nsIOService.cpp:1488
7 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112
8 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:296
9 xul.dll nsXREDirProvider::DoShutdown toolkit/xre/nsXREDirProvider.cpp:1095

=============================================================

There is 1 crash in nightly 60 with buildid 20180216104033. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1434852.

[1] https://hg.mozilla.org/mozilla-central/rev?node=6776d69d2f03
Flags: needinfo?(daniel)
Clearly TRR related.
Assignee: nobody → daniel
Flags: needinfo?(daniel)
Priority: -- → P1
Whiteboard: [necko-triaged]
mTimeout is of type nsCOMPtr<nsITimer>
So why do we have to initialize it to nullptr? I would expect the default constructor to do that already.
Flags: needinfo?(daniel)
Ehm, right. I just checked TRR.cpp:1036 and figured it looked exactly like an uninitialized mTimeout in there... So its something else!
Flags: needinfo?(daniel)
Attachment #8951786 - Flags: review?(valentin.gosu)
It seems to me that it might already be nullptr, and maybe there's a race with it being null-ed on another thread.
That might be it, yes. I'll go the other way instead and avoid some of the early cancel+clears since they're done in the destructor anyway. Another take is coming.
Hm, I bet it isn't even necessary to Cancel it in the destructor!
(In reply to Daniel Stenberg [:bagder] from comment #7)
> Hm, I bet it isn't even necessary to Cancel it in the destructor!

No. The destructor for nsTimer does that anyway.
Comment on attachment 8951786 [details]
bug 1438947 - avoid unnecessary timer cancels + clears

https://reviewboard.mozilla.org/r/221054/#review226968
Attachment #8951786 - Flags: review?(valentin.gosu) → review+
Pushed by daniel@haxx.se:
https://hg.mozilla.org/integration/autoland/rev/09cda613b1c4
avoid unnecessary timer cancels + clears r=valentin
https://hg.mozilla.org/mozilla-central/rev/09cda613b1c4
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
There still are 12 crashes in nightly 60 starting with buildid 20180217100053 so after the patch landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
In TRR::Cancel?
Yes, see http://bit.ly/2HuBoKS to get the crash reports.
The crash reports indicate that there are at least two or three different ways that lead into the TRR::Cancel crash.

1 - (example https://crash-stats.mozilla.com/report/index/4d0fc238-85ed-444f-99a2-601e90180219) Via nsHostResolver::FlushCache that gets called from nsDNSService::Observe() from mozilla::net::nsIOService::NotifyWakeup. This is a machine waking up and it gets a network-change event and flushes the DNS cache.

2 - (example https://crash-stats.mozilla.com/report/index/5a160053-88e8-4416-80ee-de0890180219) Via nsHostRecord::Cancel() and nsHostResolver::Shutdown(). This is the browser shutting down.

3 - A variation of the first one (example https://crash-stats.mozilla.com/report/index/7e58d309-5468-4704-a008-f21760180219) Via nsHostRecord::Cancel and mozilla::net::nsIOService::SetOffline. This is the browser going offline and as a part of that, it flushes the DNS cache.

I can't see how any of them can be reached without TRR being activated, which is good since TRR is still preffed off by default.
I believe this might be a race when accessing/changing the mTrrA and mTrrAAAA in the nsHostRecord objects.

Partly this happens because the code currently cancels TRR resolves from nsHostRecord::RemoveOrRefresh, which I don't think it should since we don't need to refresh TRR name resolves due to network changes.

I'm testing out a fix for this locally which I'll submit for review as soon as I feel confident.
(reviewboard refuses me to push another patch to the same bug that already had a patch previously so I'm doing it the old style as a work-around)

This patch puts a mutex around all nsHostRecord::mTrrA[AAA] accesses since they might be accessed and updated from different threads.
    
... and remove Cancel() from nsHostRecord::RemoveOrRefresh since we don't need to cancel TRR resolves due to network changes.
Attachment #8952965 - Flags: review?(valentin.gosu)
Whiteboard: [necko-triaged] → [necko-triaged][trr]
Comment on attachment 8952965 [details] [diff] [review]
0001-bug-1438947-mutex-nsHostRecord-mTrrA-AAA-accesses-r-.patch

Review of attachment 8952965 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/dns/nsHostResolver.cpp
@@ +1076,5 @@
>      RefPtr<nsHostRecord> rec(aRec);
>      mLock.AssertCurrentThreadOwns();
> +    {
> +        MutexAutoLock trrlock(rec->mTrr_lock);
> +        MOZ_ASSERT(!TRROutstanding());

Let's wrap this block in a #ifdef DEBUG, since we're only locking for this assert

::: netwerk/dns/nsHostResolver.h
@@ +189,5 @@
>      enum {
>          INIT, STARTED, OK, FAILED
>      } mTrrAUsed, mTrrAAAAUsed;
>  
> +    Mutex mTrr_lock; // lock when accessing the mTrrA[AAA] pointers

Let's call it mTrrLock or mTRRLock.
Attachment #8952965 - Flags: review?(valentin.gosu) → review+
Keywords: checkin-needed
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11edb4e3cdec
mutex nsHostRecord::mTrrA[AAA] accesses r=valentin
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/11edb4e3cdec
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.