Closed Bug 1438947 Opened 7 years ago Closed 7 years ago

Crash in mozilla::net::TRR::Cancel

Categories

(Core :: Networking: DNS, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- fixed

People

(Reporter: calixte, Assigned: bagder)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged][trr])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-d894c6f9-e5ef-4bb2-b529-c90f60180216. ============================================================= Top 10 frames of crashing thread: 0 xul.dll mozilla::net::TRR::Cancel netwerk/dns/TRR.cpp:1036 1 xul.dll nsHostRecord::Cancel netwerk/dns/nsHostResolver.cpp:230 2 xul.dll nsHostResolver::ClearPendingQueue netwerk/dns/nsHostResolver.cpp:599 3 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:684 4 xul.dll nsDNSService::Shutdown netwerk/dns/nsDNSService2.cpp:676 5 xul.dll mozilla::net::nsIOService::SetOffline netwerk/base/nsIOService.cpp:1173 6 xul.dll mozilla::net::nsIOService::Observe netwerk/base/nsIOService.cpp:1488 7 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112 8 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:296 9 xul.dll nsXREDirProvider::DoShutdown toolkit/xre/nsXREDirProvider.cpp:1095 ============================================================= There is 1 crash in nightly 60 with buildid 20180216104033. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1434852. [1] https://hg.mozilla.org/mozilla-central/rev?node=6776d69d2f03
Flags: needinfo?(daniel)
Clearly TRR related.
Assignee: nobody → daniel
Flags: needinfo?(daniel)
Priority: -- → P1
Whiteboard: [necko-triaged]
mTimeout is of type nsCOMPtr<nsITimer> So why do we have to initialize it to nullptr? I would expect the default constructor to do that already.
Flags: needinfo?(daniel)
Ehm, right. I just checked TRR.cpp:1036 and figured it looked exactly like an uninitialized mTimeout in there... So its something else!
Flags: needinfo?(daniel)
Attachment #8951786 - Flags: review?(valentin.gosu)
It seems to me that it might already be nullptr, and maybe there's a race with it being null-ed on another thread.
That might be it, yes. I'll go the other way instead and avoid some of the early cancel+clears since they're done in the destructor anyway. Another take is coming.
Hm, I bet it isn't even necessary to Cancel it in the destructor!
(In reply to Daniel Stenberg [:bagder] from comment #7) > Hm, I bet it isn't even necessary to Cancel it in the destructor! No. The destructor for nsTimer does that anyway.
Attachment #8951786 - Flags: review?(valentin.gosu) → review+
Pushed by daniel@haxx.se: https://hg.mozilla.org/integration/autoland/rev/09cda613b1c4 avoid unnecessary timer cancels + clears r=valentin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
There still are 12 crashes in nightly 60 starting with buildid 20180217100053 so after the patch landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
In TRR::Cancel?
Yes, see http://bit.ly/2HuBoKS to get the crash reports.
The crash reports indicate that there are at least two or three different ways that lead into the TRR::Cancel crash. 1 - (example https://crash-stats.mozilla.com/report/index/4d0fc238-85ed-444f-99a2-601e90180219) Via nsHostResolver::FlushCache that gets called from nsDNSService::Observe() from mozilla::net::nsIOService::NotifyWakeup. This is a machine waking up and it gets a network-change event and flushes the DNS cache. 2 - (example https://crash-stats.mozilla.com/report/index/5a160053-88e8-4416-80ee-de0890180219) Via nsHostRecord::Cancel() and nsHostResolver::Shutdown(). This is the browser shutting down. 3 - A variation of the first one (example https://crash-stats.mozilla.com/report/index/7e58d309-5468-4704-a008-f21760180219) Via nsHostRecord::Cancel and mozilla::net::nsIOService::SetOffline. This is the browser going offline and as a part of that, it flushes the DNS cache. I can't see how any of them can be reached without TRR being activated, which is good since TRR is still preffed off by default.
I believe this might be a race when accessing/changing the mTrrA and mTrrAAAA in the nsHostRecord objects. Partly this happens because the code currently cancels TRR resolves from nsHostRecord::RemoveOrRefresh, which I don't think it should since we don't need to refresh TRR name resolves due to network changes. I'm testing out a fix for this locally which I'll submit for review as soon as I feel confident.
(reviewboard refuses me to push another patch to the same bug that already had a patch previously so I'm doing it the old style as a work-around) This patch puts a mutex around all nsHostRecord::mTrrA[AAA] accesses since they might be accessed and updated from different threads. ... and remove Cancel() from nsHostRecord::RemoveOrRefresh since we don't need to cancel TRR resolves due to network changes.
Attachment #8952965 - Flags: review?(valentin.gosu)
Whiteboard: [necko-triaged] → [necko-triaged][trr]
Comment on attachment 8952965 [details] [diff] [review] 0001-bug-1438947-mutex-nsHostRecord-mTrrA-AAA-accesses-r-.patch Review of attachment 8952965 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/dns/nsHostResolver.cpp @@ +1076,5 @@ > RefPtr<nsHostRecord> rec(aRec); > mLock.AssertCurrentThreadOwns(); > + { > + MutexAutoLock trrlock(rec->mTrr_lock); > + MOZ_ASSERT(!TRROutstanding()); Let's wrap this block in a #ifdef DEBUG, since we're only locking for this assert ::: netwerk/dns/nsHostResolver.h @@ +189,5 @@ > enum { > INIT, STARTED, OK, FAILED > } mTrrAUsed, mTrrAAAAUsed; > > + Mutex mTrr_lock; // lock when accessing the mTrrA[AAA] pointers Let's call it mTrrLock or mTRRLock.
Attachment #8952965 - Flags: review?(valentin.gosu) → review+
Keywords: checkin-needed
Keywords: checkin-needed
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: