Crash in [@ AddrHostRecord::~AddrHostRecord]
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: philipp, Assigned: kershaw)
References
Details
(Keywords: crash, csectype-other, sec-high, Whiteboard: [necko-triaged][sec-survey][adv-main91+r])
Crash Data
Attachments
(2 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
dveditz
:
sec-approval+
|
Details | Review |
This bug is for crash report bp-e154ec3d-82cf-4cf6-85a8-e0c980190210.
Top 10 frames of crashing thread:
0 mozglue.dll arena_t::DallocSmall memory/build/mozjemalloc.cpp:3240
1 mozglue.dll BaseAllocator::free memory/build/mozjemalloc.cpp:4028
2 mozglue.dll je_free memory/build/malloc_decls.h:40
3 xul.dll void AddrHostRecord::~AddrHostRecord netwerk/dns/nsHostResolver.cpp:297
4 xul.dll AddrHostRecord::Release netwerk/dns/nsHostResolver.cpp:193
5 xul.dll nsHostResolver::FlushCache netwerk/dns/nsHostResolver.cpp:729
6 xul.dll nsDNSService::Observe netwerk/dns/nsDNSService2.cpp:1134
7 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:286
8 xul.dll nsNotifyAddrListener::ChangeEvent::Run netwerk/system/win32/nsNotifyAddrListener.cpp:471
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1157
this is a fairly low-volume windows crash signature, starting to show up more regularly in release builds since firefox 65. some comments from affected users are mentioning that they are disconnecting an usb modem, connect or dsiconnect to a vpn or just losing the connection at the time of the crash.
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
the issue seems to become more common again at the start of the 71.0b cycle.
Comment 2•3 years ago
|
||
Note: MOZ_CRASH_REASON is a double-free
Comment 3•3 years ago
|
||
Double-free; marking as a sec bug on principle (all double frees imply possible sec issues). Clearing priority for re-triage.
Comment 4•3 years ago
|
||
to be triaged
Assignee | ||
Comment 5•3 years ago
|
||
(In reply to [:philipp] from comment #1)
the issue seems to become more common again at the start of the 71.0b cycle.
Maybe this is related to turning on DOH?
Valentin, could you take a look at this? Thanks.
Updated•3 years ago
|
Comment 6•3 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #5)
(In reply to [:philipp] from comment #1)
the issue seems to become more common again at the start of the 71.0b cycle.
Maybe this is related to turning on DOH?
Valentin, could you take a look at this? Thanks.
Looking at the spikes in crashes in beta & release it seems that the crash rate increased in Firefox 71.
Looking at DNS related bugs that landed in 71, there's bug 1587875 and bug 1586845
It's rather clear from bug 1530175, bug 1543331 and similar that we have a bug when removing things from mRecordDB, and maybe bug 1587875 is exercising that code path more often.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 7•3 years ago
|
||
This bug landed in 80 and the crashes seem to have gone away.
https://bugzilla.mozilla.org/show_bug.cgi?id=1649143
Comment 8•3 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #7)
This bug landed in 80 and the crashes seem to have gone away.
https://bugzilla.mozilla.org/show_bug.cgi?id=1649143
There's still one crash report in 80.0.1 🙁
https://crash-stats.mozilla.org/report/index/e4f35ca6-70dc-4af8-8ea1-cbf9c0200914
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment 12•3 years ago
|
||
Talking with Valentin, there seem to be the theoretical possibility that double entries in mEvictionQ
could lead to this situation. In order to check or exclude this, we could promote this assert to a diagnostic assertion. See also bug 1665979.
Comment 13•3 years ago
|
||
My latest theory about this bug is that it's caused by unlocked removals from the LinkedList<RefPtr<nsHostRecord>>
While the refcounting of nsHostRecord is atomic, mNext and mPrev are unlocked pointers.
As such, we can get into a situation where:
T1: calls .remove on a hostRecord, that causes mNext and mPrev to be updated and decrements the refcount.
T2: calls .clear() which iterates through the linked list and removes each entry - but may be using old values of mNext so it may again decrement the refcount of the entry we just removed.
Later: we try to use the RefPtr we have from the hashtable, only to find that the memory has been freed already.
Comment 14•2 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)
My latest theory about this bug is that it's caused by unlocked removals from the
LinkedList<RefPtr<nsHostRecord>>
Is the code in question Windows only? Otherwise I think this should show up in TSan.
Comment 15•2 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #14)
(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)
My latest theory about this bug is that it's caused by unlocked removals from the
LinkedList<RefPtr<nsHostRecord>>
Is the code in question Windows only? Otherwise I think this should show up in TSan.
Yes. It only seems to happen on Windows.
But it's possible that D106415 or bug 1513519 might have fixed this too.
Assignee | ||
Comment 16•2 years ago
|
||
Assignee | ||
Comment 18•2 years ago
|
||
Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Unknown, since we still don't know the root cause of this crash. Also, the crash rate is really low.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: The risk is low.
- How likely is this patch to cause regressions; how much testing does it need?: Should be no rick. This patch only adds some diagnostic assertions.
Comment 19•2 years ago
|
||
Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko
Approved to land and uplift.
![]() |
||
Comment 20•2 years ago
|
||
Check if the record is in the queue, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/65fbc94c6d9b92f87f7f605475ae044e22672c6c
https://hg.mozilla.org/mozilla-central/rev/65fbc94c6d9b
Assignee | ||
Comment 21•2 years ago
|
||
Sorry that I forgot to add leave-open
flag.
The landed patch only adds some diagnostic assertions, not really fix the issue.
Updated•2 years ago
|
Comment 22•2 years ago
|
||
Since these crashes only happen on release, do you think it might be a good idea to turn them into MOZ_RELEASE_ASSERT
conditioned on a pref?
Comment 23•2 years ago
|
||
uplift |
Comment 24•2 years ago
|
||
Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko
Clearing beta approval flag to get this off the needs-uplift queries.
Assignee | ||
Comment 25•2 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #22)
Since these crashes only happen on release, do you think it might be a good idea to turn them into
MOZ_RELEASE_ASSERT
conditioned on a pref?
Yes, I think this is a good idea. I'll prepare a patch for this. Thanks.
Assignee | ||
Comment 26•2 years ago
|
||
Assignee | ||
Comment 27•2 years ago
|
||
We've seen this crash in beta. This means that the previous patch with diagnostic assertions is not working. Maybe there is another reason behind this.
We are going to land a refactor patch in bug 1713796 and hope it can help us figure out the root cause of this crash.
![]() |
||
Comment 28•2 years ago
|
||
This causes also crashes with the signature [@ nsHostResolver::MaybeRenewHostRecordLocked]
, e.g. bp-e0f61756-b9d2-4aea-93b4-3a66a0210612. 4 crashes on Nightly so far.
Updated•2 years ago
|
Assignee | ||
Comment 29•2 years ago
|
||
Found a possible reason that causes a record be put into the high queue twice.
Imagine a flow below:
- When
TRRQuery::DispatchLookup
is called,mTrrA
andmTrrAAAA
are set and we put these two TRR requests in an array. - Right before dispatching TRR requests in the array, TRRQuery::Cancel is called, so both
mTrrA
andmTrrAAAA
are set to null. - The TRR requests in the array are still dispatched.
TRRQuery::CompleteLookup
is called bymTrrA
. SincemTrrA
andmTrrAAAA
are null, we assume there is no pending TRR request and mHostResolver->CompleteLookup is called.TRRQuery::CompleteLookup
is called bymTrrAAAA
again, and we'll hit the diagnostic assertion.
Assignee | ||
Comment 30•2 years ago
|
||
Assignee | ||
Comment 31•2 years ago
|
||
Comment on attachment 9227525 [details]
Bug 1544190 - Use a counter to track if there is a pending TRR request, r=#necko
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Not easy, since this is most likely to be happened during shutdown.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: The risk is low. This issue is already existed for a long time and the crash rate is low.
- How likely is this patch to cause regressions; how much testing does it need?: Low risk. We already have tests that exercise this call path.
Comment 32•2 years ago
|
||
Comment on attachment 9227525 [details]
Bug 1544190 - Use a counter to track if there is a pending TRR request, r=#necko
sec-approval = dveditz
Updated•2 years ago
|
![]() |
||
Comment 33•2 years ago
|
||
Use a counter to track if there is a pending TRR request, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/9169af3a686327a9b9763c50b9826c4dca5a95d3
https://hg.mozilla.org/mozilla-central/rev/9169af3a6863
Can this bug get closed?
Assignee | ||
Comment 34•2 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)
Use a counter to track if there is a pending TRR request, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/9169af3a686327a9b9763c50b9826c4dca5a95d3
https://hg.mozilla.org/mozilla-central/rev/9169af3a6863Can this bug get closed?
I think we should wait a bit more to see if this crash is really fixed.
Comment 35•2 years ago
|
||
Removing the signatures that do not happen any more (or only with unsupported versions).
Assignee | ||
Comment 36•2 years ago
•
|
||
I think we can close this bug, since we added an MOZ_RELEASE_ASSERT
in bug 1717778. This release assertion will be triggered if we add the same record into the linked list twice. Before double releasing the host record, the release assertion should be hit first, so this bug can be closed.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 37•2 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Updated•7 months ago
|
Description
•