Closed Bug 1544190 Opened 4 years ago Closed 2 years ago

Crash in [@ AddrHostRecord::~AddrHostRecord]

Categories

(Core :: Networking: DNS, defect, P2)

65 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr60 --- unaffected
firefox-esr78 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- fixed

People

(Reporter: philipp, Assigned: kershaw)

References

Details

(Keywords: crash, csectype-other, sec-high, Whiteboard: [necko-triaged][sec-survey][adv-main91+r])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug is for crash report bp-e154ec3d-82cf-4cf6-85a8-e0c980190210.

Top 10 frames of crashing thread:

0 mozglue.dll arena_t::DallocSmall memory/build/mozjemalloc.cpp:3240
1 mozglue.dll BaseAllocator::free memory/build/mozjemalloc.cpp:4028
2 mozglue.dll je_free memory/build/malloc_decls.h:40
3 xul.dll void AddrHostRecord::~AddrHostRecord netwerk/dns/nsHostResolver.cpp:297
4 xul.dll AddrHostRecord::Release netwerk/dns/nsHostResolver.cpp:193
5 xul.dll nsHostResolver::FlushCache netwerk/dns/nsHostResolver.cpp:729
6 xul.dll nsDNSService::Observe netwerk/dns/nsDNSService2.cpp:1134
7 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:286
8 xul.dll nsNotifyAddrListener::ChangeEvent::Run netwerk/system/win32/nsNotifyAddrListener.cpp:471
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1157

this is a fairly low-volume windows crash signature, starting to show up more regularly in release builds since firefox 65. some comments from affected users are mentioning that they are disconnecting an usb modem, connect or dsiconnect to a vpn or just losing the connection at the time of the crash.

Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Priority: -- → P3
Whiteboard: [necko-triaged]

the issue seems to become more common again at the start of the 71.0b cycle.

Crash Signature: [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] → [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] [@ arena_t::DallocSmall | arena_dalloc | AddrHostRecord::~AddrHostRecord ] [@ arena_t::DallocSmall | je_free | AddrHostRecord::~AddrHostRecord ]

Note: MOZ_CRASH_REASON is a double-free

Double-free; marking as a sec bug on principle (all double frees imply possible sec issues). Clearing priority for re-triage.

Group: core-security
Flags: needinfo?(nhnguyen)
Priority: P3 → --

to be triaged

Flags: needinfo?(nhnguyen) → needinfo?(kershaw)
Whiteboard: [necko-triaged]

(In reply to [:philipp] from comment #1)

the issue seems to become more common again at the start of the 71.0b cycle.

Maybe this is related to turning on DOH?
Valentin, could you take a look at this? Thanks.

Flags: needinfo?(kershaw) → needinfo?(valentin.gosu)
Group: core-security → network-core-security

(In reply to Kershaw Chang [:kershaw] from comment #5)

(In reply to [:philipp] from comment #1)

the issue seems to become more common again at the start of the 71.0b cycle.

Maybe this is related to turning on DOH?
Valentin, could you take a look at this? Thanks.

Looking at the spikes in crashes in beta & release it seems that the crash rate increased in Firefox 71.
Looking at DNS related bugs that landed in 71, there's bug 1587875 and bug 1586845

It's rather clear from bug 1530175, bug 1543331 and similar that we have a bug when removing things from mRecordDB, and maybe bug 1587875 is exercising that code path more often.

Flags: needinfo?(valentin.gosu)
Priority: -- → P2
Whiteboard: [necko-triaged]
Assignee: dd.mozilla → valentin.gosu

This bug landed in 80 and the crashes seem to have gone away.
https://bugzilla.mozilla.org/show_bug.cgi?id=1649143

(In reply to Valentin Gosu [:valentin] (he/him) from comment #7)

This bug landed in 80 and the crashes seem to have gone away.
https://bugzilla.mozilla.org/show_bug.cgi?id=1649143

There's still one crash report in 80.0.1 🙁

https://crash-stats.mozilla.org/report/index/e4f35ca6-70dc-4af8-8ea1-cbf9c0200914

Crash Signature: [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] [@ arena_t::DallocSmall | arena_dalloc | AddrHostRecord::~AddrHostRecord ] [@ arena_t::DallocSmall | je_free | AddrHostRecord::~AddrHostRecord ] → [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] [@ arena_t::DallocSmall | arena_dalloc | AddrHostRecord::~AddrHostRecord ] [@ arena_t::DallocSmall | je_free | AddrHostRecord::~AddrHostRecord ] [@ AddrHostRecor…

Talking with Valentin, there seem to be the theoretical possibility that double entries in mEvictionQ could lead to this situation. In order to check or exclude this, we could promote this assert to a diagnostic assertion. See also bug 1665979.

My latest theory about this bug is that it's caused by unlocked removals from the LinkedList<RefPtr<nsHostRecord>>
While the refcounting of nsHostRecord is atomic, mNext and mPrev are unlocked pointers.
As such, we can get into a situation where:
T1: calls .remove on a hostRecord, that causes mNext and mPrev to be updated and decrements the refcount.
T2: calls .clear() which iterates through the linked list and removes each entry - but may be using old values of mNext so it may again decrement the refcount of the entry we just removed.
Later: we try to use the RefPtr we have from the hashtable, only to find that the memory has been freed already.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

My latest theory about this bug is that it's caused by unlocked removals from the LinkedList<RefPtr<nsHostRecord>>

Is the code in question Windows only? Otherwise I think this should show up in TSan.

Flags: needinfo?(valentin.gosu)

(In reply to Christian Holler (:decoder) from comment #14)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

My latest theory about this bug is that it's caused by unlocked removals from the LinkedList<RefPtr<nsHostRecord>>

Is the code in question Windows only? Otherwise I think this should show up in TSan.

Yes. It only seems to happen on Windows.
But it's possible that D106415 or bug 1513519 might have fixed this too.

Flags: needinfo?(valentin.gosu)

Take this from Valentin.

Assignee: valentin.gosu → kershaw

Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Unknown, since we still don't know the root cause of this crash. Also, the crash rate is really low.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: The risk is low.
  • How likely is this patch to cause regressions; how much testing does it need?: Should be no rick. This patch only adds some diagnostic assertions.
Attachment #9224033 - Flags: sec-approval?

Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko

Approved to land and uplift.

Attachment #9224033 - Flags: sec-approval?
Attachment #9224033 - Flags: sec-approval+
Attachment #9224033 - Flags: approval-mozilla-beta+
Group: network-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

Sorry that I forgot to add leave-open flag.
The landed patch only adds some diagnostic assertions, not really fix the issue.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

Since these crashes only happen on release, do you think it might be a good idea to turn them into MOZ_RELEASE_ASSERT conditioned on a pref?

Flags: needinfo?(kershaw)

Comment on attachment 9224033 [details]
Bug 1544190 - Check if the record is in the queue, r=#necko

Clearing beta approval flag to get this off the needs-uplift queries.

Attachment #9224033 - Flags: approval-mozilla-beta+

(In reply to Valentin Gosu [:valentin] (he/him) from comment #22)

Since these crashes only happen on release, do you think it might be a good idea to turn them into MOZ_RELEASE_ASSERT conditioned on a pref?

Yes, I think this is a good idea. I'll prepare a patch for this. Thanks.

Flags: needinfo?(kershaw)

We've seen this crash in beta. This means that the previous patch with diagnostic assertions is not working. Maybe there is another reason behind this.
We are going to land a refactor patch in bug 1713796 and hope it can help us figure out the root cause of this crash.

This causes also crashes with the signature [@ nsHostResolver::MaybeRenewHostRecordLocked], e.g. bp-e0f61756-b9d2-4aea-93b4-3a66a0210612. 4 crashes on Nightly so far.

Attachment #9227190 - Attachment is obsolete: true

Found a possible reason that causes a record be put into the high queue twice.
Imagine a flow below:

  1. When TRRQuery::DispatchLookup is called, mTrrA and mTrrAAAA are set and we put these two TRR requests in an array.
  2. Right before dispatching TRR requests in the array, TRRQuery::Cancel is called, so both mTrrA and mTrrAAAA are set to null.
  3. The TRR requests in the array are still dispatched.
  4. TRRQuery::CompleteLookup is called by mTrrA. Since mTrrA and mTrrAAAA are null, we assume there is no pending TRR request and mHostResolver->CompleteLookup is called.
  5. TRRQuery::CompleteLookup is called by mTrrAAAA again, and we'll hit the diagnostic assertion.

Comment on attachment 9227525 [details]
Bug 1544190 - Use a counter to track if there is a pending TRR request, r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Not easy, since this is most likely to be happened during shutdown.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: The risk is low. This issue is already existed for a long time and the crash rate is low.
  • How likely is this patch to cause regressions; how much testing does it need?: Low risk. We already have tests that exercise this call path.
Attachment #9227525 - Flags: sec-approval?

Comment on attachment 9227525 [details]
Bug 1544190 - Use a counter to track if there is a pending TRR request, r=#necko

sec-approval = dveditz

Attachment #9227525 - Flags: sec-approval? → sec-approval+

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)

Use a counter to track if there is a pending TRR request, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/9169af3a686327a9b9763c50b9826c4dca5a95d3
https://hg.mozilla.org/mozilla-central/rev/9169af3a6863

Can this bug get closed?

I think we should wait a bit more to see if this crash is really fixed.

Flags: needinfo?(kershaw)

Removing the signatures that do not happen any more (or only with unsupported versions).

Crash Signature: [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] [@ arena_t::DallocSmall | arena_dalloc | AddrHostRecord::~AddrHostRecord ] [@ arena_t::DallocSmall | je_free | AddrHostRecord::~AddrHostRecord ] [@ AddrHostRecor… → [@ AddrHostRecord::~AddrHostRecord]
Summary: Crash in [@ arena_t::DallocSmall | BaseAllocator::free | je_free | AddrHostRecord::~AddrHostRecord] → Crash in [@ AddrHostRecord::~AddrHostRecord]

I think we can close this bug, since we added an MOZ_RELEASE_ASSERT in bug 1717778. This release assertion will be triggered if we add the same record into the linked list twice. Before double releasing the host record, the release assertion should be hit first, so this bug can be closed.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Depends on: 1717778
Target Milestone: --- → 91 Branch

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged] → [necko-triaged][sec-survey]
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Whiteboard: [necko-triaged][sec-survey] → [necko-triaged][sec-survey][adv-main90+r]
Whiteboard: [necko-triaged][sec-survey][adv-main90+r] → [necko-triaged][sec-survey][adv-main91+r]
Flags: needinfo?(kershaw)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.