Closed Bug 1502641 Opened 7 years ago Closed 7 years ago

Crash in _purecall | nsHostRecord::~nsHostRecord

Categories

(Core :: Networking: DNS, defect, P2)

64 Branch
x86
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- unaffected
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: philipp, Assigned: dragana)

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is report bp-620d4c50-44cc-41e6-8bab-7c96c0181027. ============================================================= Top 10 frames of crashing thread: 0 vcruntime140.dll _purecall f:\dd\vctools\crt\vcruntime\src\misc\purevirt.cpp:29 1 xul.dll void nsHostRecord::~nsHostRecord netwerk/dns/nsHostResolver.cpp:215 2 xul.dll void AddrHostRecord::~AddrHostRecord netwerk/dns/nsHostResolver.cpp:323 3 xul.dll AddrHostRecord::Release netwerk/dns/nsHostResolver.cpp:296 4 xul.dll static void nsTHashtable<nsBaseHashtableET<nsGenericHashKey<nsHostKey>, RefPtr<nsHostRecord> > >::s_ClearEntry xpcom/ds/nsTHashtable.h:477 5 xul.dll PLDHashTable::Clear xpcom/ds/PLDHashTable.cpp:360 6 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:831 7 xul.dll nsDNSService::Shutdown netwerk/dns/nsDNSService2.cpp:770 8 xul.dll nsDNSService::Observe netwerk/dns/nsDNSService2.cpp:1253 9 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:295 ============================================================= this crash signature is newly showing up on 64.0b from 32bit windows installations, in low volume though. perhaps related to the changes from bug 1481251?
Flags: needinfo?(dd.mozilla)
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Flags: needinfo?(dd.mozilla)
Priority: -- → P1
Whiteboard: [necko-triaged]
Assignee: dd.mozilla → honzab.moz
Status: ASSIGNED → NEW
Daniel, this has started on Oct 21. Do you happen to know about any changes in that code that may have caused this?
^
Flags: needinfo?(daniel)
There were no changes in DNS code immediately prior to October 21, and even the weeks before that, the DNS related changes were telemetry oriented. The crash seems to be shutdown-related. But I fail to see what in nsHostRecord::~nsHostRecord() that can turn into a NULL dereference (all crashes say address 0x0, so it doesn't seem to be 'mCallbacks' that is null) and how that happens! :-( Puzzling!
Flags: needinfo?(daniel)
Thanks, Daniel. As the crash rate is low and it's a shutdown crash only, I will lower the priority and look at this later. But it doesn't seem easily actionable, could be a symptom of a previews heap corruption elsewhere too.
Priority: P1 → P2
It may be some heap corruption that zeros out the vtable.
(In reply to Valentin Gosu [:valentin] from comment #5) > It may be some heap corruption that zeros out the vtable. _purecall is not a null deref, the crash is actually a C++ exception. this happens when you manage to call a virtual function that is unimplemented (= 0). hard to say how a vtable can start pointing there suddenly, need to study.
(In reply to Honza Bambas (:mayhemer) from comment #6) > (In reply to Valentin Gosu [:valentin] from comment #5) > > It may be some heap corruption that zeros out the vtable. > > _purecall is not a null deref, the crash is actually a C++ exception. this > happens when you manage to call a virtual function that is unimplemented (= > 0). hard to say how a vtable can start pointing there suddenly, need to > study. Status of bug?
Flags: needinfo?(honzab.moz)
Status of the bug is that I would love to try to find the regressing bug first. > this crash signature is newly showing up on 64.0b from 32bit windows installations This looks like a low-prob race/heap corruption and it's something we have landed on 64, it then started to manifest under a heavier load on beta. It correlates with the 22-10-2018 merge date (64: m-c -> m-b). Bit-ness suggests that this is likely related to pointer logic. There has been 11 bugs landed in Net:DNS for 64: https://bugzilla.mozilla.org/buglist.cgi?list_id=14423882&resolution=FIXED&query_format=advanced&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&component=Networking%3A%20DNS&product=Core&target_milestone=mozilla64 One of them is probably causing this. Daniel, I will ask you to look at the list above, maybe you can spot something more quickly than me that could cause this crash. Then feel free to bounce back to me, if you don't find anything. Thanks.
Assignee: honzab.moz → daniel
Flags: needinfo?(honzab.moz)
this must be my bug 1481251 where I introduces to classes of hostRecord. I do know know how wa can en up with a virtual function that is unimplemented.
Assignee: daniel → dd.mozilla
Status: NEW → ASSIGNED
the crash is in AddrHostRecord destructor, I do not know how it can point to non implemented virtual function.
I agree that out of those bugs that is a likely candidate - but still not obvious what the problem is. I also noticed that some of the crashes are not shutdown triggered but instead network-change triggered. Like this: https://crash-stats.mozilla.com/report/index/d3fe2f33-4630-4a9f-bd2a-a5e110181106
My small research on _purecall suggests that this is a call on an object that is already destroyed or is in the phase between calls of derived/base destructors, while the outer dtor has already removed references to overridden methods from the vtable and made them point to pure calls. This can be either a thread race or simply a small chance to call a virtual function from dtor. Also could be "illusion of atomic reference counter" case.
(In reply to Honza Bambas (:mayhemer) from comment #12) > My small research on _purecall suggests that this is a call on an object > that is already destroyed or is in the phase between calls of derived/base > destructors, while the outer dtor has already removed references to > overridden methods from the vtable and made them point to pure calls. > > This can be either a thread race or simply a small chance to call a virtual > function from dtor. Also could be "illusion of atomic reference counter" > case. Yes, I know this all, but I am not sure what is going wrong in this case.No virtual function is called in destructor, actually no functions is called at all just cleaning up of variables. The only possibility is ref counting, that I did something wrong there.
There are a few issues here, but I'm not sure which one causes the crash: [1] nsHostRecord extends nsISupports, but NS_DECL_THREADSAFE_ISUPPORTS is done in AddrHostRecord/TypeHostRecord. Normally the base class declares and implements ISUPPORTS, and the child classes use NS_DECL_ISUPPORTS_INHERITED. [2] NS_IMPL_ISUPPORTS(AddrHostRecord, nsISupports, AddrHostRecord, nsHostRecord) is this definition recursive? Or maybe it's an issue that nsHostRecord has pure virtual methods. [3] We use nsCOMPtr<AddrHostRecord> and nsCOMPtr<TypeHostRecord>. Normally we should use RefPtr for concrete types, and use do_QueryObject to check their types. [1] https://searchfox.org/mozilla-central/rev/c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/dns/nsHostResolver.h#78 [2] https://searchfox.org/mozilla-central/rev/c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/dns/nsHostResolver.cpp#296,569
(In reply to Valentin Gosu [:valentin] from comment #14) > There are a few issues here, but I'm not sure which one causes the crash: > [1] nsHostRecord extends nsISupports, but NS_DECL_THREADSAFE_ISUPPORTS is > done in AddrHostRecord/TypeHostRecord. Normally the base class declares and > implements ISUPPORTS, and the child classes use NS_DECL_ISUPPORTS_INHERITED. > [2] NS_IMPL_ISUPPORTS(AddrHostRecord, nsISupports, AddrHostRecord, > nsHostRecord) is this definition recursive? Or maybe it's an issue that > nsHostRecord has pure virtual methods. > [3] We use nsCOMPtr<AddrHostRecord> and nsCOMPtr<TypeHostRecord>. Normally > we should use RefPtr for concrete types, and use do_QueryObject to check > their types. > > [1] > https://searchfox.org/mozilla-central/rev/ > c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/dns/nsHostResolver.h#78 > [2] > https://searchfox.org/mozilla-central/rev/ > c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/dns/nsHostResolver.cpp#296, > 569 I have change some stuff, maybe even too much...but better now than waiting for release crash stats.
Keywords: checkin-needed
Pushed by rvandermeulen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b0863b729ae9 Change ref-counting for nsHostRecord, change nsCOMPtr<*HostRecord> into RefPtr<*HostRecord>, remove pure virtual functions from nsHostRecord. r=valentin
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Please nominate this for Beta approval when you get a chance.
Flags: needinfo?(dd.mozilla)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #19) > Please nominate this for Beta approval when you get a chance. I will, just waiting for a day for the patch to run on Nightly. Thanks
Comment on attachment 9025668 [details] Change ref-counting for nsHostRecord, change nsCOMPtr<*HostRecord> into RefPtr<*HostRecord>, remove pure virtual functions from nsHostRecord. [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: Bug 1481251 User impact if declined: It cause crashes. Is this code covered by automated tests?: No Has the fix been verified in Nightly?: No Needs manual test from QE?: No If yes, steps to reproduce: List of other uplifts needed: None Risk to taking this patch: Medium Why is the change risky/not risky? (and alternatives if risky): The crash is probably connected with refCounting or pure virtual functions. The patch tries to clean all possible causes of the crash. The bug this is fixing only showed up on release, probably we will not notice any problems with the patch until release. String changes made/needed:
Flags: needinfo?(dd.mozilla)
Attachment #9025668 - Flags: approval-mozilla-beta?
Comment on attachment 9025668 [details] Change ref-counting for nsHostRecord, change nsCOMPtr<*HostRecord> into RefPtr<*HostRecord>, remove pure virtual functions from nsHostRecord. this is low enough volume that I think I'd prefer to let it ride the trains.
Attachment #9025668 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
If this is low vol on Beta it will definitely be higher volume on Release. I believe all affected branches should be fixed.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: