Closed Bug 1450630 Opened 6 years ago Closed 6 years ago

DoH crash in mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::PContentParent::SendDataStorageClear

Categories

(Core :: Networking: DNS, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- disabled
firefox59 --- unaffected
firefox60 --- disabled
firefox61 --- fixed

People

(Reporter: cpeterson, Assigned: bagder)

References

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged][trr])

Crash Data

Attachments

(1 file)

Firefox crashes when changing network.trr.uri while loading a web page.

STR:
1. Open about:config in a new tab.
2. Set network.trr.mode pref = 2
3. Set network.trr.uri pref = https://mozilla.cloudflare-dns.com/dns-query
4. Load cnn.com (or any web page with many resources that are slow to load) in a new tab.
5. While cnn.com is loading, quickly jump back to the about:config tab and change network.trr.uri pref = https://dns.google.com/experimental
6. Back in the cnn.com tab, try refreshing the page while it is still loading or ctrl+clicking a link to open a new page in a new tab.

RESULT:
Firefox crashes!

I am running Firefox Nightly buildID 20180401220058 on Windows 10. I was able to reliably reproduce this crash many times in two different profiles.

bp-9c038307-3edb-4f30-84a7-81a080180402
bp-ba7846ac-69e2-41b2-88c6-ead470180402
bp-e70724cd-6eb0-453c-a218-d09200180402
bp-6c272764-6c61-470e-a0cb-ea9280180402
bp-da6cd0fd-7167-4456-953a-d1d060180402
bp-ea5a4375-fded-4dd7-b571-46ca40180402
bp-f4c29f38-8bbe-4219-829b-997970180402
bp-97a4def2-0c06-4f42-94d0-973dc0180402

This bug was filed from the Socorro interface and is
report bp-e70724cd-6eb0-453c-a218-d09200180402.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame ipc/glue/MessageChannel.cpp:247
1 xul.dll mozilla::ipc::MessageChannel::Send ipc/glue/MessageChannel.cpp:912
2 xul.dll mozilla::dom::PContentParent::SendDataStorageClear ipc/ipdl/PContentParent.cpp:1456
3 xul.dll static void mozilla::RunOnAllContentParents<<lambda_46c9c9cb7d4c5fbb2838091f4a692aa1> > security/manager/ssl/DataStorage.cpp:725
4 xul.dll mozilla::DataStorage::Clear security/manager/ssl/DataStorage.cpp:920
5 xul.dll mozilla::net::TRRService::IsTRRBlacklisted netwerk/dns/TRRService.cpp:383
6 xul.dll nsHostResolver::TrrLookup netwerk/dns/nsHostResolver.cpp:1117
7 xul.dll nsHostResolver::NameLookup netwerk/dns/nsHostResolver.cpp:1271
8 xul.dll nsHostResolver::ResolveHost netwerk/dns/nsHostResolver.cpp:941
9 xul.dll nsDNSService::AsyncResolveExtendedNative netwerk/dns/nsDNSService2.cpp:884

=============================================================
Blocks: 1434852
I can reproduce this crash on macOS 10.13.4, too:

bp-e439cc7f-fc3b-4bd4-8b53-c16f70180402
OS: Windows 10 → All
Assignee: nobody → daniel
Priority: -- → P1
See Also: → 1441131
Whiteboard: [necko-triaged][trr]
Ok, my proxying of the removal from TRR doesn't seem to have been enough to address this issue (and my manual "verifying" of my fix also doesn't seem to have been very good). Ehsan, now you have a renewed chance to take a look at this since you missed it before in bug 1441131! =)
Component: Networking: DNS → Security: PSM
Flags: needinfo?(ehsan)
Assignee: daniel → nobody
(In reply to Daniel Stenberg [:bagder] from comment #2)
> Ok, my proxying of the removal from TRR doesn't seem to have been enough to
> address this issue (and my manual "verifying" of my fix also doesn't seem to
> have been very good). Ehsan, now you have a renewed chance to take a look at
> this since you missed it before in bug 1441131! =)

Do you mind rephrasing this into a standalone question that I can answer please?  :-)  I'm lacking all context here (e.g. I don't know what TRR is, and how this is related to something that I have done in Necko in the past?)

Perhaps to state the obvious, the cause of the crash is that a mainthread API (PContentParent::SendDataStorageClear) is being called on a non-mainthread.  But the question David asked me in bug 1441131 comment 6 was confusing in that I don't think this is a new restriction.
Flags: needinfo?(ehsan)
(TRR is just a new name resolver in necko, it's not important here. But it is a new user of DataStorage and this crash is triggered by TRR trying to remove an entry.)

It is being called on a main-thread for all I can tell (that's what I did when I thought I fixed bug 1441131) and we still get an assert. The assert that triggers this crash says (seen in crash reports linked to above)

   MOZ_RELEASE_ASSERT(mWorkerThread == GetCurrentVirtualThread()) (not on worker thread!)

Here's the calling code: https://searchfox.org/mozilla-central/source/netwerk/dns/TRRService.cpp#443-455

I was hoping someone could tell me what a stupid mistake I've done so that I can fix it. The assert message isn't very easy for me to decipher into an actionable change on my behalf.
This crash keeps occurring. I would appreciate help and guidance on what to do here...
Can you throw in an earlier ASSERT that you're on main thread?  It sounds like you think you should be, but don't know at the assert level that you are?
I don't see how the TRRService code can be any more certain:

      if (!NS_IsMainThread()) {
        NS_DispatchToMainThread(runnable);
      } else {
        runnable->Run();
      }
Ehsan, how about now? As seen above we are calling this on the mainthread, but the assert that is triggered here seems to require something more specific that I don't know what I can do about?
Flags: needinfo?(ehsan)
I run into the same crash today every time I opened Firefox Beta b11 on macOS 10.13.3 with a MacBook 13 Retina 2017
When I start Firefox Tabliss (https://addons.mozilla.org/en-US/firefox/addon/tabliss/) opens his tab.
After changing network.trr.mode back to 0 in the prefs.js I could start and use it again without any problems.
I have the follow crashes:
https://crash-stats.mozilla.com/report/index/34148279-0559-4bad-b35b-7ebb80180412
https://crash-stats.mozilla.com/report/index/9a77e34f-e4a3-446a-a67d-298b70180412
https://crash-stats.mozilla.com/report/index/02f1951b-c485-4855-a3c3-846630180412

If you need any help, just ask :)
Sorry for the late reply again...  I need to get better at dealing with bugzilla notifications :/

(In reply to Daniel Stenberg [:bagder] from comment #4)
> (TRR is just a new name resolver in necko, it's not important here. But it
> is a new user of DataStorage and this crash is triggered by TRR trying to
> remove an entry.)
> 
> It is being called on a main-thread for all I can tell (that's what I did
> when I thought I fixed bug 1441131) and we still get an assert. The assert
> that triggers this crash says (seen in crash reports linked to above)
> 
>    MOZ_RELEASE_ASSERT(mWorkerThread == GetCurrentVirtualThread()) (not on
> worker thread!)
> 
> Here's the calling code:
> https://searchfox.org/mozilla-central/source/netwerk/dns/TRRService.cpp#443-
> 455

Sorry I'm not too familiar with this code but I think you have forgotten that *all* calls to DataStorage APIs require to be made on the main thread.

The code you linked to, at least right now takes me to <https://searchfox.org/mozilla-central/rev/78dbe34925f04975f16cb9a5d4938be714d41897/netwerk/dns/TRRService.cpp#443-455>.  There, you are certainly doing the right thing.  The problem is that as far as I can tell, that code has nothing to do with the crash.

Comment 0 points to this crash: https://crash-stats.mozilla.com/report/index/e70724cd-6eb0-453c-a218-d09200180402.  The call stack begins at PR_NativeRunThread, so we are *not* on the main thread.  Frame #5 on that crash is this code: <https://hg.mozilla.org/mozilla-central/annotate/ef717c03ff54d10b2e30df7e63fc11172c69db44/netwerk/dns/TRRService.cpp#l383>.  It's the same function as the one you linked to above, just a different part of it.  It looks to me that bug 1441131 fixed one half of the problem but forgot about the other half.

Also, the crashes in comment 9 are just bug 1441131 seen on beta, and we are crashing there presumably because Julien refused to approve the beta fix (https://bugzilla.mozilla.org/show_bug.cgi?id=1441131#c24) and if he did we would see this version of the crash on beta similar to Nightly...

FWIW there is also a call to DataStorage::Get() here which *currently* isn't bound to the main thread but that's a time bomb waiting to blow up.  If you can, it may be better to defer to entire IsTRRBlacklisted() method to the main thread so that you don't have to worry about each individual method call one by one.

Does this help?
Flags: needinfo?(ehsan) → needinfo?(daniel)
Yes thanks, it helps!

Just one question for the next time: where is this fact documented? It would've saved me a lot of time if I knew this already a few months ago.
Flags: needinfo?(daniel)
Assignee: nobody → daniel
Component: Security: PSM → Networking: DNS
(In reply to :Ehsan Akhgari from comment #10)
...
> Sorry I'm not too familiar with this code but I think you have forgotten
> that *all* calls to DataStorage APIs require to be made on the main thread.

I'm confused - DataStorage was written with the intent that it be safe to access from multiple threads (the documentation even says so at the top of DataStorage.h). From what I can tell from the code, the static DataStorage::Get* meta-APIs that operate on the different types of DataStorage we have are main-thread-only, but are the Get/Set/Remove APIs that operate on actual data restricted as well? If so, we've strayed from the original design requirements (e.g. nsSiteSecurityService is fundamentally broken if we can't access its backing DataStorage off the main thread).
Flags: needinfo?(ehsan)
(In reply to Daniel Stenberg [:bagder] from comment #11)
> Just one question for the next time: where is this fact documented? It
> would've saved me a lot of time if I knew this already a few months ago.

Which part of it?  I mostly found everything I typed into the comment above from searchfox, we're very poor when in comes to documentation.

(In reply to David Keeler [:keeler] (use needinfo) from comment #13)
> I'm confused - DataStorage was written with the intent that it be safe to
> access from multiple threads (the documentation even says so at the top of
> DataStorage.h). From what I can tell from the code, the static
> DataStorage::Get* meta-APIs that operate on the different types of
> DataStorage we have are main-thread-only, but are the Get/Set/Remove APIs
> that operate on actual data restricted as well? If so, we've strayed from
> the original design requirements (e.g. nsSiteSecurityService is
> fundamentally broken if we can't access its backing DataStorage off the main
> thread).

Hmm...  I'm not sure what the API requirements here are, I'm basing my judgement on what the code is actually doing.  All calls to RunOnAllContentParents() require to be on the main thread (in general, our IPC interfaces are not thread safe so if the caller requires thread safety it is its own responsibility to take care of that at layers higher than IPC.)

This code is currently called from Put(), Remove() and Clear() so per above I think you're suggesting that this has all been broken since bug 1215723 (Firefox 45)?  If yes, then the right fix isn't to fix the Necko consumers, but to rework the fix to bug 1215723 to make it not have a main-thread dependency.  I think doing so is actually relatively easy, all that should be required is to rewrite RunOnAllContentParents() to make it dispatch a runnable to the main thread if we aren't there currently, and make sure its callers capture their closure with a copy rather than by reference.
Flags: needinfo?(ehsan) → needinfo?(dkeeler)
I believe your analysis is correct, although I'm surprised this hasn't been more of a problem until now. Rewriting RunOnAllContentParents() to dispatch to the main thread sounds like a good solution. The data passed to these functions is size-limited, so capturing by copy rather than reference shouldn't be too big of a problem (I hope).
Flags: needinfo?(dkeeler)
Comment on attachment 8971937 [details]
bug 1450630 - use DataStorage in main thread only

https://reviewboard.mozilla.org/r/240698/#review246830
Attachment #8971937 - Flags: review?(mcmanus) → review+
Pushed by daniel@haxx.se:
https://hg.mozilla.org/integration/autoland/rev/de74e326165d
use DataStorage in main thread only r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/de74e326165d
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
DoH is disabled by default in 60, updating status flags.
Crash Signature: [@ mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::PContentParent::SendDataStorageClear] → [@ mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::PContentParent::SendDataStorageClear] [@ libxul.so@0x926a0a | libxul.so@0x928180 | libxul.so@0xff9a9f | libxul.so@0x3a4b650 | libxul.so@0xb5…

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
You need to log in before you can comment on or make changes to this bug.