Closed Bug 1396307 Opened 2 years ago Closed 2 years ago

Crash in RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock

Categories

(Core :: Networking: HTTP, defect, P1, critical)

56 Branch
x86_64
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: philipp, Assigned: valentin.gosu)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [necko-active])

Crash Data

Attachments

(4 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-b5956322-c77f-403e-9f74-6cb3a0170831.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	RtlEnterCriticalSection 	
1 	xul.dll 	mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock(mozilla::RecursiveMutex&) 	obj-firefox/dist/include/mozilla/RecursiveMutex.h:77
2 	xul.dll 	nsInputStreamPump::Suspend() 	netwerk/base/nsInputStreamPump.cpp:233
3 	xul.dll 	mozilla::net::nsHttpChannel::ContinueProcessResponse2(nsresult) 	netwerk/protocol/http/nsHttpChannel.cpp:2546
4 	xul.dll 	mozilla::net::nsHttpChannel::ContinueProcessResponse1() 	netwerk/protocol/http/nsHttpChannel.cpp:2401
5 	xul.dll 	mozilla::net::nsHttpChannel::AsyncContinueProcessResponse() 	netwerk/protocol/http/nsHttpChannel.cpp:2315
6 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::net::nsHttpChannel*, void ( mozilla::net::nsHttpChannel::*)(void), 1, 0>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:1172
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1446
8 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:97
9 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
10 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
11 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
12 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:278
13 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:287
14 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4608
15 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4772
16 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4867
17 	firefox.exe 	NS_internal_main(int, char**, char**) 	browser/app/nsBrowserApp.cpp:309
18 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
19 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
20 	kernel32.dll 	BaseThreadInitThunk 	
21 	ntdll.dll 	RtlUserThreadStart

crash reports with this signature are newly showing up in firefox 56 on 64bit versions of the browser on windows.

the comment to the referenced crash report says "Systematically crashes when logging on to .htaccess password protected websites. Same problem on separate Win 10 PC's. Disabling add-ons resolves issue."

three quarters of the reports have the following extension present: https://addons.mozilla.org/en-US/firefox/addon/passifox/
I see mozilla::net::nsHttpChannel::ContinueProcessResponse2 in all of these stacks.
Component: XPCOM → Networking: HTTP
Gary, is it related to the fix in bug 1382178?
Flags: needinfo?(xeonchen)
If the password protected page could be served from the cache, the cause could be that the cache won and mTransactionPump was nulled out in nsHttpChannel::ReadFromCache. Valentin, could you please have a look at it?
Assignee: nobody → valentin.gosu
Whiteboard: [necko-active]
(In reply to Michal Novotny (:michal) from comment #3)
> If the password protected page could be served from the cache, the cause
> could be that the cache won and mTransactionPump was nulled out in
> nsHttpChannel::ReadFromCache. Valentin, could you please have a look at it?

Theoretically racing shouldn't be enabled for beta.
Maybe we are wrongly setting mRaceCacheWithNetwork to true, even though we shouldn't.
See Also: → 1397037
Comment on attachment 8904918 [details]
Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

https://reviewboard.mozilla.org/r/176700/#review181676

::: netwerk/protocol/http/nsHttpChannel.cpp:3979
(Diff revision 1)
>  
>          if (!mCacheOpenDelay) {
>              MOZ_ASSERT(NS_IsMainThread(), "Should be called on the main thread");
>              mCacheAsyncOpenCalled = true;
>              if (mNetworkTriggered) {
> -                mRaceCacheWithNetwork = true;
> +                mRaceCacheWithNetwork = sRCWNEnabled;

How is this going to fix the crash once we enable racing?
Attachment #8904918 - Flags: review?(michal.novotny) → review-
(In reply to Shian-Yow Wu [:swu] (56 Regression Engineering support) from comment #2)
> Gary, is it related to the fix in bug 1382178?

It's unlikely related to bug 1382178, and looks like Michal and Valentin are working at this (thank you!).
Flags: needinfo?(xeonchen)
Valentin any updates here? We are going to run out of time for 56 soon.
Flags: needinfo?(valentin.gosu)
Michal, can we consider the patch for beta?
I haven't been able to figure out the exact flow that triggers this bug, and uplifting the fix in the last days of beta is quite risky anyway.
You are correct in saying that the patch is working around the bug of why we are setting mRaceCacheWithNetwork to true, but we will not be enabling RCWN on beta anyway.
Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)
I'm still not sure how this patch will fix the problem.

If you mean that we set mRaceCacheWithNetwork incorrectly, so it's true even if we're not actually racing, then it won't fix the problem, because we would null out the pump only if we really race.

And if you mean that we really race in case we shouldn't (because it's disabled), then this patch would set mRaceCacheWithNetwork to false in case we really race and it's IMO pretty bad.
Flags: needinfo?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #10)
> I'm still not sure how this patch will fix the problem.

Just preventing mRaceCacheWithNetwork from being set to true when racing is not enabled.
We still need separate patch for fx57 to fix the root cause.

> If you mean that we set mRaceCacheWithNetwork incorrectly, so it's true even
> if we're not actually racing, then it won't fix the problem, because we
> would null out the pump only if we really race.

If we are setting mRacheCacheWithNetwork incorrectly then we would also null out the pump incorrectly in nsHttpChannel::ReadFromCache.

> And if you mean that we really race in case we shouldn't (because it's
> disabled), then this patch would set mRaceCacheWithNetwork to false in case
> we really race and it's IMO pretty bad.

I don't think that's the case. In any case, we should not be racing on beta anyway.
(In reply to Valentin Gosu [:valentin] from comment #11)
> > And if you mean that we really race in case we shouldn't (because it's
> > disabled), then this patch would set mRaceCacheWithNetwork to false in case
> > we really race and it's IMO pretty bad.
> 
> I don't think that's the case. In any case, we should not be racing on beta
> anyway.

Users can enable it manually and obviously some of them did it:

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-09-11&keys=__none__!__none__!__none__&max_channel_version=beta%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_USAGE_2&min_channel_version=null&processType=*&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2017-08-29&table=1&trim=1&use_submission_date=0
(In reply to Michal Novotny (:michal) from comment #12)
> (In reply to Valentin Gosu [:valentin] from comment #11)
> > > And if you mean that we really race in case we shouldn't (because it's
> > > disabled), then this patch would set mRaceCacheWithNetwork to false in case
> > > we really race and it's IMO pretty bad.
> > 
> > I don't think that's the case. In any case, we should not be racing on beta
> > anyway.
> 
> Users can enable it manually and obviously some of them did it:

We've fixed a bunch of other bugs with RCWN and not uplifted them to beta.
I think we should either disable it completely on beta (ignore the pref) - or let those users that flipped the pref deal with the crashes (as we do for other bugs).
Still, I want to make sure mRaceCacheWithNetwork is never true when it shouldn't. At the moment it seems that is a possibility.
(In reply to Valentin Gosu [:valentin] from comment #13)
> Still, I want to make sure mRaceCacheWithNetwork is never true when it
> shouldn't. At the moment it seems that is a possibility.

Why we don't simply add a null check as in bugs 1351340 and 1382178? AFAICS, Gary forgot to add a null check here in bug 1382178.
(In reply to Michal Novotny (:michal) from comment #14)
> (In reply to Valentin Gosu [:valentin] from comment #13)
> > Still, I want to make sure mRaceCacheWithNetwork is never true when it
> > shouldn't. At the moment it seems that is a possibility.
> 
> Why we don't simply add a null check as in bugs 1351340 and 1382178? AFAICS,
> Gary forgot to add a null check here in bug 1382178.

We can do that too. I'm worried there might be other crashes caused by setting mRaceCacheWithNetwork to true incorrectly.
So add MOZ_RELEASE_ASSERT(sRCWNEnabled) where we set mRaceCacheWithNetwork = true. Then we would know for sure there is some problem.
See Also: → 1399541
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
So, I managed to create a test case that triggers the cache. The stack trace is exactly as one on crash-stats.
This crash happens when we have a cached 200 response for a page that now returns a 401.
We race, network wins, but for some reason mRaceCacheWithNetwork isn't set.
Then OnCacheEntryCheck is triggered, with a valid cache entry. In OnCacheEntryAvailable we call ReadFromCache, which nulls out the transaction pump. When we get OnAuthAvailable, the pump is null, so we crash.
Blocks: RCWN
Comment on attachment 8908311 [details]
Bug 1396307 - Add null check in nsHttpChannel::OnAuthAvailable

https://reviewboard.mozilla.org/r/179940/#review185154
Attachment #8908311 - Flags: review?(michal.novotny) → review+
Comment on attachment 8904918 [details]
Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

https://reviewboard.mozilla.org/r/176700/#review185160
Attachment #8904918 - Flags: review?(michal.novotny) → review+
Comment on attachment 8908322 [details]
Bug 1396307 - Set mCacheAsyncOpenCalled back to flase when AsyncOpenURI fails

https://reviewboard.mozilla.org/r/179960/#review185166
Attachment #8908322 - Flags: review?(michal.novotny) → review+
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/77cd8fa838ba
Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled r=michal
https://hg.mozilla.org/integration/autoland/rev/d4ec49c6206d
Add null check in nsHttpChannel::OnAuthAvailable r=michal
https://hg.mozilla.org/integration/autoland/rev/ea0aae8835fe
Set mCacheAsyncOpenCalled back to flase when AsyncOpenURI fails r=michal
Attachment #8907838 - Attachment is obsolete: true
Duplicate of this bug: 1397037
Crash Signature: [@ RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock] → [@ RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock] [@ RtlEnterCriticalSection | nsInputStreamPump::Resume ]
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

Approval Request Comment
[Feature/Bug causing the regression]: bug 1325081
[User impact if declined]: possible crashes when the race condition happens, most usually with pages that require user authentication.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: just landed
[Needs manual test from QE? If yes, steps to reproduce]: no, it is unlikely the race can be intentionally triggered.
[List of other uplifts needed for the feature/fix]: attachment 8908311 [details] and attachment 8908322 [details] in this same bug apply cleanly over this one.
[Is the change risky?]: no.
[Why is the change risky/not risky?]: each of these patches contains a small change that ensures we don't mistakenly set a variable to true, and that we check pointers before using them.
[String changes made/needed]: none
Attachment #8908426 - Flags: approval-mozilla-beta?
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

56 is on m-r now.
Attachment #8908426 - Flags: approval-mozilla-beta? → approval-mozilla-release?
(In reply to Michal Novotny (:michal) from comment #23)
> Comment on attachment 8904918 [details]
> Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the
> feature is enabled
> 
> https://reviewboard.mozilla.org/r/176700/#review185160

This patch was a hotfix for releases where is RCWN disabled. Valentin, please file a follow up bug to revert the change once RCWN is enabled.
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

This crash is not high volume, but it newly appeared in 56 and may become a worse problem with release users.  Seems worth the risk to take this last minute fix (for the 56 RC build)
Attachment #8908426 - Flags: approval-mozilla-release? → approval-mozilla-release+
(In reply to Valentin Gosu [:valentin] from comment #29)
> [Is this code covered by automated tests?]: yes
> [Has the fix been verified in Nightly?]: just landed
> [Needs manual test from QE? If yes, steps to reproduce]: no, it is unlikely
> the race can be intentionally triggered.

Setting qe-verify- based on Valentin's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
If this is this bug: https://crash-stats.mozilla.com/report/index/1b27e2ca-5b2e-495d-84c2-fc9bf0170926#tab-bugzilla - I can reproduce this with 100% reliability on a TFS-system - thus if there is any way I can help to get this crunched, please feel free to contact me.
Blocks: 1403137
You need to log in before you can comment on or make changes to this bug.