Crash in RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock

RESOLVED FIXED in Firefox 56

Status

()

Core
Networking: HTTP
P1
critical
RESOLVED FIXED
2 months ago
28 days ago

People

(Reporter: philipp, Assigned: valentin)

Tracking

(Blocks: 2 bugs, {crash, regression})

56 Branch
mozilla57
x86_64
Windows
crash, regression
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox55 unaffected, firefox56 fixed, firefox57 fixed)

Details

(Whiteboard: [necko-active], crash signature)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

2 months ago
This bug was filed from the Socorro interface and is 
report bp-b5956322-c77f-403e-9f74-6cb3a0170831.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	RtlEnterCriticalSection 	
1 	xul.dll 	mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock(mozilla::RecursiveMutex&) 	obj-firefox/dist/include/mozilla/RecursiveMutex.h:77
2 	xul.dll 	nsInputStreamPump::Suspend() 	netwerk/base/nsInputStreamPump.cpp:233
3 	xul.dll 	mozilla::net::nsHttpChannel::ContinueProcessResponse2(nsresult) 	netwerk/protocol/http/nsHttpChannel.cpp:2546
4 	xul.dll 	mozilla::net::nsHttpChannel::ContinueProcessResponse1() 	netwerk/protocol/http/nsHttpChannel.cpp:2401
5 	xul.dll 	mozilla::net::nsHttpChannel::AsyncContinueProcessResponse() 	netwerk/protocol/http/nsHttpChannel.cpp:2315
6 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::net::nsHttpChannel*, void ( mozilla::net::nsHttpChannel::*)(void), 1, 0>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:1172
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1446
8 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:97
9 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
10 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
11 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
12 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:278
13 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:287
14 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4608
15 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4772
16 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4867
17 	firefox.exe 	NS_internal_main(int, char**, char**) 	browser/app/nsBrowserApp.cpp:309
18 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
19 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
20 	kernel32.dll 	BaseThreadInitThunk 	
21 	ntdll.dll 	RtlUserThreadStart

crash reports with this signature are newly showing up in firefox 56 on 64bit versions of the browser on windows.

the comment to the referenced crash report says "Systematically crashes when logging on to .htaccess password protected websites. Same problem on separate Win 10 PC's. Disabling add-ons resolves issue."

three quarters of the reports have the following extension present: https://addons.mozilla.org/en-US/firefox/addon/passifox/
I see mozilla::net::nsHttpChannel::ContinueProcessResponse2 in all of these stacks.
Component: XPCOM → Networking: HTTP

Comment 2

2 months ago
Gary, is it related to the fix in bug 1382178?
Flags: needinfo?(xeonchen)
If the password protected page could be served from the cache, the cause could be that the cache won and mTransactionPump was nulled out in nsHttpChannel::ReadFromCache. Valentin, could you please have a look at it?
Assignee: nobody → valentin.gosu
Whiteboard: [necko-active]
(Assignee)

Comment 4

2 months ago
(In reply to Michal Novotny (:michal) from comment #3)
> If the password protected page could be served from the cache, the cause
> could be that the cache won and mTransactionPump was nulled out in
> nsHttpChannel::ReadFromCache. Valentin, could you please have a look at it?

Theoretically racing shouldn't be enabled for beta.
Maybe we are wrongly setting mRaceCacheWithNetwork to true, even though we shouldn't.

Updated

2 months ago
See Also: → bug 1397037
Comment hidden (mozreview-request)

Comment 6

2 months ago
mozreview-review
Comment on attachment 8904918 [details]
Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

https://reviewboard.mozilla.org/r/176700/#review181676

::: netwerk/protocol/http/nsHttpChannel.cpp:3979
(Diff revision 1)
>  
>          if (!mCacheOpenDelay) {
>              MOZ_ASSERT(NS_IsMainThread(), "Should be called on the main thread");
>              mCacheAsyncOpenCalled = true;
>              if (mNetworkTriggered) {
> -                mRaceCacheWithNetwork = true;
> +                mRaceCacheWithNetwork = sRCWNEnabled;

How is this going to fix the crash once we enable racing?
Attachment #8904918 - Flags: review?(michal.novotny) → review-
(In reply to Shian-Yow Wu [:swu] (56 Regression Engineering support) from comment #2)
> Gary, is it related to the fix in bug 1382178?

It's unlikely related to bug 1382178, and looks like Michal and Valentin are working at this (thank you!).
Flags: needinfo?(xeonchen)
Blocks: 1397037
Valentin any updates here? We are going to run out of time for 56 soon.
Flags: needinfo?(valentin.gosu)
(Assignee)

Comment 9

a month ago
Michal, can we consider the patch for beta?
I haven't been able to figure out the exact flow that triggers this bug, and uplifting the fix in the last days of beta is quite risky anyway.
You are correct in saying that the patch is working around the bug of why we are setting mRaceCacheWithNetwork to true, but we will not be enabling RCWN on beta anyway.
Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)
I'm still not sure how this patch will fix the problem.

If you mean that we set mRaceCacheWithNetwork incorrectly, so it's true even if we're not actually racing, then it won't fix the problem, because we would null out the pump only if we really race.

And if you mean that we really race in case we shouldn't (because it's disabled), then this patch would set mRaceCacheWithNetwork to false in case we really race and it's IMO pretty bad.
Flags: needinfo?(michal.novotny)
(Assignee)

Comment 11

a month ago
(In reply to Michal Novotny (:michal) from comment #10)
> I'm still not sure how this patch will fix the problem.

Just preventing mRaceCacheWithNetwork from being set to true when racing is not enabled.
We still need separate patch for fx57 to fix the root cause.

> If you mean that we set mRaceCacheWithNetwork incorrectly, so it's true even
> if we're not actually racing, then it won't fix the problem, because we
> would null out the pump only if we really race.

If we are setting mRacheCacheWithNetwork incorrectly then we would also null out the pump incorrectly in nsHttpChannel::ReadFromCache.

> And if you mean that we really race in case we shouldn't (because it's
> disabled), then this patch would set mRaceCacheWithNetwork to false in case
> we really race and it's IMO pretty bad.

I don't think that's the case. In any case, we should not be racing on beta anyway.
(In reply to Valentin Gosu [:valentin] from comment #11)
> > And if you mean that we really race in case we shouldn't (because it's
> > disabled), then this patch would set mRaceCacheWithNetwork to false in case
> > we really race and it's IMO pretty bad.
> 
> I don't think that's the case. In any case, we should not be racing on beta
> anyway.

Users can enable it manually and obviously some of them did it:

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-09-11&keys=__none__!__none__!__none__&max_channel_version=beta%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_USAGE_2&min_channel_version=null&processType=*&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2017-08-29&table=1&trim=1&use_submission_date=0
(Assignee)

Comment 13

a month ago
(In reply to Michal Novotny (:michal) from comment #12)
> (In reply to Valentin Gosu [:valentin] from comment #11)
> > > And if you mean that we really race in case we shouldn't (because it's
> > > disabled), then this patch would set mRaceCacheWithNetwork to false in case
> > > we really race and it's IMO pretty bad.
> > 
> > I don't think that's the case. In any case, we should not be racing on beta
> > anyway.
> 
> Users can enable it manually and obviously some of them did it:

We've fixed a bunch of other bugs with RCWN and not uplifted them to beta.
I think we should either disable it completely on beta (ignore the pref) - or let those users that flipped the pref deal with the crashes (as we do for other bugs).
Still, I want to make sure mRaceCacheWithNetwork is never true when it shouldn't. At the moment it seems that is a possibility.
(In reply to Valentin Gosu [:valentin] from comment #13)
> Still, I want to make sure mRaceCacheWithNetwork is never true when it
> shouldn't. At the moment it seems that is a possibility.

Why we don't simply add a null check as in bugs 1351340 and 1382178? AFAICS, Gary forgot to add a null check here in bug 1382178.
(Assignee)

Comment 15

a month ago
(In reply to Michal Novotny (:michal) from comment #14)
> (In reply to Valentin Gosu [:valentin] from comment #13)
> > Still, I want to make sure mRaceCacheWithNetwork is never true when it
> > shouldn't. At the moment it seems that is a possibility.
> 
> Why we don't simply add a null check as in bugs 1351340 and 1382178? AFAICS,
> Gary forgot to add a null check here in bug 1382178.

We can do that too. I'm worried there might be other crashes caused by setting mRaceCacheWithNetwork to true incorrectly.
So add MOZ_RELEASE_ASSERT(sRCWNEnabled) where we set mRaceCacheWithNetwork = true. Then we would know for sure there is some problem.

Updated

a month ago
status-firefox56: affected → fix-optional
(Assignee)

Updated

a month ago
See Also: → bug 1399541
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
(Assignee)

Comment 18

a month ago
Created attachment 8907838 [details] [diff] [review]
Add test that triggers crash when RCWN returns cached 200 response for page that now returns 401 response and is in currently authenticating

So, I managed to create a test case that triggers the cache. The stack trace is exactly as one on crash-stats.
This crash happens when we have a cached 200 response for a page that now returns a 401.
We race, network wins, but for some reason mRaceCacheWithNetwork isn't set.
Then OnCacheEntryCheck is triggered, with a valid cache entry. In OnCacheEntryAvailable we call ReadFromCache, which nulls out the transaction pump. When we get OnAuthAvailable, the pump is null, so we crash.
(Assignee)

Updated

a month ago
Blocks: 1307504
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 21

a month ago
mozreview-review
Comment on attachment 8908311 [details]
Bug 1396307 - Add null check in nsHttpChannel::OnAuthAvailable

https://reviewboard.mozilla.org/r/179940/#review185154
Attachment #8908311 - Flags: review?(michal.novotny) → review+
Comment hidden (mozreview-request)

Comment 23

a month ago
mozreview-review
Comment on attachment 8904918 [details]
Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

https://reviewboard.mozilla.org/r/176700/#review185160
Attachment #8904918 - Flags: review?(michal.novotny) → review+
Comment hidden (mozreview-request)

Comment 25

a month ago
mozreview-review
Comment on attachment 8908322 [details]
Bug 1396307 - Set mCacheAsyncOpenCalled back to flase when AsyncOpenURI fails

https://reviewboard.mozilla.org/r/179960/#review185166
Attachment #8908322 - Flags: review?(michal.novotny) → review+

Comment 26

a month ago
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/77cd8fa838ba
Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled r=michal
https://hg.mozilla.org/integration/autoland/rev/d4ec49c6206d
Add null check in nsHttpChannel::OnAuthAvailable r=michal
https://hg.mozilla.org/integration/autoland/rev/ea0aae8835fe
Set mCacheAsyncOpenCalled back to flase when AsyncOpenURI fails r=michal
(Assignee)

Updated

a month ago
Attachment #8907838 - Attachment is obsolete: true
(Assignee)

Updated

a month ago
Duplicate of this bug: 1397037
(Assignee)

Updated

a month ago
Crash Signature: [@ RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock] → [@ RtlEnterCriticalSection | mozilla::RecursiveMutexAutoLock::RecursiveMutexAutoLock] [@ RtlEnterCriticalSection | nsInputStreamPump::Resume ]
(Assignee)

Comment 28

a month ago
Created attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

MozReview-Commit-ID: FrLjfuExmYV
(Assignee)

Comment 29

a month ago
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

Approval Request Comment
[Feature/Bug causing the regression]: bug 1325081
[User impact if declined]: possible crashes when the race condition happens, most usually with pages that require user authentication.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: just landed
[Needs manual test from QE? If yes, steps to reproduce]: no, it is unlikely the race can be intentionally triggered.
[List of other uplifts needed for the feature/fix]: attachment 8908311 [details] and attachment 8908322 [details] in this same bug apply cleanly over this one.
[Is the change risky?]: no.
[Why is the change risky/not risky?]: each of these patches contains a small change that ensures we don't mistakenly set a variable to true, and that we check pointers before using them.
[String changes made/needed]: none
Attachment #8908426 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/77cd8fa838ba
https://hg.mozilla.org/mozilla-central/rev/d4ec49c6206d
https://hg.mozilla.org/mozilla-central/rev/ea0aae8835fe
Status: NEW → RESOLVED
Last Resolved: a month ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

56 is on m-r now.
Attachment #8908426 - Flags: approval-mozilla-beta? → approval-mozilla-release?
(In reply to Michal Novotny (:michal) from comment #23)
> Comment on attachment 8904918 [details]
> Bug 1396307 - Make sure we only set mRaceCacheWithNetwork to true when the
> feature is enabled
> 
> https://reviewboard.mozilla.org/r/176700/#review185160

This patch was a hotfix for releases where is RCWN disabled. Valentin, please file a follow up bug to revert the change once RCWN is enabled.
Comment on attachment 8908426 [details] [diff] [review]
[beta] Make sure we only set mRaceCacheWithNetwork to true when the feature is enabled

This crash is not high volume, but it newly appeared in 56 and may become a worse problem with release users.  Seems worth the risk to take this last minute fix (for the 56 RC build)
Attachment #8908426 - Flags: approval-mozilla-release? → approval-mozilla-release+

Comment 34

a month ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/a3e692e095c6 (FIREFOX_56b13_RELBRANCH)
https://hg.mozilla.org/releases/mozilla-release/rev/16a31f23b98a
status-firefox56: fix-optional → fixed
(In reply to Valentin Gosu [:valentin] from comment #29)
> [Is this code covered by automated tests?]: yes
> [Has the fix been verified in Nightly?]: just landed
> [Needs manual test from QE? If yes, steps to reproduce]: no, it is unlikely
> the race can be intentionally triggered.

Setting qe-verify- based on Valentin's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-

Comment 36

28 days ago
If this is this bug: https://crash-stats.mozilla.com/report/index/1b27e2ca-5b2e-495d-84c2-fc9bf0170926#tab-bugzilla - I can reproduce this with 100% reliability on a TFS-system - thus if there is any way I can help to get this crunched, please feel free to contact me.
(Assignee)

Updated

28 days ago
Blocks: 1403137
You need to log in before you can comment on or make changes to this bug.