Firefox instahang on start after landing patch from bug #1392841

VERIFIED FIXED in Firefox 59

Status

()

defect
P1
blocker
VERIFIED FIXED
2 years ago
Last month

People

(Reporter: Virtual, Assigned: michal)

Tracking

(5 keywords)

59 Branch
mozilla59
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox57 unaffected, firefox58 unaffected, firefox59+ verified)

Details

(Whiteboard: [necko-triaged])

Attachments

(7 attachments, 8 obsolete attachments)

99.36 KB, application/x-zip-compressed
Details
11.91 KB, application/x-zip-compressed
Details
fix
2.62 KB, patch
ttaubert
: review+
Details | Diff | Splinter Review
1.60 MB, text/plain
Details
4.95 KB, text/plain
Details
33.04 KB, text/plain
Details
2.80 KB, patch
mayhemer
: review+
Details | Diff | Splinter Review
[Tracking Requested - why for this release]: Regression

STR:
1. Update Mozilla Firefox Nightly 59.0a1 to latest one version (2017-11-18)
2. Restart browser
and enjoy hang

Workaround:
Set "network.http.rcwn.enabled" as "false" in about:config being under Safe Mode
Flags: needinfo?(michal.novotny)
Regression caused by:
bug #1392841

Regression pushlog:
https://hg.mozilla.org/mozilla-central/rev/64b1b02e184b
Has Regression Range: --- → yes
Has STR: --- → yes
What does instahang mean? does the UI freeze? High CPU usage? 
How do you get it to reproduce?
Do you mind providing some logs?

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Use the following logging attributes: timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5
Flags: needinfo?(Virtual)
(In reply to Valentin Gosu [:valentin] from comment #2)
> What does instahang mean? does the UI freeze? High CPU usage? 
It means nearly same like instacrash,
so when I start Firefox, I get immediately UI in non responding state

> How do you get it to reproduce?
To add, I get better reproducible STR:
1. Install Decentraleyes 2.0.1 extension ( https://addons.mozilla.org/en-US/firefox/addon/decentraleyes/ )
2. Enable "Disable link prefetching" in Decentraleyes options (if it is not)
3. Install uBlock Origin 1.14.19b7 extension ( https://addons.mozilla.org/en-US/firefox/addon/ublock-origin/ )
4. Enable these preferences in uBlock Origin options (if they are not):
- "Disable pre-fetching (to prevent any connection for blocked network requests)"
- "Disable hyperlink auditing"
- "Prevent WebRTC from leaking local IP addresses"
- "Block CSP reports"
5. Update Mozilla Firefox Nightly 59.0a1 to latest one version (2017-11-18)
6. Restart browser
and enjoy hang on browser restart

Basing on STR #2 and 4, these preferences could also be cause of hang (but I'm yet to find which one exactly):
- network.dns.disablePrefetch;true
- network.http.speculative-parallel-limit;0
- network.predictor.enabled;false
- network.prefetch-next;false
as these extensions change them.


(In reply to Valentin Gosu [:valentin] from comment #2)
> Do you mind providing some logs?
> 
> https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
> Use the following logging attributes:
> timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5

Done with
> cd c:\
> set MOZ_LOG=timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5
> set MOZ_LOG_FILE=%TEMP%\log.txt 
> cd "Program Files\Nightly"
> .\firefox.exe
Flags: needinfo?(Virtual)
Posted file log.txt.child-2 (obsolete) —
log.txt.child-1 was empty, so omitting it
Assignee

Comment 7

2 years ago
(In reply to Virtual_ManPL [:Virtual] - (please needinfo? me - so I will see your comment/reply/question/etc.) from comment #4)
> Created attachment 8929836 [details]
> log.txt

Main thread waits for mRCWNLock, but it's not clear to me what happens on Cache I/O thread that OnCacheEntryCheck() didn't finish.

I'll try to reproduce with the mentioned addons...
Flags: needinfo?(michal.novotny)
Assignee

Comment 8

2 years ago
(In reply to Virtual_ManPL [:Virtual] - (please needinfo? me - so I will see your comment/reply/question/etc.) from > Done with
> > cd c:\
> > set MOZ_LOG=timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5

Could you please add also cache2:5 to MOZ_LOG?
Flags: needinfo?(Virtual)
(In reply to Michal Novotny (:michal) from comment #8)
> (In reply to Virtual_ManPL [:Virtual] - (please needinfo? me - so I will see
> your comment/reply/question/etc.) from > Done with
> > > cd c:\
> > > set MOZ_LOG=timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5
> 
> Could you please add also cache2:5 to MOZ_LOG?

Sure,
oddly, Firefox doesn't hang when I add cache2:5 to MOZ_LOG
> cd C:\
> set MOZ_LOG=timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,cache2:5
> set MOZ_LOG_FILE=%TEMP%\log.txt 
> cd "Program Files\Nightly"
> .\firefox.exe
Flags: needinfo?(Virtual)
Posted file log.txt.child-2 (obsolete) —
log.txt.child-1 was empty, so omitting it
Attachment #8929837 - Attachment is obsolete: true
Assignee

Comment 13

2 years ago
(In reply to Virtual_ManPL [:Virtual] - (please needinfo? me - so I will see your comment/reply/question/etc.) from comment #10)
> Created attachment 8930058 [details]
> log.txt
> 
> Sure,
> oddly, Firefox doesn't hang when I add cache2:5 to MOZ_LOG

So this log doesn't contain the hang, right? I don't see anything wrong in it.
Flags: needinfo?(Virtual)
Yes, as I said 
(Virtual_ManPL [:Virtual] wrote in comment #9)
> oddly, Firefox doesn't hang when I add cache2:5 to MOZ_LOG
Flags: needinfo?(Virtual)
I'm seeing the same problem, let me know if there's anything I can do to help narrow it down.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #15)
> I'm seeing the same problem, let me know if there's anything I can do to
> help narrow it down.

Can you reproduce the problem when enabling the log as stated in comment 2 & comment 8? If yes, providing such log would help.
Flags: needinfo?(emanuel.hoogeveen)
Assignee

Comment 17

2 years ago
It would be great to have call stacks from the process, but I don't know how to externally trigger a crash on Windows. Honza, is it possible on Windows? On Linux I can simply send SIGSEGV to the process and FF will send the crash to crash-stats.
Flags: needinfo?(honzab.moz)
Posted file log with cache2:5
I was able to generate a log by setting the log file location to a ramdisk (I guess writing it to an SSD prevents the race condition). This is using my main profile, I hope that doesn't make it more confusing - I can try the STR from a fresh profile if needed.

Interestingly, also putting my profile and firefox itself on the ramdisk seems to prevent the hang, I guess because the cache loads so much faster.
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Michal Novotny (:michal) from comment #17)
> It would be great to have call stacks from the process, but I don't know how
> to externally trigger a crash on Windows. Honza, is it possible on Windows?
> On Linux I can simply send SIGSEGV to the process and FF will send the crash
> to crash-stats.

If the goal is to get stacks on hang, then attaching either visual studio [0] (preferred) or windbg both with added symbol server settings for moz [1] at Tools/Options/Debugging/Symbols to a nightly build running would do.  Then you need to download the correct source tar ball (see about:buildconfig or platform.ini in the installation dir) and unpack the files you expect to crawl through (e.g. just network, xpcom, mfbt - just to save time) to the right folder (visual studio will give a clue where it's trying to find the sources when you click a frame in the Call Stack window while already attached _and manually paused_.)

[0] https://www.visualstudio.com/thank-you-downloading-visual-studio/?sku=Community&rel=15#
[1] https://symbols.mozilla.org/
c19
Flags: needinfo?(honzab.moz)
I attached VS2017 to all 4 processes and exported the call stacks from the Threads window after pausing each process in the debugger.
(by "Named process" I mean the firefox.exe process called Nightly - the others had no name)
Assignee

Comment 23

2 years ago
Emanuel, this is awesome, thanks a lot!

So the problem is that main thread is waiting for the lock which is grabbed on cache I/O thread which is stuck due to initialization of NSS that sends sync event to main thread. Honza, I though we had similar problems in the past and we moved initialization of selected services before we start any HTTP activity, but I might be wrong. Anyway, we definitely need to initialize NSS earlier. Do you have any idea, how to fix this?

mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Line 58	 
xul.dll!mozilla::CondVar::Wait(unsigned int aInterval) Line 72	 
[Inline Frame] xul.dll!mozilla::Monitor::Wait(unsigned int) Line 40	 
[Inline Frame] xul.dll!mozilla::MonitorAutoLock::Wait(unsigned int) Line 88	 
xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget * aThread, bool aForceDispatch) Line 70	 
xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget * aThread, nsIRunnable * aRunnable, bool aForceDispatch) Line 99	 
xul.dll!EnsureNSSInitializedChromeOrContent() Line 107	 
xul.dll!mozilla::psm::Constructor<mozilla::psm::TransportSecurityInfo,0,1,1>(nsISupports * aOuter, const nsID & aIID, void * * aResult) Line 94	 
xul.dll!nsComponentManagerImpl::CreateInstance(const nsID & aClass, nsISupports * aDelegate, const nsID & aIID, void * * aResult) Line 1003	 
xul.dll!CallCreateInstance(const nsID & aCID, nsISupports * aIID, const nsID & aResult, void * *) Line 137	 
xul.dll!nsCreateInstanceByCID::operator()(const nsID & aIID, void * * aInstancePtr) Line 184	 
xul.dll!nsCOMPtr_base::assign_from_helper(const nsCOMPtr_helper & aHelper, const nsID & aIID) Line 131	 
[Inline Frame] xul.dll!nsCOMPtr<nsISupports>::{ctor}(const nsCOMPtr_helper &) Line 953	 
xul.dll!nsBinaryInputStream::ReadObject(bool aIsStrongRef, nsISupports * * aObject) Line 977	 
xul.dll!NS_DeserializeObject(const nsTSubstring<char> & str, nsISupports * * obj) Line 48	 
xul.dll!mozilla::net::CacheEntry::GetSecurityInfo(nsISupports * * aSecurityInfo) Line 1358	 
xul.dll!mozilla::net::nsHttpChannel::OpenCacheInputStream(nsICacheEntry * cacheEntry, bool startBuffering, bool checkingAppCacheEntry) Line 4783	 
xul.dll!mozilla::net::nsHttpChannel::OnCacheEntryCheck(nsICacheEntry * entry, nsIApplicationCache * appCache, unsigned int * aResult) Line 4312	 
xul.dll!mozilla::net::CacheEntry::InvokeCallback(mozilla::net::CacheEntry::Callback & aCallback) Line 760	 
xul.dll!mozilla::net::CacheEntry::InvokeCallbacks(bool aReadOnly) Line 687	 
xul.dll!mozilla::net::CacheEntry::InvokeCallbacks() Line 627	 
xul.dll!mozilla::net::CacheEntry::OnFileReady(nsresult aResult, bool aIsNew) Line 507	 
xul.dll!mozilla::net::CacheFile::OnMetadataRead(nsresult aResult) Line 671	 
xul.dll!mozilla::net::CacheFileMetadata::OnDataRead(mozilla::net::CacheFileHandle * aHandle, char * aBuf, nsresult aResult) Line 830	 
xul.dll!mozilla::net::ReadEvent::Run() Line 745	 
xul.dll!mozilla::net::CacheIOThread::LoopOneLevel(unsigned int aLevel) Line 573	 
xul.dll!mozilla::net::CacheIOThread::ThreadFunc() Line 511	 
xul.dll!mozilla::net::CacheIOThread::ThreadFunc(void * aClosure) Line 452	 
nss3.dll!_PR_NativeRunThread(void * arg) Line 419	 
nss3.dll!pr_root(void * arg) Line 96
Flags: needinfo?(honzab.moz)
you ran me over :)

We have to init PSM on the main thread when an https entry is being queried.
Flags: needinfo?(honzab.moz)
Assignee

Comment 25

2 years ago
(In reply to Honza Bambas (:mayhemer) from comment #24)
> you ran me over :)
> 
> We have to init PSM on the main thread when an https entry is being queried.

We already do it at https://searchfox.org/mozilla-central/rev/919dce54f43356c22d6ff6b81c07ef412b1bf933/netwerk/protocol/http/nsHttpHandler.cpp#2105

What am I missing?
Flags: needinfo?(honzab.moz)
I think the code in EnsureNSSInitializedChromeOrContent() is wrong.  We ALWAYS try to init nss when this is called on non-MT with a sync message.

The static Atomic<bool> initialized(false) should be made accessible to the main thread as well (be in scope of the function, not just the branch) and set to true when we reach https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/security/manager/ssl/nsNSSComponent.cpp#119

this will avoid this particular race.

note that local static 'init once' logic is threadsafe these days on all platforms.
Flags: needinfo?(honzab.moz)
Assignee

Comment 28

2 years ago
Posted patch fixSplinter Review
(In reply to Honza Bambas (:mayhemer) from comment #26)
> I think the code in EnsureNSSInitializedChromeOrContent() is wrong.  We
> ALWAYS try to init nss when this is called on non-MT with a sync message.

You're right. Good catch!

> The static Atomic<bool> initialized(false) should be made accessible to the
> main thread as well (be in scope of the function, not just the branch) and
> set to true when we reach
> https://dxr.mozilla.org/mozilla-central/rev/
> b056526be38e96b3e381b7e90cd8254ad1d96d9d/security/manager/ssl/nsNSSComponent.
> cpp#119

Shouldn't it be set in all cases where the function returns true?
Flags: needinfo?(honzab.moz)
Attachment #8930525 - Flags: review?(ttaubert)
(In reply to Michal Novotny (:michal) from comment #28)
> Shouldn't it be set in all cases where the function returns true?

Yes, in all cases we return "yes, initialization succeeded".
Flags: needinfo?(honzab.moz)
Tracking for 59. Are other branches not affected? Is there a definite regression window?
Oh, I read the comments but apparently not the title.  So, from landing the work from bug 1392841. So, only affecting 59 since RCWN is only enabled by default in Nightly.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #31)
> Oh, I read the comments but apparently not the title.  So, from landing the
> work from bug 1392841. So, only affecting 59 since RCWN is only enabled by
> default in Nightly.

To be exact, this is a latent bug introduced by bug 842818.  RCWN code just hits the race.  But we didn't have any report of this before so far (which doesn't mean there were not a case this happened to someone).
(In reply to Honza Bambas (:mayhemer) from comment #32)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #31)
> > Oh, I read the comments but apparently not the title.  So, from landing the
> > work from bug 1392841. So, only affecting 59 since RCWN is only enabled by
> > default in Nightly.
> 
> To be exact, this is a latent bug introduced by bug 842818.  RCWN code just
> hits the race.  But we didn't have any report of this before so far (which
> doesn't mean there were not a case this happened to someone).

"this" means 'some code path that might block the main thread so that background init of nss would indefinitely wait'.
Assignee: nobody → michal.novotny
Priority: -- → P1
Whiteboard: [necko-triaged]
Comment on attachment 8930525 [details] [diff] [review]
fix

Review of attachment 8930525 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. Thanks for the fix!
Attachment #8930525 - Flags: review?(ttaubert) → review+
(In reply to Michal Novotny (:michal) from comment #23)
> Emanuel, this is awesome, thanks a lot!

No problem, glad I could help :)

Should this be considered for uplift in case there were other ways to trigger the hang?
Assignee

Updated

2 years ago
Keywords: checkin-needed

Comment 36

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ec39af7d2914
Firefox instahang on start after landing patch from bug #1392841. r=ttaubert
Keywords: checkin-needed

Comment 37

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ec39af7d2914
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
I'm confirming that bug is fixed, starting in Mozilla Firefox Nightly 59.0a1 (2017-11-23), so I'm marking this bug as VERIFIED.
Thank you very much! \o/
Status: RESOLVED → VERIFIED

Comment 39

2 years ago
I still have this problem on latest Nightlies from time to time on old profile (but I cannot reproduce this on new). 
network.http.rcwn.enabled;false fixed this problem.
set MOZ_LOG=timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,cache2:5
set MOZ_LOG_FILE=%TEMP%\fflog.txt

Comment 40

2 years ago
Posted file fflog.txt

Comment 41

2 years ago
Posted file fflog.txt.child-2

Comment 42

2 years ago
Posted file fflog.txt.child-3
Can you take a look?
Flags: needinfo?(michal.novotny)
Assignee

Comment 44

2 years ago
From the log it seems to be the same problem, i.e. nsHttpChannel::SetupTransaction on main thread and CacheFileMetadata::GetElement(security-info) on cache IO thread. That's strange. Are you sure you're running latest nightly?

John, could you please try to crash firefox when it's hung using crashfirefox.exe (https://releases.mozilla.org/pub/utilities/crashfirefox-intentionally/)? This should generate crash report which you'll find in about:crashes. Then we can see the callstacks on both threads. Thanks.
Flags: needinfo?(michal.novotny) → needinfo?(svanderger)
Assignee

Comment 45

2 years ago
The same problem can happen in https://searchfox.org/mozilla-central/rev/281d99b3d342b71d0653b872345761372d38f5c1/security/manager/ssl/nsNSSModule.cpp#82. AFAICS, there are 3 services whose initialization on Cache IO thread could lead to the deadlock, nsCertOverrideService, nsSiteSecurityService and CertBlocklist.
Flags: needinfo?(svanderger)
Assignee

Updated

2 years ago
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Assignee

Comment 46

2 years ago
Posted patch additional fix (obsolete) — Splinter Review
Attachment #8936438 - Flags: review?(honzab.moz)
Comment on attachment 8936438 [details] [diff] [review]
additional fix

Review of attachment 8936438 [details] [diff] [review]:
-----------------------------------------------------------------

good catch.  in other words, it's all with "MainThreadOnly" restriction from the following block:

https://searchfox.org/mozilla-central/rev/281d99b3d342b71d0653b872345761372d38f5c1/security/manager/ssl/nsNSSModule.cpp#165-222

could you add a comment above (or maybe - as well - below) to update the net_EnsurePSMInit() function when something with the main thread restriction is added?

thanks.
Attachment #8936438 - Flags: review?(honzab.moz) → review+

Comment 48

2 years ago
(In reply to Michal Novotny (:michal) from comment #44)
> John, could you please try to crash firefox when it's hung using
> crashfirefox.exe
> (https://releases.mozilla.org/pub/utilities/crashfirefox-intentionally/)?
> This should generate crash report which you'll find in about:crashes. Then
> we can see the callstacks on both threads. Thanks.
Nighly 20171214100315
https://crash-stats.mozilla.com/report/index/208a8d33-5ca0-46fa-92e7-7ae340171214
Assignee

Comment 49

2 years ago
Attachment #8936438 - Attachment is obsolete: true
Attachment #8937029 - Flags: review+
Assignee

Updated

2 years ago
Keywords: checkin-needed
Assignee

Comment 50

2 years ago
(In reply to John from comment #48)
> Nighly 20171214100315
> https://crash-stats.mozilla.com/report/index/208a8d33-5ca0-46fa-92e7-
> 7ae340171214

John, thanks for the crash report. My assumption that EnsureNSSInitializedChromeOrContent() is called when psm service is initialized was wrong. So the first patch only reduced the probability of hitting the deadlock. The additional patch will eliminate it because the code added to net_EnsurePSMInit() will call it.

Comment 51

2 years ago
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5eff2c0351a9
Firefox instahang on start after landing patch from bug #1392841, r=honzab
Keywords: checkin-needed
Assignee

Comment 53

2 years ago
Honza, it's not clear to me why the service initialization failed, but I think we can skip checking initialization of these services. What do you think?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ce81fa90271e514a8ff1edbed1967f59948c953
Attachment #8937029 - Attachment is obsolete: true
Flags: needinfo?(michal.novotny)
Attachment #8937605 - Flags: review?(honzab.moz)
Comment on attachment 8937605 [details] [diff] [review]
additional fix v3

Yep, I think it's OK to just try and forget. Thanks for the comment.
Attachment #8937605 - Flags: review?(honzab.moz) → review+
Assignee

Updated

2 years ago
Keywords: checkin-needed

Comment 55

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/adadbeff32ec
Firefox instahang on start after landing patch from bug #1392841.vr=honzab
Keywords: checkin-needed

Comment 56

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/adadbeff32ec
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Thank you very much for follow-up patch! \o/
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.