Closed Bug 1358043 Opened 3 years ago Closed Last year

Crash in nsCacheService::Init

Categories

(Core :: Networking: Cache, defect, P2, critical)

53 Branch
All
Android
defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 - wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: skywalker333, Assigned: mayhemer)

References

Details

(Keywords: crash, reproducible, Whiteboard: [necko-next])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-58815f30-8ed8-42ed-a856-3f3cf0170419.
=============================================================
Hardware: ARM → All
This affects Firefox as well, and I see one Android crash in 55 so setting it as affected. If you go back one month in crash stats there are about 21 crashes.

Do we any steps to reproduce?
Crash volume is still pretty low for this signature, ni on reporter to see if there are any STR.
Flags: needinfo?(skywalker333)
Kind of low volume, not sure it will be useful for relman to track this.
Skywalker has been filing bugs from the crash-stats server. I don't think they have been encountering the crashes.
I don't have any particular steps to reproduce (STR).

The crash occurred after having recently installed Firefox Aurora (54.0a2 2017-04-18).

Install Age 	587 seconds since version was first installed (9 minutes and 47 seconds)

I was browsing bugzilla.mozilla.org at the time of the crashes. The first crash occurred when I was looking at Bug 1164027. I experienced a crash with signature [ ElfLoader::~ElfLoader ] bp-b61973c4-6efa-43ed-9d36-25f700170419. 

Uptime 	551 seconds (9 minutes and 11 seconds)
Install Age 	551 seconds since version was first installed (9 minutes and 11 seconds)
Install Time 	2017-04-19 02:20:57
Product 	FennecAndroid
Release Channel 	aurora
Version 	54.0a2
Build ID 	20170418074655
OS 	Android
OS Version 	0.0.0 Linux 3.4.0-1974790 #1 SMP PREEMPT Fri Oct 25 08:41:54 KST 2013 armv7l
Android Version 	18 (REL)
Build Architecture 	arm
Build Architecture Info 	ARMv7 Qualcomm Krait features: swp,half,thumb,fastmult,vfpv2,edsp,neon,vfpv3,tls,vfpv4,idiva,idivt | 4
Android Manufacturer 	samsung
Android Model 	SM-N900W8
Related Bugs Bug 1164027
NEW --- intermittent PROCESS-CRASH | autophone-s1s2 | application crashed [@ ElfLoader::~ElfLoader]


Then I restarted firefox and experienced a second crash, this time with signature [ nsCacheService::Init ] bp-58815f30-8ed8-42ed-a856-3f3cf0170419.

Uptime 	7 seconds
Last Crash 	36 seconds before submission
Install Age 	587 seconds since version was first installed (9 minutes and 47 seconds)
Startup Crash 	False
MOZ_CRASH Reason 	MOZ_CRASH(Can't create cache IO thread)
Crash Reason 	SIGSEGV
Crash Address 	0x0
App Notes 	
FP(D00-L1010-W00000000-T010) EGL? EGL+ GL Context? GL Context+ AdapterDescription: 'Model: SM-N900W8, Product: hltevl, Manufacturer: samsung, Hardware: qcom, OpenGL: Qualcomm -- Adreno (TM) 330 -- OpenGL ES 3.0 V@45.0 AU@04.03.00.125.097 RVADDULA_AU_LINUX_ANDROID_JB_3.1.2.04.03.00.125.097+PATCH[ES]_msm8974_JB_3.1.2_CL3905453_release_ENGG (CL@3905453)'
GL Layers? GL Layers+ 
samsung SM-N900W8
samsung/hltevl/hltecan:4.3/JSS15J/N900W8VLUBMJ4:user/release-keys
Processor Notes
processor_ip-172-31-11-82_1318; MozillaProcessorAlgorithm2015; skunk_classifier: reject - not a plugin hang


bp-58815f30-8ed8-42ed-a856-3f3cf0170419  4/18/17 10:30 PM
bp-b61973c4-6efa-43ed-9d36-25f700170419  4/18/17 10:30 PM
Flags: needinfo?(skywalker333)
Has STR: --- → no
See Also: → 1164027
Looking at https://crash-stats.mozilla.com/signature/?signature=nsCacheService%3A%3AInit&date=%3E%3D2016-11-09T09%3A21%3A17.000Z&date=%3C2017-05-09T09%3A21%3A17.000Z#graphs

The number of crashes per day increased from 1 (Mar-Apr) up to 10-15 starting over halfway through April (Apr20-23?). For FennecAndroid only.
Signature report for nsCacheService::Init

Showing results from a month ago

Operating System
Android 	173 92.0%
Windows 7 	10 	5.3%
Windows 10 	3 	1.6%
Windows 8.1	1 	0.5%
Windows XP 	1 	0.5%

Product
FennecAndroid 	53.0.1 	33 	45.8% 	35
FennecAndroid 	53.0 	20 	27.8% 	16
FennecAndroid 	53.0.2 	7 	9.7% 	9
FennecAndroid 	54.0a2 	2 	2.8% 	2
FennecAndroid 	54.0b2 	2 	2.8% 	1
FennecAndroid 	54.0b4 	1 	1.4% 	1

Uptime Range
< 1 min 	72 	38.3%
> 1 hour 	57 	30.3%
15-60 min 	21 	11.2%
1-5 min 	20 	10.6%
5-15 min 	18 	9.6%

Architecture
arm 	161 	85.6%
x86 	26 	13.8%
amd64 	1 	0.5%

Flash Version
[blank] 	188 	100.0%
A report came in on webcompat.com regarding a crash in fennec while on the hacks blog.

Bug report:
https://webcompat.com/issues/9010

Site URL:
https://hacks.mozilla.org/2017/06/new-css-grid-layout-panel-in-firefox-nightly/

I can consistently reproduce the crash, even after a restart, though others on the webcompat team can't at all. This is in Firefox 55 and 57, only 1 tab open and no other running applications.

My device is a Nexus 6, here's a report:
https://crash-stats.mozilla.com/report/index/2cd376d0-ce87-4b0a-844f-ed9160170817

Is there anything I can do to help here? Since my device is reproducing, just by scrolling / interacting with the page.
Flags: needinfo?(mozillamarcia.knous)
I was able to reproduce this on a Nexus 6 device as well, running release. 

It looks as if this happens using Firefox as well, but much less frequently than Fennec. Because the crash reason is listed as MOZ_CRASH(Can't create cache IO thread), I moved it into what I think is a better component.
Component: General → Networking: Cache
Flags: needinfo?(mozillamarcia.knous)
Product: Firefox for Android → Core
This crash is because NS_NewNamedThread fails, which... ugh. Jason, who has a few cycles to look at this and either (1) reproduce, or (2) create a try build with some debugging for those who can reproduce?
Flags: needinfo?(jduell.mcbugs)
Whiteboard: [necko-next]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P2
Flags: needinfo?(jduell.mcbugs)
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #9)
> I was able to reproduce this on a Nexus 6 device as well, running release. 
> 
> It looks as if this happens using Firefox as well, but much less frequently
> than Fennec. Because the crash reason is listed as MOZ_CRASH(Can't create
> cache IO thread), I moved it into what I think is a better component.

Hi Marcia,

Do you remember how to reproduce this crash?
If yes, could you provide detailed steps?

Thanks.
Flags: needinfo?(mozillamarcia.knous)
(In reply to Kershaw Chang [:kershaw] from comment #12)
> (In reply to Marcia Knous [:marcia - needinfo? me] from comment #9)
> > I was able to reproduce this on a Nexus 6 device as well, running release. 
> > 
> > It looks as if this happens using Firefox as well, but much less frequently
> > than Fennec. Because the crash reason is listed as MOZ_CRASH(Can't create
> > cache IO thread), I moved it into what I think is a better component.
> 
> Hi Marcia,
> 
> Do you remember how to reproduce this crash?
> If yes, could you provide detailed steps?
> 
> Thanks.

Hello Kershaw - I don't recall how I was able to reproduce since it was so long ago - sorry.
Flags: needinfo?(mozillamarcia.knous)
Assignee: nobody → odvarko
Assignee: odvarko → honzab.moz
Note that when we fail to create an io thread in cache2, we switch to a memory only mode.  we fail at [1] and then, because of missing gInstance, we gracefully fail all IO.

Surprisingly, *all*[2] the code in cache1 is already prepared for missing io thread, cache2 links to cache1 have graceful handling as well [3].  

the fix here is to just turn the crash to a warning or something to just ignore and live with.


[1] https://searchfox.org/mozilla-central/rev/c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/cache2/CacheFileIOManager.cpp#1216
[2] https://searchfox.org/mozilla-central/search?q=symbol:F_%3CT_nsCacheService%3E_mCacheIOThread&redirect=false
[3] https://searchfox.org/mozilla-central/rev/c0b26c40769a1e5607a1ae8be37fe64df64fc55e/netwerk/cache2/OldWrappers.cpp#714-734
Status: NEW → ASSIGNED
Attached patch v1Splinter Review
Attachment #9025397 - Flags: review?(michal.novotny)
Michal, see comment 14 for rational.

There is no need to push to try this, there is no realistic scenario this could actually trigger on our test infra.
Attachment #9025397 - Flags: review?(michal.novotny) → review+
(In reply to Honza Bambas (:mayhemer) from comment #17)
> Just in case:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=27731e242a1d5980c8a0565b722b91aeb6c40cb1

To explain, this is a simulated push with the old cache io thread missing (being null).  I wanted to check for possible other crashes in case I missed any non-null checks.

Assertion failure: ((bool)(__builtin_expect(!!(!NS_FAILED_impl(rv)), 1))) (Unexpected state), at /builds/worker/workspace/build/src/netwerk/protocol/http/nsHttpChannel.cpp:855 is fine (we wait only for "normal" cache entry, no hangs expected)
No crashes on try.
Keywords: checkin-needed
Pushed by aciure@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d21e9cf5a196
Produce only warning when appcache/old cache backend I/O thread can't be created for lack of resources, r=michal
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d21e9cf5a196
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Seems simple enough, please nominate this for Beta/ESR60 approval.
I'm not sure we want to pass this to ESR.  There still could be some corner case we haven't discovered yet that may cause a crash (or instability) somewhere in the cache or its consuming code when the thread is missing.  I'd rather push this only up to beta.  Note that this mainly effects only Android because of lack of OS resources and not desktop.
Flags: needinfo?(honzab.moz)
Comment on attachment 9025397 [details] [diff] [review]
v1

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: none

User impact if declined: Early startup crash when the machine is out of memory/handles (on low end HW, specifically mobile)

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: This is hard to repro.  You would need a HW with just low enough number of free thread handles to reproduce and then try to go on...

List of other uplifts needed: None

Risk to taking this patch: Medium

Why is the change risky/not risky? (and alternatives if risky): I would rather be a bit cautious here since we may still be missing some code path or missing check that will cause a crash or some unexpected state when the thread is missing.  Also, when we are so much out of resources, we will likely crash somewhere else soon anyway...  maybe this was an accidental 'safe check' we just removed...

String changes made/needed: none
Attachment #9025397 - Flags: approval-mozilla-beta?
This is very low volume, we can let it ride the trains.
Attachment #9025397 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.