Closed Bug 1339617 Opened 7 years ago Closed 7 years ago

Crash in nsTArray_Impl<T>::InsertElementsAt<T> | nsCOMArray_base::InsertObjectsAt

Categories

(Core :: Networking: Cache, defect)

52 Branch
x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 --- unaffected
firefox51 --- unaffected
firefox52 + fixed
firefox-esr52 --- fixed
firefox53 + fixed
firefox54 + fixed

People

(Reporter: philipp, Assigned: froydnj)

References

Details

(Keywords: crash, regression, Whiteboard: [necko-active])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-0ef58075-e476-4e07-856c-c7b3a2170214.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsTArray_Impl<nsISupports*, nsTArrayInfallibleAllocator>::InsertElementsAt<nsISupports*, nsTArrayInfallibleAllocator, nsTArrayInfallibleAllocator>(unsigned int, nsTArray_Impl<nsISupports*, nsTArrayInfallibleAllocator> const&) 	xpcom/glue/nsTArray.h:1521
1 	xul.dll 	nsCOMArray_base::InsertObjectsAt(nsCOMArray_base const&, int) 	xpcom/glue/nsCOMArray.cpp:163
2 	xul.dll 	nsDeleteDir::Shutdown(bool) 	netwerk/cache/nsDeleteDir.cpp:89
3 	xul.dll 	nsCacheService::Shutdown() 	netwerk/cache/nsCacheService.cpp:1265
4 	xul.dll 	nsCacheProfilePrefObserver::Observe(nsISupports*, char const*, char16_t const*) 	netwerk/cache/nsCacheService.cpp:410
5 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:112
6 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverService.cpp:281
7 	xul.dll 	mozilla::ShutdownXPCOM(nsIServiceManager*) 	xpcom/build/XPCOMInit.cpp:910
8 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1415
9 	xul.dll 	mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) 	obj-firefox/dist/include/mozilla/UniquePtr.h:345
10 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4666
11 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4729
12 	xul.dll 	mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/Bootstrap.cpp:45
13 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
14 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
15 	kernel32.dll 	BaseThreadInitThunk 	
16 	ntdll.dll 	__RtlUserThreadStart 	
17 	ntdll.dll 	_RtlUserThreadStart

this is a regressing crash signature on shutdown which is currently only around in 32bit builds on windows. 
https://mozilla.github.io/stab-crashes/supergraph.html?s=nsTArray_Impl%3CT%3E%3A%3AInsertElementsAt%3CT%3E+%7C+nsCOMArray_base%3A%3AInsertObjectsAt

this first started to happen in 54.0a1 sometime around build 20170208030203, in 53.0a2 in build 20170210004018 and later and since 52.0b5. 
common patches that have landed in the three channels in this regression range would be:
bug 1330687 - Unaccessible
bug 1335054 - Intermittent dom/indexedDB/test/test_file_sharing.html | application crashed [@ RefPtr<mozilla::dom::BlobImpl>::operator->][@ ObjectStoreAddOrPutRequestOp::DoDatabaseWork]
bug 1325511 - Unaccessible
bug 1335349 - Allow opening a survey URL after a user chooses undo / "Don't Keep" after automatic migration of data from another browser
bug 995069 - Open /dev/urandom in JS engine conflicts with Sandbox
bug 1337301 - Telemetry to support background video decoder suspend: extend the resolution of VIDEO_AS_CONTENT_SOURCE telemetry
bug 1321707 - Intermittent w3c-css/submitted/ui3/box-sizing-replaced-001.xht == w3c-css/submitted/ui3/box-sizing-replaced-001-ref.xht | image comparison, max difference: 255, number of differing pixels: 900
bug 1336654 - Permafailing dom/security/test/contentverifier/browser_verify_content_about_newtab.js | Valid remote newtab page must have built-in CSP. - Got {}, expected {"csp-policies":[{"report-only":false,"script-src":["https://example.com","'unsafe-inline'"]
bug 1334972 - Crash in memmove | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::InsertElementAt<T> | mozilla::a11y::Accessible::MoveChild
bug 1334093 - Test jobs should always extract the mach script and mozinfo.json (for manifestparser) from common.tests.zip
bug 1329752 - ESR - Configure e10s qualification criteria for ESR52
bug 1328643 - Forgive nsITimer::Cancel/Init on threads other than the target
bug 1322970 - High CPU usage on Kickstarter page, with SVG/SMIL animation (causing unresponsive scrollbar on Windows)
bug 1335193 - Switch the alert emails for scroll tracking histograms
Component: Untriaged → Networking: Cache
Bug 1339617 seems to be a regression from  bug 1328643's landing.  in the cache shutdown it's in a loop deleting timers, so this is almost certainly the cause of the regression.  The calling code may be making bad assumptions: -- that never happens. ;-)

Byron is on PTO the next few days, and time in 52 is short, so nathan - can you look at the calling code and crashes?  Thanks
Depends on: 1328643
Flags: needinfo?(nfroyd)
Flags: needinfo?(docfaraday)
Ugh.  So the problem here is that when we Cancel() the timer now, we null out any closure that's been passed along to the timer.  The code in the networking cache (the lone user of GetClosure!) is expecting that we keep the closure alive past the Cancel() call.

So the easy fix is to move to call to GetClosure prior to the Cancel() call.

Weren't we supposed to be using the new cache2 stuff everywhere now?
Flags: needinfo?(nfroyd) → needinfo?(honzab.moz)
(In reply to Nathan Froyd [:froydnj] from comment #2)
> Ugh.  So the problem here is that when we Cancel() the timer now, we null
> out any closure that's been passed along to the timer.  The code in the
> networking cache (the lone user of GetClosure!) is expecting that we keep
> the closure alive past the Cancel() call.
> 
> So the easy fix is to move to call to GetClosure prior to the Cancel() call.

I think it is.

> 
> Weren't we supposed to be using the new cache2 stuff everywhere now?

The old cache back end can be started by appcache (when hit on a web page) or (would have to check) can search for trashes regardless if triggered by appcache, which starts the timer.
Flags: needinfo?(honzab.moz)
This is just the fix that was described in comment 2.
Attachment #8838066 - Flags: review?(honzab.moz)
[Tracking Requested - why for this release]: crashes caused by uplifting a fix (bug 1328643) for other, unrelated crashes.
Tracking for 52/53/54 based on Comment 5.
Whiteboard: [necko-active]
Assignee: nobody → nfroyd
Looks like you guys have this under control!
Flags: needinfo?(docfaraday)
Comment on attachment 8838066 [details] [diff] [review]
move GetClosure call prior to canceling timer

Review of attachment 8838066 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8838066 - Flags: review?(honzab.moz) → review+
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d540146b50b3
move GetClosure call prior to canceling timer; r=mayhemer
https://hg.mozilla.org/mozilla-central/rev/d540146b50b3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment on attachment 8838066 [details] [diff] [review]
move GetClosure call prior to canceling timer

Approval Request Comment
[Feature/Bug causing the regression]: bug 1328643
[User impact if declined]: crashes
[Is this code covered by automated tests?]: if it is, it's not catching the particular scenario that's causing the crashes.
[Has the fix been verified in Nightly?]: Not yet.  The crashes that we've been seeing should disappear, though.
[Needs manual test from QE? If yes, steps to reproduce]: No known steps to reproduce.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a small and well-understood change.
[String changes made/needed]: None.
Attachment #8838066 - Flags: approval-mozilla-beta?
Attachment #8838066 - Flags: approval-mozilla-aurora?
Comment on attachment 8838066 [details] [diff] [review]
move GetClosure call prior to canceling timer

fix regression from recent uplift in aurora53 and beta52
Attachment #8838066 - Flags: approval-mozilla-beta?
Attachment #8838066 - Flags: approval-mozilla-beta+
Attachment #8838066 - Flags: approval-mozilla-aurora?
Attachment #8838066 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.