Closed Bug 1267820 Opened 8 years ago Closed 8 years ago

Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

Categories

(Core :: Networking: Cache, defect, P3)

49 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: KWierso, Assigned: mayhemer)

References

Details

(Keywords: intermittent-failure, Whiteboard: [necko-active])

Attachments

(1 file)

Component: XPCOM → Networking: Cache
Honza, do you or Michal have the cycles to take this on (either during this work week, or after?)
Flags: needinfo?(honzab.moz)
Whiteboard: [necko-next]
new leaks and regressions should be active.. its ok to play hot potato with the owner using jduell as a backstop if need be
Assignee: nobody → honzab.moz
Whiteboard: [necko-next] → [necko-active]
I have a strong suspicion this is caused by your predictor stuff, Nick.  But I can take a look next week.
Flags: needinfo?(honzab.moz)
Yeah, that's possible (I'm not going to write it off entirely), though the predictor's use of the cache storage service hasn't changed at all, so it would have to be something more indirect (same for the other leaked items - no direct changes in usage of those types in the predictor). I do agree the timing is exceedingly suspicious, though.
Blocks: 1271944
Assignee: honzab.moz → hurley
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Summary: Intermittent tc-M(7) leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...) → Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)
:hurley, it has been almost 4 months, this is one of the top intermittents we have in all of mozilla- what are the next steps here?

This gets noticeably worse on Friday August 5th, and by Friday August 12th, it remains a top issue.
Flags: needinfo?(hurley)
(In reply to Joel Maher ( :jmaher) from comment #47)
> :hurley, it has been almost 4 months, this is one of the top intermittents
> we have in all of mozilla- what are the next steps here?

I was finally able to get a reproduction with refcounting logs yesterday (thanks to :swu), so I'm pouring over those now trying to make sense of things. For something so common on ci, this had been impossible for me to reproduce locally or on a loaner from releng.

> This gets noticeably worse on Friday August 5th, and by Friday August 12th,
> it remains a top issue.

Yeah, that's the weird thing - I didn't change anything then, but I did notice the spike. I presume it's just that something outside my control started hitting the path that causes the leak more often, but I have zero proof of that.
Flags: needinfo?(hurley)
Just looking at what is leaking, I think the most likely cause of this is code trying to dispatch a runnable  that holds onto a CacheStorageService after the thread service has shut down.
I triggered this locally today once by running the dom/workers/test/serviceworkers mochitests.  I'm not sure if it only triggers in this directly, but just mentioning as a datapoint.

That being said, InterceptedChannel does have this:

https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/InterceptedChannel.h#71

But I don't see any InterceptedChannel in the set of leaked objects.
Indeed, I've tracked it down to somewhere in the chain of stuff under Predictor::UpdateCacheability. That does indeed have a runnable associated with it, but it's currently unclear to me exactly where things go off the rails (for example, if the runnable has no real work to do, the leak never happens, even though the runnable runs). I'm inspecting & commenting out larger and large swaths of code to hopefully track down the bits that get us into trouble.
Nick, connection of these leaks to the predictor is a theory, right?  Was you considering careful hard-disabling of the predictor for few days on m-c and see if these leaks really go away?  Maybe we trace something that isn't there.  There are almost two weeks to the next merge, so enough time to get data and not getting it slipped to m-a.
Flags: needinfo?(hurley)
(In reply to Honza Bambas (:mayhemer) from comment #53)
> Nick, connection of these leaks to the predictor is a theory, right?  Was
> you considering careful hard-disabling of the predictor for few days on m-c
> and see if these leaks really go away?  Maybe we trace something that isn't
> there.  There are almost two weeks to the next merge, so enough time to get
> data and not getting it slipped to m-a.

These are specifically related to prefetch. I'm pretty comfortable in that evaluation. Prefetch is disabled on aurora, beta, and release, and none of these leaks show up on those branches - only on nightly.

Like I alluded to in comment 52, if I early-return from Predictor::UpdateCacheability, the leak goes away. Same if we never find the right thing to update cacheability of. Now its just a game of bisecting what bit of stuff that gets triggered from UpdateCacheability is directly responsible for the leak so I can figure out. It could be in the UpdateCacheability stack directly, or it could be a bug in the code that actually performs the prefetches (which only runs if UpdateCacheability runs).
Flags: needinfo?(hurley)
Do we have an xpcom shutdown observer that spins the event loop if necko work is still outstanding?  Thats how we handle these issues in other subsystems.
(In reply to Ben Kelly [:bkelly] from comment #55)
> Do we have an xpcom shutdown observer that spins the event loop if necko
> work is still outstanding?  Thats how we handle these issues in other
> subsystems.

The work the predictor does directly is on the main thread, so as long as the main thread clears out its queue, that part is good. However, this code also interacts with the cache thread and the socket thread, and I don't know offhand if either of those have a shutdown observer to spin the loop. Honza, I know you'll know about the cache thread, and I suspect you'll have a good answer about the socket thread, too.
Flags: needinfo?(honzab.moz)
Now I don't follow.  Is it good to have an observer that spins or is it bad?  From my judgment it would do more harm than good.  Also, how would that be related to leaks?
(In reply to Honza Bambas (:mayhemer) from comment #57)
> Now I don't follow.  Is it good to have an observer that spins or is it bad?
> From my judgment it would do more harm than good.  Also, how would that be
> related to leaks?

AFAIK its the only way to safely shutdown async work.  Its what we do in our quota-based storage APIs:

https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.cpp#2449

Its related to leaks in that spinning the event loop there allows you to block shutdown until your async work is done.

Spinning is bad if it can be triggered directly from JS; can break run-to-completion, etc.  Generally you don't want to spin.  However, spinning to cleanup at shutdown seems the one place its necessary and acceptable to me.
(In reply to Ben Kelly [:bkelly] from comment #58)
> (In reply to Honza Bambas (:mayhemer) from comment #57)
> > Now I don't follow.  Is it good to have an observer that spins or is it bad?
> > From my judgment it would do more harm than good.  Also, how would that be
> > related to leaks?
> 
> AFAIK its the only way to safely shutdown async work.  Its what we do in our
> quota-based storage APIs:
> 
> https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.
> cpp#2449

This might be needed for more complex stuff, yes (tho, sounds to me a bit like a bad design.)

> 
> Its related to leaks in that spinning the event loop there allows you to
> block shutdown until your async work is done.
> 
> Spinning is bad if it can be triggered directly from JS; can break
> run-to-completion, etc.  Generally you don't want to spin.  However,
> spinning to cleanup at shutdown seems the one place its necessary and
> acceptable to me.

Nothing like this is needed for the cache IO thread.  xpcom shutdown spins the loop for us later anyway (correctly!), so anything scheduled from the IO thread to run back on the main thread will execute.  before that we do block (no spinning) for the cache thread to finish.  and we don't intend to change that, because there is no need to change that.  cache IO shutdown doesn't need to loop through the main thread to actually finish its stuff.
Flags: needinfo?(honzab.moz)
So I've narrowed the leak down to one line (within the predictor): http://searchfox.org/mozilla-central/rev/6b94deded39a868f248a492e74e05733d6c4ed48/netwerk/base/Predictor.cpp#2418

If I comment out that line, the leak goes away. If I leave it in, the leak stays. Even if I do *nothing* in the OnCacheEntryAvailable associated with that AsyncOpenURI, the leak stays. Even if I return ENTRY_NOT_WANTED in the OnCacheEntryCheck associated with that AsyncOpenURI, the leak stays.

This is starting to look more and more like a shutdown race as bkelly mentioned in comment 55 and comment 58. Honza, I know you said it didn't seem possible, but can you take a look at the line linked above and make sure I'm not doing anything stupid there, or something unexpected from the cache service's point of view? If the code in the predictor looks sane, I think we need to dig into timing in the cache, which I'll need your help with. I'll kick off a try run with cache logging enabled so we can get some logs from the leak in case we end up needing it.
Flags: needinfo?(honzab.moz)
Try run with cache logging mentioned above: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8 (I'll likely need to retrigger some mochitest-1 jobs before failures with logs show up)
Nick, thanks!  Will look at the logs when available.
I think I know the cause.  We probably schedule a timer (purge timer) in the service after it has been shutdown.  I will have a patch soon.
Assignee: hurley → honzab.moz
Flags: needinfo?(honzab.moz)
As that seems unlikely (I already checked on that in the past and rejected as a cause), I would like to see the log anyway.
Cause:

2016-09-19 18:57:42.255602 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:43.260031 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::Notify
2016-09-19 18:57:46.154352 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::ShutdownBackground
2016-09-19 18:57:46.154709 UTC - [Cache2 I/O]: D/cache2   mMemorySize=259780 (+172,-0)
2016-09-19 18:57:46.155910 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:46.155988 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit

The timer is never notified nor canceled and released.  Hence the Timer<->Service cycle and leak.


Nick, just FYI, using --set-env try arg to setup logging is not the appropriate way.  Use the MOZ_LOG var in testing/mochitest/runtests.py which correctly solves overlapping of logs produced by multiple processes all seeing the same MOZ_LOG_FILE env var.  That was the source of my initial confusions here.
Status: NEW → ASSIGNED
Attached patch v1Splinter Review
- no longer engage the purge timer after mShutdown = true (in CacheStorageService::Shutdown, synchronized before CacheStorageService::ShutdownBackground)
- I also added locking to all places we work with the timer (except the one non-null check, comment is already there it's intentional)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5532246b1395 (leak checking try push)
Attachment #8792887 - Flags: review?(michal.novotny)
Attachment #8792887 - Flags: review?(michal.novotny) → review+
(In reply to Honza Bambas (:mayhemer) from comment #78)
> Leak's gone

I'm wishing *so* hard for emoji support in bugzilla right now. Woo, Honza!
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a370db31b85
Fix HTTP cache2 leak after shutdown, don't setup purging timer after shutdown. r=michal
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/9a370db31b85
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(honzab.moz)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> Please request Aurora/Beta approval on this when you get a chance.

I'm not convinced we should do this - while this has been a latent bug, it's only (to our knowledge) triggered by predictor prefetch stuff, which is disabled everywhere but nightly - hence no leaks on aurora or beta in the orangefactor reports. I'm not convinced it's worth the risk of uplifting a patch. Then again, I didn't write the patch, so take what I say with somewhere between a grain of salt and an entire salt lick :D
OrangeFactor shows this hitting on Aurora50 throughout that cycle, so consider me confused then :)
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&endday=2016-09-19&startday=2016-08-01&tree=mozilla-aurora
This is independent of predictor/prefetch.  Predictor only helped to narrow it down because it apparently exhausts the cache2 back end during shutdown much more than when not turned on.  Will request a's.
Flags: needinfo?(honzab.moz)
Comment on attachment 8792887 [details] [diff] [review]
v1

Approval Request Comment
[Feature/regressing bug #]: HTTP cache2
[User impact if declined]: none (it's just a leak not causing any secondary trouble), although, generally speaking this frequent intermittent failure may well lead to overlook of something more important
[Describe test coverage new/current, TreeHerder]: just landed on m-c
[Risks and why]: zero, the patch just doesn't allow a certain non-critical action after the shutdown mark
[String/UUID change made/needed]: none
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-aurora?
Hi Wes, could you please confirm this fixes the intermittent? That will help me decide whether to uplift this or not. Thanks!
Flags: needinfo?(wkocher)
The green line at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-09-21&tree=trunk would be the thing to track, but I think it's still probably too early to tell if this has been fixed. Leaving the ni to check back in a few days.
Ok. I'll wait for confirmation from Sheriffs before uplifting this one to Beta50.
Looks like it dropped to zero to me.
Flags: needinfo?(wkocher)
Comment on attachment 8792887 [details] [diff] [review]
v1

Verified by Wes to have fixed the intermittent failure, Aurora51+, Beta50+
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-beta+
Attachment #8792887 - Flags: approval-mozilla-aurora?
Attachment #8792887 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: