Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

RESOLVED FIXED in Firefox 50

Status

()

P3
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: KWierso, Assigned: mayhemer)

Tracking

({intermittent-failure})

49 Branch
mozilla52
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox48 wontfix, firefox49 wontfix, firefox50 fixed, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: [necko-active])

Attachments

(1 attachment)

Component: XPCOM → Networking: Cache
Comment hidden (Intermittent Failures Robot)

Comment 2

3 years ago
Honza, do you or Michal have the cycles to take this on (either during this work week, or after?)
Flags: needinfo?(honzab.moz)
Whiteboard: [necko-next]
new leaks and regressions should be active.. its ok to play hot potato with the owner using jduell as a backstop if need be
Assignee: nobody → honzab.moz
Whiteboard: [necko-next] → [necko-active]
(Assignee)

Comment 4

3 years ago
I have a strong suspicion this is caused by your predictor stuff, Nick.  But I can take a look next week.
Flags: needinfo?(honzab.moz)

Comment 5

3 years ago
Yeah, that's possible (I'm not going to write it off entirely), though the predictor's use of the cache storage service hasn't changed at all, so it would have to be something more indirect (same for the other leaked items - no direct changes in usage of those types in the predictor). I do agree the timing is exceedingly suspicious, though.

Updated

3 years ago
Blocks: 1271944
Comment hidden (Intermittent Failures Robot)
(Assignee)

Updated

3 years ago
Duplicate of this bug: 1265089
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
status-firefox48: --- → affected
Comment hidden (Intermittent Failures Robot)
(Assignee)

Updated

3 years ago
Assignee: honzab.moz → hurley
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)

Comment 39

2 years ago
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Summary: Intermittent tc-M(7) leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...) → Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
:hurley, it has been almost 4 months, this is one of the top intermittents we have in all of mozilla- what are the next steps here?

This gets noticeably worse on Friday August 5th, and by Friday August 12th, it remains a top issue.
Flags: needinfo?(hurley)

Comment 48

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #47)
> :hurley, it has been almost 4 months, this is one of the top intermittents
> we have in all of mozilla- what are the next steps here?

I was finally able to get a reproduction with refcounting logs yesterday (thanks to :swu), so I'm pouring over those now trying to make sense of things. For something so common on ci, this had been impossible for me to reproduce locally or on a loaner from releng.

> This gets noticeably worse on Friday August 5th, and by Friday August 12th,
> it remains a top issue.

Yeah, that's the weird thing - I didn't change anything then, but I did notice the spike. I presume it's just that something outside my control started hitting the path that causes the leak more often, but I have zero proof of that.
Flags: needinfo?(hurley)
Comment hidden (Intermittent Failures Robot)
Just looking at what is leaking, I think the most likely cause of this is code trying to dispatch a runnable  that holds onto a CacheStorageService after the thread service has shut down.
I triggered this locally today once by running the dom/workers/test/serviceworkers mochitests.  I'm not sure if it only triggers in this directly, but just mentioning as a datapoint.

That being said, InterceptedChannel does have this:

https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/InterceptedChannel.h#71

But I don't see any InterceptedChannel in the set of leaked objects.

Comment 52

2 years ago
Indeed, I've tracked it down to somewhere in the chain of stuff under Predictor::UpdateCacheability. That does indeed have a runnable associated with it, but it's currently unclear to me exactly where things go off the rails (for example, if the runnable has no real work to do, the leak never happens, even though the runnable runs). I'm inspecting & commenting out larger and large swaths of code to hopefully track down the bits that get us into trouble.
(Assignee)

Comment 53

2 years ago
Nick, connection of these leaks to the predictor is a theory, right?  Was you considering careful hard-disabling of the predictor for few days on m-c and see if these leaks really go away?  Maybe we trace something that isn't there.  There are almost two weeks to the next merge, so enough time to get data and not getting it slipped to m-a.
Flags: needinfo?(hurley)

Comment 54

2 years ago
(In reply to Honza Bambas (:mayhemer) from comment #53)
> Nick, connection of these leaks to the predictor is a theory, right?  Was
> you considering careful hard-disabling of the predictor for few days on m-c
> and see if these leaks really go away?  Maybe we trace something that isn't
> there.  There are almost two weeks to the next merge, so enough time to get
> data and not getting it slipped to m-a.

These are specifically related to prefetch. I'm pretty comfortable in that evaluation. Prefetch is disabled on aurora, beta, and release, and none of these leaks show up on those branches - only on nightly.

Like I alluded to in comment 52, if I early-return from Predictor::UpdateCacheability, the leak goes away. Same if we never find the right thing to update cacheability of. Now its just a game of bisecting what bit of stuff that gets triggered from UpdateCacheability is directly responsible for the leak so I can figure out. It could be in the UpdateCacheability stack directly, or it could be a bug in the code that actually performs the prefetches (which only runs if UpdateCacheability runs).
Flags: needinfo?(hurley)
Do we have an xpcom shutdown observer that spins the event loop if necko work is still outstanding?  Thats how we handle these issues in other subsystems.

Comment 56

2 years ago
(In reply to Ben Kelly [:bkelly] from comment #55)
> Do we have an xpcom shutdown observer that spins the event loop if necko
> work is still outstanding?  Thats how we handle these issues in other
> subsystems.

The work the predictor does directly is on the main thread, so as long as the main thread clears out its queue, that part is good. However, this code also interacts with the cache thread and the socket thread, and I don't know offhand if either of those have a shutdown observer to spin the loop. Honza, I know you'll know about the cache thread, and I suspect you'll have a good answer about the socket thread, too.
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 57

2 years ago
Now I don't follow.  Is it good to have an observer that spins or is it bad?  From my judgment it would do more harm than good.  Also, how would that be related to leaks?
(In reply to Honza Bambas (:mayhemer) from comment #57)
> Now I don't follow.  Is it good to have an observer that spins or is it bad?
> From my judgment it would do more harm than good.  Also, how would that be
> related to leaks?

AFAIK its the only way to safely shutdown async work.  Its what we do in our quota-based storage APIs:

https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.cpp#2449

Its related to leaks in that spinning the event loop there allows you to block shutdown until your async work is done.

Spinning is bad if it can be triggered directly from JS; can break run-to-completion, etc.  Generally you don't want to spin.  However, spinning to cleanup at shutdown seems the one place its necessary and acceptable to me.
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
(Assignee)

Comment 61

2 years ago
(In reply to Ben Kelly [:bkelly] from comment #58)
> (In reply to Honza Bambas (:mayhemer) from comment #57)
> > Now I don't follow.  Is it good to have an observer that spins or is it bad?
> > From my judgment it would do more harm than good.  Also, how would that be
> > related to leaks?
> 
> AFAIK its the only way to safely shutdown async work.  Its what we do in our
> quota-based storage APIs:
> 
> https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.
> cpp#2449

This might be needed for more complex stuff, yes (tho, sounds to me a bit like a bad design.)

> 
> Its related to leaks in that spinning the event loop there allows you to
> block shutdown until your async work is done.
> 
> Spinning is bad if it can be triggered directly from JS; can break
> run-to-completion, etc.  Generally you don't want to spin.  However,
> spinning to cleanup at shutdown seems the one place its necessary and
> acceptable to me.

Nothing like this is needed for the cache IO thread.  xpcom shutdown spins the loop for us later anyway (correctly!), so anything scheduled from the IO thread to run back on the main thread will execute.  before that we do block (no spinning) for the cache thread to finish.  and we don't intend to change that, because there is no need to change that.  cache IO shutdown doesn't need to loop through the main thread to actually finish its stuff.
Flags: needinfo?(honzab.moz)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)

Comment 64

2 years ago
So I've narrowed the leak down to one line (within the predictor): http://searchfox.org/mozilla-central/rev/6b94deded39a868f248a492e74e05733d6c4ed48/netwerk/base/Predictor.cpp#2418

If I comment out that line, the leak goes away. If I leave it in, the leak stays. Even if I do *nothing* in the OnCacheEntryAvailable associated with that AsyncOpenURI, the leak stays. Even if I return ENTRY_NOT_WANTED in the OnCacheEntryCheck associated with that AsyncOpenURI, the leak stays.

This is starting to look more and more like a shutdown race as bkelly mentioned in comment 55 and comment 58. Honza, I know you said it didn't seem possible, but can you take a look at the line linked above and make sure I'm not doing anything stupid there, or something unexpected from the cache service's point of view? If the code in the predictor looks sane, I think we need to dig into timing in the cache, which I'll need your help with. I'll kick off a try run with cache logging enabled so we can get some logs from the leak in case we end up needing it.
Flags: needinfo?(honzab.moz)

Comment 65

2 years ago
Try run with cache logging mentioned above: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8 (I'll likely need to retrigger some mochitest-1 jobs before failures with logs show up)
(Assignee)

Comment 66

2 years ago
Nick, thanks!  Will look at the logs when available.
(Assignee)

Comment 67

2 years ago
I think I know the cause.  We probably schedule a timer (purge timer) in the service after it has been shutdown.  I will have a patch soon.
Assignee: hurley → honzab.moz
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 68

2 years ago
As that seems unlikely (I already checked on that in the past and rejected as a cause), I would like to see the log anyway.
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
(Assignee)

Comment 76

2 years ago
Cause:

2016-09-19 18:57:42.255602 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:43.260031 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::Notify
2016-09-19 18:57:46.154352 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::ShutdownBackground
2016-09-19 18:57:46.154709 UTC - [Cache2 I/O]: D/cache2   mMemorySize=259780 (+172,-0)
2016-09-19 18:57:46.155910 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:46.155988 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit

The timer is never notified nor canceled and released.  Hence the Timer<->Service cycle and leak.


Nick, just FYI, using --set-env try arg to setup logging is not the appropriate way.  Use the MOZ_LOG var in testing/mochitest/runtests.py which correctly solves overlapping of logs produced by multiple processes all seeing the same MOZ_LOG_FILE env var.  That was the source of my initial confusions here.
Status: NEW → ASSIGNED
(Assignee)

Comment 77

2 years ago
Created attachment 8792887 [details] [diff] [review]
v1

- no longer engage the purge timer after mShutdown = true (in CacheStorageService::Shutdown, synchronized before CacheStorageService::ShutdownBackground)
- I also added locking to all places we work with the timer (except the one non-null check, comment is already there it's intentional)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5532246b1395 (leak checking try push)
Attachment #8792887 - Flags: review?(michal.novotny)
Attachment #8792887 - Flags: review?(michal.novotny) → review+

Comment 79

2 years ago
(In reply to Honza Bambas (:mayhemer) from comment #78)
> Leak's gone

I'm wishing *so* hard for emoji support in bugzilla right now. Woo, Honza!
Comment hidden (Intermittent Failures Robot)
status-firefox48: affected → wontfix
status-firefox49: affected → wontfix
status-firefox50: --- → affected
status-firefox51: --- → affected
status-firefox52: --- → affected

Comment 81

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a370db31b85
Fix HTTP cache2 leak after shutdown, don't setup purging timer after shutdown. r=michal
Keywords: checkin-needed

Comment 82

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9a370db31b85
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox52: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment hidden (Intermittent Failures Robot)
Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(honzab.moz)

Comment 85

2 years ago
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> Please request Aurora/Beta approval on this when you get a chance.

I'm not convinced we should do this - while this has been a latent bug, it's only (to our knowledge) triggered by predictor prefetch stuff, which is disabled everywhere but nightly - hence no leaks on aurora or beta in the orangefactor reports. I'm not convinced it's worth the risk of uplifting a patch. Then again, I didn't write the patch, so take what I say with somewhere between a grain of salt and an entire salt lick :D
OrangeFactor shows this hitting on Aurora50 throughout that cycle, so consider me confused then :)
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&endday=2016-09-19&startday=2016-08-01&tree=mozilla-aurora
(Assignee)

Comment 87

2 years ago
This is independent of predictor/prefetch.  Predictor only helped to narrow it down because it apparently exhausts the cache2 back end during shutdown much more than when not turned on.  Will request a's.
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 88

2 years ago
Comment on attachment 8792887 [details] [diff] [review]
v1

Approval Request Comment
[Feature/regressing bug #]: HTTP cache2
[User impact if declined]: none (it's just a leak not causing any secondary trouble), although, generally speaking this frequent intermittent failure may well lead to overlook of something more important
[Describe test coverage new/current, TreeHerder]: just landed on m-c
[Risks and why]: zero, the patch just doesn't allow a certain non-critical action after the shutdown mark
[String/UUID change made/needed]: none
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-aurora?
Hi Wes, could you please confirm this fixes the intermittent? That will help me decide whether to uplift this or not. Thanks!
Flags: needinfo?(wkocher)
(Reporter)

Comment 90

2 years ago
The green line at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-09-21&tree=trunk would be the thing to track, but I think it's still probably too early to tell if this has been fixed. Leaving the ni to check back in a few days.
Ok. I'll wait for confirmation from Sheriffs before uplifting this one to Beta50.
(Reporter)

Comment 92

2 years ago
Looks like it dropped to zero to me.
Flags: needinfo?(wkocher)
Comment hidden (Intermittent Failures Robot)
Comment on attachment 8792887 [details] [diff] [review]
v1

Verified by Wes to have fixed the intermittent failure, Aurora51+, Beta50+
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-beta+
Attachment #8792887 - Flags: approval-mozilla-aurora?
Attachment #8792887 - Flags: approval-mozilla-aurora+
(Reporter)

Comment 95

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/36e4081424b4
status-firefox51: affected → fixed
(Reporter)

Comment 96

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/a843ec667941
status-firefox50: affected → fixed
You need to log in before you can comment on or make changes to this bug.