Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

RESOLVED FIXED in Firefox 50

Status

()

Core
Networking: Cache
P3
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: KWierso, Assigned: mayhemer)

Tracking

({intermittent-failure})

49 Branch
mozilla52
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox48 wontfix, firefox49 wontfix, firefox50 fixed, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: [necko-active])

Attachments

(1 attachment)

Component: XPCOM → Networking: Cache

Comment 1

2 years ago
17 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 12
* fx-team: 4
* try: 1

Platform breakdown:
* linux64: 15
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-04-25&endday=2016-05-01&tree=all
Honza, do you or Michal have the cycles to take this on (either during this work week, or after?)
Flags: needinfo?(honzab.moz)
Whiteboard: [necko-next]
new leaks and regressions should be active.. its ok to play hot potato with the owner using jduell as a backstop if need be
Assignee: nobody → honzab.moz
Whiteboard: [necko-next] → [necko-active]
(Assignee)

Comment 4

2 years ago
I have a strong suspicion this is caused by your predictor stuff, Nick.  But I can take a look next week.
Flags: needinfo?(honzab.moz)
Yeah, that's possible (I'm not going to write it off entirely), though the predictor's use of the cache storage service hasn't changed at all, so it would have to be something more indirect (same for the other leaked items - no direct changes in usage of those types in the predictor). I do agree the timing is exceedingly suspicious, though.

Comment 6

2 years ago
13 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 12
* ash: 1

Platform breakdown:
* linux64: 12
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-05-10&endday=2016-05-16&tree=all
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1265089

Comment 8

2 years ago
16 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 14
* fx-team: 1
* ash: 1

Platform breakdown:
* linux64: 15
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-05-16&endday=2016-05-22&tree=all

Comment 9

2 years ago
6 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 3
* mozilla-inbound: 2
* try: 1

Platform breakdown:
* linux64: 5
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-05-23&endday=2016-05-29&tree=all
status-firefox48: --- → affected

Comment 10

2 years ago
7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 6
* mozilla-central: 1

Platform breakdown:
* linux32: 4
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-05-30&endday=2016-06-05&tree=all
(Assignee)

Updated

2 years ago
Assignee: honzab.moz → hurley

Comment 11

2 years ago
13 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 6
* mozilla-inbound: 5
* try: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 10
* linux32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-06-06&endday=2016-06-12&tree=all

Comment 12

2 years ago
7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 6
* fx-team: 1

Platform breakdown:
* linux32: 3
* windows7-32: 2
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-06-13&endday=2016-06-19&tree=all

Comment 13

2 years ago
15 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 8
* try: 3
* mozilla-central: 2
* fx-team: 2

Platform breakdown:
* windows7-32: 8
* linux64: 5
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-06-20&endday=2016-06-26&tree=all

Comment 14

2 years ago
35 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 17
* autoland: 11
* fx-team: 5
* try: 1
* mozilla-aurora: 1

Platform breakdown:
* windows7-32: 12
* linux64: 10
* linux32: 10
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-06-27&endday=2016-07-03&tree=all

Comment 15

2 years ago
30 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 10
* autoland: 7
* try: 5
* fx-team: 5
* mozilla-central: 2
* ash: 1

Platform breakdown:
* linux64: 11
* linux32: 10
* windows7-32: 8
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-07-04&endday=2016-07-10&tree=all

Comment 16

2 years ago
34 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 12
* autoland: 9
* fx-team: 6
* mozilla-central: 4
* ash: 2
* mozilla-aurora: 1

Platform breakdown:
* linux64: 18
* linux32: 10
* windows7-32: 4
* windowsxp: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-07-11&endday=2016-07-17&tree=all

Comment 17

2 years ago
49 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 22
* autoland: 14
* fx-team: 8
* mozilla-aurora: 3
* mozilla-central: 2

Platform breakdown:
* linux64: 25
* linux32: 13
* windows7-32: 6
* windows8-64: 4
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-07-18&endday=2016-07-24&tree=all

Comment 18

2 years ago
27 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 11
* mozilla-inbound: 10
* fx-team: 4
* try: 1
* mozilla-central: 1

Platform breakdown:
* linux64: 13
* linux32: 11
* windows7-32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-07-25&endday=2016-07-31&tree=all

Comment 19

2 years ago
34 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 14
* mozilla-inbound: 13
* mozilla-central: 4
* fx-team: 3

Platform breakdown:
* linux32: 19
* linux64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-05&endday=2016-08-05&tree=all

Comment 20

2 years ago
82 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 34
* mozilla-inbound: 31
* mozilla-central: 10
* fx-team: 6
* try: 1

Platform breakdown:
* linux64: 40
* linux32: 40
* windows7-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-08-07&tree=all

Comment 21

2 years ago
18 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 15
* mozilla-inbound: 3

Platform breakdown:
* linux32: 13
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-11&endday=2016-08-11&tree=all

Comment 22

2 years ago
22 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 21
* mozilla-central: 1

Platform breakdown:
* linux32: 12
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-12&endday=2016-08-12&tree=all

Comment 23

2 years ago
64 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 45
* mozilla-inbound: 12
* mozilla-central: 4
* fx-team: 3

Platform breakdown:
* linux32: 33
* linux64: 29
* windows7-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-08&endday=2016-08-14&tree=all

Comment 24

2 years ago
33 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 16
* autoland: 15
* try: 1
* mozilla-central: 1

Platform breakdown:
* linux32: 18
* linux64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-15&endday=2016-08-15&tree=all

Comment 25

2 years ago
53 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 22
* mozilla-inbound: 21
* try: 6
* fx-team: 4

Platform breakdown:
* linux32: 33
* linux64: 20

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-16&endday=2016-08-16&tree=all

Comment 26

2 years ago
38 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 15
* autoland: 14
* try: 5
* fx-team: 4

Platform breakdown:
* linux32: 23
* linux64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-17&endday=2016-08-17&tree=all

Comment 27

2 years ago
40 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 14
* autoland: 13
* try: 7
* fx-team: 4
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 24
* linux32: 15
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-18&endday=2016-08-18&tree=all

Comment 28

2 years ago
53 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 24
* autoland: 17
* fx-team: 7
* mozilla-central: 5

Platform breakdown:
* linux64: 28
* linux32: 25

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-19&endday=2016-08-19&tree=all

Comment 29

2 years ago
294 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 125
* autoland: 100
* try: 33
* fx-team: 23
* mozilla-central: 11
* mozilla-aurora: 2

Platform breakdown:
* linux64: 152
* linux32: 141
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-15&endday=2016-08-21&tree=all

Comment 30

2 years ago
48 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 27
* autoland: 16
* mozilla-central: 3
* try: 1
* fx-team: 1

Platform breakdown:
* linux64: 25
* linux32: 23

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-22&endday=2016-08-22&tree=all

Comment 31

2 years ago
53 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 26
* mozilla-inbound: 17
* fx-team: 7
* mozilla-central: 2
* try: 1

Platform breakdown:
* linux64: 30
* linux32: 22
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-23&endday=2016-08-23&tree=all

Comment 32

2 years ago
38 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 20
* autoland: 7
* try: 6
* fx-team: 3
* oak: 1
* mozilla-central: 1

Platform breakdown:
* linux64: 21
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-24&endday=2016-08-24&tree=all

Comment 33

2 years ago
68 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 36
* autoland: 25
* try: 3
* mozilla-central: 2
* mozilla-aurora: 1
* fx-team: 1

Platform breakdown:
* linux64: 44
* linux32: 24

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-25&endday=2016-08-25&tree=all

Comment 34

2 years ago
55 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 32
* autoland: 12
* fx-team: 6
* try: 3
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 36
* linux32: 19

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-26&endday=2016-08-26&tree=all

Comment 35

2 years ago
280 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 133
* autoland: 88
* fx-team: 27
* try: 17
* mozilla-central: 12
* mozilla-aurora: 2
* oak: 1

Platform breakdown:
* linux64: 164
* linux32: 115
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-22&endday=2016-08-28&tree=all

Comment 36

2 years ago
44 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 20
* autoland: 14
* mozilla-central: 4
* try: 3
* fx-team: 3

Platform breakdown:
* linux64: 27
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-29&endday=2016-08-29&tree=all

Comment 37

2 years ago
54 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 20
* mozilla-inbound: 16
* try: 7
* fx-team: 6
* mozilla-central: 5

Platform breakdown:
* linux64: 33
* linux32: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-30&endday=2016-08-30&tree=all

Comment 38

2 years ago
51 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 23
* autoland: 18
* try: 6
* mozilla-central: 2
* fx-team: 2

Platform breakdown:
* linux64: 37
* linux32: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-31&endday=2016-08-31&tree=all

Comment 39

2 years ago
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3

Comment 40

2 years ago
62 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 24
* autoland: 21
* fx-team: 7
* mozilla-central: 6
* try: 3
* oak: 1

Platform breakdown:
* linux64: 45
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-01&endday=2016-09-01&tree=all

Comment 41

2 years ago
35 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 18
* autoland: 8
* mozilla-central: 5
* fx-team: 3
* try: 1

Platform breakdown:
* linux64: 25
* linux32: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-02&endday=2016-09-02&tree=all

Comment 42

2 years ago
16 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 8
* mozilla-inbound: 5
* mozilla-central: 2
* fx-team: 1

Platform breakdown:
* linux64: 8
* linux32: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-03&endday=2016-09-03&tree=all

Comment 43

2 years ago
300 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 117
* autoland: 101
* try: 29
* mozilla-central: 28
* fx-team: 23
* oak: 1
* ash: 1

Platform breakdown:
* linux64: 197
* linux32: 103

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-29&endday=2016-09-04&tree=all
Summary: Intermittent tc-M(7) leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...) → Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

Comment 44

2 years ago
22 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 9
* autoland: 8
* mozilla-central: 3
* try: 1
* fx-team: 1

Platform breakdown:
* linux64: 15
* linux32: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-05&endday=2016-09-05&tree=all

Comment 45

2 years ago
54 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 20
* mozilla-inbound: 18
* try: 11
* fx-team: 4
* mozilla-central: 1

Platform breakdown:
* linux64: 34
* linux32: 20

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-06&endday=2016-09-06&tree=all

Comment 46

2 years ago
53 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 29
* autoland: 18
* mozilla-central: 3
* try: 2
* fx-team: 1

Platform breakdown:
* linux64: 44
* linux32: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-07&endday=2016-09-07&tree=all
:hurley, it has been almost 4 months, this is one of the top intermittents we have in all of mozilla- what are the next steps here?

This gets noticeably worse on Friday August 5th, and by Friday August 12th, it remains a top issue.
Flags: needinfo?(hurley)
(In reply to Joel Maher ( :jmaher) from comment #47)
> :hurley, it has been almost 4 months, this is one of the top intermittents
> we have in all of mozilla- what are the next steps here?

I was finally able to get a reproduction with refcounting logs yesterday (thanks to :swu), so I'm pouring over those now trying to make sense of things. For something so common on ci, this had been impossible for me to reproduce locally or on a loaner from releng.

> This gets noticeably worse on Friday August 5th, and by Friday August 12th,
> it remains a top issue.

Yeah, that's the weird thing - I didn't change anything then, but I did notice the spike. I presume it's just that something outside my control started hitting the path that causes the leak more often, but I have zero proof of that.
Flags: needinfo?(hurley)

Comment 49

2 years ago
42 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 17
* mozilla-inbound: 15
* fx-team: 6
* try: 2
* mozilla-central: 2

Platform breakdown:
* linux64: 36
* linux32: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-08&endday=2016-09-08&tree=all
Just looking at what is leaking, I think the most likely cause of this is code trying to dispatch a runnable  that holds onto a CacheStorageService after the thread service has shut down.

Comment 51

2 years ago
I triggered this locally today once by running the dom/workers/test/serviceworkers mochitests.  I'm not sure if it only triggers in this directly, but just mentioning as a datapoint.

That being said, InterceptedChannel does have this:

https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/InterceptedChannel.h#71

But I don't see any InterceptedChannel in the set of leaked objects.
Indeed, I've tracked it down to somewhere in the chain of stuff under Predictor::UpdateCacheability. That does indeed have a runnable associated with it, but it's currently unclear to me exactly where things go off the rails (for example, if the runnable has no real work to do, the leak never happens, even though the runnable runs). I'm inspecting & commenting out larger and large swaths of code to hopefully track down the bits that get us into trouble.
(Assignee)

Comment 53

2 years ago
Nick, connection of these leaks to the predictor is a theory, right?  Was you considering careful hard-disabling of the predictor for few days on m-c and see if these leaks really go away?  Maybe we trace something that isn't there.  There are almost two weeks to the next merge, so enough time to get data and not getting it slipped to m-a.
Flags: needinfo?(hurley)
(In reply to Honza Bambas (:mayhemer) from comment #53)
> Nick, connection of these leaks to the predictor is a theory, right?  Was
> you considering careful hard-disabling of the predictor for few days on m-c
> and see if these leaks really go away?  Maybe we trace something that isn't
> there.  There are almost two weeks to the next merge, so enough time to get
> data and not getting it slipped to m-a.

These are specifically related to prefetch. I'm pretty comfortable in that evaluation. Prefetch is disabled on aurora, beta, and release, and none of these leaks show up on those branches - only on nightly.

Like I alluded to in comment 52, if I early-return from Predictor::UpdateCacheability, the leak goes away. Same if we never find the right thing to update cacheability of. Now its just a game of bisecting what bit of stuff that gets triggered from UpdateCacheability is directly responsible for the leak so I can figure out. It could be in the UpdateCacheability stack directly, or it could be a bug in the code that actually performs the prefetches (which only runs if UpdateCacheability runs).
Flags: needinfo?(hurley)

Comment 55

2 years ago
Do we have an xpcom shutdown observer that spins the event loop if necko work is still outstanding?  Thats how we handle these issues in other subsystems.
(In reply to Ben Kelly [:bkelly] from comment #55)
> Do we have an xpcom shutdown observer that spins the event loop if necko
> work is still outstanding?  Thats how we handle these issues in other
> subsystems.

The work the predictor does directly is on the main thread, so as long as the main thread clears out its queue, that part is good. However, this code also interacts with the cache thread and the socket thread, and I don't know offhand if either of those have a shutdown observer to spin the loop. Honza, I know you'll know about the cache thread, and I suspect you'll have a good answer about the socket thread, too.
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 57

2 years ago
Now I don't follow.  Is it good to have an observer that spins or is it bad?  From my judgment it would do more harm than good.  Also, how would that be related to leaks?

Comment 58

2 years ago
(In reply to Honza Bambas (:mayhemer) from comment #57)
> Now I don't follow.  Is it good to have an observer that spins or is it bad?
> From my judgment it would do more harm than good.  Also, how would that be
> related to leaks?

AFAIK its the only way to safely shutdown async work.  Its what we do in our quota-based storage APIs:

https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.cpp#2449

Its related to leaks in that spinning the event loop there allows you to block shutdown until your async work is done.

Spinning is bad if it can be triggered directly from JS; can break run-to-completion, etc.  Generally you don't want to spin.  However, spinning to cleanup at shutdown seems the one place its necessary and acceptable to me.

Comment 59

2 years ago
24 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 11
* mozilla-inbound: 9
* try: 2
* mozilla-aurora: 1
* fx-team: 1

Platform breakdown:
* linux64: 17
* linux32: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-09&endday=2016-09-09&tree=all

Comment 60

2 years ago
231 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 95
* autoland: 77
* fx-team: 25
* try: 20
* mozilla-central: 13
* mozilla-aurora: 1

Platform breakdown:
* linux64: 173
* linux32: 58

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-05&endday=2016-09-11&tree=all
(Assignee)

Comment 61

2 years ago
(In reply to Ben Kelly [:bkelly] from comment #58)
> (In reply to Honza Bambas (:mayhemer) from comment #57)
> > Now I don't follow.  Is it good to have an observer that spins or is it bad?
> > From my judgment it would do more harm than good.  Also, how would that be
> > related to leaks?
> 
> AFAIK its the only way to safely shutdown async work.  Its what we do in our
> quota-based storage APIs:
> 
> https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.
> cpp#2449

This might be needed for more complex stuff, yes (tho, sounds to me a bit like a bad design.)

> 
> Its related to leaks in that spinning the event loop there allows you to
> block shutdown until your async work is done.
> 
> Spinning is bad if it can be triggered directly from JS; can break
> run-to-completion, etc.  Generally you don't want to spin.  However,
> spinning to cleanup at shutdown seems the one place its necessary and
> acceptable to me.

Nothing like this is needed for the cache IO thread.  xpcom shutdown spins the loop for us later anyway (correctly!), so anything scheduled from the IO thread to run back on the main thread will execute.  before that we do block (no spinning) for the cache thread to finish.  and we don't intend to change that, because there is no need to change that.  cache IO shutdown doesn't need to loop through the main thread to actually finish its stuff.
Flags: needinfo?(honzab.moz)

Comment 62

2 years ago
39 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 15
* fx-team: 11
* autoland: 8
* mozilla-central: 3
* try: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 22
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-12&endday=2016-09-12&tree=all

Comment 63

2 years ago
37 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 12
* mozilla-inbound: 11
* try: 5
* mozilla-central: 5
* fx-team: 4

Platform breakdown:
* linux64: 26
* linux32: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-13&endday=2016-09-13&tree=all
So I've narrowed the leak down to one line (within the predictor): http://searchfox.org/mozilla-central/rev/6b94deded39a868f248a492e74e05733d6c4ed48/netwerk/base/Predictor.cpp#2418

If I comment out that line, the leak goes away. If I leave it in, the leak stays. Even if I do *nothing* in the OnCacheEntryAvailable associated with that AsyncOpenURI, the leak stays. Even if I return ENTRY_NOT_WANTED in the OnCacheEntryCheck associated with that AsyncOpenURI, the leak stays.

This is starting to look more and more like a shutdown race as bkelly mentioned in comment 55 and comment 58. Honza, I know you said it didn't seem possible, but can you take a look at the line linked above and make sure I'm not doing anything stupid there, or something unexpected from the cache service's point of view? If the code in the predictor looks sane, I think we need to dig into timing in the cache, which I'll need your help with. I'll kick off a try run with cache logging enabled so we can get some logs from the leak in case we end up needing it.
Flags: needinfo?(honzab.moz)
Try run with cache logging mentioned above: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8 (I'll likely need to retrigger some mochitest-1 jobs before failures with logs show up)
(Assignee)

Comment 66

2 years ago
Nick, thanks!  Will look at the logs when available.
(Assignee)

Comment 67

2 years ago
I think I know the cause.  We probably schedule a timer (purge timer) in the service after it has been shutdown.  I will have a patch soon.
Assignee: hurley → honzab.moz
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 68

2 years ago
As that seems unlikely (I already checked on that in the past and rejected as a cause), I would like to see the log anyway.

Comment 70

2 years ago
62 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 26
* autoland: 21
* fx-team: 6
* try: 5
* mozilla-central: 4

Platform breakdown:
* linux64: 48
* linux32: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-14&endday=2016-09-14&tree=all

Comment 71

2 years ago
48 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 23
* mozilla-inbound: 16
* fx-team: 5
* mozilla-central: 3
* mozilla-aurora: 1

Platform breakdown:
* linux64: 32
* linux32: 16

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-15&endday=2016-09-15&tree=all

Comment 72

2 years ago
65 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 29
* autoland: 22
* fx-team: 8
* try: 3
* mozilla-central: 3

Platform breakdown:
* linux64: 37
* linux32: 27
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-16&endday=2016-09-16&tree=all
23 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 8
* autoland: 6
* mozilla-central: 5
* fx-team: 3
* try: 1

Platform breakdown:
* linux32: 13
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-17&endday=2016-09-17&tree=all
294 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 111
* autoland: 96
* fx-team: 43
* mozilla-central: 25
* try: 17
* mozilla-aurora: 2

Platform breakdown:
* linux64: 189
* linux32: 104
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-12&endday=2016-09-18&tree=all
41 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 16
* autoland: 10
* fx-team: 6
* try: 4
* mozilla-central: 3
* mozilla-aurora: 2

Platform breakdown:
* linux64: 29
* linux32: 12

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-19&endday=2016-09-19&tree=all
(Assignee)

Comment 76

a year ago
Cause:

2016-09-19 18:57:42.255602 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:43.260031 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::Notify
2016-09-19 18:57:46.154352 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::ShutdownBackground
2016-09-19 18:57:46.154709 UTC - [Cache2 I/O]: D/cache2   mMemorySize=259780 (+172,-0)
2016-09-19 18:57:46.155910 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:46.155988 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit

The timer is never notified nor canceled and released.  Hence the Timer<->Service cycle and leak.


Nick, just FYI, using --set-env try arg to setup logging is not the appropriate way.  Use the MOZ_LOG var in testing/mochitest/runtests.py which correctly solves overlapping of logs produced by multiple processes all seeing the same MOZ_LOG_FILE env var.  That was the source of my initial confusions here.
Status: NEW → ASSIGNED
(Assignee)

Comment 77

a year ago
Created attachment 8792887 [details] [diff] [review]
v1

- no longer engage the purge timer after mShutdown = true (in CacheStorageService::Shutdown, synchronized before CacheStorageService::ShutdownBackground)
- I also added locking to all places we work with the timer (except the one non-null check, comment is already there it's intentional)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5532246b1395 (leak checking try push)
Attachment #8792887 - Flags: review?(michal.novotny)
Attachment #8792887 - Flags: review?(michal.novotny) → review+
(In reply to Honza Bambas (:mayhemer) from comment #78)
> Leak's gone

I'm wishing *so* hard for emoji support in bugzilla right now. Woo, Honza!
46 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 20
* autoland: 16
* mozilla-central: 7
* try: 3

Platform breakdown:
* linux64: 29
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-20&endday=2016-09-20&tree=all
status-firefox48: affected → wontfix
status-firefox49: affected → wontfix
status-firefox50: --- → affected
status-firefox51: --- → affected
status-firefox52: --- → affected

Comment 81

a year ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a370db31b85
Fix HTTP cache2 leak after shutdown, don't setup purging timer after shutdown. r=michal
Keywords: checkin-needed

Comment 82

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9a370db31b85
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox52: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
31 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 16
* try: 6
* mozilla-inbound: 5
* mozilla-central: 2
* fx-team: 2

Platform breakdown:
* linux64: 20
* linux32: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-21&endday=2016-09-21&tree=all
Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(honzab.moz)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> Please request Aurora/Beta approval on this when you get a chance.

I'm not convinced we should do this - while this has been a latent bug, it's only (to our knowledge) triggered by predictor prefetch stuff, which is disabled everywhere but nightly - hence no leaks on aurora or beta in the orangefactor reports. I'm not convinced it's worth the risk of uplifting a patch. Then again, I didn't write the patch, so take what I say with somewhere between a grain of salt and an entire salt lick :D
OrangeFactor shows this hitting on Aurora50 throughout that cycle, so consider me confused then :)
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&endday=2016-09-19&startday=2016-08-01&tree=mozilla-aurora
(Assignee)

Comment 87

a year ago
This is independent of predictor/prefetch.  Predictor only helped to narrow it down because it apparently exhausts the cache2 back end during shutdown much more than when not turned on.  Will request a's.
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 88

a year ago
Comment on attachment 8792887 [details] [diff] [review]
v1

Approval Request Comment
[Feature/regressing bug #]: HTTP cache2
[User impact if declined]: none (it's just a leak not causing any secondary trouble), although, generally speaking this frequent intermittent failure may well lead to overlook of something more important
[Describe test coverage new/current, TreeHerder]: just landed on m-c
[Risks and why]: zero, the patch just doesn't allow a certain non-critical action after the shutdown mark
[String/UUID change made/needed]: none
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-aurora?
Hi Wes, could you please confirm this fixes the intermittent? That will help me decide whether to uplift this or not. Thanks!
Flags: needinfo?(wkocher)
(Reporter)

Comment 90

a year ago
The green line at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-09-21&tree=trunk would be the thing to track, but I think it's still probably too early to tell if this has been fixed. Leaving the ni to check back in a few days.
Ok. I'll wait for confirmation from Sheriffs before uplifting this one to Beta50.
(Reporter)

Comment 92

a year ago
Looks like it dropped to zero to me.
Flags: needinfo?(wkocher)
136 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 45
* mozilla-inbound: 42
* try: 17
* mozilla-central: 16
* fx-team: 13
* mozilla-aurora: 2
* mozilla-beta: 1

Platform breakdown:
* linux64: 92
* linux32: 44

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-09-19&endday=2016-09-25&tree=all
Comment on attachment 8792887 [details] [diff] [review]
v1

Verified by Wes to have fixed the intermittent failure, Aurora51+, Beta50+
Attachment #8792887 - Flags: approval-mozilla-beta?
Attachment #8792887 - Flags: approval-mozilla-beta+
Attachment #8792887 - Flags: approval-mozilla-aurora?
Attachment #8792887 - Flags: approval-mozilla-aurora+
(Reporter)

Comment 95

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/36e4081424b4
status-firefox51: affected → fixed
(Reporter)

Comment 96

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/a843ec667941
status-firefox50: affected → fixed
You need to log in before you can comment on or make changes to this bug.