1395233 - JSBC leaks channels when enabled on AWSY

Assignee

Description

•

7 years ago

== Change summary for alert #8954 (as of August 22 2017 19:26 UTC) ==

Regressions:

  3%  Heap Unclassified summary windows10-64 pgo      47,291,882.52 -> 48,864,383.51
  3%  Heap Unclassified summary windows10-64 opt      47,473,708.38 -> 49,053,052.95

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=8954

Looking at profiles from try results:
(without JSBC) https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea7790a1b292c1178c723ccc46fb358caadd70be
(with    JSBC) https://treeherder.mozilla.org/#/jobs?repo=try&revision=2fb961dd06f21b98acafa38b1bca22d109f0c1e5

Looking at TabsClosedForcedGC, suggest that there is some memory held over for longer than expected:

Explicit Allocations
1.47 MB (100.0%) -- explicit
├──1.11 MB (75.27%) -- heap-overhead
│  ├──1.13 MB (76.64%) ── bin-unused
├──0.38 MB (26.06%) ── heap-unclassified

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

7 years ago

Depends on: 1395540

Nicolas B. Pierron [:nbp]

Assignee

Comment 1

•

7 years ago

As I do not know what code path might be causing this issue, I tried to stop the execution at different locations in the code path.

For the moment, I can claim that this is not related to the encoding of the bytecode.  I have no (noticable) memory overhead if I do not send any byte to necko, and constantly re-encode the bytecode at each page load.

I will add a mode to always encode and save the bytecode in necko, but never request it.

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

7 years ago

Assignee: nobody → nicolas.b.pierron

Status: NEW → ASSIGNED

Nicolas B. Pierron [:nbp]

Assignee

Comment 2

•

7 years ago

I tried multiple configurations:
0. As of today.
1. JSBC enabled.
2. Eager mode.
3. fetch-count threshold to 4000.
4. Without OpenAlternativeOutputStream.
5. Without Finalizer & OpenAlternativeOutputStream.
6. Always encode and save in the cache.
7. Always fallback, if we ever load some bytecode.

Measurements of the Heap Unclassified reports:
0% regression with: 0, 4, 7.
2% regression with: 5.
4% regression with: 1, 3.
12% regression with: 2, 6.

I do not know how AWSY is working internally: (erahm?)
a. If the profile is reused, which has already visited the pages more than 4000 times (3).
b. If the profile is clean.
c. If the pages are loaded a N times before closing and re-opening Firefox to make the measurements.
d. If we load the pages multiple times, and do a mean of all the reports. (3 times)

If hypothesis (a) & (d) are true, …
· 1 & 3 seems to have the same regression, if (a) is true, then we are encoding the bytecode on the first visit. In which case I would expect to see the same results as 2 & 6. The only way to explain this would be if we were doing a mean over 3 runs (d), in which case this would match. (4% = 12% / 3).

· The fact that there is no regression for 4 suggests that the problem is not from generating the content for the cached data.

· 7 is hard to explain. 7 encodes the bytecode all the time and save it to the cache at every request, like 2 & 6, but fallback on loading the source. Falling back causes a new request to be made with a different alternative data type.

So, my current guess would be that we somehow keep the some data around, which makes us cache the bytecode, but only if we keep requesting the same alternative data type. If we start requesting a different alternative data type, the memory is no longer kept alive. (valentin?)

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(erahm)

Valentin Gosu [:valentin] (he/him)

Comment 3

•

7 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #2)
> So, my current guess would be that we somehow keep the some data around,
> which makes us cache the bytecode, but only if we keep requesting the same
> alternative data type.  If we start requesting a different alternative data
> type, the memory is no longer kept alive. (valentin?)

At first I thought it might be because we keep the cache entry alive through nsHttpChannel::mAltDataCacheEntry, when we have a non-empty mPreferredCachedAltDataType. But since we don't leak when requesting a different alt data type, I'm not very sure what could cause it.

Michal, is it possible that the memory cache holds the extra alt-data and that is causing the extra memory to be reported?

Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)

Michal Novotny [:michal]

Comment 4

•

7 years ago

(In reply to Valentin Gosu [:valentin] from comment #3)
> (In reply to Nicolas B. Pierron [:nbp] from comment #2)
> > So, my current guess would be that we somehow keep the some data around,
> > which makes us cache the bytecode, but only if we keep requesting the same
> > alternative data type.  If we start requesting a different alternative data
> > type, the memory is no longer kept alive. (valentin?)
> 
> At first I thought it might be because we keep the cache entry alive through
> nsHttpChannel::mAltDataCacheEntry, when we have a non-empty
> mPreferredCachedAltDataType. But since we don't leak when requesting a
> different alt data type, I'm not very sure what could cause it.
> 
> Michal, is it possible that the memory cache holds the extra alt-data and
> that is causing the extra memory to be reported?

The memory used by the cache would be under network/cache2.

Flags: needinfo?(michal.novotny)

Eric Rahm [:erahm]

Comment 5

•

7 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #2)
> I tried multiple configurations:
>   0. As of today.
>   1. JSBC enabled.
>   2. Eager mode.
>   3. fetch-count threshold to 4000.
>   4. Without OpenAlternativeOutputStream.
>   5. Without Finalizer & OpenAlternativeOutputStream.
>   6. Always encode and save in the cache.
>   7. Always fallback, if we ever load some bytecode.
> 
> Measurements of the Heap Unclassified reports:
>   0% regression with: 0, 4, 7.
>   2% regression with: 5.
>   4% regression with: 1, 3.
>  12% regression with: 2, 6.
> 
> I do not know how AWSY is working internally: (erahm?)
>  a. If the profile is reused, which has already visited the pages more than
> 4000 times (3).
>  b. If the profile is clean.
>  c. If the pages are loaded a N times before closing and re-opening Firefox
> to make the measurements.
>  d. If we load the pages multiple times, and do a mean of all the reports.
> (3 times)
> 
> If hypothesis (a) & (d) are true, …
> · 1 & 3 seems to have the same regression, if (a) is true, then we are
> encoding the bytecode on the first visit.  In which case I would expect to
> see the same results as 2 & 6.  The only way to explain this would be if we
> were doing a mean over 3 runs (d), in which case this would match. (4% = 12%
> / 3).
> 
> · The fact that there is no regression for 4 suggests that the problem is
> not from generating the content for the cached data.
> 
> · 7 is hard to explain.  7 encodes the bytecode all the time and save it to
> the cache at every request, like 2 & 6, but fallback on loading the source. 
> Falling back causes a new request to be made with a different alternative
> data type.
> 
> So, my current guess would be that we somehow keep the some data around,
> which makes us cache the bytecode, but only if we keep requesting the same
> alternative data type.  If we start requesting a different alternative data
> type, the memory is no longer kept alive. (valentin?)

AWSY has the following checkpoints:
  #1 'Fresh Start' - Open clean browser (empty profile)
  #2 'Fresh Start [+30]' - Wait 30 seconds
  #3 'Tabs Open' - Loads 100 pages of TP5 waiting 10 seconds after each page load in in 30 tabs in a round robin fashion
  #4 'Tabs Open [+30]' - Wait 30 seconds
  #5 'Tabs Open [+30, forced GC] - Force a GC
  #6 'Tabs Closed Extra Processes' [Close all but one tab and navigate to about:blank
  #7 'Tabs Closed' - Kill any lingering preloaded browsers
  #8 'Tabs Closed [+30]' - Wait 30 seconds
  #9 'Tabs Closed [+30, forced GC]' - Force a GC

Repeat 2 more times, skipping the first 2 measurements. So that's 300 total page loads.

The 'summary' measurement is the geometric average of the the first two checkpoints for startup and the last 7 checkpoints for the final iteration.

If you take a look at the subtests [1] you can see that everything but the start measurements regressed, and more like in the 5% to 7% range.

[1] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=dd0ce550bf64fa9f41c7f90329102d5c246ecaad&newProject=mozilla-inbound&newRevision=f139e3a8bb1592bc49806745633cb07aaae1fc8f&originalSignature=5964f29b714f0a6137fbe30d55cd91fe03430ce6&newSignature=5964f29b714f0a6137fbe30d55cd91fe03430ce6&framework=4

Flags: needinfo?(erahm)

Andrew McCreight [:mccr8]

Updated

•

7 years ago

Priority: -- → P2

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

7 years ago

Blocks: 1405738

Eric Rahm [:erahm]

Updated

•

7 years ago

Whiteboard: [MemShrink]

Nicolas B. Pierron [:nbp]

Assignee

Comment 6

•

7 years ago

Attached file DMD report from AWSY — Details

Using patches from Bug 1395540, it seems that the StartLoad function creates new channels, and that these channels are responsible for the leaks that we are seeing.

For the moment, I do not know what part of the code keeps these channels alive.

Nicolas B. Pierron [:nbp]

Assignee

Comment 7

•

7 years ago

I did a simple test which replaces all the tp5n urls tested by AWSY by "about:blank".
This test highlighted that the harness (marionette) does not seems to cause the increase of memory.

Thus the issue noted here is related to the t5pn loaded pages, and the apparent leak increases when reducing the number MAX_TABS [1].

[1] http://searchfox.org/mozilla-central/source/testing/awsy/awsy/__init__.py#2

Andrew McCreight [:mccr8]

Comment 8

•

7 years ago

I talked over this a bit with erahm and nbp. It sounds like there's also a memory usage regression of around 5mb, which is far more concerning than the regression in heap-unclassified. It might not be a problem per se, but it is hard to know without having more information about where the memory is going.

Looking at the DMD report from comment 6, it sounds like there are 44 channels still alive, which makes me concerned that we're starting to leak channels. There's also a bunch of data held as strings in nsHttpHeaderArray::nsEntry, which is associated with some kind of http chunk decoder. I don't know if that is expected or not. Maybe those are held alive by the channels.

Do you have any thoughts on this, Valentin? Thanks.

Flags: needinfo?(valentin.gosu)

Boris Zbarsky [:bzbarsky]

Comment 9

•

7 years ago

> It sounds like there are 44 channels still alive, which makes me concerned
> that we're starting to leak channels.

These are all channels from ScriptLoader::StartLoad, right?  In general, we keep alive all sorts of channels (e.g. the channel each document is loaded from stays alive for the lifetime of the document).  But I would not have expected that for script load channels.  In particular, ScriptLoader::StartLoad doesn't store the channel anywhere I can see, apart from the stack.

> There's also a bunch of data held as strings in nsHttpHeaderArray::nsEntry, which is associated with
> some kind of http chunk decoder.

Right, that would be the HTTP response headers.  If the channels are alive, they would certainly be keeping this alive.

Valentin Gosu [:valentin] (he/him)

Comment 10

•

7 years ago

Apart from LOAD_DOCUMENT_URI we also keep alive HttpChannelChild when mPreferredCachedAltDataType is not empty - which would be the case here, since opening the alternativeOutputStream is dependant on the HttpChannelParent having a cache entry available.

One thing I just noticed is that in this case we never call SendDocumentChannelCleanup - which could definitely be the cause for this leak. I filed bug 1411697 for this and I'll push a patch ASAP.

Depends on: 1411697

Flags: needinfo?(valentin.gosu)

Nicolas B. Pierron [:nbp]

Assignee

Comment 11

•

7 years ago

(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #9)
> > It sounds like there are 44 channels still alive, which makes me concerned
> > that we're starting to leak channels.
> 
> These are all channels from ScriptLoader::StartLoad, right?  In general, we
> keep alive all sorts of channels (e.g. the channel each document is loaded
> from stays alive for the lifetime of the document).  But I would not have
> expected that for script load channels.  In particular,
> ScriptLoader::StartLoad doesn't store the channel anywhere I can see, apart
> from the stack.

The channel is kept alive on the ScriptLoadRequest mCacheInfo field [1], if we intent to save the bytecode.

http://searchfox.org/mozilla-central/rev/dd47bee6468de7e1221b4d006342ad6b9813d0e5/dom/script/ScriptLoadHandler.cpp#379

It is clean at the end of the evaluation, or after saving the bytecode to the cache.
I will double check all the error paths that I might have missed.

Andrew McCreight [:mccr8]

Comment 12

•

7 years ago

I'm going to update the summary to be clearer about what the problem is.

Summary: JSBC: 3% Heap Unclassified summary windows10-64 → JSBC leaks channels when enabled on AWSY

Eric Rahm [:erahm]

Comment 13

•

7 years ago

Nicolas can you retest with JSBC enabled now that bug 1411697 has landed?

Flags: needinfo?(nicolas.b.pierron)

Whiteboard: [MemShrink] → [MemShrink:P1]

Nicolas B. Pierron [:nbp]

Assignee

Comment 14

•

7 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #13)
> Nicolas can you retest with JSBC enabled now that bug 1411697 has landed?

I tried on Sunday, and this seems to have fix the issue I was noticing on AWSY:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=5b78bb355d4d&newProject=try&newRevision=00471247c1056a0c26679f1b900e510cd16f9aca&framework=4
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=5b78bb355d4d&newProject=try&newRevision=00471247c1056a0c26679f1b900e510cd16f9aca&originalSignature=cf8c73114da80e660331e208dcca846248e060f9&newSignature=cf8c73114da80e660331e208dcca846248e060f9&framework=4

I was unable to notice a difference in the number of channel created which was bigger than the difference of channel that can be noticed between 2 runs having the same configuration.

(note: these builds take much more memory because DMD is enabled)

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

7 years ago

Flags: needinfo?(nicolas.b.pierron)

Resolution: FIXED → DUPLICATE

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: DOM → DOM: Core & HTML