Closed Bug 1013333 Opened 6 years ago Closed 6 years ago

HTTP cache v2: ~15MB bump in RSS memory usage, ~11MB bump in Explicit memory usage on mobile

Categories

(Core :: Networking: Cache, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: mayhemer, Assigned: michal)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink])

Attachments

(4 files)

(In reply to Mark Finkle (:mfinkle) from bug 913806 comment #13)
> When the new cache was turned on we saw a ~15MB bump in RSS memory usage

What exactly does this measure?

> and
> ~11MB bump in Explicit memory usage:
> https://areweslimyet.com/mobile/
> 
> It looks like the Explicit jump is due to network/cache2/disk-storage()
> which wasn't reported before turning on, but is now ~10MB. Is this memory
> that just wasn't reported before with the old cache?

This (network/cache2/disk-storage()) is an area used only by the new cache - as the name suggests.  So, it was 0 with cache2=off.

This however should not be 11MB.  We have a limit to keep only metedata, not data, of disk entries and the limit is 250kB, so way bellow 11MB.

Possible cause:
- simply a bug in the limit counter, but why I don't see this on desktop too?
- error to write data, then we may keep data in memory (have to check in the code and with Michal, tho)
- extremely slow writes to the media, everything that is received from the network is buffer and than written in background ; when slow, this may pile up
Summary: [MemShrink] HTTP cache v2: ~15MB bump in RSS memory usage, ~11MB bump in Explicit memory usage on mobile → HTTP cache v2: ~15MB bump in RSS memory usage, ~11MB bump in Explicit memory usage on mobile
Whiteboard: [MemShrink]
One possibility is that this is a problem in the harness itself - I use a canned profile for the Fennec run, and I only re-generate the profile periodically. It could be that the profile being used here is too old and not playing well with the new code. I'll re-generate the profile and re-run the before/after builds on this regression to confirm or eliminate that as a possible source of problems.
Looking into the pushlogs on edges of the bumps, there are no cache2 patches at all.  Or am I am looking at wrong data?  (I've checked the changesets are all expanded).
Also for reference the raw data from the before run is at [1] and the raw data from the after run is at [2]. In this case the relevant files are the unified-memory-report-Tabs*.json.gz files, and they can be loaded into about:memory to investigate and compare.

[1] http://areweslimyet.mobi/data/mozilla-inbound/1400171192/
[2] http://areweslimyet.mobi/data/mozilla-inbound/1400171971/
Attached image AWSY memory increase
(In reply to Honza Bambas (:mayhemer) from comment #2)
> Looking into the pushlogs on edges of the bumps, there are no cache2 patches
> at all.  Or am I am looking at wrong data?  (I've checked the changesets are
> all expanded).

I'm not sure which part you're looking at, but I'm attaching a screenshot of the bump in question. You have to zoom in around may 14-15, and then clicking on the data point with the increase will give you changeset info.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> I'm not sure which part you're looking at, but I'm attaching a screenshot of
> the bump in question. You have to zoom in around may 14-15, and then
> clicking on the data point with the increase will give you changeset info.

Aha.  Got it.  https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=8d7dedb27da9&tochange=4e38abbbd329

So, it's cache2.
From the data in comment 3, I can see one strange thing:

├───10.12 MB (07.81%) -- network
│   ├──10.12 MB (07.81%) -- cache2
│   │  ├───9.98 MB (07.70%) ── disk-storage()
│   │  └───0.14 MB (00.11%) -- (3 tiny)
│   │      ├──0.07 MB (00.05%) ── io
│   │      ├──0.06 MB (00.05%) ── index
│   │      └──0.01 MB (00.01%) ── service


The index is like 0, as well as all the rest.


Typical for my desktop is:

├────3,672,024 B (00.83%) -- network
│    ├──3,015,078 B (00.68%) -- cache2
│    │  ├──1,041,472 B (00.24%) ── index
│    │  ├──1,027,718 B (00.23%) ── disk-storage()
│    │  ├────936,512 B (00.21%) ── memory-storage(/M)
│    │  ├──────6,288 B (00.00%) ── io
│    │  ├──────2,720 B (00.00%) ── service
│    │  └────────368 B (00.00%) ── disk-storage(a,)


Michal, any idea in what state the mobile could be?
Flags: needinfo?(michal.novotny)
It is hard to guess without any additional data. If the about:memory status was taken early after startup, the cache index could be corrupted and the rebuild process didn't start yet, so the cache index contains only entries opened in this session. It is also possible that there are really just few big entries.
Flags: needinfo?(michal.novotny)
I regenerated the profile and ran the before/after inbound builds 5 times each. Got these results for the resident memory numbers:

Start
  Before		Average: 132492492.80	Stddev: 2036496.27	
  After			Average: 133459148.80	Stddev:  748003.62	
StartSettled
  Before		Average: 150307635.20	Stddev: 2182000.98	
  After			Average: 151481548.80	Stddev:  102429.49	
TabsOpen
  Before		Average: 280479334.40	Stddev: 1239386.52	
  After			Average: 297619456.00	Stddev: 4021658.80	
TabsOpenSettled
  Before		Average: 280418713.60	Stddev: 1080066.66	
  After			Average: 279783014.40	Stddev: 32026614.90	
TabsOpenForceGC
  Before		Average: 269561856.00	Stddev: 1382852.31	
  After			Average: 270358118.40	Stddev: 28334823.45	
TabsClosed
  Before		Average: 257646592.00	Stddev:  633729.43	
  After			Average: 260224614.40	Stddev: 23949774.22	
TabsClosedSettled
  Before		Average: 185928908.80	Stddev: 1057543.73	
  After			Average: 187814707.20	Stddev: 2717032.63	
TabsClosedForceGC
  Before		Average: 185603686.40	Stddev:  957155.89	
  After			Average: 186922598.40	Stddev: 3907056.38	

There still appears to be a ~16MB regression in the "TabsOpen" stage but that seems to disappear later. I can attach the full data as well.
Full data in the tarball at people.mozilla.com/~kgupta/tmp/awsy-cache2.tgz - the 1400171192-* folders are from before the cache was enabled and the 1400171971-* folders are from after.
Kartikaya, can we get a NSPR_LOG_MODULES=cache2:5 log?
Yup, here is the logcat from a full run with cache2 logging turned on. In this run the resident memory on the TabsOpen was 294162432 bytes.
Thanks!  We believe we know the cause now, bug in our preload logic.  Fix is one line.  Michal will take this now.
Assignee: nobody → michal.novotny
Status: NEW → ASSIGNED
Attached patch patch v1Splinter Review
Attachment #8425765 - Flags: review?(honzab.moz)
Attachment #8425765 - Flags: review?(honzab.moz) → review+
I ran the two try builds you listed above through the harness and got the results below. "before" is pure m-c, "after" is with the patch. In general the patch seems to make it worse? However the "before" numbers are now also lower than what I reported in comment 8, and looking at areweslimyet.com/mobile over the last day or so there's a lot of fluctuation, it looks like bimodal data now. This patch seems to at least reduce the standard deviation on the numbers so maybe it will be easier to fix the numbers with the patch?

Start
  Before		Average: 135365427.20	Stddev: 1187691.40	
  After			Average: 134992691.20	Stddev: 1244891.32	
StartSettled
  Before		Average: 152170496.00	Stddev:  327935.90	
  After			Average: 152095129.60	Stddev:  384261.56	
TabsOpen
  Before		Average: 276848640.00	Stddev: 19287159.80	
  After			Average: 286408704.00	Stddev: 4176588.99	
TabsOpenSettled
  Before		Average: 222875648.00	Stddev: 31899915.70	
  After			Average: 281151897.60	Stddev: 5074918.94	
TabsOpenForceGC
  Before		Average: 221931110.40	Stddev: 30593746.35	
  After			Average: 276035174.40	Stddev: 3689857.77	
TabsClosed
  Before		Average: 219013939.20	Stddev: 26508946.75	
  After			Average: 264722022.40	Stddev: 3794672.52	
TabsClosedSettled
  Before		Average: 184007884.80	Stddev: 7964210.58	
  After			Average: 191765708.80	Stddev: 3975449.73	
TabsClosedForceGC
  Before		Average: 184693555.20	Stddev: 7287549.70	
  After			Average: 191699353.60	Stddev: 4163810.43
Kats, the patch had to completely fix it.  Can you again produce a cache2:5 log please?  This is weird.  Thanks.
And the '"before" is pure m-c', what cset is that?  You can also easily turn the cache off on any changes set : make netwerk/cache2/CacheObserver.cpp!CacheObserver::UseNewCache() return false immediately.
(In reply to Honza Bambas (:mayhemer) from comment #17)
> Kats, the patch had to completely fix it.  Can you again produce a cache2:5
> log please?  This is weird.  Thanks.

Will do.

(In reply to Honza Bambas (:mayhemer) from comment #18)
> And the '"before" is pure m-c', what cset is that?

I just used the builds from your try pushes, so cset 85c54f4e09c4. The ARMv6 builds, specifically.
Ugh, I just realized I've been running with the cache2:5 logging enabled since yesterday - I forgot to turn it off after getting that log you wanted. The logging may have affected the numbers, so I'll re-run with that disabled. Here's a log from one of the runs I just did (comment 16) which has the patch on this bug applied.
85c54f4e09c4 - it's mozilla-central with cache2 ENABLED, so it should manifest this bug (add 15MB/11MB regressions)
d0afe20c43a1 - it's the same + patch for this bug, so it should revert to what you have seen before enabling cache2 on may 15

Haven't you switched the changesets?
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Haven't you switched the changesets?

I double-checked to make sure I hadn't switched them. But yes I agree the new numbers are not what I would expect, and don't make much sense. Let me finish re-running them with the logging disabled and see if that makes any difference.

Also I have noticed in the past that try builds in general have different baseline values than inbound builds, so it's possible that the patch does regress things more, but the baseline is different so you can't compare the numbers in comment 16 to the numbers in comment 8.

Did you take a look at the log in comment 20 to see if it demonstrates the desired behaviour from the patch on this bug?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #22)
> (In reply to Honza Bambas (:mayhemer) from comment #21)
> > Haven't you switched the changesets?
> 
> I double-checked to make sure I hadn't switched them. But yes I agree the
> new numbers are not what I would expect, and don't make much sense. Let me
> finish re-running them with the logging disabled and see if that makes any
> difference.

Logging is not the cause here, I'm pretty sure.

> 
> Also I have noticed in the past that try builds in general have different
> baseline values than inbound builds, so it's possible that the patch does
> regress things more, but the baseline is different so you can't compare the
> numbers in comment 16 to the numbers in comment 8.

Of course, I thought you were comparing the two runs from comment 15 and not try/d0afe20c43a1 vs m-c.

> 
> Did you take a look at the log in comment 20 to see if it demonstrates the
> desired behaviour from the patch on this bug?

Yes and it seems fixed.  Despite the log is not complete (the shutdown lines are missing), apparently we release all data chunks now.  Maybe the leak is else where?
Here is the data without logging. "Before" is 85c54f4e09c4 and "after" is d0afe20c43a1. It does show an improvement in TabsOpen but it's worse in a few of the others. The reduction in stddev is still definitely present with the patch, so that's good at least.

Start
  Before		Average: 134497894.40	Stddev:  722326.32	
  After			Average: 134868992.00	Stddev:  894887.42	
StartSettled
  Before		Average: 151438131.20	Stddev:  534325.67	
  After			Average: 151687987.20	Stddev:  697650.65	
TabsOpen
  Before		Average: 297408921.60	Stddev: 2990509.32	
  After			Average: 287926681.60	Stddev: 3421254.81	
TabsOpenSettled
  Before		Average: 264090419.20	Stddev: 40713536.12	
  After			Average: 272293068.80	Stddev: 29536189.76	
TabsOpenForceGC
  Before		Average: 255603507.20	Stddev: 37161970.37	
  After			Average: 264262451.20	Stddev: 25701413.40	
TabsClosed
  Before		Average: 247148544.00	Stddev: 32913039.80	
  After			Average: 253178675.20	Stddev: 21840942.43	
TabsClosedSettled
  Before		Average: 194768076.80	Stddev: 11626403.00	
  After			Average: 188168601.60	Stddev:  4334807.15	
TabsClosedForceGC
  Before		Average: 194641920.00	Stddev: 11505177.69	
  After			Average: 188130918.40	Stddev:  3387477.29
BTW, the about:memory logs would be good, how to get the link from awsy?
about:memory logs for the runs in comment 24 are at http://people.mozilla.org/~kgupta/tmp/awsy-cache2.tgz (since I'm running these from try builds the data isn't automatically pushed anywhere publicly accessible).
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #26)
> about:memory logs for the runs in comment 24 are at
> http://people.mozilla.org/~kgupta/tmp/awsy-cache2.tgz (since I'm running
> these from try builds the data isn't automatically pushed anywhere publicly
> accessible).

Thanks.  According these logs this bug is clearly fixed on the cache2 level.  

I was checking explicit/network/cache2/disk-storage().  In the 85c54f4e09c4 reports (unfixed m-c with cache2 on) there are almost always >10MB used.  In *all* of the d0afe20c43a1 (fixed cache2) it is ~1MB, which is an expected overhead for the 250kB metadata limit.
Ok, in that case please feel free to land this patch. At the very least it fixes one known regression. I'm interested in seeing the data from inbound after this lands and comparing to before cache2 was turned on and/or looking for other sources of regressions.
Also a note that the old cache was eating some memory (for performance reasons) as well, but that was never reported.  Now we do, hence there is more memory reported :)  cache2 might just be more honest!
https://hg.mozilla.org/mozilla-central/rev/004ae4f486c2
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
I did some more analysis on the TabsClosedForceGC data point for the 200 inbound csets around the cache2 enabling (100 csets before, 100 csets after). The graph data is at [1] although if you're unfamiliar with bug 1000268 the graph data may be hard to interpret (well it's hard to interpret regardless).

The upshot of all this is that there were significant jumps in both explicit/network and explicit/heap-overhead/waste at that time.

[1] http://areweslimyet.mobi/cache2/plotter/
Sorry for the delay here but AWSY was broken for a bit. Now that the fix above has had time to sit, I did another analysis of the 100 csets before and after the fix landed to see what changed. Data is at http://areweslimyet.mobi/cache2/plotter2/

At around the time this fix landed, these were a lot of significant changes. Here are the first few I found (not necessarily the largest in terms of byte count changes)

* decommitted/js-non-window/gc-heap/decommitted-arenas stopped being hexa-modal and settled to a stable low value (dropped by ~5MB on average)
* explicit/heap-overhead/waste jumped by about 4MB
* explicit/images/content/raster/unused/raw jumped by about 600k
* explicit/js-non-window/gc-heap/chunk-admin stopped being hexa-modal and dropped by ~80k on average
* general reductions across explicit/js-non-window/zones/ totalling 150k
* explicit/storage/sqlite went up by 600k
* many more..

It'll take a while to go through all of these; some of them are probably from other changesets that landed at around the same time. Really I'd rather wait until 1000268 is fixed to get some tools that generate all this data rather than looking manually.
You need to log in before you can comment on or make changes to this bug.