Track AWSY regressions more precisely by using leaf node data from about:memory rather than high-level data

RESOLVED WONTFIX

Status

()

defect
RESOLVED WONTFIX
5 years ago
2 years ago

People

(Reporter: kats, Assigned: erahm)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P1])

Take a look at the graph at [1]. This is the graph of "explicit" memory usage in Fennec over 175 inbound changesets sometime in early February. The problem with this graph is that the data is too noisy, and small regressions can sneak in without us being able to do a whole lot about it.

Now look at the graph at [2]. This is graph of "explicit/atom-tables" memory usage over the same period of time. Admire the lack of noise in the graph - the two changes that caused regressions here can be easily identified.

In fact, most of the "leaf" nodes in the about:memory dump have very little noise, and it's very easy to pinpoint changes that increase it. But if you look at the big picture there's too much noise and it's hard to tell what's going on. At the risk of overwhelming you, the page at [3] contains links to all "interesting" (i.e. non-flat) graphs over the same period as [1] and [2]. The hover tooltip for each graph shows what the graph is for - it's pretty clear the leaf nodes are generally either bimodal or clean and in either case regressions are very easy to detect. In the rare cases where the leaf data is still noisy, I think we can try to replace that memory reporter in gecko with smaller reporters so as to get clean data.

So basically what I would like to do is:
(1) Break down noisy leaf-node memory reporters until all of our leaf-node memory reporters are providing a low noise data stream
(2) Add a system to track the leaf-node memory data and alert on significant regressions.
(3) Take a step further and automatically provide patch authors with a "memory impact" notice of their changes, which they can look at to ensure their change didn't do something unexpected and that the impact is acceptable.

[1] http://areweslimyet.mobi/plotter-results/graph-a45c2264b8cab0f1e0637f6c44501c0f34f7e227.png
[2] http://areweslimyet.mobi/plotter-results/graph-72af7cd253eff03ac05bfa55acf7cc4012a33140.png
[3] http://areweslimyet.mobi/plotter-results/
AWSY keeps all reporter dumps around (currently in a massive per-month sqlite DB) -- editing the export list [1] and re-running the cron job is all that's needed to export lines to graphs, e.g. the "misc" graph. This also gives you flat series in the full-resolution JSON files that could be scanned for regressions.

(Of course, none of this code is very sophisticated, so if you'd like to take a shot at writing something more comprehensive I'm all for it.)

[1] https://github.com/Nephyrin/MozAreWeSlimYet/blob/master/create_graph_json.py#L36
Note that we can't have a fixed list of leaf nodes, because over time things like new jsm files and compartments and windows appear, and we want to be able to detect those as well.
Whiteboard: [MemShrink] → [MemShrink:P1]
I think this is a fantastic idea. (As is the idea of being able to diff AWSY snapshots.)

kats, what is required to move it forward? I know we have excellent historical record on AWSY, but I'd be content to lose some of our historical data if it meant we could get better data in the recent past and moving forward.
I think in terms of data we already have the data we need to get started. It just needs some dedicated time from somebody. I'd be happy to do this but can only devote time to it on an ad-hoc basis right now.

The first step should be to separate the "clean" and "noisy" leaf nodes in the data. We can file bugs for the "noisy" data (to break down the memory reporters further) and write some tools to detect new regressions in the "clean" data. As we clean up the noisy stuff we can keep adding it to the clean data and report a wider variety of regressions.

Initially I think we should use something like the "Explicit" memory report subtree on the StartSettled data snapshot because that tends to be the most stable. Once we have that under control we can start expanding the scope to other data snapshots and resident memory as well.
Assignee: nobody → erahm
Depends on: 1024249
We're pushing data into perfherder now with automated regression tracking, I don't think this is worth pursuing.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.