Last Comment Bug 492185 - Find classes that are wasting memory due to alignment
: Find classes that are wasting memory due to alignment
Status: NEW
[MemShrink:P3]
: footprint
Product: Core
Classification: Components
Component: Rewriting and Analysis (show other bugs)
: Trunk
: All All
: -- normal with 5 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: 492257 509380 745758 747516
Blocks: analyses 583074 492376 492400 492510 492520 492531
  Show dependency treegraph
 
Reported: 2009-05-09 03:06 PDT by Arpad Borsos [:Swatinem]
Modified: 2014-07-01 13:04 PDT (History)
47 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
dehydra script (3.19 KB, application/javascript)
2009-05-09 03:06 PDT, Arpad Borsos [:Swatinem]
no flags Details
List of badly aligned classes (ods) (71.73 KB, application/vnd.oasis.opendocument.spreadsheet)
2009-05-09 03:08 PDT, Arpad Borsos [:Swatinem]
no flags Details
dehydra script, v2 (3.64 KB, application/javascript)
2009-05-10 05:45 PDT, Arpad Borsos [:Swatinem]
no flags Details
list of badly aligned classes (csv) (110.51 KB, text/plain)
2009-05-11 01:56 PDT, Arpad Borsos [:Swatinem]
no flags Details
dehydra script, v3 (4.01 KB, application/javascript)
2009-05-11 01:59 PDT, Arpad Borsos [:Swatinem]
no flags Details
list of badly aligned classes (csv) (78.83 KB, text/plain)
2009-05-11 09:14 PDT, Arpad Borsos [:Swatinem]
no flags Details
dehydra script (4.37 KB, application/javascript)
2009-05-11 10:34 PDT, Arpad Borsos [:Swatinem]
no flags Details
dehydra script (4.88 KB, application/javascript)
2010-08-18 04:43 PDT, Arpad Borsos [:Swatinem]
no flags Details
Vava prog for parsing output into cvs (1.42 KB, text/x-vala)
2010-08-18 04:45 PDT, Arpad Borsos [:Swatinem]
no flags Details
fix classes wasting memory (6.66 KB, patch)
2010-08-20 03:27 PDT, Arpad Borsos [:Swatinem]
dolske: review-
Details | Diff | Splinter Review
Python script to aggregate clang -Wpadded warnings (1003 bytes, text/x-python)
2014-02-19 02:39 PST, Arpad Borsos [:Swatinem]
no flags Details

Description Arpad Borsos [:Swatinem] 2009-05-09 03:06:12 PDT
Created attachment 376545 [details]
dehydra script

I have been hacking up a dehydra script that is calculating a classes "real" sizeof and comparing it to the final sizeof to determine if that class is wasting memory due to alignment.
The script works quite well as far as I can tell.
Comment 1 Arpad Borsos [:Swatinem] 2009-05-09 03:08:41 PDT
Created attachment 376546 [details]
List of badly aligned classes (ods)

This is a filtered list of badly aligned classes. I will skim through it and try to fix some of the classes.
Comment 2 (dormant account) 2009-05-09 14:01:43 PDT
This awesome stuff. Who would've thought that getting this number is as simple as subtracting sizes of members from the struct size :) Whenever I thought about this issue I forgot that dehydra provides size_of.

/**
 * Size of a pointer, assume as 8, TODO: how to get the *real* pointer size?
 */
var pointerSize = 8;

I presume this is 64bit

	if(type.isPointer)
	{
		size.real = pointerSize;
I think you can do size.real = type.precision/8

For lazy bums like me, csv files are easier to open.

As I said before, awesome!

David, I believe you wanted to see some static analysis results on accessibility, this spreadsheet looks like a goldmine.
Comment 3 Arpad Borsos [:Swatinem] 2009-05-09 14:57:21 PDT
>         size.real = pointerSize;
> I think you can do size.real = type.precision/8

Yes indeed. I use pointerSize mainly for adding the size of the vtable pointer to classes.
Will see what else I can do there.

I wonder how much freedom I have in reordering members. I suppose I can't change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other classes should be fine though?

A quick look over some classes also revealed more optimization opportunities by changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to bool, as that is now used in some places in the codebase already.
Comment 4 (dormant account) 2009-05-09 15:14:19 PDT
> I wonder how much freedom I have in reordering members. I suppose I can't
> change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other
> classes should be fine though?

Think so.

> 
> A quick look over some classes also revealed more optimization opportunities by
> changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to
> bool, as that is now used in some places in the codebase already.

PRPackedBool is probably easier since you don't have to add casts everywhere to switch to bool. I'm also not sure if all modules allow bool usage.

While you are at it could you do a check similar to the one mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=376434

Basically scan for structs that have both signed and unsigned bitfields and warn because MSVC packs them wrong.
Comment 5 David Bolter [:davidb] 2009-05-09 16:46:47 PDT
(In reply to comment #2)
> David, I believe you wanted to see some static analysis results on
> accessibility, this spreadsheet looks like a goldmine.

Wow, crazy! Great work. What happens next?
Comment 6 (dormant account) 2009-05-09 17:09:29 PDT
(In reply to comment #5)
> (In reply to comment #2)
> > David, I believe you wanted to see some static analysis results on
> > accessibility, this spreadsheet looks like a goldmine.
> 
> Wow, crazy! Great work. What happens next?

Would be nice if you could go through accessibility ones to verify the results are correct and maybe pick a couple that are frequently allocated as candidates to fixing.
Comment 7 Brendan Eich [:brendan] 2009-05-09 21:39:27 PDT
Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes to internal fragmentation?

/be
Comment 8 Arpad Borsos [:Swatinem] 2009-05-10 01:26:03 PDT
(In reply to comment #7)
> Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes
> to internal fragmentation?
> 
> /be


My tool currently doesn't print a list of members, but that would be easy to add.
Looking at the code, I can see some possible problems:

>    uint32              gcMallocBytes;
>    JSGCArenaInfo       *gcUntracedArenaStackTop;

>    JSBool              rngInitialized;
>    int64               rngMultiplier;

>    uint32              requestCount;
>    JSThread            *gcThread;

And quite a few more... I'm on x86_64, and every time a 4byte member is followed by a 8byte member (or pointer) its wasting 4 bytes.
Comment 9 Arpad Borsos [:Swatinem] 2009-05-10 04:02:49 PDT
Started a thread on dev.platform: http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/52bc09157428dadb#
Comment 10 Arpad Borsos [:Swatinem] 2009-05-10 05:45:21 PDT
Created attachment 376620 [details]
dehydra script, v2
Comment 11 Jesse Ruderman 2009-05-10 12:31:39 PDT
On 32-bit x86, misaligned doubles are allowed but slow down the program considerably.  It might be useful to run this script with pointerSize=4 to find those as well.
Comment 12 Arpad Borsos [:Swatinem] 2009-05-11 01:56:26 PDT
Created attachment 376682 [details]
list of badly aligned classes (csv)

This list includes a possible size if using C++ bools (which are 1byte as are PRPackedBools) and a list of direct members with their sizes (does not include inherited members from parent classes).
Comment 13 Arpad Borsos [:Swatinem] 2009-05-11 01:59:14 PDT
Created attachment 376683 [details]
dehydra script, v3
Comment 14 Arpad Borsos [:Swatinem] 2009-05-11 09:14:50 PDT
Created attachment 376728 [details]
list of badly aligned classes (csv)

The script had some issues with inheritance and vtables, probably still has, I'm not quite sure.
Comment 15 Zack Weinberg (:zwol) 2009-05-11 10:17:53 PDT
You might find this document helpful, if you haven't seen it already:

http://www.codesourcery.com/public/cxx-abi/abi.html#layout

It lays out things like the rules for where the vtable pointer appears and under what circumstances a derived class's members can be packed into the base class's tail padding.  (Ignore mention of Itanium; GCC uses this ABI on all architectures nowadays, modulo things like pointer size and minimum required alignment.)
Comment 16 Arpad Borsos [:Swatinem] 2009-05-11 10:34:41 PDT
Created attachment 376742 [details]
dehydra script

Still has problems with inheritance and vtable pointers though :(
Comment 17 Boris Zbarsky [:bz] 2009-05-11 22:08:20 PDT
Quick comments on looking at the list:

1)  A number of these are services or one-per-document objects.  Savings from
    those will be small no matter what.  It might not be worth making the
    changes if it decreases readability by separating related members.
2)  You might want to get an NSS person to look at those SEC* things near the
    top of the list (as well as the other ones on it).
3)  A bunch of per-element accessibility stuff here that could really use
    fixing, if the numbers are right.
4)  The frame and element classes in that list should be a high priority.
5)  I'm not sure where some of these differences are coming from.  For example,
    The one for nsHTMLFormElement seems way too big; I would have expected
    about 2 bytes of difference there.  The class does have 5 vtable pointers
    in addition to the ones inherited from the superclass; is that where the
    other 40 bytes come from?
Comment 18 Vladimir Vukicevic [:vlad] [:vladv] 2009-05-13 03:36:08 PDT
Minor output nit:  can you include a percentage, or better yet just stick the results in a google spreadsheet and link to them?

Tying these results to trace-refcnt output from a Tp run would also be interesting, as far as looking at the maximum number of bytes saved during runtime.
Comment 19 Arpad Borsos [:Swatinem] 2009-05-13 08:00:47 PDT
This time I tried a different approach. Instead of taking the compiled sizeof() and messing around with vtable pointers, I rather look at the members and see how many GAPs are between them. I then calculate how much one can win by reordering the members or using 1byte bools.

This time, there aren't that many wins to be made. More wins could be made by cutting down the inheritance tree, consolidating Interfaces and remove unnecessary ones.

Here is the spreadsheet: http://spreadsheets.google.com/ccc?key=rVIrgklQt9Y2H1hnS87L1HA
Comment 20 Brendan Eich [:brendan] 2009-07-17 17:16:04 PDT
The desire for a cheap, always-on memory profiler that puts allocations in coarse, non-hierarchical buckets (JS, DOM, HTML, XUL, CSS, ...) came up again, Ben Galbraith is leading the charge.

This goal reminded me of SGI's Power Bloat Vision(tm) tool, a graphical line-chart or (default UI) bar with color-coded segments jumping around as different kernel identified resource users of memory changed their shares. PBV really helped users zero in on bloat in their apps, as well as in Motif/X, the SGI graphical desktop, and Irix ;-).

Thinking about how to do it super-cheaply, I thought of this bug. We don't want to add malloc interception by default, but we use jemalloc on Windows and Linux, and may on Mac. We can hack jemalloc. Problem is we don't want to change malloc, new, etc. signatures to carry "bucket ids", nor do we wish to walk the stack on every allocation, even by one frame.

But suppose each bucket (subsystem) happened to allocate unique sizes? Then we could statically analyze the sizes and produce a mapping to burn into the build.

Ok, I'm sure different buckets have their own structs or classes that happen to have the same size. Still, we could find such types with static analysis, and even generate a patch to farble the size of one or the other to differ.

If this wastes space, we could do something trickier, like farble the size passed to malloc in a way that did not waste space (e.g., subtract from 0 to 3 to help distinguish 4 buckets' uses of the same size).

What say you static analysis folks? We need your northern-style kung fu!

/be
Comment 21 Brendan Eich [:brendan] 2009-07-17 17:20:21 PDT
I will file a new bug if needed, I didn't mean to hijack this one -- sorry. Want to get a quick sanity check, figured this bug could use some action ;-).

/be
Comment 22 Arpad Borsos [:Swatinem] 2010-08-18 04:43:21 PDT
Created attachment 466989 [details]
dehydra script
Comment 23 Arpad Borsos [:Swatinem] 2010-08-18 04:45:17 PDT
Created attachment 466990 [details]
Vava prog for parsing output into cvs

I have been running this again after a long time, there are quite a few offenders left, see the bug url field.
Comment 24 Vladimir Vukicevic [:vlad] [:vladv] 2010-08-18 14:37:26 PDT
Assuming we fixed everything and every structure was tightly packed, what's the total net savings?  (Both absolute and %)
Comment 25 (dormant account) 2010-08-18 14:40:01 PDT
(In reply to comment #24)
> Assuming we fixed everything and every structure was tightly packed, what's the
> total net savings?  (Both absolute and %)

Savings would require instrumenting the build to figure out how often these datastructures are used.
Comment 26 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2010-08-19 08:18:19 PDT
Can we get the list sorted by win size or win%? I suspect that most of these classes are very infrequently used, but a few seem like they could be big wins. Some suggestions, if you want to do tryserver runs on a limited set:

nsHtml5StackNode (not actually stack-allocated, despite its name)
nsStyleVisibility
mozilla::dom::Link (there can be thousands of these per page, might improve Tp times)
nsHtml5TreeOperation
nsHtml5ElementName
nsHTMLElement (!)

For all of the IPDL-generated classes (P*), you should probably just file a bug on the code generator to emit the members in a better order.

It also looks like you're listing typedefs of classes, which adds a lot of duplicate data.
Comment 27 Arpad Borsos [:Swatinem] 2010-08-19 09:07:29 PDT
(In reply to comment #26)
> Can we get the list sorted by win size or win%?

"usingbool" means wins if PRBools are changed to PRPackedBools (or C++ bools) and is thus a better sorting column, especially for bool heavy Classes like nsHtml5StackNode. In extreme cases, using a bitfield may yield even better wins. What is the policy regarding such changes?

> It also looks like you're listing typedefs of classes, which adds a lot of
> duplicate data.

Yes, the script is far from being perfect :(
Comment 28 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2010-08-19 09:10:17 PDT
We're not going to switch to real bool until after FF4. Can we fix the easy wins I mentioned now?
Comment 29 (dormant account) 2010-08-19 09:45:21 PDT
Just FYI, Julian has a decent list of distributions of allocated objects from bug 551477. I'm gonna try cross-referencing that with this.
Comment 30 Arpad Borsos [:Swatinem] 2010-08-20 03:27:28 PDT
Created attachment 467717 [details] [diff] [review]
fix classes wasting memory

(In reply to comment #28)
> We're not going to switch to real bool until after FF4. Can we fix the easy
> wins I mentioned now?

mozilla::dom::Link does already use bool, and according to dehydra, they are 1 bit types instead of 1 byte, so there are possibly more wins to have. Will update my script accordingly.

The easy cases you mentioned are included in the patch, except for nsHTMLElement, which would require a rewrite of all the static initialization.
Comment 31 Arpad Borsos [:Swatinem] 2010-08-20 05:45:32 PDT
(In reply to comment #30)
> mozilla::dom::Link does already use bool, and according to dehydra, they are 1
> bit types instead of 1 byte, so there are possibly more wins to have. Will
> update my script accordingly.

Dehydra reports a precision of 1bit for C++ bools, but GCC does not automatically fold adjacent bools into bitfields, would have to be done manually. I updated the table to assume bitfield packing for bools, also I hope to have fixed the duplicate typedefs.
Comment 32 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2010-08-22 14:58:11 PDT
You cannot and should not fold bools into bitfields because you can take their address. In some cases it can also be a performance hit because it involves a full read-write.
Comment 33 Justin Dolske [:Dolske] 2012-04-15 18:04:34 PDT
Comment on attachment 467717 [details] [diff] [review]
fix classes wasting memory

I'm going through unassigned review -- and am going to take the liberty of r-ing this based on comment 32.

Please update patch and request review again if this is still an issue.
Comment 34 Arpad Borsos [:Swatinem] 2012-04-16 09:09:42 PDT
Last time I tried, like half a year ago, neither my original dehydra script worked with the latest gcc/dehydra, nor pahole (http://lwn.net/Articles/206805/).
If someone else wants to take over, feel free to do so.
Comment 35 Arpad Borsos [:Swatinem] 2014-02-19 02:39:40 PST
Created attachment 8378185 [details]
Python script to aggregate clang -Wpadded warnings

So apparently compilers have had `-Wpadded` warnings for some time. GCCs warnings are not that useful, those from clang are more so, they look like this:

(this is from `mach warnings-list`)
```
js/src/vm/Runtime.h:693:25 [-Wpadded] padding struct 'JSRuntime' with 7 bytes to align 'operationCallback'
js/src/vm/Runtime.h:764:12 [-Wpadded] padding struct 'JSRuntime' with 6 bytes to align 'numExclusiveThreads'
js/src/vm/Runtime.h:814:11 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'ownerThread_'
js/src/vm/Runtime.h:958:25 [-Wpadded] padding struct 'JSRuntime' with 3 bytes to align 'gcChunkSet'
js/src/vm/Runtime.h:983:25 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'gcMarker'
[…]
```

So I just wrote a small python script (attached) to sum up all those numbers, they then look like this:
```
144.0	JSRuntime	js/src/vm/Runtime.h
92.0	icu_52::DateFormatSymbols	intl/icu/source/i18n/unicode/dtfmtsym.h
58.625	mozilla::net::nsHttpHandler	netwerk/protocol/http/nsHttpHandler.h
52.0	webrtc::acm1::AudioCodingModuleImpl	media/webrtc/trunk/webrtc/modules/audio_coding/main/source/audio_coding_module_impl.h
48.0	mozilla::net::WebSocketChannel	netwerk/protocol/websocket/WebSocketChannel.h
47.0	fsmdef_dcb_t	media/webrtc/signaling/src/sipcc/core/gsm/h/fsm.h
42.0	nsFtpState	netwerk/protocol/ftp/nsFtpConnectionThread.h
42.0	mozilla::WebGLContext	content/canvas/src/WebGLContext.h
42.0	mozilla::net::BackgroundFileSaver	netwerk/base/src/BackgroundFileSaver.h
40.0	AffixMgr	extensions/spellcheck/hunspell/src/affixmgr.hxx
```

I have no idea how clang handles unions in this case though, so this might have some double counting and false positives. Haven’t done any research on that yet.
Comment 36 Nicholas Nethercote [:njn] 2014-02-19 14:44:02 PST
Arpad: nice! I suggest that you post the full list to dev-platform to get feedback on which of these objects are likely to matter. For example, I know that the padding in JSRuntime isn't important, because we have 1+N JSRuntime objects live at any one time, where N is the number of workers, which is typically 3. But I don't know about any of those others.

Note You need to log in before you can comment on or make changes to this bug.