Closed Bug 492185 Opened 15 years ago Closed 4 years ago

Find classes that are wasting memory due to alignment

Categories

(Developer Infrastructure :: Source Code Analysis, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Swatinem, Unassigned)

References

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P3])

Attachments

(1 file, 10 obsolete files)

Attached file dehydra script (obsolete) —
I have been hacking up a dehydra script that is calculating a classes "real" sizeof and comparing it to the final sizeof to determine if that class is wasting memory due to alignment.
The script works quite well as far as I can tell.
Attached file List of badly aligned classes (ods) (obsolete) —
This is a filtered list of badly aligned classes. I will skim through it and try to fix some of the classes.
This awesome stuff. Who would've thought that getting this number is as simple as subtracting sizes of members from the struct size :) Whenever I thought about this issue I forgot that dehydra provides size_of.

/**
 * Size of a pointer, assume as 8, TODO: how to get the *real* pointer size?
 */
var pointerSize = 8;

I presume this is 64bit

	if(type.isPointer)
	{
		size.real = pointerSize;
I think you can do size.real = type.precision/8

For lazy bums like me, csv files are easier to open.

As I said before, awesome!

David, I believe you wanted to see some static analysis results on accessibility, this spreadsheet looks like a goldmine.
>         size.real = pointerSize;
> I think you can do size.real = type.precision/8

Yes indeed. I use pointerSize mainly for adding the size of the vtable pointer to classes.
Will see what else I can do there.

I wonder how much freedom I have in reordering members. I suppose I can't change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other classes should be fine though?

A quick look over some classes also revealed more optimization opportunities by changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to bool, as that is now used in some places in the codebase already.
> I wonder how much freedom I have in reordering members. I suppose I can't
> change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other
> classes should be fine though?

Think so.

> 
> A quick look over some classes also revealed more optimization opportunities by
> changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to
> bool, as that is now used in some places in the codebase already.

PRPackedBool is probably easier since you don't have to add casts everywhere to switch to bool. I'm also not sure if all modules allow bool usage.

While you are at it could you do a check similar to the one mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=376434

Basically scan for structs that have both signed and unsigned bitfields and warn because MSVC packs them wrong.
(In reply to comment #2)
> David, I believe you wanted to see some static analysis results on
> accessibility, this spreadsheet looks like a goldmine.

Wow, crazy! Great work. What happens next?
(In reply to comment #5)
> (In reply to comment #2)
> > David, I believe you wanted to see some static analysis results on
> > accessibility, this spreadsheet looks like a goldmine.
> 
> Wow, crazy! Great work. What happens next?

Would be nice if you could go through accessibility ones to verify the results are correct and maybe pick a couple that are frequently allocated as candidates to fixing.
Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes to internal fragmentation?

/be
(In reply to comment #7)
> Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes
> to internal fragmentation?
> 
> /be


My tool currently doesn't print a list of members, but that would be easy to add.
Looking at the code, I can see some possible problems:

>    uint32              gcMallocBytes;
>    JSGCArenaInfo       *gcUntracedArenaStackTop;

>    JSBool              rngInitialized;
>    int64               rngMultiplier;

>    uint32              requestCount;
>    JSThread            *gcThread;

And quite a few more... I'm on x86_64, and every time a 4byte member is followed by a 8byte member (or pointer) its wasting 4 bytes.
Depends on: 492257
Attached file dehydra script, v2 (obsolete) —
Attachment #376545 - Attachment is obsolete: true
On 32-bit x86, misaligned doubles are allowed but slow down the program considerably.  It might be useful to run this script with pointerSize=4 to find those as well.
Attached file list of badly aligned classes (csv) (obsolete) —
This list includes a possible size if using C++ bools (which are 1byte as are PRPackedBools) and a list of direct members with their sizes (does not include inherited members from parent classes).
Attachment #376546 - Attachment is obsolete: true
Attached file dehydra script, v3 (obsolete) —
Attachment #376620 - Attachment is obsolete: true
Attached file list of badly aligned classes (csv) (obsolete) —
The script had some issues with inheritance and vtables, probably still has, I'm not quite sure.
Attachment #376682 - Attachment is obsolete: true
Attachment #376728 - Attachment mime type: application/octet-stream → text/plain
Blocks: 492376
You might find this document helpful, if you haven't seen it already:

http://www.codesourcery.com/public/cxx-abi/abi.html#layout

It lays out things like the rules for where the vtable pointer appears and under what circumstances a derived class's members can be packed into the base class's tail padding.  (Ignore mention of Itanium; GCC uses this ABI on all architectures nowadays, modulo things like pointer size and minimum required alignment.)
Attached file dehydra script (obsolete) —
Still has problems with inheritance and vtable pointers though :(
Attachment #376683 - Attachment is obsolete: true
Quick comments on looking at the list:

1)  A number of these are services or one-per-document objects.  Savings from
    those will be small no matter what.  It might not be worth making the
    changes if it decreases readability by separating related members.
2)  You might want to get an NSS person to look at those SEC* things near the
    top of the list (as well as the other ones on it).
3)  A bunch of per-element accessibility stuff here that could really use
    fixing, if the numbers are right.
4)  The frame and element classes in that list should be a high priority.
5)  I'm not sure where some of these differences are coming from.  For example,
    The one for nsHTMLFormElement seems way too big; I would have expected
    about 2 bytes of difference there.  The class does have 5 vtable pointers
    in addition to the ones inherited from the superclass; is that where the
    other 40 bytes come from?
Blocks: 492400
Blocks: 492510
Blocks: 492520
Blocks: 492531
Minor output nit:  can you include a percentage, or better yet just stick the results in a google spreadsheet and link to them?

Tying these results to trace-refcnt output from a Tp run would also be interesting, as far as looking at the maximum number of bytes saved during runtime.
This time I tried a different approach. Instead of taking the compiled sizeof() and messing around with vtable pointers, I rather look at the members and see how many GAPs are between them. I then calculate how much one can win by reordering the members or using 1byte bools.

This time, there aren't that many wins to be made. More wins could be made by cutting down the inheritance tree, consolidating Interfaces and remove unnecessary ones.

Here is the spreadsheet: http://spreadsheets.google.com/ccc?key=rVIrgklQt9Y2H1hnS87L1HA
Attachment #376728 - Attachment is obsolete: true
The desire for a cheap, always-on memory profiler that puts allocations in coarse, non-hierarchical buckets (JS, DOM, HTML, XUL, CSS, ...) came up again, Ben Galbraith is leading the charge.

This goal reminded me of SGI's Power Bloat Vision(tm) tool, a graphical line-chart or (default UI) bar with color-coded segments jumping around as different kernel identified resource users of memory changed their shares. PBV really helped users zero in on bloat in their apps, as well as in Motif/X, the SGI graphical desktop, and Irix ;-).

Thinking about how to do it super-cheaply, I thought of this bug. We don't want to add malloc interception by default, but we use jemalloc on Windows and Linux, and may on Mac. We can hack jemalloc. Problem is we don't want to change malloc, new, etc. signatures to carry "bucket ids", nor do we wish to walk the stack on every allocation, even by one frame.

But suppose each bucket (subsystem) happened to allocate unique sizes? Then we could statically analyze the sizes and produce a mapping to burn into the build.

Ok, I'm sure different buckets have their own structs or classes that happen to have the same size. Still, we could find such types with static analysis, and even generate a patch to farble the size of one or the other to differ.

If this wastes space, we could do something trickier, like farble the size passed to malloc in a way that did not waste space (e.g., subtract from 0 to 3 to help distinguish 4 buckets' uses of the same size).

What say you static analysis folks? We need your northern-style kung fu!

/be
I will file a new bug if needed, I didn't mean to hijack this one -- sorry. Want to get a quick sanity check, figured this bug could use some action ;-).

/be
Depends on: 509380
Attached file dehydra script (obsolete) —
Attachment #376742 - Attachment is obsolete: true
Attached file Vava prog for parsing output into cvs (obsolete) —
I have been running this again after a long time, there are quite a few offenders left, see the bug url field.
Assuming we fixed everything and every structure was tightly packed, what's the total net savings?  (Both absolute and %)
(In reply to comment #24)
> Assuming we fixed everything and every structure was tightly packed, what's the
> total net savings?  (Both absolute and %)

Savings would require instrumenting the build to figure out how often these datastructures are used.
Can we get the list sorted by win size or win%? I suspect that most of these classes are very infrequently used, but a few seem like they could be big wins. Some suggestions, if you want to do tryserver runs on a limited set:

nsHtml5StackNode (not actually stack-allocated, despite its name)
nsStyleVisibility
mozilla::dom::Link (there can be thousands of these per page, might improve Tp times)
nsHtml5TreeOperation
nsHtml5ElementName
nsHTMLElement (!)

For all of the IPDL-generated classes (P*), you should probably just file a bug on the code generator to emit the members in a better order.

It also looks like you're listing typedefs of classes, which adds a lot of duplicate data.
(In reply to comment #26)
> Can we get the list sorted by win size or win%?

"usingbool" means wins if PRBools are changed to PRPackedBools (or C++ bools) and is thus a better sorting column, especially for bool heavy Classes like nsHtml5StackNode. In extreme cases, using a bitfield may yield even better wins. What is the policy regarding such changes?

> It also looks like you're listing typedefs of classes, which adds a lot of
> duplicate data.

Yes, the script is far from being perfect :(
We're not going to switch to real bool until after FF4. Can we fix the easy wins I mentioned now?
Just FYI, Julian has a decent list of distributions of allocated objects from bug 551477. I'm gonna try cross-referencing that with this.
Attached patch fix classes wasting memory (obsolete) — Splinter Review
(In reply to comment #28)
> We're not going to switch to real bool until after FF4. Can we fix the easy
> wins I mentioned now?

mozilla::dom::Link does already use bool, and according to dehydra, they are 1 bit types instead of 1 byte, so there are possibly more wins to have. Will update my script accordingly.

The easy cases you mentioned are included in the patch, except for nsHTMLElement, which would require a rewrite of all the static initialization.
Attachment #467717 - Flags: review?
(In reply to comment #30)
> mozilla::dom::Link does already use bool, and according to dehydra, they are 1
> bit types instead of 1 byte, so there are possibly more wins to have. Will
> update my script accordingly.

Dehydra reports a precision of 1bit for C++ bools, but GCC does not automatically fold adjacent bools into bitfields, would have to be done manually. I updated the table to assume bitfield packing for bools, also I hope to have fixed the duplicate typedefs.
You cannot and should not fold bools into bitfields because you can take their address. In some cases it can also be a performance hit because it involves a full read-write.
Blocks: 583074
Whiteboard: [MemShrink:P3]
OS: Linux → All
Hardware: x86_64 → All
Comment on attachment 467717 [details] [diff] [review]
fix classes wasting memory

I'm going through unassigned review -- and am going to take the liberty of r-ing this based on comment 32.

Please update patch and request review again if this is still an issue.
Attachment #467717 - Flags: review? → review-
Depends on: 745758
Last time I tried, like half a year ago, neither my original dehydra script worked with the latest gcc/dehydra, nor pahole (http://lwn.net/Articles/206805/).
If someone else wants to take over, feel free to do so.
Depends on: 747516
So apparently compilers have had `-Wpadded` warnings for some time. GCCs warnings are not that useful, those from clang are more so, they look like this:

(this is from `mach warnings-list`)
```
js/src/vm/Runtime.h:693:25 [-Wpadded] padding struct 'JSRuntime' with 7 bytes to align 'operationCallback'
js/src/vm/Runtime.h:764:12 [-Wpadded] padding struct 'JSRuntime' with 6 bytes to align 'numExclusiveThreads'
js/src/vm/Runtime.h:814:11 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'ownerThread_'
js/src/vm/Runtime.h:958:25 [-Wpadded] padding struct 'JSRuntime' with 3 bytes to align 'gcChunkSet'
js/src/vm/Runtime.h:983:25 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'gcMarker'
[…]
```

So I just wrote a small python script (attached) to sum up all those numbers, they then look like this:
```
144.0	JSRuntime	js/src/vm/Runtime.h
92.0	icu_52::DateFormatSymbols	intl/icu/source/i18n/unicode/dtfmtsym.h
58.625	mozilla::net::nsHttpHandler	netwerk/protocol/http/nsHttpHandler.h
52.0	webrtc::acm1::AudioCodingModuleImpl	media/webrtc/trunk/webrtc/modules/audio_coding/main/source/audio_coding_module_impl.h
48.0	mozilla::net::WebSocketChannel	netwerk/protocol/websocket/WebSocketChannel.h
47.0	fsmdef_dcb_t	media/webrtc/signaling/src/sipcc/core/gsm/h/fsm.h
42.0	nsFtpState	netwerk/protocol/ftp/nsFtpConnectionThread.h
42.0	mozilla::WebGLContext	content/canvas/src/WebGLContext.h
42.0	mozilla::net::BackgroundFileSaver	netwerk/base/src/BackgroundFileSaver.h
40.0	AffixMgr	extensions/spellcheck/hunspell/src/affixmgr.hxx
```

I have no idea how clang handles unions in this case though, so this might have some double counting and false positives. Haven’t done any research on that yet.
Attachment #466989 - Attachment is obsolete: true
Attachment #466990 - Attachment is obsolete: true
Arpad: nice! I suggest that you post the full list to dev-platform to get feedback on which of these objects are likely to matter. For example, I know that the padding in JSRuntime isn't important, because we have 1+N JSRuntime objects live at any one time, where N is the number of workers, which is typically 3. But I don't know about any of those others.
Attachment #467717 - Attachment is obsolete: true
Product: Core → Firefox Build System

We already have this in place with one of our static-analysis tools.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
Product: Firefox Build System → Developer Infrastructure
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: