Created attachment 376545 [details]
I have been hacking up a dehydra script that is calculating a classes "real" sizeof and comparing it to the final sizeof to determine if that class is wasting memory due to alignment.
The script works quite well as far as I can tell.
Created attachment 376546 [details]
List of badly aligned classes (ods)
This is a filtered list of badly aligned classes. I will skim through it and try to fix some of the classes.
This awesome stuff. Who would've thought that getting this number is as simple as subtracting sizes of members from the struct size :) Whenever I thought about this issue I forgot that dehydra provides size_of.
* Size of a pointer, assume as 8, TODO: how to get the *real* pointer size?
var pointerSize = 8;
I presume this is 64bit
size.real = pointerSize;
I think you can do size.real = type.precision/8
For lazy bums like me, csv files are easier to open.
As I said before, awesome!
David, I believe you wanted to see some static analysis results on accessibility, this spreadsheet looks like a goldmine.
> size.real = pointerSize;
> I think you can do size.real = type.precision/8
Yes indeed. I use pointerSize mainly for adding the size of the vtable pointer to classes.
Will see what else I can do there.
I wonder how much freedom I have in reordering members. I suppose I can't change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other classes should be fine though?
A quick look over some classes also revealed more optimization opportunities by changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to bool, as that is now used in some places in the codebase already.
> I wonder how much freedom I have in reordering members. I suppose I can't
> change anything in NSPR/NSS/frozen linkage due to ABI compatibility. Other
> classes should be fine though?
> A quick look over some classes also revealed more optimization opportunities by
> changing away from PRBool (4byte) to PRPackedBool (1byte) or even straight to
> bool, as that is now used in some places in the codebase already.
PRPackedBool is probably easier since you don't have to add casts everywhere to switch to bool. I'm also not sure if all modules allow bool usage.
While you are at it could you do a check similar to the one mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=376434
Basically scan for structs that have both signed and unsigned bitfields and warn because MSVC packs them wrong.
(In reply to comment #2)
> David, I believe you wanted to see some static analysis results on
> accessibility, this spreadsheet looks like a goldmine.
Wow, crazy! Great work. What happens next?
(In reply to comment #5)
> (In reply to comment #2)
> > David, I believe you wanted to see some static analysis results on
> > accessibility, this spreadsheet looks like a goldmine.
> Wow, crazy! Great work. What happens next?
Would be nice if you could go through accessibility ones to verify the results are correct and maybe pick a couple that are frequently allocated as candidates to fixing.
Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes to internal fragmentation?
(In reply to comment #7)
> Arpad: excellent work -- any clues from your tool how JSRuntime loses 92 bytes
> to internal fragmentation?
My tool currently doesn't print a list of members, but that would be easy to add.
Looking at the code, I can see some possible problems:
> uint32 gcMallocBytes;
> JSGCArenaInfo *gcUntracedArenaStackTop;
> JSBool rngInitialized;
> int64 rngMultiplier;
> uint32 requestCount;
> JSThread *gcThread;
And quite a few more... I'm on x86_64, and every time a 4byte member is followed by a 8byte member (or pointer) its wasting 4 bytes.
Started a thread on dev.platform: http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/52bc09157428dadb#
Created attachment 376620 [details]
dehydra script, v2
On 32-bit x86, misaligned doubles are allowed but slow down the program considerably. It might be useful to run this script with pointerSize=4 to find those as well.
Created attachment 376682 [details]
list of badly aligned classes (csv)
This list includes a possible size if using C++ bools (which are 1byte as are PRPackedBools) and a list of direct members with their sizes (does not include inherited members from parent classes).
Created attachment 376683 [details]
dehydra script, v3
Created attachment 376728 [details]
list of badly aligned classes (csv)
The script had some issues with inheritance and vtables, probably still has, I'm not quite sure.
You might find this document helpful, if you haven't seen it already:
It lays out things like the rules for where the vtable pointer appears and under what circumstances a derived class's members can be packed into the base class's tail padding. (Ignore mention of Itanium; GCC uses this ABI on all architectures nowadays, modulo things like pointer size and minimum required alignment.)
Created attachment 376742 [details]
Still has problems with inheritance and vtable pointers though :(
Quick comments on looking at the list:
1) A number of these are services or one-per-document objects. Savings from
those will be small no matter what. It might not be worth making the
changes if it decreases readability by separating related members.
2) You might want to get an NSS person to look at those SEC* things near the
top of the list (as well as the other ones on it).
3) A bunch of per-element accessibility stuff here that could really use
fixing, if the numbers are right.
4) The frame and element classes in that list should be a high priority.
5) I'm not sure where some of these differences are coming from. For example,
The one for nsHTMLFormElement seems way too big; I would have expected
about 2 bytes of difference there. The class does have 5 vtable pointers
in addition to the ones inherited from the superclass; is that where the
other 40 bytes come from?
Minor output nit: can you include a percentage, or better yet just stick the results in a google spreadsheet and link to them?
Tying these results to trace-refcnt output from a Tp run would also be interesting, as far as looking at the maximum number of bytes saved during runtime.
This time I tried a different approach. Instead of taking the compiled sizeof() and messing around with vtable pointers, I rather look at the members and see how many GAPs are between them. I then calculate how much one can win by reordering the members or using 1byte bools.
This time, there aren't that many wins to be made. More wins could be made by cutting down the inheritance tree, consolidating Interfaces and remove unnecessary ones.
Here is the spreadsheet: http://spreadsheets.google.com/ccc?key=rVIrgklQt9Y2H1hnS87L1HA
The desire for a cheap, always-on memory profiler that puts allocations in coarse, non-hierarchical buckets (JS, DOM, HTML, XUL, CSS, ...) came up again, Ben Galbraith is leading the charge.
This goal reminded me of SGI's Power Bloat Vision(tm) tool, a graphical line-chart or (default UI) bar with color-coded segments jumping around as different kernel identified resource users of memory changed their shares. PBV really helped users zero in on bloat in their apps, as well as in Motif/X, the SGI graphical desktop, and Irix ;-).
Thinking about how to do it super-cheaply, I thought of this bug. We don't want to add malloc interception by default, but we use jemalloc on Windows and Linux, and may on Mac. We can hack jemalloc. Problem is we don't want to change malloc, new, etc. signatures to carry "bucket ids", nor do we wish to walk the stack on every allocation, even by one frame.
But suppose each bucket (subsystem) happened to allocate unique sizes? Then we could statically analyze the sizes and produce a mapping to burn into the build.
Ok, I'm sure different buckets have their own structs or classes that happen to have the same size. Still, we could find such types with static analysis, and even generate a patch to farble the size of one or the other to differ.
If this wastes space, we could do something trickier, like farble the size passed to malloc in a way that did not waste space (e.g., subtract from 0 to 3 to help distinguish 4 buckets' uses of the same size).
What say you static analysis folks? We need your northern-style kung fu!
I will file a new bug if needed, I didn't mean to hijack this one -- sorry. Want to get a quick sanity check, figured this bug could use some action ;-).
Created attachment 466989 [details]
Created attachment 466990 [details]
Vava prog for parsing output into cvs
I have been running this again after a long time, there are quite a few offenders left, see the bug url field.
Assuming we fixed everything and every structure was tightly packed, what's the total net savings? (Both absolute and %)
(In reply to comment #24)
> Assuming we fixed everything and every structure was tightly packed, what's the
> total net savings? (Both absolute and %)
Savings would require instrumenting the build to figure out how often these datastructures are used.
Can we get the list sorted by win size or win%? I suspect that most of these classes are very infrequently used, but a few seem like they could be big wins. Some suggestions, if you want to do tryserver runs on a limited set:
nsHtml5StackNode (not actually stack-allocated, despite its name)
mozilla::dom::Link (there can be thousands of these per page, might improve Tp times)
For all of the IPDL-generated classes (P*), you should probably just file a bug on the code generator to emit the members in a better order.
It also looks like you're listing typedefs of classes, which adds a lot of duplicate data.
(In reply to comment #26)
> Can we get the list sorted by win size or win%?
"usingbool" means wins if PRBools are changed to PRPackedBools (or C++ bools) and is thus a better sorting column, especially for bool heavy Classes like nsHtml5StackNode. In extreme cases, using a bitfield may yield even better wins. What is the policy regarding such changes?
> It also looks like you're listing typedefs of classes, which adds a lot of
> duplicate data.
Yes, the script is far from being perfect :(
We're not going to switch to real bool until after FF4. Can we fix the easy wins I mentioned now?
Just FYI, Julian has a decent list of distributions of allocated objects from bug 551477. I'm gonna try cross-referencing that with this.
Created attachment 467717 [details] [diff] [review]
fix classes wasting memory
(In reply to comment #28)
> We're not going to switch to real bool until after FF4. Can we fix the easy
> wins I mentioned now?
mozilla::dom::Link does already use bool, and according to dehydra, they are 1 bit types instead of 1 byte, so there are possibly more wins to have. Will update my script accordingly.
The easy cases you mentioned are included in the patch, except for nsHTMLElement, which would require a rewrite of all the static initialization.
(In reply to comment #30)
> mozilla::dom::Link does already use bool, and according to dehydra, they are 1
> bit types instead of 1 byte, so there are possibly more wins to have. Will
> update my script accordingly.
Dehydra reports a precision of 1bit for C++ bools, but GCC does not automatically fold adjacent bools into bitfields, would have to be done manually. I updated the table to assume bitfield packing for bools, also I hope to have fixed the duplicate typedefs.
You cannot and should not fold bools into bitfields because you can take their address. In some cases it can also be a performance hit because it involves a full read-write.
Comment on attachment 467717 [details] [diff] [review]
fix classes wasting memory
I'm going through unassigned review -- and am going to take the liberty of r-ing this based on comment 32.
Please update patch and request review again if this is still an issue.
Last time I tried, like half a year ago, neither my original dehydra script worked with the latest gcc/dehydra, nor pahole (http://lwn.net/Articles/206805/).
If someone else wants to take over, feel free to do so.
Created attachment 8378185 [details]
Python script to aggregate clang -Wpadded warnings
So apparently compilers have had `-Wpadded` warnings for some time. GCCs warnings are not that useful, those from clang are more so, they look like this:
(this is from `mach warnings-list`)
js/src/vm/Runtime.h:693:25 [-Wpadded] padding struct 'JSRuntime' with 7 bytes to align 'operationCallback'
js/src/vm/Runtime.h:764:12 [-Wpadded] padding struct 'JSRuntime' with 6 bytes to align 'numExclusiveThreads'
js/src/vm/Runtime.h:814:11 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'ownerThread_'
js/src/vm/Runtime.h:958:25 [-Wpadded] padding struct 'JSRuntime' with 3 bytes to align 'gcChunkSet'
js/src/vm/Runtime.h:983:25 [-Wpadded] padding struct 'JSRuntime' with 4 bytes to align 'gcMarker'
So I just wrote a small python script (attached) to sum up all those numbers, they then look like this:
144.0 JSRuntime js/src/vm/Runtime.h
92.0 icu_52::DateFormatSymbols intl/icu/source/i18n/unicode/dtfmtsym.h
58.625 mozilla::net::nsHttpHandler netwerk/protocol/http/nsHttpHandler.h
52.0 webrtc::acm1::AudioCodingModuleImpl media/webrtc/trunk/webrtc/modules/audio_coding/main/source/audio_coding_module_impl.h
48.0 mozilla::net::WebSocketChannel netwerk/protocol/websocket/WebSocketChannel.h
47.0 fsmdef_dcb_t media/webrtc/signaling/src/sipcc/core/gsm/h/fsm.h
42.0 nsFtpState netwerk/protocol/ftp/nsFtpConnectionThread.h
42.0 mozilla::WebGLContext content/canvas/src/WebGLContext.h
42.0 mozilla::net::BackgroundFileSaver netwerk/base/src/BackgroundFileSaver.h
40.0 AffixMgr extensions/spellcheck/hunspell/src/affixmgr.hxx
I have no idea how clang handles unions in this case though, so this might have some double counting and false positives. Haven’t done any research on that yet.
Arpad: nice! I suggest that you post the full list to dev-platform to get feedback on which of these objects are likely to matter. For example, I know that the padding in JSRuntime isn't important, because we have 1+N JSRuntime objects live at any one time, where N is the number of workers, which is typically 3. But I don't know about any of those others.