Open Bug 1254777 Opened 4 years ago Updated 11 months ago

[meta] Reduce size of writable static data

Categories

(Core :: General, defect)

defect
Not set

Tracking

()

ASSIGNED
Tracking Status
firefox48 --- affected

People

(Reporter: njn, Assigned: njn)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [overhead:meta])

Attachments

(2 files, 6 obsolete files)

Writable static data (e.g. non-const static variables, vtables, etc.) can take up significant memory, and it is duplicated in every process. This is a meta-bug for improving this.

How I'm finding these (on Linux), thanks to glandium's help:

- `readelf -W -l libxul.so` gives the sections; from the LOAD RW section I can
see the relevant segments.

- `objdump -t -j .dynamic -j .got -j .got.plt -j .data -j .jcr -j .tm_clone_table -j .fini_array -j .init_array -j .kPStaticModules -j .data.rel.ro.local -j .data.rel.ro -j .bss libxul.so` gives all the writable data symbols. (The segments were obtained from the previous command.)

- `sort --key=5,6 -r -n` sorts the objdump output so the largest symbols are shown first.
Attached file Biggest 1000 symbols in libxul.so (obsolete) —
Attached file Biggest 1000 data symbols in libxul.so (obsolete) —
Here's the same data but with symbol names demangled (using objdump's -C flag).
Attachment #8728168 - Attachment is obsolete: true
Depends on: 1254780
This one actually is demangled.
Attachment #8728170 - Attachment is obsolete: true
Those ff_{sin,cos} symbols come from libav here:
https://dxr.mozilla.org/mozilla-central/rev/4657041c6f77b36b1fb9647c28f53f4f51757360/media/libav/libavcodec/fft_template.c#37
https://dxr.mozilla.org/mozilla-central/rev/4657041c6f77b36b1fb9647c28f53f4f51757360/media/libav/libavcodec/rdft.c#31

They are literally sine and cosine lookup tables, but they're readable because they're calculated at runtime. Upstream libav has host programs that can generate these tables at build-time, which makes them const:
https://github.com/libav/libav/blob/master/libavcodec/cos_tablegen.c

Integrating that into our build system is probably a PITA because we don't have a good way to wire up a HOST_PROGRAM to GENERATED_FILES, but that program is also pretty simple and we could almost certainly rewrite it in Python.
Is the first column size? Because it doesn't seem to be sorted correctly. 000000000028b850 is followed by 00000000050753d0, for instance. Sorry if I'm confused about something here.

I looked at sAttributes_specs. It is static const, and is marked as .data.rel.ro.local. According to a Stack Overflow answer I found [1], it sounds like this is because it contains function pointers, which can be relocated. Is this something that we can or should change?

[1] http://stackoverflow.com/questions/7029734/what-is-the-data-rel-ro-used-for
The fifth column is size; the first column is the address of the symbol within the library.
(In reply to Andrew McCreight [:mccr8] from comment #5)
> I looked at sAttributes_specs. It is static const, and is marked as
> .data.rel.ro.local. According to a Stack Overflow answer I found [1], it
> sounds like this is because it contains function pointers, which can be
> relocated. Is this something that we can or should change?

We have to hand off function pointers to the JS engine at some point.  Bug 786819 comment 4 lays out a scheme for having fewer function pointers, though.

There's a bug I filed on how to lay out some of the datstructures so we stop having so many nullptrs everwhere.  I can't find it right now, though. :(
Whiteboard: [MemShrink]
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #4)
> Those ff_{sin,cos} symbols come from libav here:

One complication: the latest attachment shows symbols from all the .so files in $OBJDIR/dist/bin/ on a Linux64 build. But some of them are dynamically loaded on demand. E.g. avcodec_options is in libmozavcodec.so which is dynamically loaded when we play a VP9 video.

ff_{sin,cos} are in liblgpllibs.so, I'm not sure how that one gets loaded.
Here's a description of the different segment types.

* .text: read-only code.

* .rodata: non-zero read-only data.

* .bss: initially-all-zero writable data.

* .data (no pointers): non-zero writable data. Often it's actually read-only in
practice, in which case |const| can be added, changing it to .rodata, which is
better than .data because it allows it to be shared between processes.

* .data (with pointers): like ".data (no pointers)", but relocatable. If |const|
is added it becomes .data.rel.ro, which is better than .data because (a) the
compiler might be able to optimize it, and (b) the data cannot be modified
accidentally/maliciously (if the platform supports this).

* .data.rel.ro, .data.rel.ro.local: read-only data that contains pointers and so
is relocatable.

There are a few other kinds, but they don't take up much space.

> -----------------------------------------------------------------------------
> Type            Uses disk  Uses address  Uses physical  Shared between
>                 space?     space?        memory?        processes?
> -----------------------------------------------------------------------------
> .text           Y          Y             if touched     Y
> .rodata         Y          Y             if touched     Y
> .bss            -          Y             if touched     -
> .data(no ptrs)  Y          Y             probably[1]    -
> .data(w/ptrs)   Y          Y             always[2]      -
> .data.rel.ro*   Y          Y             always         -
> -----------------------------------------------------------------------------
[1]: because it probably shares pages with ".data(w/ptrs)" symbols
[2]: because applying relocations requires touching the memory
Depends on: 1255239
Attached file Script to get symbols (obsolete) —
Here's the script I'm using to get these results. I've tweaked it to:

- include all symbols except for .text ones (.rodata ones are interesting, even if they can be shared)

- cut out the boring stuff at the start of each line

Just run it in a directory containing .so files, such as $OBJDIR/dist/bin/, and it'll put the results in .so.syms files, and a summary of everything in all.syms.
Linux only!
Updated version, obtained with the attached script.
Attachment #8728341 - Attachment is obsolete: true
> .data.rel.ro.local	0000000000012310              kSTSPreloadList

We ought to be able to significantly reduce the size of this table by applying the same techniques the effective tld service uses, which will also make it shareable across processes.  Will eliminate a number of relocations, too.
(In reply to Nicholas Nethercote [:njn] from comment #8)
> ff_{sin,cos} are in liblgpllibs.so, I'm not sure how that one gets loaded.

It's just linked directly to libxul:

$ LD_LIBRARY_PATH=`pwd` ldd libxul.so | grep liblgpl
	liblgpllibs.so => /home/luser/firefox/liblgpllibs.so (0x00007f1a7c41a000)

(built as a standalone lib because it contains LGPL code)
Depends on: 1255425
Depends on: 1255655
I filed an upstream bug report (https://ssl.icu-project.org/trac/ticket/12390) to shrink the silly gDigitCount lookup table in ICU.
A fair chunk of the static data is from DOM bindings:

- sAttributes_specs/sAttributes_ids/sAttributes (and similar ones for methods, constants, etc)

- InterfaceObjectClass and PrototypeClass (which incorporate js::Class)

- sNativeProperties (which is usually mostly nullptrs)

- There is lots of duplicated data across worker and non-worker code, e.g. |namespace URLBinding| and |namespace
  URLBinding_workers|

- sNativePropertyHooks

Bug 1011826 is about shrinking Prefable, which shrinks sAttributes/sMethods/sConstants/etc.
Depends on: 1011826
Depends on: 1258924
Depends on: 1259194
Depends on: 1259182
froydnj: as a follow-up to our conversation today I looked more closely at vtables. There are 12,276 of them. They take up 2,564,936 bytes! That's roughly half of our static data.

I looked for frequently-occurring patterns in the names, which suggest possibly
fruitful inheritance hierarchies. Here's a list...

- Foo:cycleCollection            770
- Ends with "Event"              607
- Ends with "Impl"               566
- Ends with "Accessible          365
- Ends with "Parent"             353
- Ends with "Child"              320
- nsRunnableMethodImpl<T>        285
- Ends with "Element"            278
- Ends with "Callback"           269
- PDocAccessible::{Msg,Reply}_*  259
- Ends with "Runnable"           245
- PContent::*                    233
- Ends with "Frame"              199
- Ends with "Listener"           174
- Ends with "Stream"             165
- MozPromise<T>                  152
- Ends with "Channel"            149
- Ends with "Request"            143
- Ends with "Handler"            141
- Runnable{Function,Method}      126
- Ends with "Effect"             102
- Foo::Compiler                  100
- Ends with "Command"             74
- Ends with "Layer"               70
- Ends with "Type"                63
- Ends with "Document"            53
- PointerClearer<T>               53
- Ends with "Controller"          50
(needinfo'ing to ensure you see comment 16)
Flags: needinfo?(nfroyd)
I also looked more carefully at the DOM bindings structs. bz, you'll be interested in this.

- DOMIfaceAndProtoJSClass sInterfaceObjectClass: 748 x 240 B = 179,520 B (.d.r.ro.l)
- DOMIfaceAndProtoJSClass sPrototypeClass:       794 x 240 B = 190,560 B (.d.r.ro.l)
- DOMJSClass sClass:                             773 x 264 B = 204,072 B (.d.r.ro.l)
  - These sizes are *after* bug 1259124's patches are applied, which saved
    over 200,000 B overall.
  - The same trick from that bug should be applicable a couple more times,
    saving that much again and more.

- s{Attributes,Methods,...}_specs: 414,888 (.data.rel.ro.local)
  - Depends on JSPropertySpec; shrinking that looks tricky

- s{Attributes,Methods,...}_ids: 81,152 (.bss)
  - Each one is only a word, so shrinking seems difficult

- s{Attributes,Methods,...}: 49,388 (.data)
  - Bug 1011826 shrank this by 2/3 already, hard to do more

- s{Attributes,Methods,...}_disablers: 11,720 (.data)
  - Bug 1011826 introduced these

- sNativeProperties: 739 x 176 B == 130,064 B total (.data.rel.ro.local)
  - An inefficient structure. Making it variable-length should reduce it to
    1/3 its current size or less, saving 90--100 KB.

- *_[gs]etterinfo: 16 x 6796 (in total) = 108,736 (.data.rel.ro.local)
  - Looks hard to shrink

- sNativePropertyHooks: 792 x 48 B = 38,016 B (.data.rel.ro.local)
  - Looks hard to shrink

That gives a total of 1,408,116 bytes, and that's after cutting about 300 KB in bug
1259124 (pending) and bug 1011826 (landed). The savings identified above should give
us another 300 KB or more. I'd love to hear any other ideas.
Flags: needinfo?(bzbarsky)
Looks like I missed all the 'sChrome' variants of the DOM bindings stuff, so some of the items above will be under-measured.
Depends on: 1259428
Depends on: 1259429
Depends on: 1259445
>- DOMIfaceAndProtoJSClass sInterfaceObjectClass: 748 x 240 B = 179,520 B (.d.r.ro.l)
>- DOMIfaceAndProtoJSClass sPrototypeClass:       794 x 240 B = 190,560 B (.d.r.ro.l)

What's particularly annoying is that the JSClass parts of these fall into one of a small number of buckets.  But we can't easily share those, because the JSClass* is what the JS engine is pointing to, so they need to be distinct pointers for cases when our non-JSClass bits differ.

>  - These sizes are *after* bug 1259124's patches are applied, which saved
>    over 200,000 B overall.

I assume you mean bug 1259194?  That sort of approach does seem likely to be fruitful if we can figure out which things are typically the same across all this stuff.

>- s{Attributes,Methods,...}_ids: 81,152 (.bss)
>  - Each one is only a word, so shrinking seems difficult

So... I _think_ this is only used for Xray code, and only used for finding "the index of the thing with the given jsid, so we can use that as an index into our JSPropertySpec/JSFunctionSpec array".

If we had a way to walk a JSPropertySpec[] or JSFunctionSpec[] and do fast-ish compares of a given jsid to the const char* name in there , I think we could nix the *_ids arrays completely.  We'd need to be a little careful about symbol IDs, but I think that should not be too bad.

One other thing we could do if it's worth it: the mPrototypeID in DOMIfaceAndProtoJSClass is the same as mPrototypeID in NativePropertyHooks as far as I can tell.  So we could nix it from DOMIfaceAndProtoJSClass and have an extra pointer-chase at the consumers, if they're not too hot (and I expect they're not).  We could also move mDepth into the NativePropertyHooks so we don't have two copies of it lying around in the common case (one in the interface object class, one in the prototype object class).  Again, not sure whether this helps in practice given how DOMIfaceAndProtoJSClass packs.
Flags: needinfo?(bzbarsky)
A non-exhaustive checking of these:

(In reply to Nicholas Nethercote [:njn] from comment #16)
> - Foo:cycleCollection            770

These are likely pretty tight; I think glandium spent some time optimizing these.

> - Ends with "Event"              607

These don't seem to fit nicely into one category: there's mozilla::dom::Event subclasses, which I'm not sure we can do a whole lot about, things that are sort of like nsIRunnable, but more specialized, and WebIDL events (?), which are also similarly intractable.

> - Ends with "Impl"               566

A fair number of these seem to be *JSImpl things from the WebIDL bindings, which are not easily minimized.  Some of these also hail from webrtc:: as well.

> - Ends with "Accessible          365

I do not get nearly that many:

froydnj@thor:/opt/build/froydnj/build-debug$ readelf -sW toolkit/library/libxul.so |c++filt|egrep -c 'vtable.*Accessible$'
93

Am I doing something wrong?

> - Ends with "Parent"             353
> - Ends with "Child"              320

I mentioned in our conversion that IPDL/IPC code sets up its class hierarchies a little oddly; we could probably eliminate some of these, or at least trim them down, but it would take some work.

> - nsRunnableMethodImpl<T>        285

Not much to be done here, I think.

> - Ends with "Element"            278
> - Ends with "Callback"           269
> - PDocAccessible::{Msg,Reply}_*  259

I filed bug 1259428 for this and similar things.  I think we can get a nice win here.

> - Ends with "Channel"            149

I'm not seeing this many, and the ones that I do appear to be mostly nsI* classes.  We don't care much about those vtables, as --gc-sections (not turned on for local builds, AIUI) will take care of those.

> - Ends with "Request"            143
> - Ends with "Handler"            141
> - Runnable{Function,Method}      126

Not a lot we can do here.  It would be nice if the C++ side of things didn't need full nsIRunnable objects, but could instead get by with objects having just a Run() method (I know the graphics folks mentioned this when I asked why they implemented a separate thread pool instead of reusing nsThreadPool...), but I think that's intractable.
Flags: needinfo?(nfroyd)
> and WebIDL events (?), which are also similarly intractable.

Wait.  Shouldn't WebIDL events just have the same vtable a dom::Event?  I guess not in the cases when they need to do their own cycle collection...  I just checked and for Web IDL events we have 32 that have something to cycle collect and 42 that don't have anything to cycle collect, if I'm measuring right.  Also, I guess virtual destructors could cause problems here.  :(
> > - Ends with "Accessible          365
> 
> I do not get nearly that many:
> 
> froydnj@thor:/opt/build/froydnj/build-debug$ readelf -sW
> toolkit/library/libxul.so |c++filt|egrep -c 'vtable.*Accessible$'
> 93
> 
> Am I doing something wrong?

For all my searches I did "Foo\>" rather than "Foo$". For a lot of the Accessible ones I think there was a template parameter like "Accessible<T>".
(In reply to Nathan Froyd [:froydnj] from comment #21)
> A non-exhaustive checking of these:
> 
> (In reply to Nicholas Nethercote [:njn] from comment #16)
> > - Foo:cycleCollection            770
> 
> These are likely pretty tight; I think glandium spent some time optimizing
> these.

I optimized the macros. I didn't look from the angle of static data usage.
Depends on: 1260653
> - sNativeProperties: 739 x 176 B == 130,064 B total (.data.rel.ro.local)
>   - An inefficient structure. Making it variable-length should reduce it to
>     1/3 its current size or less, saving 90--100 KB.

I filed bug 1260653 for this. The draft patch saves 110 KB.
Depends on: 1260984
Depends on: 1261129
Depends on: 1261720
Depends on: 1261723
Depends on: 1261726
> >- s{Attributes,Methods,...}_ids: 81,152 (.bss)
> >  - Each one is only a word, so shrinking seems difficult
> 
> So... I _think_ this is only used for Xray code, and only used for finding
> "the index of the thing with the given jsid, so we can use that as an index
> into our JSPropertySpec/JSFunctionSpec array".
> 
> If we had a way to walk a JSPropertySpec[] or JSFunctionSpec[] and do
> fast-ish compares of a given jsid to the const char* name in there , I think
> we could nix the *_ids arrays completely.  We'd need to be a little careful
> about symbol IDs, but I think that should not be too bad.

I filed bug 1261726 for this one, with extra information from our IRC discussion. I doubt I will get to this one any time soon, bz, so please take it if you want it.
Depends on: 1262826
Another one worth noting, even though it's a read-only table and thus shared...

In media/libjpeg/, jpeg_nbits_table[] is 64 KiB, and it is just a table implementation of the CLZ intrinsic. The comment USE_CLZ_INTRINSIC documents this, and why the intrinsic is only used on ARM by default (some x86 implementations are slow?). This option is GCC/clang-only, there's nothing supporting use of intrinsics for MSVC.
Another read-only one:

There's an option SMALL_TABLE which reduces the prime_tab table from 51.1 KiB to 1 KiB. I don't know how hot this table is and thus whether it would hurt performance. Also, it's in freebl, which is in NSS, so setting that constant may be difficult.
Depends on: 1269490
I think `codesighs` used to report metrics like this, but it rotted due to toolchain changes over time. Would it be useful to generate this data as part of the build, and report the sum to perfherder so we could graph it over time?
(In reply to Ted Mielczarek [:ted.mielczarek] (Vacation May 4th-11th) from comment #29)
> I think `codesighs` used to report metrics like this, but it rotted due to
> toolchain changes over time. Would it be useful to generate this data as
> part of the build, and report the sum to perfherder so we could graph it
> over time?

Yes!  How difficult is that to do?
Just running |size| on libxul.so on Linux and parsing the output should be pretty easy.
Attached file Script to get symbols (obsolete) —
Here's a slightly tweaked version of the script.
Attachment #8728737 - Attachment is obsolete: true
Depends on: 1292039
Attachment #8921707 - Attachment is obsolete: true
Attachment #8921707 - Flags: feedback?(nfroyd)
Attached file Script to get symbols
This version of the script has better comments about how to use it.
Attachment #8769526 - Attachment is obsolete: true
It may be worth revisiting whether VS2017 has fixed the "const data goes in a writable section" bug and/or whether new instances of it have popped up.
Depends on: 1334254
Attachment #8956657 - Attachment mime type: application/x-shellscript → text/plain
Depends on: 1463296
Depends on: 1464501
Summary: Reduce size of writable static data → [meta] Reduce size of writable static data
Whiteboard: [overhead:meta]
You need to log in before you can comment on or make changes to this bug.