Bug 912735 (hub-headers)

Fix the "hub" headers that include and are included by many other headers

RESOLVED DUPLICATE of bug 785103

Status

()

Core
General
RESOLVED DUPLICATE of bug 785103
5 years ago
5 years ago

People

(Reporter: bjacob, Unassigned)

Tracking

(Depends on: 3 bugs, Blocks: 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

75.63 KB, application/vnd.oasis.opendocument.spreadsheet
Details
Created attachment 799792 [details]
headers spreadsheet

It's not easy to tell automatically where to start fixing the headers (to get faster builds).

IWYU is great at one end of the spectrum of header problems. At the other end of the spectrum, we have some big "hub" headers that include many other headers, are included by many other headers, and contain a lot of code themselves. These headers need a suitable combination of
 - getting simplified themselves, so they're cheaper to include
 - not being included so much, instead getting replaced by forward declarations, or taking constants/enums into separate headers.

These big hub headers are not easy to fix automatically, but with some manual work (refactoring), progress can be made. An example was bug 912042 which improved the situation around GLContext.h.

To discover them, a naive approach that may be sufficient as a quick heuristic is to just grep to discover the direct include relationships. This script prints for each header: its path, the number of other headers including it, the number of other headers that it includes, and its number of lines of code (before preprocessing).

  $ for h in `find . -type f -name '*.h'`; do echo "$h,`find . -type f -name '*.h' | xargs egrep "\#include.*[^a-zA-Z0-9_]$(echo $h | sed 's|.*/||g')[^a-zA-Z0-9_]" | wc -l`,`grep '\#include' $h | wc -l`,`cat $h | wc -l`"; done | tee headers.csv

Attaching the result, preprocessed a big in libreoffice to sort it by interestingness (product of the number of headers that it includes by the number of headers including it).
(Reporter)

Updated

5 years ago
Depends on: 912042

Comment 1

5 years ago
More possibly useful data is here https://dl.dropboxusercontent.com/u/74741329/includes.csv, that is the size of each header post-preprocessing (in bytes). Note that the headers with very small numbers are probably errors - I'm still working on making the processing better. The total size of includes is about 5GB.

I was also trying to figure out how to identify these 'hub' headers. One way would be to instrument Clang or GCC to get the time spent on each header but that sounds difficult for someone (like me) who has no knowledge of their internals. My guess was that time would correlate with size, thus why I went after post-processing size. But I'm not sure about that. I plan to do some measurements soon.

Updated

5 years ago
Blocks: 785103

Comment 2

5 years ago
(In reply to Nick Cameron [:nrc] from comment #1)
> More possibly useful data is here
> https://dl.dropboxusercontent.com/u/74741329/includes.csv, that is the size
> of each header post-preprocessing (in bytes). Note that the headers with
> very small numbers are probably errors - I'm still working on making the
> processing better. The total size of includes is about 5GB.
> 
> I was also trying to figure out how to identify these 'hub' headers. One way
> would be to instrument Clang or GCC to get the time spent on each header but
> that sounds difficult for someone (like me) who has no knowledge of their
> internals. My guess was that time would correlate with size, thus why I went
> after post-processing size. But I'm not sure about that. I plan to do some
> measurements soon.

The post-processing size is not necessarily a good measure.  Our goal is to minimize the number of cpp files that need to be rebuilt when a given header is touched, so I think Benoit's measure is probably better.

Joshua has done something similar as well, in another bug which I cannot find right now.
(Reporter)

Comment 3

5 years ago
> The post-processing size is not necessarily a good measure.  Our goal is to
> minimize the number of cpp files that need to be rebuilt when a given header
> is touched, so I think Benoit's measure is probably better.

Well, by that point of view, the right measure would be 'total number of files recursively including that header' while my naive script just gives the number of /headers/ /directly/ including that header.

Anyway, I'm not really out there to find the best possible metric. I want to fix hub headers one by one, so I just need a decent heuristic to discover hub headers. My idea was that people would glance at that spreadsheet, pick a hub header in their area of expertise, fix it, and repeat.
(Reporter)

Updated

5 years ago
Alias: hub-headers

Comment 4

5 years ago
(In reply to comment #3)
> > The post-processing size is not necessarily a good measure.  Our goal is to
> > minimize the number of cpp files that need to be rebuilt when a given header
> > is touched, so I think Benoit's measure is probably better.
> 
> Well, by that point of view, the right measure would be 'total number of files
> recursively including that header' while my naive script just gives the number
> of /headers/ /directly/ including that header.

Yeah, I _think_ that's what Joshua's script used to do?  But I may be wrong.

> Anyway, I'm not really out there to find the best possible metric. I want to
> fix hub headers one by one, so I just need a decent heuristic to discover hub
> headers. My idea was that people would glance at that spreadsheet, pick a hub
> header in their area of expertise, fix it, and repeat.

Agreed!  This metric is clearly on the right path, it doesn't matter if it is not the ideal metric!

Comment 5

5 years ago
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #2)
> (In reply to Nick Cameron [:nrc] from comment #1)
> > More possibly useful data is here
> > https://dl.dropboxusercontent.com/u/74741329/includes.csv, that is the size
> > of each header post-preprocessing (in bytes). Note that the headers with
> > very small numbers are probably errors - I'm still working on making the
> > processing better. The total size of includes is about 5GB.
> > 
> > I was also trying to figure out how to identify these 'hub' headers. One way
> > would be to instrument Clang or GCC to get the time spent on each header but
> > that sounds difficult for someone (like me) who has no knowledge of their
> > internals. My guess was that time would correlate with size, thus why I went
> > after post-processing size. But I'm not sure about that. I plan to do some
> > measurements soon.
> 
> The post-processing size is not necessarily a good measure.  Our goal is to
> minimize the number of cpp files that need to be rebuilt when a given header
> is touched, so I think Benoit's measure is probably better.
> 

It would also be nice to reduce total build time when a cpp is touched and for a clobber. It is not clear to me how much we can affect that by optimising includes. But anecdotally a dumb IWYU pass for gfx/layers improved build times by 12.5%. If we could do something similar for the whole tree, that would be a real benefit. By that measure total size might be what we want, but maybe not. Perhaps total number of includes is important. I guess it depends if compilation performance is IO or CPU/memory bound.

(In reply to Benoit Jacob [:bjacob] from comment #3)
> > The post-processing size is not necessarily a good measure.  Our goal is to
> > minimize the number of cpp files that need to be rebuilt when a given header
> > is touched, so I think Benoit's measure is probably better.
> 
> Well, by that point of view, the right measure would be 'total number of
> files recursively including that header' while my naive script just gives
> the number of /headers/ /directly/ including that header.
> 

I think I can get this number with a relatively simple tweak to my scripts.
(Reporter)

Updated

5 years ago
Depends on: 912974

Comment 6

5 years ago
(In reply to Benoit Jacob [:bjacob] from comment #3)
> Well, by that point of view, the right measure would be 'total number of
> files recursively including that header' while my naive script just gives
> the number of /headers/ /directly/ including that header.
> 

Here you go: https://dl.dropboxusercontent.com/u/74741329/includes2.csv

First column is header name, second is the number of headers transitively included by that header, and the third column is the number of times that header is included by another header (note that this does not include the number of times it is included by cpp files, so a header could be included by every cpp in our tree and still get zero here). Also note that the same accuracy disclaimer applies as to the first set of data - its work in progress. (I think the second number is fairly accurate, but the first can be heavily skewed depending on how it was compiled, if that number is unusually small, then it was probably compiled without some system headers.
(Reporter)

Comment 7

5 years ago
Thanks, very useful. Is the 3rd column counting all transitive includes or only direct includes?
(Reporter)

Comment 8

5 years ago
...self-answering: looking at the values, it has got to be transitive includes. Very useful, as my own data only had direct includes.

Comment 9

5 years ago
> ...self-answering: looking at the values, it has got to be transitive
> includes. Very useful, as my own data only had direct includes.

Yep, both columns are transitive.
Note that the only difference between this file and the naive spreadsheet that I attached is that your file has transitive includes, while mine only had direct includes.

That is very interesting data, but somehow, the results are not so obvious to act on: it shows what we indirectly include, but we can only act on what we directly include.

I feel that these two spreadsheets have different uses: my spreadsheet allows identifying what particular files we should focus on fixing (e.g. it currently prominently shows Layers.h, which is far less prominent in your spreadsheet) while your spreadsheet, if you regenerate it after we've made some fixes, is useful to measure how much progress we've actually made.

Comment 11

5 years ago
jcranmer's includes tool is in bug 901132.
(Reporter)

Updated

5 years ago
Depends on: 913603
(Reporter)

Updated

5 years ago
Depends on: 913847
(Reporter)

Updated

5 years ago
Depends on: 913852
(Reporter)

Updated

5 years ago
Depends on: 913868
(Reporter)

Updated

5 years ago
Depends on: 913869
(Reporter)

Updated

5 years ago
Depends on: 913872

Updated

5 years ago
Depends on: 908678
(Reporter)

Updated

5 years ago
Depends on: 918330
(Reporter)

Updated

5 years ago
Depends on: 919219
This bug is really a duplicate of includehell, duping.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 785103
You need to log in before you can comment on or make changes to this bug.