Closed Bug 608196 Opened 14 years ago Closed 14 years ago

Incremental genxref implementation: unit at a time

Categories

(Webtools Graveyard :: MXR, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: timeless, Assigned: timeless)

References

Details

TOC:
Comment 0 has a description of how things work and summary of the feature
Comment 1 has instructions for testing the feature
Comment 2 has timing numbers for an incremental run
Comment 3 has timing numbers for a normal run

because otherwise they'd be lost in this bug :)

Description:
Currently MXR must rebuild its entire index each time update-xref.pl runs, which is typically after a run of update-src.pl.

It's possible for very little of a tree to change.

The implementation I have isn't ready for use by mozilla-central because the database loses too much knowledge, but this is part one.

currently mxr does 2 source mode passes:
pass 1...: Collect identifier definitions.
pass 2...: Generate reference statistics.

plus a dump phase (stage 3).

basically you need to have a list of identifiers before you can decide which tokens in source files are references, so you need to do pass 1 completely first.

for a simple tree (security) with three directories (dbm, nspr, security), this looks like:

pass 1 collect dbm
pass 1 collect nspr
pass 1 collect security
-at this point we have a list of definitions from dbm+nspr+security-
pass 2 find refs dbm
pass 2 find refs nspr
pass 2 find refs security
-at this point we have a list of references from dbm+nspr+security-
stage 3 dump database
-at this point we dump the single database containing identifiers from dbm+nspr+security with consumers found in all of them-

With the code I have, the following happens instead:

if dbm is unchanged, copy dbm database, else:
pass 1 collect dbm
-at this point we have a list of definitions from dbm-
pass 2 find refs dbm
-at this point we have a list of references of dbm defined identifiers in dbm + things the indexer decides are identifiers anyway-
stage 3 dump dbm database
-at this point we dump a database containing identifiers from dbm with consumers found in dbm-

if nspr is unchanged, copy nspr database, else:
pass 1 collect nspr
-at this point we have a list of definitions from nspr-
pass 2 find refs nspr
-at this point we have a list of references of nspr defined identifiers in nspr + things the indexer decides are identifiers anyway-
-we did not find nspr references to dbm defined identifiers-
stage 3 dump nspr database
-at this point we dump a database containing identifiers from nspr with consumers found in nspr-

if security is unchanged, copy security database, else:
pass 1 collect security
-at this point we have a list of definitions from security-
pass 2 find refs security
-at this point we have a list of references of security defined identifiers in security + things the indexer decides are identifiers anyway-
-we did not find security references to dbm or nspr defined identifiers-
stage 3 dump security database
-at this point we dump a database containing identifiers from security with consumers found in security-

"pass" 4 merge databases
-we merge the dbm, nspr and security databases-

The neat part about this is that each database is self contained, and there's logic to check to see if the source code for a database is unchanged from a reference directory. If it is, then instead of generating an index (reading each file twice), we can simply borrow the existing database. We only need to rebuild databases for directories which have had changes.

For certain cases (addons) this is actually a very good thing, because we keep much smaller databases in memory and the directories are actually independent.

People can test this on konigsberg:
http://mxr-test.konigsberg.mozilla.org/security (currently indexed using the non incremental indexer, but it had a pass using incremental which was used by security-copy for testing purposes)
http://mxr-test.konigsberg.mozilla.org/security-copy (indexed using the incremental indexer)

I *believe* there would be differences in C++ cases. To show this I'd need to run against m-c or a similar tree. Unfortunately konigsberg also has a prototype for a different parsing system too which I believe causes more missed than the incremental code for this bug.

This is obviously not ideal for mozilla-central (or even security).

Future plans (to be refiled in another bug at a later date):
The longer version of this will instead rely on distinct identifier and reference databases. For directories which haven't changed, we would skip the identifier search.

if dbm is unchanged, copy dbm-ident database, else:
pass 1 collect dbm
-at this point we have a list of definitions from dbm-
dump dbm-ident database

if nspr is unchanged, copy nspr-ident database, else:
pass 1 collect nspr
-at this point we have a list of definitions from nspr-
dump nspr-ident database

if security is unchanged, copy security-ident database, else:
pass 1 collect security
-at this point we have a list of definitions from security-
dump security-ident database

if all -ident databases are unchanged, copy *-ref databases, else:

if dbm is unchanged, copy dbm database, else:
"pass 4" merge *-ident databases

pass 2 find refs dbm
-at this point we have a list of references of dbm defined identifiers in dbm+nspr+security-
stage 3 dump dbm-ref database
-at this point we dump a database containing identifiers from dbm,nspr,security with consumers found in dbm-

pass 2 find refs nspr
-at this point we have a list of references of nspr defined identifiers in dbm+nspr+security-
stage 3 dump nspr-ref database
-at this point we dump a database containing identifiers from dbm,nspr,security with consumers found in nspr-

pass 2 find refs security
-at this point we have a list of references of security defined identifiers in dbm+nspr+security-
stage 3 dump security-ref database
-at this point we dump a database containing identifiers from dbm,nspr,security with consumers found in security-

stage 4 merge databases
-we merge the dbm-ident, dbm-ref, nspr-dent, nspr-ref, security-ident and security-ref databases-
Basic comparison testing:
For simple C identifiers such as PR_Malloc, I don't see any differences:
http://mxr-test.konigsberg.mozilla.org/security/ident?i=PR_Malloc
http://mxr-test.konigsberg.mozilla.org/security-copy/ident?i=PR_Malloc
This instance is a case where there are no changes to any of the indexed directories between security/ and security-copy/, so this is just measuring phase 4: merging 3 databases (dbm, nspr, security).
[timeless@konigsberg mxr-test]$ ./update-xref.pl -incremental security-copy/
real	0m5.937s
user	0m4.351s
sys	0m1.517s
This is a normal complete index rebuild of the entire security root using the normal non incremental system
[timeless@konigsberg mxr-test]$ ./update-xref.pl security
real	8m46.718s
user	6m59.780s
sys	1m46.517s
Summary: Initial incremental genxref implementation → Incremental genxref implementation: unit at a time
Blocks: 608236
http://hg.mozilla.org/webtools/mxr/rev/892637a51793
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.