1652158 - make_dafsa.py: replace usage of tuple-based node representation to new class-based node representation

Assignee

Description

•

4 years ago

In this ticket, a new Node representation and algorithm was implemented for building a dafsa. However, to reduce the patch's impact, it didn't replace the old dafsa-to-code logic, and had to keep the legacy tuple-based node representation around.

By removing the tuple-based representation and using the class-based representation everywhere, we're going to have two benefits:

Performance will be improved, since there won't be a conversion from class-based to tuple-based.
Old code and documentation around the tuple-based representation will be scrap-able

Ricky Stewart

Comment 1

•

4 years ago

Practically speaking how much of a performance hit is the conversion?

Mitchell Hentges [:mhentges] 🦀

Assignee

Comment 2

•

4 years ago

•

Edited

IIRC it was roughly (EDIT: ~0.6) seconds.

EDIT:
Looking again, the stats we're looking at is:

Overall make_dafsa.py invocation on my machine is ~4.5 seconds
The conversion from class-based dafsa to tuple dafsa is ~0.6 seconds
There might be additional performance improvements to be found in the dafsa-to-cpp code when using the new class-based nodes, which have additional information (such as parent links)

Eric Rahm [:erahm]

Comment 3

•

4 years ago

So I came here to say that we completely control the format of the data that gets input to make_dafsa and friends, so that might open up chances for more optimizations.

But then that made it click for me that we're generating a dafsa on the fly for every build from data that is a) static (nobody hand edits it) and b) updated and checked in by a script. WDYT about just generating the eTLD dafsa header when we udpate the dat file and checking in the inc file? We could also do the same for the STS preload list.

Mitchell Hentges [:mhentges] 🦀

Assignee

Comment 4

•

4 years ago

That's a good call! I think that this work here would still be helpful and relatively low-investment, but reducing the burden on mach export would be helpful.
We'd probably want to enforce that the generated files are up-to-date when a dat/STS file is changed though, right? Do you know how we could do that?

Mitchell Hentges [:mhentges] 🦀

Assignee

Updated

•

4 years ago

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 5

•

4 years ago

Agreed, this would be a separate issue from what this bug is about. periodic_file_updates.sh is the only thing that updates them, so we'd just modify that to do both in lockstep (we could potentially get rid of the intermediary file as well). I suppose we could add a phabricator rule that yells at you if you try to check in a modified file.

Flags: needinfo?(erahm)

Mike Hommey [:glandium]

Comment 6

•

4 years ago

•

Edited

The generated file is double the size, though. It's also obfuscated and not friendly for modifications. Considering the time it now takes to generate it, I don't think it's worth the inconvenience to move it to update-time.

Mike Hommey [:glandium]

Comment 7

•

4 years ago

The generated file is double the size

Also, the differences between each generation of the file are going to be much larger, which will add some serious weight to the mercurial repo.

Mike Hommey [:glandium]

Comment 8

•

4 years ago

•

Edited

(In reply to Mike Hommey [:glandium] from comment #7)

The generated file is double the size

Also, the differences between each generation of the file are going to be much larger, which will add some serious weight to the mercurial repo.

To give an idea how bad this would look, if I take the 482 versions of nsSTSPreloadList.inc from the point we started using make_dafsa (when the file was only 385KB) until now, but excluding all the updates from beta, release or esr (there are 638 more in those branches), and create a fresh new mercurial repo with both the .inc and the result from make_dafsa and commit each revision, the size of the mercurial backing file is 2.3M for the .inc and ... 204M for the .h. It also doesn't compress great (a zstd mercurial bundle is not smaller than the sum of those). A zstd bundle of mozilla-central right now is 1.3GB, that would be 13% of its size (204M/(1.3G+204M)). For unified, this would be even more.

Mitchell Hentges [:mhentges] 🦀

Assignee

Comment 9

•

4 years ago

Attached file Bug 1652158: Removes usage of old tuple-based node system r?rstewart — Details

Refactors dafsa-encoding logic to use class-based node representation instead of the tuple-based
representation

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → mhentges

Status: NEW → ASSIGNED

Mitchell Hentges [:mhentges] 🦀

Assignee

Comment 10

•

4 years ago

•

Edited

There's some potential cleanup around the dafsa-to-bytes encoding logic here. Specifically, I was interested in refactoring the "guess and loop" algorithm used to create links.

To refactor this logic, I'd like the new output to match up one-to-one with the old output so that spotting regressions would be easy. However, this wouldn't necessarily be easy, since the existing logic doesn't behave consistently.

For some backstory, when the dafsa is encoded, it encodes children like this:

... 0xE1, 0x02, 0x83, 0x62, 0x61, 0x81, 0x63 ...
... "a"   jump  then, "b"   ...   ...   "c"  ...
          2     jump
                3

So, in this case, from "a", it has two children. The first child is "2 jumps [right]", then the second child is "3 [more] jumps [right]".
So, the children would we be:

(2 bytes to the right of the 0x02 byte is) 0x62: the character "b"
(3 more bytes to the right of the last jump (0x62) is) 0x63: the character "c"

The spot where the existing logic doesn't produce consistent output is around the "jump" bytes. This is because:

since the size of a "jump" can be larger than the maximum value that can be encoded in a byte, a "jump" can be represented with multiple bytes
When a "jump" is represented with multiples jumps, the jump happens from the first byte in the jump sequence

This means that the same jump can be represented in two different ways:

0xBF ... [62 nodes] ... [target]
jump 
 63 
bytes 
    -----[63 bytes]----> ✔
-----------------------------------------------
0x40  0x40 ... [62 nodes] ... target
jump   64
  bytes
    --------[64 bytes]-------> ✔

This, in tandem with the "guess-and-loop" logic, means that the same byte gap can be represented with different "jumps".
Additionally, during this investigation, I found that there were negligible performance improvements to be found in refactoring this logic.

In summary, I don't think refactoring the existing encoding logic is worth it right now for two reasons:

Verifying correctness will be inconvenient since the new output is expected to be different (more consistent) than the old output
There aren't performance gains to be found by doing this refactoring

Nathan Froyd [:froydnj]

Comment 11

•

4 years ago

(In reply to Mike Hommey [:glandium] from comment #8)

(In reply to Mike Hommey [:glandium] from comment #7)

The generated file is double the size

Also, the differences between each generation of the file are going to be much larger, which will add some serious weight to the mercurial repo.

To give an idea how bad this would look, if I take the 482 versions of nsSTSPreloadList.inc from the point we started using make_dafsa (when the file was only 385KB) until now, but excluding all the updates from beta, release or esr (there are 638 more in those branches), and create a fresh new mercurial repo with both the .inc and the result from make_dafsa and commit each revision, the size of the mercurial backing file is 2.3M for the .inc and ... 204M for the .h. It also doesn't compress great (a zstd mercurial bundle is not smaller than the sum of those). A zstd bundle of mozilla-central right now is 1.3GB, that would be 13% of its size (204M/(1.3G+204M)). For unified, this would be even more.

What if we generated binary data directly, and used assembly files to .incbin and define all the necessary symbols we needed?

Mike Hommey [:glandium]

Comment 12

•

4 years ago

If would be smaller, but the diffs would still be very large. The diffs might actually be even worse because I think mercurial's filelog diffs are between "lines" (i.e. binary stuff separated by \n)

Eric Rahm [:erahm]

Comment 13

•

4 years ago

That's fair enough, the generation time is much shorter now. Combined with Mitch's proposals and maybe just generating a binary blob like Nathan suggests rather than a rather large header that needs to be parsed/compiled (perhaps worth filing a follow-up) we're probably good enough. Sorry for hijacking your bug Mitch!

Mike Hommey [:glandium]

Comment 14

•

4 years ago

.incbin has other problems, I don't think it's worth the effort. It's not going to save make_dafsa time, and it's not going to make a noticeable difference for the compiler.

Pulsebot

Comment 15

•

4 years ago

Pushed by mhentges@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a3508068b70d
Removes usage of old tuple-based node system r=firefox-build-system-reviewers,rstewart

Atila Butkovits

Comment 16

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/a3508068b70d

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox80: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla80