Closed Bug 1581052 Opened 3 months ago Closed 2 months ago

[BinAST] Make getHuffmanLookup faster

Categories

(Core :: JavaScript Engine, task, P2)

task

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: Yoric, Assigned: a.munoz3327)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

Currently, on my benchmarks, getHuffmanLookup shows up as taking 6.5% of parsing time. For a tiny method that only performs a few arithmetic operations, that's too much :)

So what can we do about it?

  • Look at readByte and either make it faster or call it only once?
  • Somehow make the per-byte bit inversion faster?
  • Somehow get rid of the per-byte bit inversion?
Priority: -- → P2

I see two ways to make getHuffmanLookup faster:

  1. we know that we're only ever going to read at most 4 bytes, so we could specialize the code for 1, 2, 3 or 4 bytes and unroll the loop;
  2. we don't actually need to always have MAX_CODE_BIT_LENGTH bits available, we just need the number of bits necessary for the next lookup, which is quite often very small, so we could rewrite as getHuffmanLookup(uint8_t bitsNeeded)

Hi David, I'd like to work on this. I don't know how this is typically benchmarked though. I'm familiar with profilers and have started playing with the Firefox profiler, but wanted to check if you use something else to benchmark this.

We'll be happy to have help, but this is clearly not an easy bug. If you're ok with this, I can send you the benchmarking data.

Flags: needinfo?(a.munoz3327)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #3)

We'll be happy to have help, but this is clearly not an easy bug. If you're ok with this, I can send you the benchmarking data.

That's fine with me. Can you send the benchmark data?

Flags: needinfo?(a.munoz3327) → needinfo?(dteller)

Ok, let's do this :)

Sharing the benchmark data through Firefox Send as this is a 700Mb+ patch (almost entirely data), not meant to be landed into Firefox: https://send.firefox.com/download/3ebd6d0d98776f50/#FtmQsHCYWXjn1cecS7zvEw .

This is very rough around the edges as I only ever intended to run these benchmarks myself.

You'll need to configure SpiderMonkey with options --enable-optimize --disable-debug for bechmarking purposes.

To run the benchmarks, go to js/src and run ./OPT/dist/bin/jsapi-tests testBinASTReaderContextRealJSSamples (replace OPT with the name of your build directory). The benchmarks can take a while to run, in particular if you're running them from a profiler. Once you look at the profile, we are only interested in code that is executed inside BinASTParserPerTokenizer::parse().

On my machine, we currently get the worse offenders summarized at https://bugzilla.mozilla.org/show_bug.cgi?id=1577764#c5 .

If you're still with me, I'll be happy to chat about possible optimizations :)

Flags: needinfo?(dteller)

Thanks for sending the benchmarking stuff, I really appreciate all the help.

I started looking at making the per-byte bit inversion in getHuffmanLookup faster by masking and shifting bits in parallel as described here http://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel. I'm attaching a patch of what I tried in case that isn't clear. The bit inversion itself takes about half as many operations as the original way, but the speed increase is pretty small and probably negligible. I don't think there's a way to do the bit inversion with fewer operations without using lookup tables, but I don't think that would be that much faster.

From your suggestions in comment #1 I think specializing the code for the number of bytes and unrolling the loop sounds pretty straightforward, but I don't quite understand the second idea. What do you mean by "we just need the number of bits necessary for the next lookup"? I'm still going through the file you pointed out and BinASTParser.cpp to get some context, but just wanted to make sure I'm looking in the right places.

Flags: needinfo?(dteller)

Doing the per-byte bit inversion on the lower 4 bytes of this->bits in place
after reading data is noticeably faster.

(In reply to Ayrton Muñoz from comment #8)

Created attachment 9094094 [details]
Bug 1581052 - [BinAST] Make getHuffmanLookup faster r?Yoric

Doing the per-byte bit inversion on the lower 4 bytes of this->bits in place
after reading data is noticeably faster.

Ah, interesting. Does this pass tests?

For testing, I suggest having a second build with --enable-debug --disable-optimize in a different build directory.

Then, after rebuilding,

  • ./DEBUG/dist/bin/jsapi-tests testBinASTReaderContextECMAScript2 for the quick tests; or
  • ./DEBUG/dist/bin/jsapi-tests testBinASTReaderContextRealJSSamples to use the ~700Mb samples in the patch I sent you for testing.

From your suggestions in comment #1 I think specializing the code for the number of bytes and unrolling the loop sounds pretty straightforward, but I don't quite understand the second idea. What do you mean by "we just need the number of bits necessary for the next lookup"? I'm still going through the file you pointed out and BinASTParser.cpp to get some context, but just wanted to make sure I'm looking in the right places.

To do this, the idea is

1/ to add a method maxBitLength() to classes HuffmanTableImplementation*, just to return maxBitLength (you can of course just make the field public for experimentation);
2/ rewrite getHuffmanLookup(BinASTTokenReaderContext& owner) into getHuffmanLookup(BinASTTokenReaderContext& owner, uint8_t requiredBitLength);
3/ replace the MAX_PREFIX_BIT_LENGTH in getHuffmanLookup() with requiredBitLength;
4/ fix all the call sites of getHuffmanLookup() to pass table.maxBitLength() as argument requiredBitLength.

On the upside, you shouldn't need to fix BinASTparser.cpp :)

Flags: needinfo?(dteller) → needinfo?(a.munoz3327)

So after taking a closer look at the test results I realized my previous patch was actually failing a lot of tests (I thought it was running suspiciously fast...). I fixed the problem by reading the data into newBits while shifting this->bits to make room for the data. Then newBits is inverted in place outside the loop and added to this->bits. This version passed all the quick tests and my profiles show it spends 15% less time in getHuffmanLookup. Some of the tests in your benchmarking data failed, but I get the same thing when I run it with an unmodified build so I'm guessing it's ok.

I also modified getHuffmanLookup() to take the number of bits required like you suggested. It's passing all tests, but with both this and my other change it's only spending 10% less time in getHuffmanLookup than the unmodified build. I only recently got to this point so I think I need to take a closer look to figure out why we're getting this unexpected result.

Flags: needinfo?(a.munoz3327)

Yeah, you should use the testBinASTReaderContextECMAScript2 suite for testing. The other one makes uses of features that have not implemented yet, so it's not a good test.

Now, if your change works better without my suggestion, that's ok with me :) Mine was just a hunch, if the hunch proves wrong, don't use it!

Here's a new version that only reads data once to avoid the while loop. It spends 26% less time in getHuffmanLookup than the unmodified code and passes all tests on my machine :)

However, I don't think it wouldn't work on big endian machines because I'm casting a uint8_t array to uint64_t *. It'd be pretty straightforward to modify for big endian by adding something like

#ifdef MOZ_BIG_ENDIAN
  //call the original readBuf() here
#elif MOZ_LITTLE_ENDIAN
  //call my new readBufReverse() here
#endif

but I don't know if these are always defined or if this is something we'd like to avoid which is why I'm attaching it here. The patch has updated comments and uses an extended example to make things clear, but lmk if something doesn't make sense.

Flags: needinfo?(dteller)
Flags: needinfo?(dteller) → needinfo?(a.munoz3327)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #13)

Would any of the code in https://searchfox.org/mozilla-central/source/mfbt/EndianUtils.h be helpful?

Thanks! that's exactly what I was looking for

Flags: needinfo?(a.munoz3327)
Flags: needinfo?(a.munoz3327)
Attachment #9094094 - Attachment description: Bug 1581052 - [BinAST] Make getHuffmanLookup faster r?Yoric → Bug 1581052 - [BinAST] Make getHuffmanLookup faster by inverting bits after reading data r?Yoric

Reading data with only one call to readBuf() to avoid the while loop reduces
time spent in getHuffmanLookup() by 26%. Also added readBufReverse() to
return the data in reverse order if MOZ_LITTLE_ENDIAN is set.

Depends on D46559

Just split the patch into two like you suggested and I'm currently working on revisions to the new patch.

Flags: needinfo?(a.munoz3327)

Thanks :)

Ping me when you're ready for me to review! In the meantime, I'm assigning the bug to you :)

Assignee: nobody → a.munoz3327
Flags: needinfo?(a.munoz3327)

It's ready for review

Flags: needinfo?(a.munoz3327) → needinfo?(dteller)
Flags: needinfo?(dteller)

So, tests seem to pass :)
If you're ready, we can land the code and see if it sticks!

Flags: needinfo?(a.munoz3327)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #20)

So, tests seem to pass :)
If you're ready, we can land the code and see if it sticks!

yep I think it's ready to land :)

Flags: needinfo?(a.munoz3327)

"We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again. applying /tmp/tmpbt18xX js/src/frontend/BinASTTokenReaderContext.cpp Hunk #2 FAILED at 1276. 1 out of 2 hunks FAILED -- saving rejects to file js/src/frontend/BinASTTokenReaderContext.cpp.rej abort: patch command failed: exited with status 256"

Could you pull the latest mozilla-central and rebase your code?

Flags: needinfo?(a.munoz3327)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #22)

"We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again. applying /tmp/tmpbt18xX js/src/frontend/BinASTTokenReaderContext.cpp Hunk #2 FAILED at 1276. 1 out of 2 hunks FAILED -- saving rejects to file js/src/frontend/BinASTTokenReaderContext.cpp.rej abort: patch command failed: exited with status 256"

Could you pull the latest mozilla-central and rebase your code?

done

Flags: needinfo?(a.munoz3327)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b33ee74d78c3
[BinAST] Make getHuffmanLookup faster by inverting bits after reading data r=Yoric
https://hg.mozilla.org/integration/autoland/rev/28a5e5744f1f
[BinAST] Make getHuffmanLookup faster by calling readBuf once r=Yoric
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Well, this seems to have worked \o/

Thanks for the patch, Ayrton. If you're interested, I can suggest further work.

Flags: needinfo?(a.munoz3327)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #26)

Well, this seems to have worked \o/

Thanks for the patch, Ayrton. If you're interested, I can suggest further work.

Yeah I'm definitely interested in finding something else to work on. I've been reading up on the BinAST proposal but I could use some pointers on what I could work on.

Also I just saw someone suggested removing the type-punning in my last patch so I'm submitting a revised version. I ran the tests and performance is about the same as before.

Status: RESOLVED → REOPENED
Flags: needinfo?(a.munoz3327) → needinfo?(dteller)
Resolution: FIXED → ---
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7c6cde9fc234
[BinAST] removed type-punning in getHuffmanLookup r=tcampbell,Yoric
Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Resolution: --- → FIXED

(In reply to Ayrton Muñoz from comment #27)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #26)

Well, this seems to have worked \o/

Thanks for the patch, Ayrton. If you're interested, I can suggest further work.

Yeah I'm definitely interested in finding something else to work on. I've been reading up on the BinAST proposal but I could use some pointers on what I could work on.

Also I just saw someone suggested removing the type-punning in my last patch so I'm submitting a revised version. I ran the tests and performance is about the same as before.

Thanks, by the way :)

Flags: needinfo?(dteller)
Regressions: 1588761

marauder, that regression sounds extremely unlikely, given that the code affected is prefed-off.

Flags: needinfo?(marian.raiciof)

Thanks David! Marked it as invalid.

Flags: needinfo?(marian.raiciof)
You need to log in before you can comment on or make changes to this bug.