Bug 1624468 Comment 3 Edit History

Original comment by

on 2020-03-26 13:30:38 PDT

Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops). I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run. The attached HTML has some charts (the error line on each bar is min-max) or between runs (the five runs).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!). A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms. It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?). I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :) I'll try to grab a few more data points. We can chat a bit next week. Cheers!

Revision 1 by

Bert Peers [:bpeers]

on 2020-03-26 13:32:05 PDT

Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops). I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run. (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!). A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms. It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?). I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :) I'll try to grab a few more data points. We can chat a bit next week. Cheers!

Revision 2 by

Bert Peers [:bpeers]

on 2020-03-26 13:32:13 PDT

Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops). I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!). A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms. It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?). I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :) I'll try to grab a few more data points. We can chat a bit next week. Cheers!