Bug 1624468 Comment 3 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops).  I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run.  The attached HTML has some charts (the error line on each bar is min-max) or between runs (the five runs).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!).  A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms.  It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?).  I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :)  I'll try to grab a few more data points.  We can chat a bit next week.  Cheers!
Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops).  I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run.  (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!).  A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms.  It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?).  I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :)  I'll try to grab a few more data points.  We can chat a bit next week.  Cheers!
Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops).  I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run  (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!).  A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms.  It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?).  I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :)  I'll try to grab a few more data points.  We can chat a bit next week.  Cheers!

Back to Bug 1624468 Comment 3