On integrated GPUs, we are typically completely bound by memory
bandwidth and the number of pixels that get written / blended.
On real world pages, it's often the case that we end up with
clip tasks that are long in one dimension but not the other, due
to box-shadow edges, clip mask segments etc. When this occurs,
the logic that tries to get a small 'used_rect' to clear targets
to fails, since the union of those ends up being a very large
rect that covers (most of) the surface. This can cost a lot of
GPU time on some integrated chipsets.
Instead, it appears to be much faster to issue multiple clears,
one for each clip mask region, which is typically < 10% of the
surface we were clearing previously.
However, we can also restore an old optimization we used to have
which means we can skip clears altogether in the common case. The
first mask in a clip task will write to all the pixels in the mask,
so we can draw that with blending disabled (also a significant win
on integrated GPUs) and skip the clear in these cases. With this
functionality in place, the multiplicative blend mode is only
enabled for any clips other than the first in a mask (this is
quite a rare case - most clip tasks end up with a single mask).
On low end GPUs driving a 4k screen, I've measured GPU wins of up
to 5 ms/frame on some real world pages with this change.