Bug 1762603 Comment 0 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Anti-aliasing is currently computed by inflating the quad in local space by two pixels which is large enough for most cases. the fragment shader then compares the local space fragment position with a local space representation of the rectange and compute a simple distance to determine the AA.  Scaling factor between local and screen space is compensated by taking fwidth into account.

The main problem with that is that the way the scaling factor is computed only works when the scaling is mostly uniform in x and y. if the scaling is stronger in an axis than the other, the aa gets squashed on one axis and very blurry on the other. (See bug 1759526)

In addition it is more complicated and expensive than I'd want it to be.

The advantage is that the aa does not require extra geometry (although we are missing the opportunity to move pixels to the opaque pass when not using extra geometry).

# Extruding the aa in screen-space

I propose to replace this aa method with one where all the work is done in the fragment shader. The aa is always handled by extra geometry.

quad vertices are extruded *in screen space* by half a pixel inward/outward in order to generate thin band that deal with the anti-aliasing. Since the extrusion happens in screen space, we can have these bands always be exactly one pixel wide which means no conservative inflation, and it is correct under any (invertible) transform.

The transparency falloff is no longer computed via a distance function, instead it is baked in the vertex varying (0.0 for vertices extruded outward, 1.0 for vertices extruded inward). The alpha value is obtained from the interpolation of this vertex varying, so no computation is needed in the fragment shader.

Since the fragment shader also needs some information interpolated in local space (for example for uvs), the vertex shader does the following steps in pseudo-code:

```C
#ifdef WR_FEATURE_ANTIALIASING
    vec2 local_normal = get_normal(edge_flags, aPosition);
    float sign = outward ? 1.0 : -1.0;
    vec2 screen_normal = transform.transform_vector(local_normal).normalize();
    vec2 screen_position = transform.transform_point(local_position) + screen_normal * 0.5 * sign;
    // Have to reproject the adjusted screen position in local space. 
    vec2 adjusted_local_position = inv_transform.transform_point(screen_position);
#else
    vec2 adjusted_local_position = local_position;
#endif

v_uv = adjusted_local_position;
```

There are more matrix transformations happening on the vertex shader, I don't expect they will make a big difference although we could have fast paths for when we know the transform to be axis aligned or scale uniformly. I think it's more likely that the speedup of having no work in the fragment shader will outweight the extra alu in the vertex shader.

The potentially more complicated part of implementing this is that the batching code now has to output an extra quad for each edge that needs anti-aliasing. I don't think it changes much from what is currently done but we have to make sure it integrates properly with how brush segments already generate multiple quads per primitive. One of the questions to answer is should the aa bands be segments themselves?

An idea:
Currently we submit 2-triangle/4-vertices instances. In practice GPUs today don't put multiple instances into the same wavefronts n vertex shaders so we have very very low vertex shader occupancy. Maybe we could make our instances be always 10-triangles/12-vertices.  (2-tri/4-verts for the interior + 8-tri/8verts for the aa bands). When an edge does not need aa, it is not extruded which results in the aa triangles having no area and not contributing to the final image. When we want to render the interior in the opaque pass, we can squash it into a zero-area quad in the instance that renders the aa in the alpha pass (hence the center area not sharing its vertices with the aa bands).
Perhaps that won't add noticeable overhead since we are adding geometry in currently unused lanes, and all of the extra lanes will be fetching the same data from the gpu cache (so I don't expect extra latency). It would make it very easy to integrate the aa into the batching code. Opaque batches could remain with two triangles per instance.
That is a good candidate for a prototype outside of webrender.
Anti-aliasing is currently computed by inflating the quad in local space by two pixels which is large enough for most cases. the fragment shader then compares the local space fragment position with a local space representation of the rectange and compute a simple distance to determine the AA.  Scaling factor between local and screen space is compensated by taking fwidth into account.

The main problem with that is that the way the scaling factor is computed only works when the scaling is mostly uniform in x and y. if the scaling is stronger in an axis than the other, the aa gets squashed on one axis and very blurry on the other. (See bug 1759526)

In addition it is more complicated and expensive than I'd want it to be.

The advantage is that the aa does not require extra geometry (although we are missing the opportunity to move pixels to the opaque pass when not using extra geometry).

# Extruding the aa in screen-space

I propose to replace this aa method with one where all the work is done in the fragment shader. The aa is always handled by extra geometry.

quad vertices are extruded *in screen space* by half a pixel inward/outward in order to generate thin band that deal with the anti-aliasing. Since the extrusion happens in screen space, we can have these bands always be exactly one pixel wide which means no conservative inflation, and it is correct under any (invertible) transform.

The transparency falloff is no longer computed via a distance function, instead it is baked in the vertex varying (0.0 for vertices extruded outward, 1.0 for vertices extruded inward). The alpha value is obtained from the interpolation of this vertex varying, so no computation is needed in the fragment shader.

Since the fragment shader also needs some information interpolated in local space (for example for uvs), the vertex shader does the following steps in pseudo-code:

```C
#ifdef WR_FEATURE_ANTIALIASING
    vec2 local_normal = get_normal(edge_flags, aPosition);
    // adjust the length after normalization depending on whether the normal aligned with the rectangle or diagonal.
    float screen_length = local_normal.x == 0.0 || local_nornal.y == 0.0 ? 1.0 : sqrt(2.0) / 2.0; 
    float sign = outward ? 1.0 : -1.0;
    vec2 screen_normal = transform.transform_vector(local_normal).normalize() * screen_length;
    vec2 screen_position = transform.transform_point(local_position) + screen_normal * 0.5 * sign;
    // Have to reproject the adjusted screen position in local space. 
    vec2 adjusted_local_position = inv_transform.transform_point(screen_position);
#else
    vec2 adjusted_local_position = local_position;
#endif

v_uv = adjusted_local_position;
```

There are more matrix transformations happening on the vertex shader, I don't expect they will make a big difference although we could have fast paths for when we know the transform to be axis aligned or scale uniformly. I think it's more likely that the speedup of having no work in the fragment shader will outweight the extra alu in the vertex shader.

The potentially more complicated part of implementing this is that the batching code now has to output an extra quad for each edge that needs anti-aliasing. I don't think it changes much from what is currently done but we have to make sure it integrates properly with how brush segments already generate multiple quads per primitive. One of the questions to answer is should the aa bands be segments themselves?

An idea:
Currently we submit 2-triangle/4-vertices instances. In practice GPUs today don't put multiple instances into the same wavefronts n vertex shaders so we have very very low vertex shader occupancy. Maybe we could make our instances be always 10-triangles/12-vertices.  (2-tri/4-verts for the interior + 8-tri/8verts for the aa bands). When an edge does not need aa, it is not extruded which results in the aa triangles having no area and not contributing to the final image. When we want to render the interior in the opaque pass, we can squash it into a zero-area quad in the instance that renders the aa in the alpha pass (hence the center area not sharing its vertices with the aa bands).
Perhaps that won't add noticeable overhead since we are adding geometry in currently unused lanes, and all of the extra lanes will be fetching the same data from the gpu cache (so I don't expect extra latency). It would make it very easy to integrate the aa into the batching code. Opaque batches could remain with two triangles per instance.
That is a good candidate for a prototype outside of webrender.

Back to Bug 1762603 Comment 0