Open Bug 1592654 Opened 4 months ago Updated 3 months ago

Support accessible screenshots via embedded screen readerable content

Categories

(Firefox :: Screenshots, enhancement)

enhancement
Not set

Tracking

()

People

(Reporter: _6a68, Unassigned)

Details

(Keywords: access)

Attachments

(1 file)

Really interesting concept: embed alt text and bits of page structure in screenshots, to make them visually accessible to screen readers.

From the abstract at [1] (implementation at [2]):

"Using the X-Ray screenshot tool, semantic information is captured and stored in the Exif data of the resulting image, allowing it to "tag along" as the image is shared and reposted. We demonstrate that our approach retains accessibility for screen reader users via a study with five blind participants. More generally, our approach suggests a method for embedding accessibility metadata into otherwise inaccessible formats, enabling them to retain the more accessible representations that are present at capture time."

[1] https://dl.acm.org/citation.cfm?id=3353808
[2] https://gitlab.com/sujeathpareddy/xray

Keywords: access

Hi,

I am the first author on the linked paper. I've considered this and I am hopeful. There are lots of questions and architecture level decisions that need to made. Here are my considerations:

1. At what level of abstraction is the view hierarchy placed into the image?

Since FF can only take screenshots of the web page as opposed to the surrounding UI or background content, HTML elements could work as well as accessibility nodes. HTML nodes would be nicely OS independent, allowing screenshots taken on one OS to be interpreted on another OS, including mobile ones. However, I think the broader accessibility ecosystem would want to participate as well. I would love for example, for a screenshot of an Android app to be readable when embedded inside a webpage, something that is not elegantly possible with HTML. If we use lower-level representations this would be possible, but I would need to understand how Firefox interfaces with OS accessibility infrastructure and how to make this cross-platform.

2. Privacy issues

This is a big one and one that I think should definitely be addressed. We raised this issue in our paper and we are working on approaches that may work. Essentially, we encrypt the nodes of the view hierarchy using a key computed from pixels underlying the bounding box of each node. At access time, when the user tries to access the view hierarchy, they need to first decrypt the nodes by computing the key from the pixels.

If the pixels are lost by cropping or blacking out in a photo editor, then the key cannot be computed. This approach is elegant but more work is needed to verify that it doesn't have security issues. Another extension of this is handling cropping and zooming and is solved by a related technique.

3. Interop with other file metadata

The current PNG and JPEG don't have a specified tag for accessibility information. We currently hijack a bunch of string type fields and store our metadata there. I'm worried that we might break somebody's workflow by not using a "standard" tag. That said, the standardization seems to be iffy anyway. The tag descriptions at https://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html shows a bunch of manufacturer and company-specific tags, so other people's code might already be pretty robust against this.

4. Usability issues

Our current system causes the accessibility cursor to move directly into the image without providing context or warning. I am a very novice screen reader user but I find it pretty confusing. Ideally, we'd want to provide something like "This screenshot is accessible: Double-tap/click to enter inside." before mounting the contained view hierarchy.

You can also see a video of it working at https://www.youtube.com/watch?v=59J0Xy7lt7s.

(In reply to Sujeath from comment #2)

Since FF can only take screenshots of the web page as opposed to the surrounding UI or background content, HTML elements could work as well as accessibility nodes. HTML nodes would be nicely OS independent, allowing screenshots taken on one OS to be interpreted on another OS, including mobile ones. However, I think the broader accessibility ecosystem would want to participate as well. I would love for example, for a screenshot of an Android app to be readable when embedded inside a webpage, something that is not elegantly possible with HTML. If we use lower-level representations this would be possible, but I would need to understand how Firefox interfaces with OS accessibility infrastructure and how to make this cross-platform.

Firefox builds an a11y tree in C++ containing Accessible objects, with subclasses for various implementations. For the most part, these are built from the DOM and layout. It's theoretically possible to create Accessible classes which take data from some other source. However, there are a bunch of technical concerns: it'd involve writing an entirely new C++ layer to build the tree from this new data source; we'd have to create some layer to get this from JS into C++; it would mean maintaining another layer of code which is only used for this one case; etc. At least for Firefox, I think it'd make sense to render this in HTML + ARIA, allowing our existing accessibility engine to be used as-is.

Of course, we could take some lower level representation and render it into HTML + ARIA for Firefox. However, I wonder whether it would be better to use HTML + ARIA across the board. It's true that rendering an Android app into HTML isn't elegant, but on the flip side, HTML + ARIA is a common language, is mapped and supported across all platforms and allows for very rich representation of semantics. If we used some lower level representation, we'd need to devise an abstract representation and then map it to native APIs on every platform. This doesn't seem efficient given that this has already been done (and is continually being worked on) for HTML + ARIA.

Thanks for the reply, Sujeath! There's a lot to unpack here (:asa, any thoughts on the right way to break off comment #2 into metabug + actionable sub-bugs?), but I did want to add on to comment #4, about HTML+ARIA vs internal C++ representations:

A key advantage of keeping the implementation in unprivileged JS, using HTML + ARIA, is that it'd be portable to other browsers and extensions, as well as other tools, like automated testing tools (selenium can take screenshots, for instance).

It might well make sense to publish a JS library independently of Firefox, to make it easy for existing tools to add accessibility support.

Flags: needinfo?(asa)

I've been talking with other people in the community and I think I've come around to using HTML + ARIA. I will take some this week and build a proof of concept converter from Android's DOM to HTML just to make sure this is not unexpectedly hard. I think the process is made easier since each element now has a fixed bounding box that does not adapt to screen dimensions.

I agree with Jared that it should be an independent JS library for the rendering side for now. I will revert with the results by the end of the weekend.

Sounds great! Let me know if I can help at all.

(In reply to Jared Hirsch [:_6a68] [:jhirsch] (Needinfo please) from comment #5)

Thanks for the reply, Sujeath! There's a lot to unpack here (:asa, any thoughts on the right way to break off comment #2 into metabug + actionable sub-bugs?)

I think this could become the metabug and we could file bugs for code changes as dependencies to this.

Flags: needinfo?(asa)

I agree that this should be a metabug. I am new to Bugzilla and I'm not sure

I have built a very basic proof of concept available at https://github.com/SujeathPareddy/xray-mozilla-poc. I'm still adding things like support for image scaling so bear with me. It seems to work fine on VoiceOver on Mac OS Catalina.

You need to log in before you can comment on or make changes to this bug.