I am the first author on the linked paper. I've considered this and I am hopeful. There are lots of questions and architecture level decisions that need to made. Here are my considerations:
1. At what level of abstraction is the view hierarchy placed into the image?
Since FF can only take screenshots of the web page as opposed to the surrounding UI or background content, HTML elements could work as well as accessibility nodes. HTML nodes would be nicely OS independent, allowing screenshots taken on one OS to be interpreted on another OS, including mobile ones. However, I think the broader accessibility ecosystem would want to participate as well. I would love for example, for a screenshot of an Android app to be readable when embedded inside a webpage, something that is not elegantly possible with HTML. If we use lower-level representations this would be possible, but I would need to understand how Firefox interfaces with OS accessibility infrastructure and how to make this cross-platform.
2. Privacy issues
This is a big one and one that I think should definitely be addressed. We raised this issue in our paper and we are working on approaches that may work. Essentially, we encrypt the nodes of the view hierarchy using a key computed from pixels underlying the bounding box of each node. At access time, when the user tries to access the view hierarchy, they need to first decrypt the nodes by computing the key from the pixels.
If the pixels are lost by cropping or blacking out in a photo editor, then the key cannot be computed. This approach is elegant but more work is needed to verify that it doesn't have security issues. Another extension of this is handling cropping and zooming and is solved by a related technique.
3. Interop with other file metadata
The current PNG and JPEG don't have a specified tag for accessibility information. We currently hijack a bunch of string type fields and store our metadata there. I'm worried that we might break somebody's workflow by not using a "standard" tag. That said, the standardization seems to be iffy anyway. The tag descriptions at https://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html shows a bunch of manufacturer and company-specific tags, so other people's code might already be pretty robust against this.
4. Usability issues
Our current system causes the accessibility cursor to move directly into the image without providing context or warning. I am a very novice screen reader user but I find it pretty confusing. Ideally, we'd want to provide something like "This screenshot is accessible: Double-tap/click to enter inside." before mounting the contained view hierarchy.