Closed Bug 1937438 Opened 11 months ago Closed 28 days ago

Support tagged math in PDF documents

Categories

(Firefox :: PDF Viewer, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
146 Branch
Tracking Status
relnote-firefox --- nightly+
firefox146 --- fixed

People

(Reporter: Jamie, Assigned: calixte)

References

(Depends on 1 open bug)

Details

(Keywords: access)

Attachments

(2 files)

Years ago, PDF described a mechanism to add MathML to PDF documents in order to make math content accessible. Unfortunately, this mechanism is difficult for software to generate, so this wasn't widely used. More recently, PDF v2 added a much simpler mechanism: associated files. A recent update to LaTeX implements this, so there will soon be a lot of PDF documents generated from laTeX that contain MathML. Foxit already supports this. It would be good if we could support this too to enable access to math in these documents.

We would expose the math as a <math> subtree in the struct tree. Any accessibility client which supports MathML would thus be able to access the math content just as it does for MathML on the web.

Neil Soiffer outlined this new development in an NVDA issue, including the relevant parts of the spec. I've quoted the relevant information below:

Section 14.13 of the ISO 32000-2 spec discusses associated files. Here are some relevant quotes from the spec:

Associated files provide a means to associate content in other formats with objects of a PDF file and to
identify the relationship between them. Such associated files are designated using file specification
dictionaries (see 7.11.3, "File specification dictionaries"), and AF keys are used in object dictionaries to
connect the associated file’s specification dictionaries with those objects.
For associated files, their associated file specification dictionaries should include the AFRelationship
key indicating one of several possible relationships that the file has to the associated PDF object
The file specification for an associated file represents either a file external to the PDF file or an
embedded file stream (see 7.11.4, "Embedded file streams") within the PDF file.
It should always be the case that the MathML is an embedded file stream, not an external file.

...the resulting PDF document might contain the following embedded files: ...MathML version of the equation embedded with an AFRelationship value Supplement, and associated using a structure element or a form XObject depending on how the equation is rendered in the page’s content stream.

14.13.6 Associated files linked to structure elements
One or more files may be associated with structure elements (see 14.7.2, "Structure hierarchy") to
accommodate content that spans pages such as in an article, section or table, in which cases logical
structural elements should be used to make an association with files. This entry represents the
associated files for the entire structure element. To associate files with structure elements, the
structure element dictionary shall contain an AF entry which represents the associated files for that
structure element. The relationship that the associated files have to the structure element is supplied
by the AFRelationship key in each file specification dictionary.

Other potential places in the spec for info:

* Table 43 discusses the "AFRelationship" key along with the potential values  ("Supplement" being the important one).

There are several PDF files demonstrating this technique here:
https://github.com/latex3/tagging-project/discussions/56

Marking this as an enhancement for now, since even though this has been possible for years, support has only been implemented recently elsewhere and this isn't in wide spread usage yet. That said, this will eventually become an s2 defect (and a reason for users to use other PDF tools) once other tools start implementing it and more PDF files utilise it.

Type: defect → enhancement

Note that in addition to Associated Files, MathML can also be more directly inlined as Structure Elements, also recent versions of Microsoft Word embed MathML via a MSFT_MathML attribute (which is otherwise structurally the same as using an associated file) NVDA screen reader would (via MathCAT) read mathematics in any of these formats in pdf readers that pass on the mathml (currently acrobat and foxit, although AF doesn't work in acrobat).

nvda's code as used in acrobat and foxit which shows the handling of AF, Structure elements and MSFT_MathML

https://github.com/nvaccess/nvda/blob/master/source/NVDAObjects/IAccessible/adobeAcrobat.py#L173

Attached file equation1.pdf

:David, it's pretty interesting.
I attached a pdf with a basic formula generated with Word (version 2506) and in the struct tree we've an entry containing the corresponding mathml and an alt text describing it (6 sum from n equals 1 to infinity of 1 over n squared , equals pi squared).
Do you know if there's something similar in the TeX/LaTeX world ?

Flags: needinfo?(davidc)

I'll flag that the actual MathML is far more useful than the alt text for screen reader users, since it enables navigation/exploration, pausing, math braille, etc. Obviously, something is better than nothing though.

(In reply to Calixte Denizet (:calixte) from comment #3)

Created attachment 9498878 [details]
equation1.pdf

:David, it's pretty interesting.
I attached a pdf with a basic formula generated with Word (version 2506) and in the struct tree we've an entry containing the corresponding mathml and an alt text describing it (6 sum from n equals 1 to infinity of 1 over n squared , equals pi squared).
Do you know if there's something similar in the TeX/LaTeX world ?

yes latex will generate mathml and embed it using the standard Structure Element or Associated file mechanisms, either of which will be suitably read or converted to braille by nvda/mathcat

https://latex3.github.io/tagging-project/documentation/wtpdf/

the resulting PDF can be read using acrobat or foxit but currently not any of the browser based pdf readers, which is a shame.

Flags: needinfo?(davidc)

The main difficulty is to make sure we don't inject bad data in the DOM.
Right now there is no API in the content thread to do that but we should have the Sanitizer class in the next months (see bug 1954437).
In privileged JS, we could rely on:
https://searchfox.org/mozilla-central/source/parser/html/nsIParserUtils.idl#18
and once it has been sanitized, we can remove the elements having a non-mathml namespace and remove the attributes we don't want (see https://github.com/WICG/sanitizer-api/blob/main/builtins/safe-default-configuration.txt#L171).
In the pdf.js code, we've an XML parser we could use to sanitize the MathML strings ourselves, but in term of security, I don't think it's a good idea.

Depends on: 1954437
Assignee: nobody → cdenizet
Status: NEW → ASSIGNED
Priority: -- → P1

Release Note Request (optional, but appreciated)
[Why is this notable]: MathML elements in PDFs will now be accessible through screen reader and other accessibility tools
[Affects Firefox for Android]: Yes
[Suggested wording]: Improved support for screen readers to access mathematical formulas in PDFs
[Links (documentation, blog post, etc)]: Maybe we could link to one of the example PDFs using MathML

relnote-firefox: --- → ?

Hi Calixte! I was just looking at this patch, because I was curious about the Sanitizer usage. I think there is a small bug here, the attributes should be in the null (default) namespace.

I was also a bit curious how you handle the Sanitizer being missing? It's only enabled in Nightly for now.

Flags: needinfo?(cdenizet)

This is really welcome news. If I understand the PR correctly Firefox (via PDF.js) will support MathML via associated files (and Microsoft's MSFT_MathML syntactic variant of that) does the above sanitizer comment mean that you also support MathML Structure Elements where the Math is tagged directly in the structure tree?

(In reply to Tom Schuster (MoCo) from comment #9)

Hi Calixte! I was just looking at this patch, because I was curious about the Sanitizer usage. I think there is a small bug here, the attributes should be in the null (default) namespace.

Ah yes that's correct: I'll fix that, thank you.
That said it'll be nice to have a kind of shortcut just in order to avoid to have to maintain the list of tags/attributes.

I was also a bit curious how you handle the Sanitizer being missing? It's only enabled in Nightly for now.

We use the Sanitizer only if the feature is available:

When do you plan to release this nice and expected feature :) ?

Flags: needinfo?(cdenizet)

(In reply to David Carlisle from comment #10)

This is really welcome news. If I understand the PR correctly Firefox (via PDF.js) will support MathML via associated files (and Microsoft's MSFT_MathML syntactic variant of that) does the above sanitizer comment mean that you also support MathML Structure Elements where the Math is tagged directly in the structure tree?

:David Carlisle, yes we support MathML tags when they're in the struct tree and in this case we don't need to have the Sanitizer:
https://github.com/mozilla/pdf.js/blob/7fc5706e16ad76e04b66257882764caa2c9db7bd/web/struct_tree_layer_builder.js#L329-L330

and there is a test for it:
https://github.com/mozilla/pdf.js/blob/7fc5706e16ad76e04b66257882764caa2c9db7bd/test/integration/accessibility_spec.mjs#L349-L375

If you see anything wrong or if I missed something please tell me.

I missed that the embedded file is using UTF-8.

Blocks: 1997343
Status: ASSIGNED → RESOLVED
Closed: 28 days ago
Resolution: --- → FIXED
Target Milestone: --- → 146 Branch

I'm a bit confused, does this depend on the Nightly-only functionality added by bug 1954437? Seems very relevant from a release note perspective :-)

Flags: needinfo?(mcastelluccio)

Yes, I think we must wait for that to ride to Release, and then we can add this to release notes.

Flags: needinfo?(mcastelluccio)

Thanks, added to the Nightly release notes and I've added a dependency on the Sanitizer API ship bug.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: