PDF editor removes tags from tagged PDFs
Categories
(Firefox :: PDF Viewer, defect, P1)
Tracking
()
People
(Reporter: aroselli, Assigned: calixte)
References
Details
(Keywords: access)
Attachments
(3 files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
Steps to reproduce:
When I opened Firefox 111 it prompted me to try editing a PDF. I did so using a PDF that I had tagged (a PDF/UA-conformant accessible PDF).
- Open a tagged PDF in Firefox;
- Make any edit (text box, drawing);
- Save PDF;
- Open the newly-saved PDF in Adobe Acrobat.
Actual results:
All pre-existing tags are gone.
Expected results:
No tags should have been removed.
Possibly related (for background on PDF/UA if nothing else): https://bugzilla.mozilla.org/show_bug.cgi?id=861157
The attached image shows how a PDF that had been tagged has had all its tags removed after adding a drawing and text comment on the first page.
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
I tried with the pdf: https://accessinghigherground.org/wp/wp-content/uploads/2015/09/The-BasicsOfTaggedPDF20161.pdf and all the tags are still there after I modified and save it in Firefox nightly (113) and the same in release (111).
Would it be possible to share the pdf ? or even just create a basic one containing few tags ?
Reporter | ||
Comment 2•2 years ago
|
||
I used the PDF you linked, edited it, and opened the edited file in Acrobat Pro and the tags were retained.
I created a new Microsoft Word document, dumped content from Wikipedia in there as plain text (used Notepad as an intermediary), formatted it, exported as PDF (using Create PDF/XPS Document), confirmed the tags in Acrobat Pro, opened in Firefox 111, edited, saved, and then opened the new file in Acrobat Pro and all the tags were removed.
My original PDF: https://adrianroselli.com/files/xfr/PDF-UA.pdf
My edited PDF: https://adrianroselli.com/files/xfr/PDF-UA_edited.pdf
To speculate, Firefox may be struggling with certain PDF files as a function of how they are encoded.
Reporter | ||
Comment 3•2 years ago
|
||
Quick follow-up: I noted the AHG PDF (which retained its tags) was created in Word 2016 and mine (which lost its tags) was created in Word 2021. No idea if/how that factors.
Comment 4•2 years ago
|
||
I can confirm that tags are removed when editing and saving the PDF-UA.pdf problem file linked above in Comment 2.
I did some further fiddling, and found that if I simply added the PDF/UA metadata identifier to that PDF, and then edited the file in Firefox 111, the tags would be preserved this time.
Unfortunately, the "Basics of Tagged PDF" file linked in Comment 1 does not have the PDF/UA metadata either, so the PDF/UA identifier cannot be the only issue causing this bug.
Assignee | ||
Comment 5•2 years ago
•
|
||
If I understand correctly the pdf specs, PDF-UA.pdf
is a "hybrid-reference" file because it contains a xref table with some deleted elements and a xref stream which references those deleted elements.
When we're writing data in the pdf, we use a xref stream but its Prev
entry makes a reference on the xref table but not on the previous xref stream:
https://github.com/mozilla/pdf.js/blob/b1e0253f29176751c9762f88b5b9765fcf9fc07c/src/core/writer.js#L285
but in the specifications for xref stream we've:
The byte offset in the decoded stream from the beginning of the file to the
beginning of the previous cross-reference stream. This entry has the same
function as the Prev entry in the trailer dictionary (Table 15).
Consequently, the fix is to reference the previous xref stream instead of the previous xref table.
Assignee | ||
Comment 6•2 years ago
|
||
Comment 7•2 years ago
|
||
Updated•2 years ago
|
Updated•1 year ago
|
Description
•