Closed Bug 916883 Opened 11 years ago Closed 3 years ago

find functionality creates terms in pdf viewer pages

Categories

(Firefox :: PDF Viewer, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: karlden, Assigned: calixte)

References

Details

(Whiteboard: [bugday-20140113][pdfjs-ux][pdfjs-text-search])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130905 Firefox/23.0 (Nightly/Aurora) Build ID: 20130905203958 Steps to reproduce: 1) Using a fresh load of firefox 23.0.1, open the page at: http://archive.dovebid.com/brochure/bro1514.pdf. 2) Open the find functionality. 3) Enter "WorkBenches" in the find box. 4) Enter "Work Benches" in the find box. Actual results: The term "WorkBenches" is found for step 3 and "Work Benches" is not found anywhere in the document for step 4. Expected results: This term "WorkBenches" should not be found in step 3, it does not exist as a term in the document. The phrase "Work Benches" should be found at the same place where "WorkBenches" is currently found. Details: Other such concatenations can be found. Apparently, the internal textual functionality of the pdf viewer concatenates terms under some circumstances that include transitions to a new line.
Attached image WorkBenches.png
The highlighting of the search text "WorkBenches" vs "Work Benches" in Firefox vs Adobe Reader is also not the same as shown in the attached WorkBenches.png. Also, the test "WorkBenches" cannot be searched for this test document in Adobe Reader.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Component: Untriaged → PDF Viewer
Resolution: --- → WORKSFORME
Whiteboard: [bugday-20140113]
Some Points: 1. In the current FF beta 25 this still works as described in the bug, it's not fixed there. 2. Using PDF Viewer 0.8.870 (today's development build) or 0.8.851 from a couple of days ago, "Work Benches" is not found in the pdf mentioned in the bug. Note that doing this test in these builds is difficult for stability reasons, other stuff is going on with find, but with some care including a new load of the pdf the "Work" in the occurrence of "Work Benches" can be found but never "Work Benches" using the same procedures. I did these tests using a FF 25 beta build with the current development builds of pdfjs identified. If this failure to find "Work Benches" persists as this code stabilizes, then this bug is not fixed. This raises the question: where is this working? 3. Using those same builds, the "WorkBenches" find functionality in fact also does not turn up anything. Hopefully this means that the term "WorkBenches" and other non-existent terms are not being created going forward. Thanks for attention to this issue--I appreciate it and was concerned from the silence that nothing may be happening with it; nevertheless, a quiet "WORKSFORME" resolution months after entering the bug--especially when it is still broken in the current FF beta build--seems to not be what a simple WORKSFORME resolution ought to imply and until now I have inferred something different when I saw such a resolution to the bugs of others in this database.
Sorry, I don't remember what my reason was to close this. Maybe because of the "WorkBenches" therm, which is no longer found. I can confirm that "Work" from "Work Benches" is not found in Firefox. I think it's because of the fragmented rendering of the word.
Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 7 → All
Hardware: x86_64 → All
Version: 23 Branch → Trunk
Priority: -- → P4
Whiteboard: [bugday-20140113] → [bugday-20140113][pdfjs-c-ux][pdfjs-d-text-search]
The original pdf used to describe this bug cannot be accessed online anymore (apparently linkrot). However, I can reproduce this bug in firefox 42.0 using the following URI: http://www.boeing.com/commercial/aeromagazine/articles/qtr_4_07/AERO_Q407.pdf search for "aisleTwin" Note that the search is successful in pdf.js, but there is no term "aisleTwin" in the document. Note that a search for aisleTwin fails in adobe reader, while a search for "aisle Twin" succeeds. Perhaps related in its cause is that a search for "aisle 226" succeeds in adobe reader, but both "aisle226" and "aisle 226" fail to find a match in pdf.js.

7 years later, this bug is still present.

Using :
Firefox 79.0 on desktop
Windows 10

Here is a link to the original document using archive.org's Wayback Machine:

https://web.archive.org/web/20060818161558/http://archive.dovebid.com/brochure/bro1514.pdf

Using the PDF bar at the top (not using Firefox Menubar -> Edit -> Find in This Page)

"WorkBenches" triggers a find for Work [wrap to new line, then separate word] Benches.
"WorkBenches" with the search bar option "Whole words" ticked on does not yield any results.

Copy+Pasting from the PDF in Firefox -> to a text editor gives the following text :

"BENCH-TOP ARBOR PRESSES, (2) FAMCO NO. 2, (2) SHELDON NO. 2Pedestal Fans, Portable Heaters, Washers, Sweepers, Shop Vacs, WorkBenches, Hand Tools, Vises, Supply Cabinets, Flammable Safety StorageCabinets, Tool Storage Cabinets, Extension & Safety Ladders, Carts, PalletJacks, Work Fixtures, Horses, Stands, Etc."

Not familiar enough with PDFs or Firefox's rendering to posit any explanations, but hopefully this will be useful (someday) for someone to continue diagnosing the issue.

Have a wonderful day.

Still reproducible.

Severity: normal → --
Priority: P4 → P3
Severity: -- → S3
Assignee: nobody → cdenizet
Status: NEW → ASSIGNED
Whiteboard: [bugday-20140113][pdfjs-c-ux][pdfjs-d-text-search] → [bugday-20140113][pdfjs-ux][pdfjs-text-search]
Priority: P3 → P1

The "Work Benches" bug from the first PDF is fixed. "aisle Twin" bug from the second PDF is fixed too. "aisle 226" from the second PDF is not fixed (though I'm not sure what's the correct behavior).
Calixte, could you check with Adobe Reader?

Flags: needinfo?(cdenizet)

"aisle Cross Section 226" finds a result ("aisle" and "226" are in the 7th page on two following lines, "Cross Section" is in the 6th page)

(In reply to Marco Castelluccio [:marco] from comment #8)

"aisle 226" from the second PDF is not fixed (though I'm not sure what's the correct behavior).

Interestingly that can also be reproduced in PDFium (in Google Chrome), and it seems that the problem isn't related to the search functionality as such but rather to the actual contents of the textLayer.
From a very cursory look, it appears that some of the textContent is being position (by the PDF document itself) in such a way that it ends up outside of the visible pages. This is apparently affecting both PDF.js and PDFium, but not Adobe Reader as far as I can tell.

I think we should skip strings which are not in the page bounding box when we're creating text chunks.

Flags: needinfo?(cdenizet)
Blocks: 1755201
Status: ASSIGNED → RESOLVED
Closed: 11 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: