Open Bug 839096 Opened 7 years ago Updated 6 years ago

Searching for consecutive words is broken in pdf.js

Categories

(Firefox :: PDF Viewer, defect, P3)

defect

Tracking

()

Tracking Status
firefox19 - ---

People

(Reporter: adalucinet, Unassigned)

Details

(Whiteboard: [pdfjs-c-ux][pdfjs-d-text-search])

Reproducible on the latest Beta (BuildID: 20130206083616): Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:19.0) Gecko/20100101 Firefox/19.0
Reproducible on the latest Aurora (BuildID: 20130207042017): Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:20.0) Gecko/20130207 Firefox/20.0
Reproducible on the latest Nightly (BuildID: 20130207030936): Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20130207 Firefox/21.0

Steps to reproduce:
1. Start Firefox and make sure the pdf.js is enabled.
2. Navigate to http://www.selab.isti.cnr.it/ws-mate/example.pdf.
3. Search for minimum two consecutive words (eg.:"how files should").

Expected results: At step 3, the searched text is highlighted and display accordingly.

Actual results: At step 3, no results are found after searching existing words in the pdf. 

Note: 
1. In Nightly 16.0a1 (2012-06-08), with "pdfjs.disabled" pref set to false for the pdf.js to be enabled, I'm not able to search for a single word in that specific pdf. I'll investigate this issue tomorrow and come back with more details.
Does Adobe Reader work in this case as well?
This seems to fail on the whitespace.

Using the first sentence as an example:
"This is a short example to show the basics of using the ENTCS style macro files.

With PDF.js, searching for "how" highlights "how" in "show", searching for "how " finds nothing. It works fine in Adobe Reader 11 though.

I also noticed another bug in testing this. Copying that first sentence in PDF.js and pasting it in a text field results in the following:
This
is
a
short
example
to
sho
w
the
basics
of
using
the
ENTCS
st
yle
macro
les.

Doing the same from Adobe Reader 11 results in:
This is a short example to show the basics of using the ENTCS style macro les.

I can file this separately if you think it warrants a bug report, Yury.
Nominating this for tracking since searching a document is a fairly common use case.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #2)
 
> I can file this separately if you think it warrants a bug report, Yury.

I think you have encountered the same issue as in bug 819636 . Could you please take a look just to be sure?
I can find this bug on Firefox 16, therefore it landed with the PDF.js feature and it's not a regression. On Firefox 18, there is a push that made the Find bar active, but search only works for one word at a time.
I can see no other differences between current builds and 16 build.

Please let me know if I could help more.

Note:
1. The word "file" cannot be searched at all even in Firefox 18.
(In reply to Alexandra Lucinet [QA] from comment #4)
> (In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #2)
>  
> > I can file this separately if you think it warrants a bug report, Yury.
> 
> I think you have encountered the same issue as in bug 819636 . Could you
> please take a look just to be sure?

I can neither confirm nor deny as I do not have access to that bug.
Searching in pdf is not exact science. If you try this PDF in Adobe Reader, you will not be able to find "how files" or "files" as well.

Two ways to address the issue: a) close as wont fix; b) wait until somebody will think of better search algorithm.
Despite Adobe Reader's in-line search failings, I think PDF.js is worse in its current form. At the very least we should be able to parse whitespace in a query. I think we should strive to be on par with Adobe Reader, if not better. I'd rather see this moved to the Enhancements bucket than wontfixed if this is deemed to be unimportant.
Not tracking for FF19's release, because this isn't a widespread issue with PDFs in general. Our own testing shows that multi-word searches typically work.
(In reply to Alex Keybl [:akeybl] from comment #9)
> Our own testing shows that multi-word searches typically work.

I'm wondering if this is something unique to certain PDFs and if there's a characteristic that breaks multi-word searches.

All multi-word searches I tried in this PDF work:
http://www.irs.gov/pub/irs-pdf/fw4.pdf

None of the multi-word searches I tried in this PDF work:
http://www.selab.isti.cnr.it/ws-mate/example.pdf
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #6)
> (In reply to Alexandra Lucinet [QA] from comment #4)
> > (In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #2)
> >  
> > > I can file this separately if you think it warrants a bug report, Yury.
> > 
> > I think you have encountered the same issue as in bug 819636 . Could you
> > please take a look just to be sure?
> 
> I can neither confirm nor deny as I do not have access to that bug.

Sorry, I've mistyped the bug number. Anthony, please take a look at bug 810636.
Priority: -- → P3
Whiteboard: [pdfjs-c-ux][pdfjs-d-text-search]
(In reply to Alexandra Lucinet [QA] from comment #11)
> Sorry, I've mistyped the bug number. Anthony, please take a look at bug
> 810636.

Looks like it, thanks.
I could still reproduce the issue on latest Aurora (20130916004002) on Win7 and Ubuntu 64bit.
We found that searching for phrases containing words with different font styles displays 0 results.

e.g: in http://fzs.sve-mo.ba/sites/default/files/dokumenti-vijesti/sample.pdf search for "Do use" or "you must" which are displayed in chapter 1.2.3 of that document are not found

Reproduces from Firefox 19.0.2 to Nightly 32.0a1 2014-06-01 on all platforms.

Yury, please let me know if a separate bug should be opened for this.
You need to log in before you can comment on or make changes to this bug.