Copying text from PDF document inserts random whitespace or newline.




6 years ago
6 years ago


(Reporter: ishikawa, Unassigned)


Firefox Tracking Flags

(Not tracked)




6 years ago
When I copy a text from in-line PDF viewer under windows 
(FF v 19.0 automagically enabled for me
when FF updated itself from FF18.), and paste it to other
text processing program, I get random whitespace or
newlines at unexpected places.

For example,
(This is the first hit when I searched "pdf test" using google.)
If I copy the first paragraph in this PDF and paste it into memo pad, somehow 
the word computer is split "comput" and "er" with a line-break in between.

With the following Japanese PDF (from Information Processing Society of Japan),
if I copy and paste the first paragraph starting (1) (at about five lines in the main text), voila!  Each and every character is on one line: each line is
pasted as a single character line: This is not what I expected. Copy&Past is useless
in this case.

Funny, another Japanese PDF in the following (a government leaflet)
allows copy and paste work as expected.
If you copy the title page's remark in the central light-colored round-corner rectangle, copy and paste works as expected. I tested a paragraph in a few pages later, and copy and paste works as expected again.

Maybe there is a method of PDF creation that would allow PDF.js to operate 
copy and paste correctly, but we can't dictate how to create PDFs. So not all the PDFs on the Internet are created equal in this regard, and PDF.js needs to pay a 
little more attention to this issue IMHO.

BTW, I tried copy and paste operations for the last two Japanese PDFs using Adobe acrobat 10.1.6 and they work as expected.
But  I was surprised to find that
the Yukon education PDF test file (the first PDF) does not seem to allow
copy operation (!). How interesting that PDF.js allows us to bypass the
light-hearted copy-prohibition :-)

Bug 429859 may be related to the issue at hand although it is an old issue.
(copying text from some pages inserts garbage characters between the characters on the page)
Back then it was reported NUL character was inserted after every character.


Comment 1

6 years ago
My bad, I forgot to insert the IPSJ's URL for the second PDF.

Someone reported that he/she could not read this page even (not displayed correctly), but at least on my PC with the said OS, it is readable.

Last Resolved: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 810636
You need to log in before you can comment on or make changes to this bug.