Closed Bug 844006 Opened 11 years ago Closed 11 years ago

Copying text from PDF document inserts random whitespace or newline.

Categories

(Firefox :: PDF Viewer, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 810636

People

(Reporter: ishikawa, Unassigned)

Details

When I copy a text from in-line PDF viewer under windows 
(FF v 19.0 automagically enabled for me
when FF updated itself from FF18.), and paste it to other
text processing program, I get random whitespace or
newlines at unexpected places.

For example,
http://www.education.gov.yk.ca/pdf/pdf-test.pdf
(This is the first hit when I searched "pdf test" using google.)
If I copy the first paragraph in this PDF and paste it into memo pad, somehow 
the word computer is split "comput" and "er" with a line-break in between.

With the following Japanese PDF (from Information Processing Society of Japan),
if I copy and paste the first paragraph starting (1) (at about five lines in the main text), voila!  Each and every character is on one line: each line is
pasted as a single character line: This is not what I expected. Copy&Past is useless
in this case.

Funny, another Japanese PDF in the following (a government leaflet)
allows copy and paste work as expected.
http://www.meti.go.jp/statistics/toppage/topics/pamphlet/pdf/h21shokai.pdf
If you copy the title page's remark in the central light-colored round-corner rectangle, copy and paste works as expected. I tested a paragraph in a few pages later, and copy and paste works as expected again.

Maybe there is a method of PDF creation that would allow PDF.js to operate 
copy and paste correctly, but we can't dictate how to create PDFs. So not all the PDFs on the Internet are created equal in this regard, and PDF.js needs to pay a 
little more attention to this issue IMHO.

BTW, I tried copy and paste operations for the last two Japanese PDFs using Adobe acrobat 10.1.6 and they work as expected.
But  I was surprised to find that
the Yukon education PDF test file (the first PDF) does not seem to allow
copy operation (!). How interesting that PDF.js allows us to bypass the
light-hearted copy-prohibition :-)

Bug 429859 may be related to the issue at hand although it is an old issue.
(copying text from some pages inserts garbage characters between the characters on the page)
Back then it was reported NUL character was inserted after every character.

TIA
My bad, I forgot to insert the IPSJ's URL for the second PDF.
http://ime.nu/www.itscj.ipsj.or.jp/nenkan/nenkan08.pdf

Someone reported that he/she could not read this page even (not displayed correctly), but at least on my PC with the said OS, it is readable.

TIA
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.