Closed Bug 810636 Opened 12 years ago Closed 3 years ago

Poor copy & paste behavior with pdf.js

Categories

(Firefox :: PDF Viewer, defect, P2)

defect

Tracking

()

VERIFIED FIXED
95 Branch
Tracking Status
firefox-esr91 --- verified
firefox95 --- fixed

People

(Reporter: RyanVM, Unassigned)

References

()

Details

(Whiteboard: [pdfjs-c-ux][pdfjs-d-text-search][pdfjs-d-text-selection])

Attachments

(7 files)

Attached file Blah
Attached are a couple PDFs that are producing copy & paste behavior that doesn't seem to be correct.

Blah is a Word document I created myself and converted to PDF using PDFCreator. When I open it with Adobe Reader, I can copy & paste the text into a text editor exactly as it shows. When I open it with pdf.js, the third line pastes with an extra space at the end of it. To summarize:

Expected:
Blah
Blah
Blah

Actual:
Blah
Blah
Blah <-- Extra space


Actiontec is a PDF from the manufacturer of my wireless router. It shows much worse performance when trying to copy & paste "Verizon FiOS Router" at the top, there are a couple issues. First, it is very difficult to select just that text without it trying to select much of the body text as well. Second, it seems to miss the last letter of each line. Third, it inserts a new paragraph after each character. Adobe Reader is able to copy & paste the text fine.

Expected:
Verizon
FiOS
Router

Actual:
V
e
r
i
z
o
n
F
i
O
S
R
o
u
t
e
Attached file Actiontec
Attached image Actiontec screenshot
Priority: -- → P2
Whiteboard: [pdfjs-c-ux][pdfjs-d-text-search][pdfjs-d-text-selection]
Build identifier: Mozilla/5.0 (X11; Linux i686; rv:19.0) Gecko/20130103 Firefox/19.0

Reproducible on Linux (Ubuntu 12.10), as well.
OS: Windows 7 → All
Attached file Reduced testcase
I think pdf.js should use span elements rather than div elements to create dummy layers for selection.
As of 2013-1-11 as reported the 829686  "dupe" bug, it is breaking on words, not individual letters.  Hopefully this is just progress or some quirk of the PDF and the fix for both scenarios is the same.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #0)
> into a text editor exactly as it shows. When I open it with pdf.js, the
> third line pastes with an extra space at the end of it. To summarize:

Now every line has an extra trailing space.
ATM we use a lot of single divs that hold the text on the PDF page. There can be multiple divs per line or even per word. If you then select a line there is an extra newline inserted between each div. That's just how the spec goes ;)

To improve this, we need to compute the string to copy to the clipboard ourself (it's not that complicated as it maybe sounds as we have the layout data in PDF.JS already). This requires support for copy-event/cut-event clipboarData, which enables you to change the content copied/cut from a page (like need to do in this case here). The copy-event/cut-event clipboarData is very close to land in bug #407983. Once that's in, I will do the necessary bits in PDF.JS.
Status: NEW → ASSIGNED
Depends on: 407983
Flags: needinfo?
I am using FF 19, and now selecting text does not select all the text. It'll select most of it, but it seems that the last word on every line is cut off, as well as a few in the middle.
Flags: needinfo?
Well, I agree that copying selected out via clipboard results in incredibly dis-formatted mess, fixing it would be really great, right now it is almost not usable for regular text extraction... :-(
I have just seen this behaviour with this pdf file:
http://www.mairie-rochecorbon.fr/pdf/reglement.pdf

Hope a fix will come soon
I've run into this recently as well. Tested on both v24 and a several day old nightly.

http://alstomsignalingsolutions.com/Data/Documents/VCS_April_16_2013.pdf

See attached screen shot. A copy of the selected region resulted
in the following text in the clipboard.

"nt sensing require"

It should have been:

"sensing requirements of the"

Though it's hard to be sure because the displayed selection box is a little
sloppy.
Just to be clear, the warning from NoScript is because I have Javascript from the website disabled at that point, pdf.js is enabled. I later enabled the website's JS and it had no effect.
Oh, and I'm on a Mac running OS X 10.7.5.
Here's another example from a current version of Firefox 27.0.1 on Mac OSX 10.9.2

Using the PDF viewer built into Firefox on this document:

http://www.institutional-economics.com/images/uploads/randreview.pdf

An attempt to highlight and then copy the text shown in the attached image, produced the errant highlight
shown in that image and resulted in the following truncated text in the clipboard

Tyler Cowen’s recent book,
Create Your Own Econ

A cut/paste from the same document viewed in Preview did the right thing. The highlight in the view and the text in the clipboard were correct.
Attached image TextSelectionError.png
Are there any plans to fix this issue in the near future?
I hope this gets fixed one day, it's really oannoying to be forced to download and open this file in adobe reader just to copy text... (which is also not very good since Adobe Reader is exposed to lots of attacks because of exploits)
at least it keeps the formatting unlike Adobe Reader (fonts and stuff) if you copy it into Word (Office)
Any updates regarding this bug ?
Yeah, I hope this is not expected behaviour :/
An updated on the current state would be nice. Has anybody interest in providing a patch, maybe? (I would like to but I don't have any knowledge about this at all)
Especially since https://bugzilla.mozilla.org/show_bug.cgi?id=407983 is now integrated already :)
There's a thread started at mozillaZine recently regarding this: http://forums.mozillazine.org/viewtopic.php?f=38&t=2961299

Firefox 40.0.3 in Win 7 Pro also has wacky issues copying and pasting from PDFs using the built-in viewer as well.
No assignee, updating the status.
Status: ASSIGNED → NEW
No assignee, updating the status.
No assignee, updating the status.
When trying to copy text:

The in-and-out breaths, are bodily formation.
Thinking and pondering are verbal formation.
Perception and feeling are thought formation.

from

http://www.themindingcentre.org/dharmafarer/wp-content/uploads/2013/04/40a.9-Culavedalla-S-m44-piya.pdf

produces

The in
-
and
-
out breath
s
,
are
bodily formation.
120
Thinking and 
ponder
ing 
are
verbal formation.
121
Perception
and
feeling 
are 
thought formation
.‖
12

"Opened 7 years ago".

This is the kind of thing that gets people to switch to Chrome.

It should be fixed thanks to https://github.com/mozilla/pdf.js/pull/13424.
pdf.js has been updated in m-c (see bug 1737299) so the fix is available in nightly.

Status: NEW → RESOLVED
Closed: 3 years ago
Depends on: 1737299
Resolution: --- → FIXED
Depends on: 1748536
Target Milestone: --- → 95 Branch

I have reproduced this issue in ESR v91.4.1esr and verified the fix in ESR v91.5.0esr and Release v95.0.2 and Nightly v97.0a1 and Windows 10, Mac OS 11.6.2 and Ubuntu 20.04.3 LTS.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: