Open Bug 18012 Opened 25 years ago Updated 16 days ago

Support tables in plaintext output

Categories

(Core :: DOM: Serializers, enhancement, P5)

enhancement

Tracking

()

Future

People

(Reporter: BenB, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: helpwanted)

This is a "sub-bug" of bug #16800.

I would like to see at least the most essential parts of tables (basic col and
row formatting, colspan and rowspan, caption, maybe summary) supported.

This wouldn't be easy, because it means (one more) rewrite of the line functions
of nsHTMLToTXTSinkStream.

I tried to lurk at lynx, but it doesn't support them either :-(.
Depends on: 17723
Target Milestone: M12
Usual disclaimer: I don't know, if I will really implement this. I'm just
looking.
FWIW, this would be most useful for the simplest cases - 2-4 columns only,
no nested tables, no fanciness at all. It is easy to forget the days before
DTP and the web, but as little as 15 years ago the majority of tables that
were not professionally produced or made by computer programs were either
composed at a typewriter or at a word-processing station of some kind connected
to a letter-quality printer, i.e. a typewriter with a data cable.

I am sure that if you get code for this working at all there will be pressure
to support more and more subtleties, but the ability to view simple tables
is worth enough to draw a line around what is reasonably possible.
Status: NEW → ASSIGNED
Priority: P3 → P2
Target Milestone: M12 → M20
Assignee: mozilla → akkana
Status: ASSIGNED → NEW
Summary: Support tables in plaintext output → [HELP WANTED] Support tables in plaintext output
Whiteboard: [HELP WANTED]
I don't think, I'll work on that anytime soon. HELP WANTED.
No longer depends on: 17723
Removing dependency: formatting tables would be nice regardless of anything
else.

Cc'ing Daniel in case he has any interest in this.
Keywords: helpwanted
Status: NEW → ASSIGNED
Summary: [HELP WANTED] Support tables in plaintext output → Support tables in plaintext output
Whiteboard: [HELP WANTED]
Bulk move of all "Output" component bugs to new "DOM to Test Conversion" 
component.  Output will be deleted as a component.
Component: Output → DOM to Text Conversion
moving to future milestone
Assignee: akkana → beppe
Status: ASSIGNED → NEW
moving back to previous owner
Assignee: beppe → akkana
Target Milestone: M20 → Future
Re-accepting.
Status: NEW → ASSIGNED
Re-accepting.
Thsi bug is far from trivial. Means caching the content of table cells. I see
two ways:
- Ignore all formatting inside table cells.
This is against the HTML 4.0 spec, which allows both inline and block tags
inside table cells. Assuming, the table is more important than formatting inside
it, this would be an improvemant, but of doubtable value. If we want to output
commercial web pages, this way would worse the situation, as tables are often
used for big scale formatting (see e.g. <http://www.mozilla.org> :-( ).
- Set up a new output sink for each cell.
This would preserve all formatting inside cells (even nested cells :) ), and is
IMO a logical solution, but it is more work (see below) and might have some
performance problems (dunno for sure).
Implementation:
  - If <td>/<th>, go into table cell mode.
  - If in table cell mode, record all input up to the </td>/<th> corresponding
to the <td>/th> above. This includes both leafs and tags! I have no idea how to
do this, would be some outputsink magic. I think, this is the hard part. Akk?
  - Do the above for all cells until <table>.
  - Compare the length of the concatted leafs* for all cells (columns?), and
calculate column widths.
  - If table cell is closed, create new HTML->TXT sink and feed it with the
recorded data. Wrap column is sat following the calculated column width. Fill
the lines with spaces up to the wrap column (we could add a new mode to the
HTML->TXT for that). Record the output.
  - Lay the output out in a table. (line 1 of cell 1 + "|" + line 1 of cell 2
etc..)

*Correctly, we would have to compare the length of the TXT output, but we don't
know the column width yet, so we would have to run the HTML->TXT twice. The
length of the concatted leafs is a close approximation, and should be enough.
w3m's table algorithm:
<http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/STORY.html>.


Good news: Seems like we will switch to direct DOM->text output sometime
(currently, we do DOM->XIF/HTML->text), which means we can navigate through the
document (back and forth), which means, we don't have to do the caching
described above. Adding dependency on bug 51308.
Depends on: 51308
Anthonyd is taking over Output bugs, so he's the default owner for RFEs like
this one.
Assignee: akkana → anthonyd
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
--> brade
Assignee: anthonyd → brade
Status: ASSIGNED → NEW
This is a serializer bug; it needs to be reassigned to the module owner for DOM 
to Text Conversion.
Assignee: brade → anthonyd
reassigning to cmanske.
Assignee: anthonyd → cmanske
Severity: normal → enhancement
Over to serializer owner.
Assignee: cmanske → tmutreja
Table support should be optional. many webpages add a lot of cruft in the left
and right columns, and having it before/after the real text makes it *much*
easier to remove it from the resulting document later as if it were next to the
real content.
*** Bug 143151 has been marked as a duplicate of this bug. ***
Blocks: 192458
QA Contact: sujay → dom-to-text
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: t_mutreja → nobody
Severity: normal → S3
Priority: P3 → P5
You need to log in before you can comment on or make changes to this bug.