Closed Bug 44439 Opened 25 years ago Closed 25 years ago

Support headers

Tracking

()

Status:

VERIFIED FIXED

Milestone:

M18

People

(Reporter: BenB, Assigned: BenB)

References

Details

(Whiteboard: Fixed.)

Attachments

(10 files)

Simple testcase 25 years ago Ben Bucksch (:BenB) 174 bytes, text/html		Details
Fix, version 1 25 years ago Ben Bucksch (:BenB) 3.67 KB, patch		Details \| Diff \| Splinter Review
Fix, version 2 (forgot the .h) 25 years ago Ben Bucksch (:BenB) 4.55 KB, patch		Details \| Diff \| Splinter Review
A harder testcase 25 years ago Ben Bucksch (:BenB) 247 bytes, text/html		Details
Fix, version 2 25 years ago Ben Bucksch (:BenB) 10.14 KB, patch		Details \| Diff \| Splinter Review
Fix, version 3 25 years ago Ben Bucksch (:BenB) 10.31 KB, patch		Details \| Diff \| Splinter Review
Changes to all.js (default prefs) 25 years ago Ben Bucksch (:BenB) 947 bytes, patch		Details \| Diff \| Splinter Review
Fix, version 4 25 years ago Ben Bucksch (:BenB) 13.98 KB, patch		Details \| Diff \| Splinter Review
Fix, version 5 25 years ago Ben Bucksch (:BenB) 14.65 KB, patch		Details \| Diff \| Splinter Review
Default prefs, version 2 25 years ago Ben Bucksch (:BenB) 976 bytes, patch		Details \| Diff \| Splinter Review

Ben Bucksch (:BenB)

Assignee

Description

•

25 years ago

Reproduce: 1. Load the testcase in the editor 2. Debug|OutputText Actual result: <<h1 foo h2 foo h1 foo h2 foo h3 foo h3 foo h4 foo h5 foo h6 foo >> Expected result: Something more intelligent, e.g. Lynx' output: bash$ lynx -dump header.html h1 foo h2 foo h1 foo h2 foo h3 foo h3 foo h4 foo h5 foo h6 foo Or something with numbers, like 1. h1 foo 1.1 h2 foo 2. h1 foo 2.1 h2 foo 2.1.1 h3 foo 2.1.2 h3 foo 2.1.2.1. h4 foo 2.1.2.1.1. h5 foo 2.1.2.1.1.1. h6 foo etc. Since we output normal text at column 0, I tend towards the numbering scheme. Suggestions welcome, but fast, please :).

Ben Bucksch (:BenB)

Assignee

Updated

•

25 years ago

Status: NEW → ASSIGNED

Target Milestone: --- → M17

Ben Bucksch (:BenB)

Assignee

Comment 1

•

25 years ago

Attached file Simple testcase — Details

Ben Bucksch (:BenB)

Assignee

Comment 2

•

25 years ago

No proposed rendering in the HTML 4.0 spec, but <quote src="http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.5"> HTML does not itself cause section numbers to be generated from headings. This facility may be offered by user agents, however. </quote>

Ben Bucksch (:BenB)

Assignee

Comment 3

•

25 years ago

Fixed. Used the numbering version. akk, can you review, please?

Keywords: patch, review

Whiteboard: Fixed.

Ben Bucksch (:BenB)

Assignee

Comment 4

•

25 years ago

Attached patch Fix, version 1 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 5

•

25 years ago

Attached patch Fix, version 2 (forgot the .h) — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Updated

•

25 years ago

Summary: Support headers better → Support headers

Daniel Bratell

Comment 6

•

25 years ago

I'm not sure that numbering headers is the best since that inserts content with a meaning that the author might not have intended. Imagine the following: ---------- A sends a html-mail with headers converted to plain text to B. The mail gets numbered headers. B answers. "I agree with you. Especially 1.1.3 was very interesting". This will confuse A very much since he didn't send anything with "1.1.3". ------------ or just mailing any web page where the author has used <h#> tags for formatting rather than for logical organizing of the content. Then our numbering could be all wrong in respect of the content in the mail. I think your other proposal, using indentations is better. In the case of a wrapping text, it should even be simple to center a header. (Left as an exercise for the interested reader).

Ben Bucksch (:BenB)

Assignee

Comment 7

•

25 years ago

Attached file A harder testcase — Details

Ben Bucksch (:BenB)

Assignee

Comment 8

•

25 years ago

Daniel, the problem with the lynx style is that it does not scale. Lynx outout of a harder testcase: foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo The structure is completely unreadable. And this example is not even far-fetched, I write such documents. OK, not in that intense and you can use the content to guess the structure, but that defeats the purpose. Structure should be clearly and unambiguously *visible*. As we write normal text at colunm 0, we have to indent headers more. This would look like: foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo A bit better for normal text plus headers, but even less readable for headers plus indented content (<blockquote>, <dl> , <ul> etc.). As for your first example, there are two cases: Send as Plaintext and multipart/alternative. In the former case, the sender might have a copy looking the same in his Sent folder, so he can see the numbering there. Unfortunate, but possible. In the second case, the recipient used a non-HTML-compliant mailer. He should be used to multipart msgs and know that he might see different output than the sender, so he will propably refere to the header name, not number. As for your second example, I was aware of that. If people abuse HTML, there is not much we can do. The problem with abusing HTML is *exactly* that it won't work well with configurations different from the author. If authors get aware of that: fine with me :). Note, that the Composer says "Heading 1" etc. in the UI, not "bigger" or so. I will add a comment to the askSendFormat dialog that the plaintext version might look different from what the author saw in the composer. The HTML 4.0 spec authors were well aware of abuse of HTML. Nevertheless, they explicitly allow numbering of numbers.

Ben Bucksch (:BenB)

Assignee

Comment 9

•

25 years ago

Ah, sample output of my implementation: 1. foo foo 1.1. foo foo 2. foo foo 2.1. foo foo 2.1.1. foo foo 2.1.2. foo foo 2.1.2.1. foo foo foo foo foo 2.1.2.1.1. foo foo 2.1.2.1.1.1. foo foo (We don't support <dl> yet.) Note, that my implementation is done. I don't really want to dump it, make correct HTML unreadable, just to make illegal HTML more readable.

Daniel Bratell

Comment 10

•

25 years ago

I saw your implementation and it looked good but IMHO this doesn't improve Mozilla. That W3C allows a user agent to number headers is not enough reason to do it. Numbered headers can be useful but it's a feature that should be controlled by the author. In this case we insert numbers with no feedback to the sender until it's done and sent and not even then if the user doesn't look in the sent mail folder, and honestly, how many do check their mails there after they are sent. Your examples were quite unrealistic you know. No text and headers consists solely of the word foo. Normally the text and the headers contribute to the understanding of the logical structure. Indentation could be a help to that understanding. A help that is discrete, but it's there and it won't disturb the content. (Another example would be what would happen if someone already numbered headers manually before they are sent)

Ben Bucksch (:BenB)

Assignee

Comment 11

•

25 years ago

> Your examples were quite unrealistic you know. I filed and fixed this bug exactly because 4.x doesn't convert headers well and the plain text version of my texts were not easily parsable/understandable.* I often have paragraphs < 1 line and I happily use headers, lists etc. in the wildest combinations. In mail/news. So, it was a drastic example, but it has a valid, realistic and severe point. > Normally the text and the headers contribute to the > understanding of the logical structure. I intentionally include and rely on structure in my composition. You have a hard time to understand my texts, if the structurual information is not preserved. *This gives me 3 options: - Directly composing in plain text (including manual wrapping for lists) - Writing the HTML version with the plain text version in mind and crippling the HTML version by doing so. (This requires detailed knowledge about the HTML->TXT converter - something only power users have.) - Not caring about the plain text version. Surely, none of these are acceptable.

Akkana Peck

Comment 12

•

25 years ago

I would not want the output system numbering headers which did not appear numbered in the html. That doesn't make sense at all. If I wanted it numbered, I'd use something like a numbered list -- or I'd just put in numbers and make them part of the header. I agree that the old code isn't working -- the text and the header shouldn't appear on the same line, we should have a newline, at least -- but adding numbers doesn't seem right. As an alternate suggestion, we could add an identifier like "*" (since html headers usually appear bold by default), or perhaps something new, like "<<" and ">>", around headers when converting to formatted plaintext, so <h1>foo</h1> would look like <<foo>>. You could even add more of them depending on the level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.

Ben Bucksch (:BenB)

Assignee

Comment 13

•

25 years ago

> I would not want the output system numbering headers which did not appear > numbered in the html. That doesn't make sense at all. It does make sense. Both me and the HTML spec authors came up with the same proposal. It might not make sense for *your* documents, which is still a legal argument. can you explain why you think, this made no sense? The two reasons Daniel gave were valid, but IMO no reason to drop numbers. BTW: I just added a clear statement to the askSendFormat dialog: "[...] the plaintext version might look different from what you saw in the composer". Suggestions for rephrasing welcome. > If I wanted it > numbered, I'd use something like a numbered list -- or I'd just put in numbers > and make them part of the header. See <http://www.bucksch.org/1/projects/mozilla/31906> rendered with that patch. Although I do use numbers for headers in one section (creating 2 numbers for each of those headers), it still looks better than rendered by lynx. > You could even add more of them depending on the > level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc. This is not obvious. This was exactly the problem, which led my to my proposal: Making the hierarchy obvious (without looking at the content). This is a *requirement*.

Akkana Peck

Comment 14

•

25 years ago

The point of doing formatted plaintext is to make the plaintext output look as much as possible as the html that produced it, so that we're as close to wysiwyg as possible.

Ben Bucksch (:BenB)

Assignee

Comment 15

•

25 years ago

IMO wrong. Impossible. How do you show bold in plaintext? Font sizes? You will loose a lot of information, if you try to emulate the look of a graphical display. The goal should be to carry over as much *information* as possible and output it in a way as if a human had written it directly in plaintext. And RFCs, a good example for formatted plaintext, use numbers for headers. Mamybe, we should discuss this in a newsgroup?

Akkana Peck

Comment 16

•

25 years ago

Yes, bring it up in a newsgroup (mailnews, certainly, and probably crosspost to editor). If a majority of people say they want numbers added to their headers, I'll go along, though I still won't like it myself (I'll probably stop using headers and use bold instead, which isn't a big deal).

Ben Bucksch (:BenB)

Assignee

Comment 17

•

25 years ago

Posted to .mail-news and .editor.

Ben Bucksch (:BenB)

Assignee

Comment 18

•

25 years ago

Attached patch Fix, version 2 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 19

•

25 years ago

I gave in and implemented a more Lynx-like rendering method. I output 2 lines before and 2 after the header and indent the header text 2x columns for h(x). I.e. I decided not to indent h1 for now, because the current implementation - is more consistent across header levels - reduces the risk of confusion between h1 and h6 - is easier to implement I did *not* insert and new characters like e.g. an underline. So, the current implementation bears the risk of confusion between indention and headers (as already pointed out). Hopefully, that is not that bad in practice, since we output 2 lines before the header - it will be a problem, if either the user or the editor insert two lines before a normal indention. I added a pref ("network.converter.html2txt.numbered_headers", default off for now) to switch to the implementation with numbered headers. The patch also contains a pref ("network.converter.html2txt.structs", default on for now) for the output of structured phrases (strong, em, code (new), sub, sup, b, i, u), i.e. either the 4.x or the previous Mozilla behaviour. The patch also changes the <img> implementation: We now don't output both the alt and src (URI) attribute (if existant) anymore, but the alt, title *or* src attribute (in decreasing order of preference, depending on what exists).

Ben Bucksch (:BenB)

Assignee

Comment 20

•

25 years ago

Attached patch Fix, version 3 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 21

•

25 years ago

akk, can you review now, please? Example output with numbered headers pref off: <example content="simple_testcase"> h1 foo h2 foo h1 foo h2 foo h3 foo h3 foo h4 foo h5 foo h6 foo </example> <example content="harder_testcase"> foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo </example>

Ben Bucksch (:BenB)

Assignee

Comment 22

•

25 years ago

> I output 2 lines before and 2 after the header as/2 after/1 after > I.e. I decided not to indent h1 for now s/indent/center Note that I output brackets around the alt/title/src attribute of <img> (as I did before).

Ben Bucksch (:BenB)

Assignee

Comment 23

•

25 years ago

Attached patch Changes to all.js (default prefs) — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 24

•

25 years ago

I added some support for definition lists (<dl>, <dt>, <dd>), <th> and <q> (hardcoding western quotation marks for the latter). Also fixed a bug where the converter gets confused, if a normal <blockquote> is inside a <blockquote type=cite> (or the other way around, I think), due to a completely broken algorithm (catched this while reading source). While reading the HTML 4.0 spec, I noticed that the "alt" attribute is *required* to be specified in the document and *required* to be rendered, if the img is not rendered. Interesting. <quote src="http://www.w3.org/TR/REC-html40/struct/objects.html#edef-IMG"> User agents must render alternate text when they cannot support images, they cannot support a certain image type or when they are configured not to display images. </quote> I changed the rendering if <img> again so we don't output anything, if the value for alt is empty (|alt=""|) - I read that somewhere. I still put "["/"]" out around non-empty alt or title text. I noticed that unknown tags are cosidered blocks, i.e. "unknown" inline tags are rendered with linebreaks. I didn't fix that yet. Sorry for overloading this patch, but this is the result, if the stuff lies around for so long.

Ben Bucksch (:BenB)

Assignee

Comment 25

•

25 years ago

Attached patch Fix, version 4 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 26

•

25 years ago

Akk, I also uncommented the following code: // Else make sure we'll separate block level tags, // even if we're about to leave before doing any other formatting. // Oddly, I can't find a case where this actually makes any difference. //else if (IsBlockLevel(type)) // EnsureVerticalSpace(0); I *did* see a difference for <dt> and <dl>. Ah, and I added some inline tags, so that we don't output linebreak around them, i.e. hotfixed the default-block problem described above. Note that this problem exists right now in the tree, it is *not* introduced by the change above.

Ben Bucksch (:BenB)

Assignee

Comment 27

•

25 years ago

Attached patch Fix, version 5 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 28

•

25 years ago

Attached patch Default prefs, version 2 — Details — Splinter Review

Ben Bucksch (:BenB)

Assignee

Comment 29

•

25 years ago

Added new mode for akk, Joe et al: No indention at all. Per akk, removed "network." from prefs. See default pref diff for details.

Ben Bucksch (:BenB)

Assignee

Updated

•

25 years ago

Keywords: review → approval

Ben Bucksch (:BenB)

Assignee

Comment 30

•

25 years ago

Checked in. Summary: Support headers (3 modes), <dd>/<dt>, <q>, <code>, <th>. Improved <img>, </blockquote>. Improved "unknown" block and (some) inline tags. Pref for structured phrases. Opinion: Now tables and some other minor improvements, and we have a decent HTML->TXT converter :).

Status: ASSIGNED → RESOLVED

Closed: 25 years ago

Resolution: --- → FIXED

Target Milestone: M17 → M18

Ben Bucksch (:BenB)

Assignee

Comment 31

•

25 years ago

*** Bug 41952 has been marked as a duplicate of this bug. ***

sujay

Comment 32

•

25 years ago

verified in 9/13 build.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.