Closed
Bug 44439
Opened 24 years ago
Closed 24 years ago
Support headers
Categories
(Core :: DOM: Serializers, defect, P3)
Core
DOM: Serializers
Tracking
()
VERIFIED
FIXED
M18
People
(Reporter: BenB, Assigned: BenB)
References
Details
(Whiteboard: Fixed.)
Attachments
(10 files)
174 bytes,
text/html
|
Details | |
3.67 KB,
patch
|
Details | Diff | Splinter Review | |
4.55 KB,
patch
|
Details | Diff | Splinter Review | |
247 bytes,
text/html
|
Details | |
10.14 KB,
patch
|
Details | Diff | Splinter Review | |
10.31 KB,
patch
|
Details | Diff | Splinter Review | |
947 bytes,
patch
|
Details | Diff | Splinter Review | |
13.98 KB,
patch
|
Details | Diff | Splinter Review | |
14.65 KB,
patch
|
Details | Diff | Splinter Review | |
976 bytes,
patch
|
Details | Diff | Splinter Review |
Reproduce:
1. Load the testcase in the editor
2. Debug|OutputText
Actual result:
<<h1
foo h2
foo h1
foo h2
foo h3
foo h3
foo h4
foo h5
foo h6
foo
>>
Expected result:
Something more intelligent, e.g. Lynx' output:
bash$ lynx -dump header.html
h1
foo
h2
foo
h1
foo
h2
foo
h3
foo
h3
foo
h4
foo
h5
foo
h6
foo
Or something with numbers, like
1. h1
foo
1.1 h2
foo
2. h1
foo
2.1 h2
foo
2.1.1 h3
foo
2.1.2 h3
foo
2.1.2.1. h4
foo
2.1.2.1.1. h5
foo
2.1.2.1.1.1. h6
foo
etc. Since we output normal text at column 0, I tend towards the numbering
scheme. Suggestions welcome, but fast, please :).
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → M17
Assignee | ||
Comment 1•24 years ago
|
||
Assignee | ||
Comment 2•24 years ago
|
||
No proposed rendering in the HTML 4.0 spec, but <quote src="http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.5"> HTML does not itself cause section numbers to be generated from headings. This facility may be offered by user agents, however. </quote>
Assignee | ||
Comment 3•24 years ago
|
||
Fixed. Used the numbering version. akk, can you review, please?
Assignee | ||
Comment 4•24 years ago
|
||
Assignee | ||
Comment 5•24 years ago
|
||
Assignee | ||
Updated•24 years ago
|
Summary: Support headers better → Support headers
Comment 6•24 years ago
|
||
I'm not sure that numbering headers is the best since that inserts content with a meaning that the author might not have intended. Imagine the following: ---------- A sends a html-mail with headers converted to plain text to B. The mail gets numbered headers. B answers. "I agree with you. Especially 1.1.3 was very interesting". This will confuse A very much since he didn't send anything with "1.1.3". ------------ or just mailing any web page where the author has used <h#> tags for formatting rather than for logical organizing of the content. Then our numbering could be all wrong in respect of the content in the mail. I think your other proposal, using indentations is better. In the case of a wrapping text, it should even be simple to center a header. (Left as an exercise for the interested reader).
Assignee | ||
Comment 7•24 years ago
|
||
Assignee | ||
Comment 8•24 years ago
|
||
Daniel, the problem with the lynx style is that it does not scale. Lynx outout of a harder testcase: foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo The structure is completely unreadable. And this example is not even far-fetched, I write such documents. OK, not in that intense and you can use the content to guess the structure, but that defeats the purpose. Structure should be clearly and unambiguously *visible*. As we write normal text at colunm 0, we have to indent headers more. This would look like: foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo A bit better for normal text plus headers, but even less readable for headers plus indented content (<blockquote>, <dl> , <ul> etc.). As for your first example, there are two cases: Send as Plaintext and multipart/alternative. In the former case, the sender might have a copy looking the same in his Sent folder, so he can see the numbering there. Unfortunate, but possible. In the second case, the recipient used a non-HTML-compliant mailer. He should be used to multipart msgs and know that he might see different output than the sender, so he will propably refere to the header name, not number. As for your second example, I was aware of that. If people abuse HTML, there is not much we can do. The problem with abusing HTML is *exactly* that it won't work well with configurations different from the author. If authors get aware of that: fine with me :). Note, that the Composer says "Heading 1" etc. in the UI, not "bigger" or so. I will add a comment to the askSendFormat dialog that the plaintext version might look different from what the author saw in the composer. The HTML 4.0 spec authors were well aware of abuse of HTML. Nevertheless, they explicitly allow numbering of numbers.
Assignee | ||
Comment 9•24 years ago
|
||
Ah, sample output of my implementation: 1. foo foo 1.1. foo foo 2. foo foo 2.1. foo foo 2.1.1. foo foo 2.1.2. foo foo 2.1.2.1. foo foo foo foo foo 2.1.2.1.1. foo foo 2.1.2.1.1.1. foo foo (We don't support <dl> yet.) Note, that my implementation is done. I don't really want to dump it, make correct HTML unreadable, just to make illegal HTML more readable.
Comment 10•24 years ago
|
||
I saw your implementation and it looked good but IMHO this doesn't improve Mozilla. That W3C allows a user agent to number headers is not enough reason to do it. Numbered headers can be useful but it's a feature that should be controlled by the author. In this case we insert numbers with no feedback to the sender until it's done and sent and not even then if the user doesn't look in the sent mail folder, and honestly, how many do check their mails there after they are sent. Your examples were quite unrealistic you know. No text and headers consists solely of the word foo. Normally the text and the headers contribute to the understanding of the logical structure. Indentation could be a help to that understanding. A help that is discrete, but it's there and it won't disturb the content. (Another example would be what would happen if someone already numbered headers manually before they are sent)
Assignee | ||
Comment 11•24 years ago
|
||
> Your examples were quite unrealistic you know. I filed and fixed this bug exactly because 4.x doesn't convert headers well and the plain text version of my texts were not easily parsable/understandable.* I often have paragraphs < 1 line and I happily use headers, lists etc. in the wildest combinations. In mail/news. So, it was a drastic example, but it has a valid, realistic and severe point. > Normally the text and the headers contribute to the > understanding of the logical structure. I intentionally include and rely on structure in my composition. You have a hard time to understand my texts, if the structurual information is not preserved. *This gives me 3 options: - Directly composing in plain text (including manual wrapping for lists) - Writing the HTML version with the plain text version in mind and crippling the HTML version by doing so. (This requires detailed knowledge about the HTML->TXT converter - something only power users have.) - Not caring about the plain text version. Surely, none of these are acceptable.
Comment 12•24 years ago
|
||
I would not want the output system numbering headers which did not appear numbered in the html. That doesn't make sense at all. If I wanted it numbered, I'd use something like a numbered list -- or I'd just put in numbers and make them part of the header. I agree that the old code isn't working -- the text and the header shouldn't appear on the same line, we should have a newline, at least -- but adding numbers doesn't seem right. As an alternate suggestion, we could add an identifier like "*" (since html headers usually appear bold by default), or perhaps something new, like "<<" and ">>", around headers when converting to formatted plaintext, so <h1>foo</h1> would look like <<foo>>. You could even add more of them depending on the level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.
Assignee | ||
Comment 13•24 years ago
|
||
> I would not want the output system numbering headers which did not appear > numbered in the html. That doesn't make sense at all. It does make sense. Both me and the HTML spec authors came up with the same proposal. It might not make sense for *your* documents, which is still a legal argument. can you explain why you think, this made no sense? The two reasons Daniel gave were valid, but IMO no reason to drop numbers. BTW: I just added a clear statement to the askSendFormat dialog: "[...] the plaintext version might look different from what you saw in the composer". Suggestions for rephrasing welcome. > If I wanted it > numbered, I'd use something like a numbered list -- or I'd just put in numbers > and make them part of the header. See <http://www.bucksch.org/1/projects/mozilla/31906> rendered with that patch. Although I do use numbers for headers in one section (creating 2 numbers for each of those headers), it still looks better than rendered by lynx. > You could even add more of them depending on the > level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc. This is not obvious. This was exactly the problem, which led my to my proposal: Making the hierarchy obvious (without looking at the content). This is a *requirement*.
Comment 14•24 years ago
|
||
The point of doing formatted plaintext is to make the plaintext output look as much as possible as the html that produced it, so that we're as close to wysiwyg as possible.
Assignee | ||
Comment 15•24 years ago
|
||
IMO wrong. Impossible. How do you show bold in plaintext? Font sizes? You will loose a lot of information, if you try to emulate the look of a graphical display. The goal should be to carry over as much *information* as possible and output it in a way as if a human had written it directly in plaintext. And RFCs, a good example for formatted plaintext, use numbers for headers. Mamybe, we should discuss this in a newsgroup?
Comment 16•24 years ago
|
||
Yes, bring it up in a newsgroup (mailnews, certainly, and probably crosspost to editor). If a majority of people say they want numbers added to their headers, I'll go along, though I still won't like it myself (I'll probably stop using headers and use bold instead, which isn't a big deal).
Assignee | ||
Comment 17•24 years ago
|
||
Posted to .mail-news and .editor.
Assignee | ||
Comment 18•24 years ago
|
||
Assignee | ||
Comment 19•24 years ago
|
||
I gave in and implemented a more Lynx-like rendering method. I output 2 lines before and 2 after the header and indent the header text 2x columns for h(x). I.e. I decided not to indent h1 for now, because the current implementation - is more consistent across header levels - reduces the risk of confusion between h1 and h6 - is easier to implement I did *not* insert and new characters like e.g. an underline. So, the current implementation bears the risk of confusion between indention and headers (as already pointed out). Hopefully, that is not that bad in practice, since we output 2 lines before the header - it will be a problem, if either the user or the editor insert two lines before a normal indention. I added a pref ("network.converter.html2txt.numbered_headers", default off for now) to switch to the implementation with numbered headers. The patch also contains a pref ("network.converter.html2txt.structs", default on for now) for the output of structured phrases (strong, em, code (new), sub, sup, b, i, u), i.e. either the 4.x or the previous Mozilla behaviour. The patch also changes the <img> implementation: We now don't output both the alt and src (URI) attribute (if existant) anymore, but the alt, title *or* src attribute (in decreasing order of preference, depending on what exists).
Assignee | ||
Comment 20•24 years ago
|
||
Assignee | ||
Comment 21•24 years ago
|
||
akk, can you review now, please? Example output with numbered headers pref off: <example content="simple_testcase"> h1 foo h2 foo h1 foo h2 foo h3 foo h3 foo h4 foo h5 foo h6 foo </example> <example content="harder_testcase"> foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo </example>
Assignee | ||
Comment 22•24 years ago
|
||
> I output 2 lines before and 2 after the header as/2 after/1 after > I.e. I decided not to indent h1 for now s/indent/center Note that I output brackets around the alt/title/src attribute of <img> (as I did before).
Assignee | ||
Comment 23•24 years ago
|
||
Assignee | ||
Comment 24•24 years ago
|
||
I added some support for definition lists (<dl>, <dt>, <dd>), <th> and <q> (hardcoding western quotation marks for the latter). Also fixed a bug where the converter gets confused, if a normal <blockquote> is inside a <blockquote type=cite> (or the other way around, I think), due to a completely broken algorithm (catched this while reading source). While reading the HTML 4.0 spec, I noticed that the "alt" attribute is *required* to be specified in the document and *required* to be rendered, if the img is not rendered. Interesting. <quote src="http://www.w3.org/TR/REC-html40/struct/objects.html#edef-IMG"> User agents must render alternate text when they cannot support images, they cannot support a certain image type or when they are configured not to display images. </quote> I changed the rendering if <img> again so we don't output anything, if the value for alt is empty (|alt=""|) - I read that somewhere. I still put "["/"]" out around non-empty alt or title text. I noticed that unknown tags are cosidered blocks, i.e. "unknown" inline tags are rendered with linebreaks. I didn't fix that yet. Sorry for overloading this patch, but this is the result, if the stuff lies around for so long.
Assignee | ||
Comment 25•24 years ago
|
||
Assignee | ||
Comment 26•24 years ago
|
||
Akk, I also uncommented the following code: // Else make sure we'll separate block level tags, // even if we're about to leave before doing any other formatting. // Oddly, I can't find a case where this actually makes any difference. //else if (IsBlockLevel(type)) // EnsureVerticalSpace(0); I *did* see a difference for <dt> and <dl>. Ah, and I added some inline tags, so that we don't output linebreak around them, i.e. hotfixed the default-block problem described above. Note that this problem exists right now in the tree, it is *not* introduced by the change above.
Assignee | ||
Comment 27•24 years ago
|
||
Assignee | ||
Comment 28•24 years ago
|
||
Assignee | ||
Comment 29•24 years ago
|
||
Added new mode for akk, Joe et al: No indention at all. Per akk, removed "network." from prefs. See default pref diff for details.
Assignee | ||
Updated•24 years ago
|
Assignee | ||
Comment 30•24 years ago
|
||
Checked in. Summary: Support headers (3 modes), <dd>/<dt>, <q>, <code>, <th>. Improved <img>, </blockquote>. Improved "unknown" block and (some) inline tags. Pref for structured phrases. Opinion: Now tables and some other minor improvements, and we have a decent HTML->TXT converter :).
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Target Milestone: M17 → M18
Assignee | ||
Comment 31•24 years ago
|
||
*** Bug 41952 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•