Closed
Bug 44439
Opened 25 years ago
Closed 25 years ago
Support headers
Categories
(Core :: DOM: Serializers, defect, P3)
Core
DOM: Serializers
Tracking
()
VERIFIED
FIXED
M18
People
(Reporter: BenB, Assigned: BenB)
References
Details
(Whiteboard: Fixed.)
Attachments
(10 files)
174 bytes,
text/html
|
Details | |
3.67 KB,
patch
|
Details | Diff | Splinter Review | |
4.55 KB,
patch
|
Details | Diff | Splinter Review | |
247 bytes,
text/html
|
Details | |
10.14 KB,
patch
|
Details | Diff | Splinter Review | |
10.31 KB,
patch
|
Details | Diff | Splinter Review | |
947 bytes,
patch
|
Details | Diff | Splinter Review | |
13.98 KB,
patch
|
Details | Diff | Splinter Review | |
14.65 KB,
patch
|
Details | Diff | Splinter Review | |
976 bytes,
patch
|
Details | Diff | Splinter Review |
Reproduce:
1. Load the testcase in the editor
2. Debug|OutputText
Actual result:
<<h1
foo h2
foo h1
foo h2
foo h3
foo h3
foo h4
foo h5
foo h6
foo
>>
Expected result:
Something more intelligent, e.g. Lynx' output:
bash$ lynx -dump header.html
h1
foo
h2
foo
h1
foo
h2
foo
h3
foo
h3
foo
h4
foo
h5
foo
h6
foo
Or something with numbers, like
1. h1
foo
1.1 h2
foo
2. h1
foo
2.1 h2
foo
2.1.1 h3
foo
2.1.2 h3
foo
2.1.2.1. h4
foo
2.1.2.1.1. h5
foo
2.1.2.1.1.1. h6
foo
etc. Since we output normal text at column 0, I tend towards the numbering
scheme. Suggestions welcome, but fast, please :).
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → M17
Assignee | ||
Comment 1•25 years ago
|
||
Assignee | ||
Comment 2•25 years ago
|
||
No proposed rendering in the HTML 4.0 spec, but
<quote src="http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.5">
HTML does not itself cause section numbers to be generated from headings. This
facility may be offered by user agents, however.
</quote>
Assignee | ||
Comment 3•25 years ago
|
||
Fixed. Used the numbering version.
akk, can you review, please?
Assignee | ||
Comment 4•25 years ago
|
||
Assignee | ||
Comment 5•25 years ago
|
||
Assignee | ||
Updated•25 years ago
|
Summary: Support headers better → Support headers
![]() |
||
Comment 6•25 years ago
|
||
I'm not sure that numbering headers is the best since that inserts content with
a meaning that the author might not have intended.
Imagine the following:
----------
A sends a html-mail with headers converted to plain text to B. The mail gets
numbered headers.
B answers. "I agree with you. Especially 1.1.3 was very interesting".
This will confuse A very much since he didn't send anything with "1.1.3".
------------
or just mailing any web page where the author has used <h#> tags for formatting
rather than for logical organizing of the content. Then our numbering could be
all wrong in respect of the content in the mail.
I think your other proposal, using indentations is better. In the case of a
wrapping text, it should even be simple to center a header. (Left as an exercise
for the interested reader).
Assignee | ||
Comment 7•25 years ago
|
||
Assignee | ||
Comment 8•25 years ago
|
||
Daniel,
the problem with the lynx style is that it does not scale. Lynx outout of a
harder testcase:
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
The structure is completely unreadable. And this example is not even
far-fetched, I write such documents. OK, not in that intense and you can use the
content to guess the structure, but that defeats the purpose. Structure should
be clearly and unambiguously *visible*.
As we write normal text at colunm 0, we have to indent headers more. This would
look like:
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
A bit better for normal text plus headers, but even less readable for headers
plus indented content (<blockquote>, <dl> , <ul> etc.).
As for your first example, there are two cases: Send as Plaintext and
multipart/alternative.
In the former case, the sender might have a copy looking the same in his Sent
folder, so he can see the numbering there. Unfortunate, but possible.
In the second case, the recipient used a non-HTML-compliant mailer. He should be
used to multipart msgs and know that he might see different output than the
sender, so he will propably refere to the header name, not number.
As for your second example, I was aware of that. If people abuse HTML, there is
not much we can do. The problem with abusing HTML is *exactly* that it won't
work well with configurations different from the author. If authors get aware of
that: fine with me :). Note, that the Composer says "Heading 1" etc. in the UI,
not "bigger" or so. I will add a comment to the askSendFormat dialog that the
plaintext version might look different from what the author saw in the composer.
The HTML 4.0 spec authors were well aware of abuse of HTML. Nevertheless, they
explicitly allow numbering of numbers.
Assignee | ||
Comment 9•25 years ago
|
||
Ah, sample output of my implementation:
1. foo
foo
1.1. foo
foo
2. foo
foo
2.1. foo
foo
2.1.1. foo
foo
2.1.2. foo
foo
2.1.2.1. foo
foo
foo
foo
foo
2.1.2.1.1. foo
foo
2.1.2.1.1.1. foo
foo
(We don't support <dl> yet.)
Note, that my implementation is done. I don't really want to dump it, make
correct HTML unreadable, just to make illegal HTML more readable.
![]() |
||
Comment 10•25 years ago
|
||
I saw your implementation and it looked good but IMHO this doesn't improve
Mozilla. That W3C allows a user agent to number headers is not enough reason to
do it.
Numbered headers can be useful but it's a feature that should be controlled by
the author. In this case we insert numbers with no feedback to the sender until
it's done and sent and not even then if the user doesn't look in the sent mail
folder, and honestly, how many do check their mails there after they are sent.
Your examples were quite unrealistic you know. No text and headers consists
solely of the word foo. Normally the text and the headers contribute to the
understanding of the logical structure. Indentation could be a help to that
understanding. A help that is discrete, but it's there and it won't disturb the
content.
(Another example would be what would happen if someone already numbered headers
manually before they are sent)
Assignee | ||
Comment 11•25 years ago
|
||
> Your examples were quite unrealistic you know.
I filed and fixed this bug exactly because 4.x doesn't convert headers well and
the plain text version of my texts were not easily parsable/understandable.* I
often have paragraphs < 1 line and I happily use headers, lists etc. in the
wildest combinations. In mail/news. So, it was a drastic example, but it has a
valid, realistic and severe point.
> Normally the text and the headers contribute to the
> understanding of the logical structure.
I intentionally include and rely on structure in my composition. You have a hard
time to understand my texts, if the structurual information is not preserved.
*This gives me 3 options:
- Directly composing in plain text (including manual wrapping for lists)
- Writing the HTML version with the plain text version in mind and crippling the
HTML version by doing so. (This requires detailed knowledge about the HTML->TXT
converter - something only power users have.)
- Not caring about the plain text version.
Surely, none of these are acceptable.
Comment 12•25 years ago
|
||
I would not want the output system numbering headers which did not appear
numbered in the html. That doesn't make sense at all. If I wanted it numbered,
I'd use something like a numbered list -- or I'd just put in numbers and make
them part of the header.
I agree that the old code isn't working -- the text and the header shouldn't
appear on the same line, we should have a newline, at least -- but adding
numbers doesn't seem right.
As an alternate suggestion, we could add an identifier like "*" (since html
headers usually appear bold by default), or perhaps something new, like "<<" and
">>", around headers when converting to formatted plaintext, so <h1>foo</h1>
would look like <<foo>>. You could even add more of them depending on the
level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.
Assignee | ||
Comment 13•25 years ago
|
||
> I would not want the output system numbering headers which did not appear
> numbered in the html. That doesn't make sense at all.
It does make sense. Both me and the HTML spec authors came up with the same
proposal.
It might not make sense for *your* documents, which is still a legal argument.
can you explain why you think, this made no sense? The two reasons Daniel gave
were valid, but IMO no reason to drop numbers.
BTW: I just added a clear statement to the askSendFormat dialog: "[...] the
plaintext version might look different from what you saw in the composer".
Suggestions for rephrasing welcome.
> If I wanted it
> numbered, I'd use something like a numbered list -- or I'd just put in numbers
> and make them part of the header.
See <http://www.bucksch.org/1/projects/mozilla/31906> rendered with that patch.
Although I do use numbers for headers in one section (creating 2 numbers for
each of those headers), it still looks better than rendered by lynx.
> You could even add more of them depending on the
> level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.
This is not obvious. This was exactly the problem, which led my to my proposal:
Making the hierarchy obvious (without looking at the content). This is a
*requirement*.
Comment 14•25 years ago
|
||
The point of doing formatted plaintext is to make the plaintext output look as
much as possible as the html that produced it, so that we're as close to wysiwyg
as possible.
Assignee | ||
Comment 15•25 years ago
|
||
IMO wrong. Impossible. How do you show bold in plaintext? Font sizes? You will
loose a lot of information, if you try to emulate the look of a graphical
display.
The goal should be to carry over as much *information* as possible and output it
in a way as if a human had written it directly in plaintext.
And RFCs, a good example for formatted plaintext, use numbers for headers.
Mamybe, we should discuss this in a newsgroup?
Comment 16•25 years ago
|
||
Yes, bring it up in a newsgroup (mailnews, certainly, and probably crosspost to
editor). If a majority of people say they want numbers added to their headers,
I'll go along, though I still won't like it myself (I'll probably stop using
headers and use bold instead, which isn't a big deal).
Assignee | ||
Comment 17•25 years ago
|
||
Posted to .mail-news and .editor.
Assignee | ||
Comment 18•25 years ago
|
||
Assignee | ||
Comment 19•25 years ago
|
||
I gave in and implemented a more Lynx-like rendering method. I output 2 lines
before and 2 after the header and indent the header text 2x columns for h(x).
I.e. I decided not to indent h1 for now, because the current implementation
- is more consistent across header levels
- reduces the risk of confusion between h1 and h6
- is easier to implement
I did *not* insert and new characters like e.g. an underline. So, the current
implementation bears the risk of confusion between indention and headers (as
already pointed out). Hopefully, that is not that bad in practice, since we
output 2 lines before the header - it will be a problem, if either the user or
the editor insert two lines before a normal indention.
I added a pref ("network.converter.html2txt.numbered_headers", default off for
now) to switch to the implementation with numbered headers.
The patch also contains a pref ("network.converter.html2txt.structs", default on
for now) for the output of structured phrases (strong, em, code (new), sub, sup,
b, i, u), i.e. either the 4.x or the previous Mozilla behaviour.
The patch also changes the <img> implementation: We now don't output both the
alt and src (URI) attribute (if existant) anymore, but the alt, title *or* src
attribute (in decreasing order of preference, depending on what exists).
Assignee | ||
Comment 20•25 years ago
|
||
Assignee | ||
Comment 21•25 years ago
|
||
akk, can you review now, please?
Example output with numbered headers pref off:
<example content="simple_testcase">
h1
foo
h2
foo
h1
foo
h2
foo
h3
foo
h3
foo
h4
foo
h5
foo
h6
foo
</example>
<example content="harder_testcase">
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
foo
</example>
Assignee | ||
Comment 22•25 years ago
|
||
> I output 2 lines before and 2 after the header
as/2 after/1 after
> I.e. I decided not to indent h1 for now
s/indent/center
Note that I output brackets around the alt/title/src attribute of <img> (as I
did before).
Assignee | ||
Comment 23•25 years ago
|
||
Assignee | ||
Comment 24•25 years ago
|
||
I added some support for definition lists (<dl>, <dt>, <dd>), <th> and <q>
(hardcoding western quotation marks for the latter).
Also fixed a bug where the converter gets confused, if a normal
<blockquote> is inside a <blockquote type=cite> (or the other way
around, I think), due to a completely broken algorithm (catched this
while reading source).
While reading the HTML 4.0 spec, I noticed that the "alt" attribute is
*required* to be specified in the document and *required* to be
rendered, if the img is not rendered. Interesting.
<quote
src="http://www.w3.org/TR/REC-html40/struct/objects.html#edef-IMG">
User agents must render alternate text when they cannot support images,
they cannot support a certain image type or when they are configured not
to display images.
</quote>
I changed the rendering if <img> again so we don't output anything, if the value
for alt is empty (|alt=""|) - I read that somewhere. I still put "["/"]" out
around non-empty alt or title text.
I noticed that unknown tags are cosidered blocks, i.e. "unknown" inline tags are
rendered with linebreaks. I didn't fix that yet.
Sorry for overloading this patch, but this is the result, if the stuff
lies around for so long.
Assignee | ||
Comment 25•25 years ago
|
||
Assignee | ||
Comment 26•25 years ago
|
||
Akk, I also uncommented the following code:
// Else make sure we'll separate block level tags,
// even if we're about to leave before doing any other formatting.
// Oddly, I can't find a case where this actually makes any difference.
//else if (IsBlockLevel(type))
// EnsureVerticalSpace(0);
I *did* see a difference for <dt> and <dl>.
Ah, and I added some inline tags, so that we don't output linebreak around them,
i.e. hotfixed the default-block problem described above. Note that this problem
exists right now in the tree, it is *not* introduced by the change above.
Assignee | ||
Comment 27•25 years ago
|
||
Assignee | ||
Comment 28•25 years ago
|
||
Assignee | ||
Comment 29•25 years ago
|
||
Added new mode for akk, Joe et al: No indention at all. Per akk, removed
"network." from prefs. See default pref diff for details.
Assignee | ||
Updated•25 years ago
|
Assignee | ||
Comment 30•25 years ago
|
||
Checked in.
Summary:
Support headers (3 modes), <dd>/<dt>, <q>, <code>, <th>.
Improved <img>, </blockquote>.
Improved "unknown" block and (some) inline tags.
Pref for structured phrases.
Opinion:
Now tables and some other minor improvements, and we have a decent HTML->TXT
converter :).
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Target Milestone: M17 → M18
Assignee | ||
Comment 31•25 years ago
|
||
*** Bug 41952 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•