Closed
Bug 775952
Opened 12 years ago
Closed 10 years ago
Unclosed <a> tags are repeated throughout document by html5lib & ckeditor
Categories
(developer.mozilla.org Graveyard :: Editing, defect, P2)
developer.mozilla.org Graveyard
Editing
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: sheppy, Unassigned)
References
Details
(From an email from bz):
I'm looking at https://developer.mozilla.org/en/Mozilla/WebIDL_bindings and comparing it to https://developer-new.mozilla.org/en-US/docs/Mozilla/WebIDL_bindings and in particular looking at the section heading "[Creator]".
In the former, the source (via view source) looks like this:
<h3 class="editable"><a name="Creator"></a>[Creator]</h3>
<p>Used to flag methods or attributes ...</p>
In the new wiki, the source looks like this:
<h3 id="[Creator]"><a>[Creator]</a></h3><a>
<p>Used to flag methods or attributes ...</p>
Worse yet, if I actually go to edit that section, it gets turned into:
<h3 id="[Creator]"><a name="Creator">[Creator]</a></h3>
<p><a name="Creator"> </a></p>
<p><a name="Creator">Used to flag methods or attributes ...</a></p>
and in fact that <a name="Creator"> ends up scattered through each paragraph in that section. That's consistent with parsing the above source string with a totally broken HTML parser. A correct HTML5 parser would allow <a> to contain <p> instead of doing the weirdness above.
And even worse yet, the actual link to that section in the document (in the note near the beginning) is broken, because _that_ says <a href="#Creator"> but that anchor target has gone completely awol. Though for some reason it reappears when I'm editing the section; it's just not there when looking at the actual page?
Comment 1•12 years ago
|
||
Sheppy: Could you provide a tldr for quick reference?
Reporter | ||
Comment 2•12 years ago
|
||
Looks like the name attribute is being handled wring during import; it may be a Bleach issue. This may be a significant issue. I don't know.
Comment 3•12 years ago
|
||
Splitting off the anchor issue as bug 776703.
Updated•12 years ago
|
Updated•12 years ago
|
Whiteboard: s=2012-08-01
Updated•12 years ago
|
Whiteboard: s=2012-08-01
Reporter | ||
Comment 4•12 years ago
|
||
For what it's worth, I have no idea why this page has these anchors in the first place. However they got there, they shouldn't be there. Unfortunately, it doesn't seem to be possible to remove them right now due to the way the HTML gets borked when you try to edit the page.
Comment 5•12 years ago
|
||
Which anchors?
The initial source had the <a name="Creator"></a> so that I could link to them.
In the imported source, the <a> after the </h3> looks like an attempt by a buggy parser to reopen the <a> that got closed _before_ the </h3>. Why that <a> didn't get closed when it should have (before the '[' of "[Creator]") is a bit of a mystery.
Reporter | ||
Comment 6•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #5)
> The initial source had the <a name="Creator"></a> so that I could link to
> them.
The wiki (the old one, as well as the new one) automatically creates anchors at rendering time (that is, it inserts appropriate HTML to create them when sending the content over the network to readers) for all headings with the same name as the text of the heading. It wasn't necessary to create them manually. Doing so is probably why the import got confused; it didn't know to look for this being done, so it completely mishandled it (that's not an excuse, just an explanation -- it's definitely a problem that it mishandled this so badly).
Comment 7•12 years ago
|
||
Yes, I know the wiki does that. The problem with using those anchors is that if someone changes the heading text all links to that section break. I've had that happen to me a few times before I wised up. So for my purposes (creating stable in-document links and being able to link to the wiki from other places) the autogenerated anchors are unfortunately useless. :(
Reporter | ||
Comment 8•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #7)
> Yes, I know the wiki does that. The problem with using those anchors is
> that if someone changes the heading text all links to that section break.
> I've had that happen to me a few times before I wised up. So for my
> purposes (creating stable in-document links and being able to link to the
> wiki from other places) the autogenerated anchors are unfortunately useless.
Hm. I can see that; however, editing of headings is really uncommon once an article has stabilized. Still, I will work toward getting this resolved properly.
Comment 9•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #7)
> Yes, I know the wiki does that. The problem with using those anchors is
> that if someone changes the heading text all links to that section break.
> I've had that happen to me a few times before I wised up. So for my
> purposes (creating stable in-document links and being able to link to the
> wiki from other places) the autogenerated anchors are unfortunately useless.
> :(
FWIW: The new Kuma wiki will use a manually-supplied name="" attribute on a header to override auto-generation of id="" attributes from header text.
Example:
"Header text differs from ID"
https://developer-new.mozilla.org/en-US/docs/User:lmorchard#header-test-1
So, this:
<h3 name="foo">Bar Baz Quux</h3>
Should result in this when rendered:
<h3 name="foo" id="foo">Bar Baz Quux</h3>
Comment 10•12 years ago
|
||
Ah, very nice!
Reporter | ||
Comment 11•12 years ago
|
||
:lorchard for the win!
Reporter | ||
Comment 12•12 years ago
|
||
:lorchard - is there a way to fix this page automatically, or do we need to rebuild it by hand?
Comment 13•12 years ago
|
||
I've yet to figure out what the underlying issue with the anchors inside headers is, so fastest would be to hand-correct that markup with what I mentioned in comment 9.
Comment 14•12 years ago
|
||
If it helps, probably groovecoder or I could run the migration tool on prod to re-migrate this one page
Reporter | ||
Comment 15•12 years ago
|
||
OK, I will just redo the page from scratch with the content from the old site. This will work better.
Comment 16•12 years ago
|
||
So... flakey internet aside, I think I managed to fix it with a bit of source search & replace form the original DB:
https://developer-new.mozilla.org/en-US/docs/Mozilla/WebIDL_bindings
Still not sure what in the parser might have done it, but self-closed <a name="" /> seems to trip it up. We're using Bleach, which in turn uses html5lib in Python - but there's also whatever we do for header processing.
If I get some time tonight, I might dig into that processing chain and see what might be at fault. In the meantime, I think the immediately practical fix is to just avoid <a name="" /> inside headers
Reporter | ||
Comment 17•12 years ago
|
||
Thanks, Les! Well done!
I'm converting this bug into a "don't get confused by <a name=''> in headers" bug.
Summary: Kuma: Page contents messed up by import to Kuma? → Kuma: Using <a name="..."> in headers trips up Bleach in a bad way
Comment 18•12 years ago
|
||
> but self-closed <a name="" /> seems to trip it up
Er.. that's an _unclosed_ <a> in HTML, so yes, it could lead to the observed behavior when parsed with an HTML5 parser like html5lib.
But the actual markup I used on the original doesn't include unclosed <a> like that; I carefully listed ed tags for my anchors. Where is the "self-closed" bit coming from?
Comment 19•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #18)
> > but self-closed <a name="" /> seems to trip it up
>
> Er.. that's an _unclosed_ <a> in HTML, so yes, it could lead to the observed
> behavior when parsed with an HTML5 parser like html5lib.
>
> But the actual markup I used on the original doesn't include unclosed <a>
> like that; I carefully listed ed tags for my anchors. Where is the
> "self-closed" bit coming from?
That's from content I pulled directly out of the MindTouch MySQL DB (ie. current prod).
That makes me think this bug is actually an interaction between a MindTouch bug and a Kuma bug, mediated by migration. Nightmare fuel, right there.
I hope Kuma can somehow be tweaked able to handle this gracefully. But, whatever you typed into MindTouch (ie. <a name=""></a>) got turned into <a name="" /> when it saved
Comment 20•12 years ago
|
||
> But, whatever you typed into MindTouch (ie. <a name=""></a>) got turned into <a name="" />
And it kept showing me the former in the source view. :( Yeah, fail. :(
Comment 21•12 years ago
|
||
And thank you very much for sorting through this!
Updated•12 years ago
|
Priority: -- → P1
Updated•12 years ago
|
Whiteboard: p=2
Updated•12 years ago
|
Assignee: nobody → lorchard
Comment 22•12 years ago
|
||
Okay, so after spending some time spelunking through the guts of html5lib, I'm tempted to WONTFIX this.
The exact problem condition is where someone sticks an unclosed <a> or XHTML-style self-closing <a/> into the document (which is also an unclosed tag).
The HTML5 parser & serializer seems to helpfully carry that unclosed <a> forward throughout the document, "fixing" things by wrapping chunks in the <a> wherever valid.
I'm not sure how to actually prevent this, without bringing in a different parser or... resorting to regexes. On the other hand, this should be a very rare condition, and requires using the source view of the editor to induce.
Going to poke a bit more, but the temptation to wave the white flag is strong
Reporter | ||
Comment 23•12 years ago
|
||
(In reply to Les Orchard [:lorchard] from comment #22)
> I'm not sure how to actually prevent this, without bringing in a different
> parser or... resorting to regexes. On the other hand, this should be a very
> rare condition, and requires using the source view of the editor to induce.
The problem is that we have several users, particularly certain engineers, who exclusively work in source view, or do so on a regular basis. We simply can't rely on people to use the WYSIWYG and assume that nobody will do things the HTML parser doesn't like. :(
However, fixing bug 779881 would help alleviate this, because it would remove one of the common use cases for doing so.
Comment 24•12 years ago
|
||
(In reply to Eric Shepherd [:sheppy] from comment #23)
> (In reply to Les Orchard [:lorchard] from comment #22)
>
> > I'm not sure how to actually prevent this, without bringing in a different
> > parser or... resorting to regexes. On the other hand, this should be a very
> > rare condition, and requires using the source view of the editor to induce.
>
> The problem is that we have several users, particularly certain engineers,
> who exclusively work in source view, or do so on a regular basis. We simply
> can't rely on people to use the WYSIWYG and assume that nobody will do
> things the HTML parser doesn't like. :(
Well, it's not so much that the HTML parser doesn't like this case. It's that this is actually the HTML5 spec-defined behavior for handling unclosed formatting tags (of which <a> is one):
http://dev.w3.org/html5/spec/parsing.html#the-list-of-active-formatting-elements
I'm still poking, and maybe I can come up with a non-standard HTML5 serializer that breaks from the spec yet does something less weird for our use case.
> However, fixing bug 779881 would help alleviate this, because it would
> remove one of the common use cases for doing so.
Summary: Kuma: Using <a name="..."> in headers trips up Bleach in a bad way → Kuma: Unclosed <a> tags are repeated throughout document by html5lib
Comment 25•12 years ago
|
||
To be clear, in this case I did _not_ manually add the unclosed <a>. The original source on the old wiki had <a></a>.
So somewhere during the import process that got turned into either <a> or <a /> or something, after which just parsing with an HTML5 parser will in fact give the observed behavior. But the real issue was the fact that an originally-closed <a></a> got turned into just an open tag somewhere in the pipeline.
Comment 26•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #25)
> To be clear, in this case I did _not_ manually add the unclosed <a>. The
> original source on the old wiki had <a></a>.
>
> So somewhere during the import process that got turned into either <a> or <a
> /> or something, after which just parsing with an HTML5 parser will in fact
> give the observed behavior. But the real issue was the fact that an
> originally-closed <a></a> got turned into just an open tag somewhere in the
> pipeline.
Yeah, I think that was MindTouch, which appeared to collapse <a></a> into <a/> in the stored data. In Kuma, I'm poking around at trying to catch that and do something more desirable
Comment 27•12 years ago
|
||
I might have a fix, but it seems too easy to be true:
https://github.com/mozilla/kuma/pull/585
Comment 28•12 years ago
|
||
Looks like CKEditor is causing a mess with these unclosed elements, also. So, the html5lib serializer hack is only part of the solution...
https://github.com/mozilla/kuma/pull/585#issuecomment-8479180
http://screencast.com/t/n0xKHlSCl
Updated•12 years ago
|
Summary: Kuma: Unclosed <a> tags are repeated throughout document by html5lib → Kuma: Unclosed <a> tags are repeated throughout document by html5lib & ckeditor
Comment 29•12 years ago
|
||
Yeah, looks like CKEditor turns this:
<h3><a name="x-1">Foo</h3>
<p>Lorem ipsum</p>
<p>Dolor amet</p>
<h3><a name="x-2"/>Bar</h3>
<p>Bacon bacon bacon</p>
<h3><a name="x-3">Baz</h3>
<p>Tofu tofu tofu</p>
<h3><a name="x-4"></a>Quux</h3>
<p>Peas peas peas</p>
Into this:
<h3><a name="x-1">Foo</a></h3>
<p><a name="x-1"> </a></p>
<p><a name="x-1">Lorem ipsum</a></p>
<p><a name="x-1">Dolor amet</a></p>
<h3><a name="x-1"></a><a name="x-2">Bar</a></h3>
<p><a name="x-2"> </a></p>
<p><a name="x-2">Bacon bacon bacon</a></p>
<h3><a name="x-2"></a><a name="x-3">Baz</a></h3>
<p><a name="x-3"> </a></p>
<p><a name="x-3">Tofu tofu tofu</a></p>
<h3><a name="x-3"></a><a name="x-4"></a>Quux</h3>
<p>Peas peas peas</p>
Can confirm this on the CKEditor demo page, independent of Kuma. http://ckeditor.com/demo
So, it's an upstream bug. Continuing to dig through the guts of their HTML parser and serializer, not *entirely* understanding what I'm looking at. Maybe I'll work something out
Reporter | ||
Comment 30•12 years ago
|
||
Do we want to try to ping the CKEditor folks and see if we can get an assist on this problem? At least some advice on how to tackle the problem?
Comment 31•12 years ago
|
||
(In reply to Eric Shepherd [:sheppy] from comment #30)
> Do we want to try to ping the CKEditor folks and see if we can get an assist
> on this problem? At least some advice on how to tackle the problem?
Looks like there's an existing ticket, albeit 18 months old: http://dev.ckeditor.com/ticket/7543
Not sure who to contact for an update on this, thinking a fix and upstream patch is the best bet for our purposes. I'll dig some more.
Comment 32•12 years ago
|
||
The upstream bug on CKEditor has been marked "closed" and "invalid", basically with the advice of "close your tags".
I'll see if I can get a CKEditor dev env going, so as to make one more swing at fixing this myself.
Assignee | ||
Updated•12 years ago
|
Version: Kuma → unspecified
Assignee | ||
Updated•12 years ago
|
Component: Website → Landing pages
Updated•12 years ago
|
Assignee: lorchard → nobody
QA Contact: website
Whiteboard: p=2 → p=3 s=2012-10-02
Comment 33•12 years ago
|
||
So, I've spent about 2 weeks solidly looking into this and have found that pretty much every part of the stack does something other than what we'd like for the purposes of this bug - from Firefox to CKEditor to html5lib.
Not any closer to resolving this, really. So, I think I need to take a break from this one and try tackling a few other bugs. In the meantime, I guess it's just a known bug that leaving inline tags like <a> open, through whatever chain of events intentional or otherwise, leads to undesirable results.
Updated•12 years ago
|
Priority: P1 → P3
Comment 34•12 years ago
|
||
Removing the s= tag to work around Scrumbugs issue 99:
https://github.com/mozilla/scrumbugz/issues/99
Whiteboard: p=3 s=2012-10-02 → p=3
Updated•12 years ago
|
Component: Landing pages → Editing
Whiteboard: p=3 → p=3 c=Editing
Updated•12 years ago
|
Priority: P3 → P2
Updated•11 years ago
|
Summary: Kuma: Unclosed <a> tags are repeated throughout document by html5lib & ckeditor → Unclosed <a> tags are repeated throughout document by html5lib & ckeditor
Whiteboard: p=3 c=Editing
Comment 35•10 years ago
|
||
I'm not able to reproduce this with the new CKEditor. Boris, please reopen if this is still an issue.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Updated•5 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•