Closed Bug 775952 Opened 8 years ago Closed 5 years ago

Unclosed <a> tags are repeated throughout document by html5lib & ckeditor

Categories

(developer.mozilla.org :: Editing, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: sheppy, Unassigned)

References

(Blocks 1 open bug)

Details

(From an email from bz):

I'm looking at https://developer.mozilla.org/en/Mozilla/WebIDL_bindings and comparing it to https://developer-new.mozilla.org/en-US/docs/Mozilla/WebIDL_bindings and in particular looking at the section heading "[Creator]".

In the former, the source (via view source) looks like this:

  <h3 class="editable"><a name="Creator"></a>[Creator]</h3>
  <p>Used to flag methods or attributes ...</p>

In the new wiki, the source looks like this:

  <h3 id="[Creator]"><a>[Creator]</a></h3><a>
  <p>Used to flag methods or attributes ...</p>

Worse yet, if I actually go to edit that section, it gets turned into:

  <h3 id="[Creator]"><a name="Creator">[Creator]</a></h3>
  <p><a name="Creator"> </a></p>
  <p><a name="Creator">Used to flag methods or attributes ...</a></p>

and in fact that <a name="Creator"> ends up scattered through each paragraph in that section.  That's consistent with parsing the above source string with a totally broken HTML parser.  A correct HTML5 parser would allow <a> to contain <p> instead of doing the weirdness above.

And even worse yet, the actual link to that section in the document (in the note near the beginning) is broken, because _that_ says <a href="#Creator"> but that anchor target has gone completely awol. Though for some reason it reappears when I'm editing the section; it's just not there when looking at the actual page?
Sheppy: Could you provide a tldr for quick reference?
Looks like the name attribute is being handled wring during import; it may be a Bleach issue. This may be a significant issue. I don't know.
Splitting off the anchor issue as bug 776703.
Blocks: 756263
Blocks: 773295
No longer blocks: 756263
Whiteboard: s=2012-08-01
Whiteboard: s=2012-08-01
For what it's worth, I have no idea why this page has these anchors in the first place. However they got there, they shouldn't be there. Unfortunately, it doesn't seem to be possible to remove them right now due to the way the HTML gets borked when you try to edit the page.
Which anchors?

The initial source had the <a name="Creator"></a> so that I could link to them.

In the imported source, the <a> after the </h3> looks like an attempt by a buggy parser to reopen the <a> that got closed _before_ the </h3>.  Why that <a> didn't get closed when it should have (before the '[' of "[Creator]") is a bit of a mystery.
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #5)

> The initial source had the <a name="Creator"></a> so that I could link to
> them.

The wiki (the old one, as well as the new one) automatically creates anchors at rendering time (that is, it inserts appropriate HTML to create them when sending the content over the network to readers) for all headings with the same name as the text of the heading. It wasn't necessary to create them manually. Doing so is probably why the import got confused; it didn't know to look for this being done, so it completely mishandled it (that's not an excuse, just an explanation -- it's definitely a problem that it mishandled this so badly).
Yes, I know the wiki does that.  The problem with using those anchors is that if someone changes the heading text all links to that section break.  I've had that happen to me a few times before I wised up.  So for my purposes (creating stable in-document links and being able to link to the wiki from other places) the autogenerated anchors are unfortunately useless.  :(
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #7)
> Yes, I know the wiki does that.  The problem with using those anchors is
> that if someone changes the heading text all links to that section break. 
> I've had that happen to me a few times before I wised up.  So for my
> purposes (creating stable in-document links and being able to link to the
> wiki from other places) the autogenerated anchors are unfortunately useless.

Hm. I can see that; however, editing of headings is really uncommon once an article has stabilized. Still, I will work toward getting this resolved properly.
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #7)
> Yes, I know the wiki does that.  The problem with using those anchors is
> that if someone changes the heading text all links to that section break. 
> I've had that happen to me a few times before I wised up.  So for my
> purposes (creating stable in-document links and being able to link to the
> wiki from other places) the autogenerated anchors are unfortunately useless.
> :(

FWIW: The new Kuma wiki will use a manually-supplied name="" attribute on a header to override auto-generation of id="" attributes from header text.

Example:
    "Header text differs from ID"
    https://developer-new.mozilla.org/en-US/docs/User:lmorchard#header-test-1

So, this:
    <h3 name="foo">Bar Baz Quux</h3>

Should result in this when rendered:
    <h3 name="foo" id="foo">Bar Baz Quux</h3>
Ah, very nice!
:lorchard for the win!
:lorchard - is there a way to fix this page automatically, or do we need to rebuild it by hand?
I've yet to figure out what the underlying issue with the anchors inside headers is, so fastest would be to hand-correct that markup with what I mentioned in comment 9.
If it helps, probably groovecoder or I could run the migration tool on prod to re-migrate this one page
OK, I will just redo the page from scratch with the content from the old site. This will work better.
So... flakey internet aside, I think I managed to fix it with a bit of source search & replace form the original DB:

    https://developer-new.mozilla.org/en-US/docs/Mozilla/WebIDL_bindings

Still not sure what in the parser might have done it, but self-closed <a name="" /> seems to trip it up. We're using Bleach, which in turn uses html5lib in Python - but there's also whatever we do for header processing. 

If I get some time tonight, I might dig into that processing chain and see what might be at fault. In the meantime, I think the immediately practical fix is to just avoid <a name="" /> inside headers
Thanks, Les! Well done!

I'm converting this bug into a "don't get confused by <a name=''> in headers" bug.
Summary: Kuma: Page contents messed up by import to Kuma? → Kuma: Using <a name="..."> in headers trips up Bleach in a bad way
> but self-closed <a name="" /> seems to trip it up

Er.. that's an _unclosed_ <a> in HTML, so yes, it could lead to the observed behavior when parsed with an HTML5 parser like html5lib.

But the actual markup I used on the original doesn't include unclosed <a> like that; I carefully listed ed tags for my anchors.  Where is the "self-closed" bit coming from?
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #18)
> > but self-closed <a name="" /> seems to trip it up
> 
> Er.. that's an _unclosed_ <a> in HTML, so yes, it could lead to the observed
> behavior when parsed with an HTML5 parser like html5lib.
> 
> But the actual markup I used on the original doesn't include unclosed <a>
> like that; I carefully listed ed tags for my anchors.  Where is the
> "self-closed" bit coming from?

That's from content I pulled directly out of the MindTouch MySQL DB (ie. current prod).

That makes me think this bug is actually an interaction between a MindTouch bug and a Kuma bug, mediated by migration. Nightmare fuel, right there.

I hope Kuma can somehow be tweaked able to handle this gracefully. But, whatever you typed into MindTouch (ie. <a name=""></a>) got turned into <a name="" /> when it saved
No longer blocks: 773295
> But, whatever you typed into MindTouch (ie. <a name=""></a>) got turned into <a name="" />

And it kept showing me the former in the source view.  :(  Yeah, fail.  :(
And thank you very much for sorting through this!
Priority: -- → P1
Whiteboard: p=2
Assignee: nobody → lorchard
Okay, so after spending some time spelunking through the guts of html5lib, I'm tempted to WONTFIX this.

The exact problem condition is where someone sticks an unclosed <a> or XHTML-style self-closing <a/> into the document (which is also an unclosed tag). 

The HTML5 parser & serializer seems to helpfully carry that unclosed <a> forward throughout the document, "fixing" things by wrapping chunks in the <a> wherever valid.

I'm not sure how to actually prevent this, without bringing in a different parser or... resorting to regexes. On the other hand, this should be a very rare condition, and requires using the source view of the editor to induce.

Going to poke a bit more, but the temptation to wave the white flag is strong
(In reply to Les Orchard [:lorchard] from comment #22)

> I'm not sure how to actually prevent this, without bringing in a different
> parser or... resorting to regexes. On the other hand, this should be a very
> rare condition, and requires using the source view of the editor to induce.

The problem is that we have several users, particularly certain engineers, who exclusively work in source view, or do so on a regular basis. We simply can't rely on people to use the WYSIWYG and assume that nobody will do things the HTML parser doesn't like. :(

However, fixing bug 779881 would help alleviate this, because it would remove one of the common use cases for doing so.
(In reply to Eric Shepherd [:sheppy] from comment #23)
> (In reply to Les Orchard [:lorchard] from comment #22)
> 
> > I'm not sure how to actually prevent this, without bringing in a different
> > parser or... resorting to regexes. On the other hand, this should be a very
> > rare condition, and requires using the source view of the editor to induce.
> 
> The problem is that we have several users, particularly certain engineers,
> who exclusively work in source view, or do so on a regular basis. We simply
> can't rely on people to use the WYSIWYG and assume that nobody will do
> things the HTML parser doesn't like. :(

Well, it's not so much that the HTML parser doesn't like this case. It's that this is actually the HTML5 spec-defined behavior for handling unclosed formatting tags (of which <a> is one):

http://dev.w3.org/html5/spec/parsing.html#the-list-of-active-formatting-elements

I'm still poking, and maybe I can come up with a non-standard HTML5 serializer that breaks from the spec yet does something less weird for our use case.

> However, fixing bug 779881 would help alleviate this, because it would
> remove one of the common use cases for doing so.
Summary: Kuma: Using <a name="..."> in headers trips up Bleach in a bad way → Kuma: Unclosed <a> tags are repeated throughout document by html5lib
To be clear, in this case I did _not_ manually add the unclosed <a>.  The original source on the old wiki had <a></a>.

So somewhere during the import process that got turned into either <a> or <a /> or something, after which just parsing with an HTML5 parser will in fact give the observed behavior.  But the real issue was the fact that an originally-closed <a></a> got turned into just an open tag somewhere in the pipeline.
(In reply to Boris Zbarsky (:bz) from comment #25)
> To be clear, in this case I did _not_ manually add the unclosed <a>.  The
> original source on the old wiki had <a></a>.
> 
> So somewhere during the import process that got turned into either <a> or <a
> /> or something, after which just parsing with an HTML5 parser will in fact
> give the observed behavior.  But the real issue was the fact that an
> originally-closed <a></a> got turned into just an open tag somewhere in the
> pipeline.

Yeah, I think that was MindTouch, which appeared to collapse <a></a> into <a/> in the stored data. In Kuma, I'm poking around at trying to catch that and do something more desirable
I might have a fix, but it seems too easy to be true:

https://github.com/mozilla/kuma/pull/585
Looks like CKEditor is causing a mess with these unclosed elements, also. So, the html5lib serializer hack is only part of the solution...

https://github.com/mozilla/kuma/pull/585#issuecomment-8479180
http://screencast.com/t/n0xKHlSCl
Summary: Kuma: Unclosed <a> tags are repeated throughout document by html5lib → Kuma: Unclosed <a> tags are repeated throughout document by html5lib & ckeditor
Yeah, looks like CKEditor turns this:

        <h3><a name="x-1">Foo</h3>
        <p>Lorem ipsum</p>
        <p>Dolor amet</p>
        <h3><a name="x-2"/>Bar</h3>
        <p>Bacon bacon bacon</p>
        <h3><a name="x-3">Baz</h3>
        <p>Tofu tofu tofu</p>
        <h3><a name="x-4"></a>Quux</h3>
        <p>Peas peas peas</p>

Into this:

<h3><a name="x-1">Foo</a></h3>
<p><a name="x-1"> </a></p>
<p><a name="x-1">Lorem ipsum</a></p>
<p><a name="x-1">Dolor amet</a></p>
<h3><a name="x-1"></a><a name="x-2">Bar</a></h3>
<p><a name="x-2"> </a></p>
<p><a name="x-2">Bacon bacon bacon</a></p>
<h3><a name="x-2"></a><a name="x-3">Baz</a></h3>
<p><a name="x-3"> </a></p>
<p><a name="x-3">Tofu tofu tofu</a></p>
<h3><a name="x-3"></a><a name="x-4"></a>Quux</h3>
<p>Peas peas peas</p>

Can confirm this on the CKEditor demo page, independent of Kuma. http://ckeditor.com/demo

So, it's an upstream bug. Continuing to dig through the guts of their HTML parser and serializer, not *entirely* understanding what I'm looking at. Maybe I'll work something out
Do we want to try to ping the CKEditor folks and see if we can get an assist on this problem? At least some advice on how to tackle the problem?
(In reply to Eric Shepherd [:sheppy] from comment #30)
> Do we want to try to ping the CKEditor folks and see if we can get an assist
> on this problem? At least some advice on how to tackle the problem?

Looks like there's an existing ticket, albeit 18 months old: http://dev.ckeditor.com/ticket/7543

Not sure who to contact for an update on this, thinking a fix and upstream patch is the best bet for our purposes. I'll dig some more.
The upstream bug on CKEditor has been marked "closed" and "invalid", basically with the advice of "close your tags".

I'll see if I can get a CKEditor dev env going, so as to make one more swing at fixing this myself.
Version: Kuma → unspecified
Component: Website → Landing pages
Assignee: lorchard → nobody
QA Contact: website
Whiteboard: p=2 → p=3 s=2012-10-02
So, I've spent about 2 weeks solidly looking into this and have found that pretty much every part of the stack does something other than what we'd like for the purposes of this bug - from Firefox to CKEditor to html5lib. 

Not any closer to resolving this, really. So, I think I need to take a break from this one and try tackling a few other bugs. In the meantime, I guess it's just a known bug that leaving inline tags like <a> open, through whatever chain of events intentional or otherwise, leads to undesirable results.
Priority: P1 → P3
Removing the s= tag to work around Scrumbugs issue 99:

https://github.com/mozilla/scrumbugz/issues/99
Whiteboard: p=3 s=2012-10-02 → p=3
Component: Landing pages → Editing
Whiteboard: p=3 → p=3 c=Editing
Priority: P3 → P2
Summary: Kuma: Unclosed <a> tags are repeated throughout document by html5lib & ckeditor → Unclosed <a> tags are repeated throughout document by html5lib & ckeditor
Whiteboard: p=3 c=Editing
Blocks: 1032827
I'm not able to reproduce this with the new CKEditor. Boris, please reopen if this is still an issue.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.