Closed Bug 840973 Opened 12 years ago Closed 12 years ago

update bleach to v1.2.1

Categories

(support.mozilla.org :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2013Q1

People

(Reporter: willkg, Assigned: rrosario)

References

Details

(Whiteboard: u=dev c=infrastructure p=2 s=2013.8)

We're currently using bleach 1.1.5 and should upgrade to 1.2.
Summary: update bleach → update bleach to v1.2
Putting this in the 2013.4 sprint, but it's possible that the changes between 1.1.5 and 1.2 aren't so compelling that we couldn't push it off a sprint if need be.
Whiteboard: u=dev c=infrastructure p= s=2013.4
Priority: -- → P4
Target Milestone: --- → 2013Q1
I think this is probably between a 1 pointer and a 2 pointer. It's a small version jump, but I vaguely remember there being some API changes involved. James: Do you know offhand if this is a trivial project or not?
I missed the CHANGES file: Version 1.2 ----------- - Bleach will no longer consider unacceptable protocols when linkifying. - linkify() now takes a tokenizer argument that allows it to skip sanitization. - delinkify() is gone. - Removed exception handling from _render. clean() and linkify() may now throw. - linkify() correctly ignores case for protocols and domain names. - linkify() correctly handles markup within an <a> tag. We don't use delinkify, so I think it shouldn't be a hard upgrade. Estimating at 1 point.
Whiteboard: u=dev c=infrastructure p= s=2013.4 → u=dev c=infrastructure p=1 s=2013.4
Assigning to backlog for an appropriate sprint.
Whiteboard: u=dev c=infrastructure p=1 s=2013.4 → u=dev c=infrastructure p=1 s=2013.backlog
Looked into it briefly. Updating bleach requires us to fix py-wikimarkup because it's passing a nofollow argument into linkify which bleach 1.2 no longer has. py-wikimarkup needs to be fixed. Something like this: diff --git a/setup.py b/setup.py index c312255..dd0cb66 100644 --- a/setup.py +++ b/setup.py @@ -13,6 +13,6 @@ setup( url='http://www.github.com/dcramer/py-wikimarkup/', zip_safe=False, include_package_data=True, - install_requires=['bleach>=1.1.4'], + install_requires=['bleach>=1.2'], package_data = { '': ['README.rst'] }, ) diff --git a/wikimarkup/parser.py b/wikimarkup/parser.py index 65d1dd0..a42e7e8 100644 --- a/wikimarkup/parser.py +++ b/wikimarkup/parser.py @@ -1713,7 +1713,11 @@ class Parser(BaseParser): if utf8: text.encode("utf-8") # Pass output through bleach and linkify - text = bleach.linkify(text, nofollow=nofollow, tokenizer=HTMLTokenizer) + if nofollow: + callbacks = [bleach.linkify_callbacks.nofollow] + else: + callbacks = [] + text = bleach.linkify(text, callbacks=callbacks, tokenizer=HTMLTokenizer) return bleach.clean(text, tags=self.tags, attributes=attributes, styles=styles, strip_comments=False) After that, all the tests in kitsune pass except one: FAIL: kitsune.apps.sumo.tests.test_parser:TestWikiImageTags.test_page_link_caption vim +432 apps/sumo/tests/test_parser.py # test_page_link_caption eq_('my caption', caption) vim +31 vendor/packages/nose/nose/tools.py # eq_ assert a == b, msg or "%r != %r" % (a, b) AssertionError: 'my caption' != '<img class=" "\n src="/media/uploads/gallery/images/2013-02-19-11-53-00-098f6b.jpg"\n alt="my caption"\n /> my caption' The html that gets returned is pretty crazy. Seems like something is horribly wrong there.
Would love to do this in next sprint.
Whiteboard: u=dev c=infrastructure p=1 s=2013.backlog → u=dev c=infrastructure p=1 s=2013.6
This requires patching py-wikimarkup. I'm pretty sure James said we're the only users of that, so we can probably just patch it without really thinking about anyone else. I did some work on this already, but it probably needs to be a 2 pointer.
(In reply to Will Kahn-Greene [:willkg] from comment #8) > I did some work on this already, but it probably needs to be a 2 pointer. Changing to 2pts. Moving back to the backlog for now.
Whiteboard: u=dev c=infrastructure p=1 s=2013.6 → u=dev c=infrastructure p=2 s=2013.backlog
(In reply to Will Kahn-Greene [:willkg] from comment #8) > This requires patching py-wikimarkup. I'm pretty sure James said we're the > only users of that, so we can probably just patch it without really thinking > about anyone else. AFAIK we're still using Paul C's fork, and we're the only users of that. And I have commit. So if there are patches just @jsocol ping me on GitHub.
Depends on: 862120
Blocks: 862120
No longer depends on: 862120
Putting into current sprint and upping the priority.
Priority: P4 → P3
Whiteboard: u=dev c=infrastructure p=2 s=2013.backlog → u=dev c=infrastructure p=2 s=2013.8
Assignee: nobody → rrosario
v1.2.1 is the latest latest
Summary: update bleach to v1.2 → update bleach to v1.2.1
S(In reply to Will Kahn-Greene [:willkg] from comment #5) > FAIL: > kitsune.apps.sumo.tests.test_parser:TestWikiImageTags.test_page_link_caption > vim +432 apps/sumo/tests/test_parser.py # test_page_link_caption > eq_('my caption', caption) > vim +31 vendor/packages/nose/nose/tools.py # eq_ > assert a == b, msg or "%r != %r" % (a, b) > AssertionError: 'my caption' != '<img class=" "\n > src="/media/uploads/gallery/images/2013-02-19-11-53-00-098f6b.jpg"\n > alt="my caption"\n /> my caption' I've been able to reduce this down to the following difference between old and new bleach. Bleach 1.1.x: In [5]: bleach.linkify('<a href="/"><noscript><img src="img.jpg" alt=""/></noscript></a>', tokenizer=HTMLTokenizer) Out[5]: u'<a href="/" rel="nofollow"><noscript><img src="img.jpg" alt=""/></noscript></a>' Bleach 1.2.1: In [37]: bleach.linkify('<a href="/"><noscript><img src="img.jpg" alt=""/></noscript></a>', tokenizer=HTMLTokenizer) Out[37]: u'<a href="/" rel="nofollow"><noscript>&lt;img src="img.jpg" alt=""/&gt;</noscript></a>' For some reason, the inner <img/> is getting escaped. If I don't include the outter <a ...>...</a>, everything is fine. BTW `<a href="/"><noscript><img src="img.jpg" alt=""/></noscript></a>` is valid HTML according to the W3C validator.
So, I took a look at bleach and html5lib.. From what I can tell, it is an issue with the parser treating the contents of <noscript/> as raw text instead of HTML. Then when it goes to render it back, it escapes everything screwing up the HTML inside. We don't have this problem in older bleach, because bleach used to just write back the text inside an <a/> without parsing: https://github.com/jsocol/bleach/commit/fb404b2cc0bc22daa3642f214cdda852855d2553
OK, a quick reading of the spec... http://www.w3.org/TR/html5/scripting-1.html#the-noscript-element confuses the hell out of me. I think it's to do with html5lib. I *guess* that html5lib is parsing it as if scripting was enabled? But that's more of a thing to test than anything else, and I don't know how to test it.
(In reply to James Socol [:jsocol, :james] from comment #16) > OK, a quick reading of the spec... > http://www.w3.org/TR/html5/scripting-1.html#the-noscript-element confuses > the hell out of me. I think it's to do with html5lib. I *guess* that > html5lib is parsing it as if scripting was enabled? But that's more of a > thing to test than anything else, and I don't know how to test it. Yep, I ended up on that page last night too and was scratching my head. I think you are right, it is parsing as if scripting was enabled. In that case, it is following the spec.
This is all related to how we render out images in the wiki so that they are lazy loaded and still work without js. The markup has something like: ``` <a href=...> <img class="lazy-img" data-src=... /> <noscript> <img src=... /> </noscript> </a> ``` we could just change that to: ``` <a href=...> <img class="lazy-img" data-src=... /> </a> <noscript> <a href=...> <img src=... /> </a> </noscript> ``` And we'd probably be fine. If we don't care about bleach/html5lib being perfect for this super edge case, I can just make this change instead.
In a pull request: https://github.com/mozilla/kitsune/pull/1305 I went with the workaround from Comment 18
Rehan deployed this to prod this morning.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.