Closed Bug 794958 Opened 12 years ago Closed 8 years ago

Use whitelist element and attribute filtering for reader mode

Categories

(Toolkit :: Reader Mode, defect, P5)

defect
Points:
3

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mgoodwin, Unassigned)

References

(Blocks 1 open bug)

Details

Issue:
Currently, reader mode uses a blacklist for the element and attribute filtering. In the interest of making the protection afforded by this filtering more robust, it'd be desirable to replace this with a whitelist (list of known good) filter.

See https://wiki.mozilla.org/Security/Reviews/ReaderMode for more informaion.
OS: Mac OS X → Android
Priority: -- → P1
Hardware: x86 → All
Blocks: 785077
Blocks: reader
Assignee: lucasr.at.mozilla → nobody
Blocks: readerv2
No longer blocks: reader, 785077
Blocks: fix-readability
No longer blocks: readerv2
Component: Reader View → Reader Mode
Product: Firefox for Android → Toolkit
Flags: qe-verify?
Flags: firefox-backlog+
Blocks: 1132074
mgoodwin/bnicholson, can you help me understand what would go into fixing this bug? 
Is the main concern here that pages could invent their own tags with garbage, and we wouldn't end up removing those?

Looking at our current Readability.js, I see us stripping out tags that we definitely don't want (e.g. "object", "iframe"), and then conditionally stripping other types of tags (e.g. "p", "img") depending on the result of that first pass, as well as some other heuristics about those elements.

To implement a whitelist approach, I imagine we would keep that first pass to get rid of definitely unwanted junk, and then we would look for known good elements, such as "p" and "img", etc.

I'm worried that this change would be a large fundamental change to how we do our readability parsing, and it seems like this could be risky for Fx38.

I'm wondering how much this is actually an issue in practice. Yes, we don't want pages working against us to "fool" the parser, but couldn't a page just put its "unwanted" content inside an element that we think looks good? I think that no matter what we do, we still need heuristics to analyze what's actually inside these elements.
Flags: needinfo?(mgoodwin)
Flags: needinfo?(bnicholson)
Points: --- → 3
(In reply to :Margaret Leibovic from comment #1)
> mgoodwin/bnicholson, can you help me understand what would go into fixing
> this bug? 
> Is the main concern here that pages could invent their own tags with
> garbage, and we wouldn't end up removing those?

Yes, or new tags, or existing things in ways that we don't currently envisage (it's difficult to enumerate badness).

> To implement a whitelist approach, I imagine we would keep that first pass
> to get rid of definitely unwanted junk, and then we would look for known
> good elements, such as "p" and "img", etc.

Yes, that sounds sensible.

> 
> I'm worried that this change would be a large fundamental change to how we
> do our readability parsing, and it seems like this could be risky for Fx38.
> 
> I'm wondering how much this is actually an issue in practice. Yes, we don't
> want pages working against us to "fool" the parser, but couldn't a page just
> put its "unwanted" content inside an element that we think looks good? I
> think that no matter what we do, we still need heuristics to analyze what's
> actually inside these elements.

This bug was less about reader mode being fooled into displaying different content and more about preventing reader mode XSS; we've had bugs of this kind before (see bug 778582) so the more we can do to reduce the likelihood and impact, the better.

I think freddy has (casually) been looking into reader mode stuff recently; perhaps he has more to add?
Flags: needinfo?(mgoodwin) → needinfo?(fbraun)
It's worth noting here that we already use nsIParserUtils.sanitize() to strip unsafe elements. The filtering that happens in Readability.js maybe helps to remove any other safe elements we don't want people to see in the article, but I'm not sure it's valuable for much else.

I wouldn't think this would be a high priority bug unless we don't trust our built-in sanitization for some reason. And if that were the case, I think we'd have bigger problems!
Flags: needinfo?(bnicholson)
Updating stale priorities to reflect priority in current reading list / reader view work.

Comment 3 makes sense to me. I'd further posit that we shouldn't be relying on Readability to handle security issues, as it's a complex piece of very heuristicy code. The transform it's doing should be orthogonal to security (ie, security should happen at another layer or in a sandbox).
OS: Android → All
Priority: P1 → P5
No longer blocks: 1132074
(In reply to Brian Nicholson (:bnicholson) from comment #3)
> It's worth noting here that we already use nsIParserUtils.sanitize() to
> strip unsafe elements. 

(In reply to Justin Dolske [:Dolske] from comment #4)
> The transform it's doing should be orthogonal to security

Note that if we ever officially plan to create a lib reusable outside FFx, that could become an issue. Eg. my readable-proxy[0] had to implement its own sanitization mechanism, and I bet other potential consumers would expect that we ensure sanitization natively as well. It's a little hard to ask lib users to handle basic security on their own, though that's still an option. At the very least that should he stated in blinking bold =)

That raises the question of what are our plans atm for Readability now, and for the longer term. I'd personally tend to favor federating contributions from the outside world, and to support more use cases that our very owns, but that's obviously just an opinion.

[0]: https://github.com/n1k0/readable-proxy/
(In reply to Nicolas Perriault (:NiKo`) — needinfo me if you need my attention from comment #5)
> That raises the question of what are our plans atm for Readability now, and
> for the longer term. I'd personally tend to favor federating contributions
> from the outside world, and to support more use cases that our very owns,
> but that's obviously just an opinion.

I agree 100% with that long term plan. We may not be able to fully invest in that in the very near term, given the need to get this feature shipped, though.
Comment 3 answers the requested needinfo: We should use nsIParserUtils.sanitize() on top of what Readability gives us.
Flags: needinfo?(fbraun)
(In reply to Frederik Braun [:freddyb] [unavailable June 4th to August 29th] from comment #7)
> Comment 3 answers the requested needinfo: We should use
> nsIParserUtils.sanitize() on top of what Readability gives us.

We already do this now. Sandboxing is bug 1204818.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.