Closed Bug 236002 Opened 21 years ago Closed 7 years ago

Memory expands further than the system will allow with tags such as <s^g^ggfdf%> parsed as <s>

Categories

(Core :: DOM: HTML Parser, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

critical

Tracking

()

Status:

RESOLVED INCOMPLETE

People

(Reporter: superdug, Assigned: choess)

References

(
URL
)

Details

Attachments

(1 obsolete file)

Patch to give us IE tokenization behavior here 21 years ago Boris Zbarsky [:bzbarsky] 11.38 KB, patch	choess : review+ jst : superreview+	Details \| Diff \| Splinter Review

Reporter

Description

•

21 years ago

User-Agent: Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040206 Firefox/0.8 Screen captures in XP can be found at http://www.dugnet.com/dug/oldstuff/000090.html the 17 megabyte file takes more than 1.12 gigs of system memory and doesnt completely load. Reproducible: Always Steps to Reproduce: 1.go to page http://www.chigc.com/et/ss-current/moz-kill.htm 2.watch memory usage of process 3.mozilla crashes Actual Results: Windows produced an error explaining that the system memory was critically low and exited the firefox process. Expected Results: Internet Explorer displayed the page correctly in under 100 megs of system resources.

Boris Zbarsky [:bzbarsky]

Comment 1

•

21 years ago

Typical line from that file: <p> <a href=pb000106.htm target=_blank>000106</a> "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) GUID=fb9205635216121e04601afc1bd20ac6(VALID:247) [2004.02.12 00:28:41] This keeps coming up in various places. In view of our residual style handling being as memory-intensive as it is, I propose making almost all characters allowed in tagnames in the HTML parser (basically disallow whitespace, '/', '>'). That would solve issues like crap sites doing <style="something"> and ending up with the whole page as part of the stylesheet, too. Thoughts?

Kevin McCormick

Updated

•

21 years ago

Blocks: 26003

Ben Bucksch (:BenB)

Comment 2

•

21 years ago

Confirmed on linux. Browser continued to run, but system gets very sluggish when memory is exhausted (and another mozilla process (my Mailnews) crashed, probably because of out mem). Browser does *not* free memory after page close, which is probably a bug in itself. Mozilla Seamonkey 1.4.x on SuSE 9.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Boris Zbarsky [:bzbarsky]

Comment 3

•

21 years ago

> Browser does *not* free memory after page close Actually, that's not a Mozilla bug... see bug 130157 and discussion therein. That said, could we stay somewhat focused here? There's really no need to comment on this bug by anyone not directly involved in parser development or standards-compliance QA..... (in other words, I'd like to hear from Ian and choess on the feasibility of what I suggested).

Kevin McCormick

Updated

•

21 years ago

No longer blocks: 26003

Hixie (not reading bugmail)

Comment 4

•

21 years ago

The following: "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) ...should parse to: text: |"^YF^| element: <s^ht^YF^> + element: <n^hg^Yr^> + element: <z^_[^jAd">; attribute: |(W)|, value: || ...or some such. That is, <foo<bar> ...is valid HTML (ignoring the fact that the tag names are from some future version of HTML that we don't know yet) and is exactly equivalent to <foo><bar>. I don't exactly understand what you are proposing though. Could you explain it in more detail? What is the problem you are solving?

Boris Zbarsky [:bzbarsky]

Comment 5

•

21 years ago

Ian, '^' is not a valid character in an element name, as far as I understand. At the moment we parse that as follows: "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) parses to text: |"^YF^| element: <s>; attribute: |^ht^YF^|, value: || + element: <n>; attribute: |^hg^Yr^|, value:|| + element: <z>; attribute: |^_[^jAd"|, value: ||; attribute: |(W)|, value: || There may be more attributes; I'm not sure how '^' affects attribute parsing and it does not matter for purposes of this testcase. Now notice that we ended up with an <s> tag. And now we have unclose <s> tags, and residual style kicks in, so we keep reopening them inside every <p>, leading to a DOM tree that's several orders of magnitude bigger than it "should" be. That is the problem I am trying to solve. My proposal is to allow '^' in tagnames so that we will parse as you propose in comment 4. That would mean we don't have an <s> tag anymore. Since just allowing '^' is hacky, I propose allowing all chars that are not whitespace, '>', or '/' in a tagname (that is, the tagname would go from the '<' to the first whitespace, '/', or '>' char).

Hixie (not reading bugmail)

Comment 6

•

21 years ago

(In reply to comment #5) > Ian, '^' is not a valid character in an element name, as far as I understand. Well, no, but it's not allowed in attribute names either, as I understand it. > Now notice that we ended up with an <s> tag. And now we have unclose <s> tags, > and residual style kicks in, so we keep reopening them inside every <p>, > leading to a DOM tree that's several orders of magnitude bigger than it > "should" be. Oh, I see. Ok, I now understand what you were talking about before when you mentioned the residual style thing. So yes, your proposal makes sense to me. Does it match what IE does, do we know? I can make test cases if you want, but I don't have IE here.

Boris Zbarsky [:bzbarsky]

Comment 7

•

21 years ago

IE allows '^' in tagnames as far as I recall from the testing I had people do before this bug got filed. If you can make some testcases for random chars, I'm sure we can find IE-using volunteers to test IE's behavior. ;)

Hixie (not reading bugmail)

Comment 8

•

21 years ago

Ok. Testcase: http://junkyard.damowmow.com/122 The test consists of taking the markup "<p>aaa<s^bbb</p>" and seeing what innerHTML thinks it is. IE6: aaa<S^BBB< p> IE appears to treat the whole thing as an element, although it also seems to get rather confused about the end tag. It also capitalised, go figure. Safari 1.2: aaa Safari appears to drop unknown HTML elements altogether, it thinks the first element contains no unknown element on that test. Mozilla: aaa<s ^bbb=""></s> As noted above, we treat the ^ bit as part of the attribute. (But note that View Source is different than either innerHTML _or_ the actual source.) Opera: aaa<s/> I don't even know where to begin. In conclusion, I think your proposal is the best thing.

Boris Zbarsky [:bzbarsky]

Comment 9

•

21 years ago

Ian, what we actually need are tests that will test at least the following characters as part of "tag names": ! @ # $ % ^ & * ( ) _ + - = { } | [ ] \ : " ; ' , . ? < > / We could also use tests that use non-ascii chars in tag-soup tagnames. At the moment, we allow ':', '_', '-', and '.' in tagnames but no other punctuation and we don't allow non-ascii. The question is what punctuation to allow and whether to allow other random non-ascii stuff in there too. That is, do we keep our current "chars are not allowed in a tagname unless we explicitly say they are" approach, or do we go with the "chars are allowed unless we explicitly say they are not" approach?

Hixie (not reading bugmail)

Comment 10

•

21 years ago

The < and > characters are special. <t<t> is the same as <t><t> <t>t> is the same as <t>t> So I didn't test those. Testcase: http://www.hixie.ch/tests/adhoc/html/parsing/compat/016.html It seems IE does indeed treat them all as valid characters.

Boris Zbarsky [:bzbarsky]

Comment 11

•

21 years ago

We don't really want to allow '/' in tagnames because then <img/> will not be an <img> tag.... What about non-ascii?

Hixie (not reading bugmail)

Comment 12

•

21 years ago

Updated http://www.hixie.ch/tests/adhoc/html/parsing/compat/016.html

Boris Zbarsky [:bzbarsky]

Comment 13

•

21 years ago

OK, my informants tell me that on that testcase IE has exactly the behavior the testcase describes. I think we should do the same.

Hixie (not reading bugmail)

Comment 14

•

21 years ago

agreed

Christopher Hoess (gone)

Assignee

Comment 15

•

21 years ago

Hmmm. What if we modified CStartToken::Consume after the call to GetIdentifier and peeked at the next character? If it's not whitespace, "<", ">", or "/", then instead of calling LookupTag to set the TypeID, we could just set it to eHTMLTag_userdefined. That way we'd still keep identifiers to name characters, per spec, but it *should*--I'd want to check CNavDTD first for surprises--keep residual style from being activated on the tags, since they'll no longer be recognized as <s>, <p>, or whatever.

Hixie (not reading bugmail)

Comment 16

•

21 years ago

"per spec"?

Boris Zbarsky [:bzbarsky]

Comment 17

•

21 years ago

That sounds pretty fragile to me.... And we'd apply the style anyway, even if non-residually (since for style the tagname is all that matters). I think that keeping to the spec on this aspect of tag-soup is just a lost cause....

Boris Zbarsky [:bzbarsky]

Comment 18

•

21 years ago

Attached patch Patch to give us IE tokenization behavior here (obsolete) — Details — Splinter Review

Incidentally, this patch refixes bug 23791. The funny thing is, this doesn't make us use non-ridiculous amounts of memory on the page, since the page also contains about 250 lines that look like: <p> <a href=pb000194.htm target=_blank>000194</a> "^8<<I am a Target>>" (W) GUID=18a6a80c8604653277b2929c7352b857(?) [2004.02.16 21:10:19] and <i am a target> is a perfectly valid HTML tag that we parse as such and which triggers residual style. (And 125-fold increase in the amount of memory needed by the page. We need about 200 megs RAM to load the page if I remove the 'I' from there, so figure we'd need 25GB RAM to load the thing with this patch. Without this patch, toss in another factor of 5 or so, since there are around 850 lines that have a "<s".)

Boris Zbarsky [:bzbarsky]

Comment 19

•

21 years ago

Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here I think this is the right thing to do anyway.... we have separate bugs on the fact that residual style is really bloaty in cases when we _do_ need it.

Attachment #143225 - Flags: superreview?(jst)

Attachment #143225 - Flags: review?(choess)

Hixie (not reading bugmail)

Comment 20

•

21 years ago

> ...perfectly valid HTML tag... I think you may be stretching the words "perfectly" and "valid" here... :-P + case '\b': + case '\t': + case '\v': Why do we search for those, but not, say, the form feed character?

Boris Zbarsky [:bzbarsky]

Comment 21

•

21 years ago

It's valid from a tokenization standpoint. Because I forgot to add \f to the list. Consider it added.

Hixie (not reading bugmail)

Comment 22

•

21 years ago

Ok, what about a zero width space? or a zero width non-joiner? or...?

Boris Zbarsky [:bzbarsky]

Comment 23

•

21 years ago

If there is a simple way to express things that should terminate tagnames, I welcome you telling me what it is. The ugly thing in the XML spec is not something I'm going to stick into this code (never mind that we are in fact violating it). Though I would love to hear what IE does on those testcases.

Hixie (not reading bugmail)

Comment 24

•

21 years ago

Try it (warning: takes several minutes to complete): http://www.hixie.ch/tests/adhoc/html/parsing/compat/017.html

Boris Zbarsky [:bzbarsky]

Comment 25

•

21 years ago

Apparently IE totally fails to load that testcase, Ian.

Michael Lefevre

Comment 26

•

21 years ago

Hixie updated the test to work. This is the output from IE 6.0: Test complete. 65536 characters tested. 0 characters not mentioned below were treated as part of tag names. U+0 - U+8 created no element U+9 - U+d delimits tag names from attributes U+e - U+1f created no element U+20 delimits tag names from attributes U+21 - U+2e created no element U+2f delimits tag names from attributes U+30 - U+3d created no element U+3e delimits tag names from attributes U+3f - U+d7ff created no element U+e000 - U+ffff created no element

Hixie (not reading bugmail)

Comment 27

•

21 years ago

bz: So in IE, "spaces" are U+0009 to U+000D, U+0020, U+002F (/), and U+003E (>). I have no clue why IE failed to created characters for the other cases. Probably some silly bug.

Johnny Stenback (:jst)

Comment 28

•

21 years ago

Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here sr=jst

Attachment #143225 - Flags: superreview?(jst) → superreview+

Boris Zbarsky [:bzbarsky]

Comment 29

•

21 years ago

choess, any chance of a review? I'd sorta like to land this as early in 1.8 as I can....

Christopher Hoess (gone)

Assignee

Comment 30

•

20 years ago

Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here r=choess, but could you leave this bug on my plate when you check it in? I'd still like to consider an alternative solution that avoids the "arbitrariness" of the current behavior, but it will take a while to be sure it's not fragile.

Attachment #143225 - Flags: review?(choess) → review+

Boris Zbarsky [:bzbarsky]

Comment 31

•

20 years ago

Checked that patch in to trunk. Over to choess, per request.

Assignee: parser → choess

Boris Zbarsky [:bzbarsky]

Updated

•

20 years ago

Attachment #143225 - Attachment is obsolete: true

Blake Kaplan (:mrbkap) (inactive)

Comment 32

•

20 years ago

*** Bug 278055 has been marked as a duplicate of this bug. ***

Blake Kaplan (:mrbkap) (inactive)

Comment 33

•

20 years ago

*** Bug 290795 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: parser

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

You need to log in before you can comment on or make changes to this bug.