Closed
Bug 236002
Opened 21 years ago
Closed 7 years ago
Memory expands further than the system will allow with tags such as <s^g^ggfdf%> parsed as <s>
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: superdug, Assigned: choess)
References
()
Details
Attachments
(1 obsolete file)
User-Agent:
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040206 Firefox/0.8
Screen captures in XP can be found at http://www.dugnet.com/dug/oldstuff/000090.html
the 17 megabyte file takes more than 1.12 gigs of system memory and doesnt
completely load.
Reproducible: Always
Steps to Reproduce:
1.go to page http://www.chigc.com/et/ss-current/moz-kill.htm
2.watch memory usage of process
3.mozilla crashes
Actual Results:
Windows produced an error explaining that the system memory was critically low
and exited the firefox process.
Expected Results:
Internet Explorer displayed the page correctly in under 100 megs of system
resources.
Comment 1•21 years ago
|
||
Typical line from that file:
<p> <a href=pb000106.htm target=_blank>000106</a>
"^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W)
GUID=fb9205635216121e04601afc1bd20ac6(VALID:247) [2004.02.12 00:28:41]
This keeps coming up in various places. In view of our residual style handling
being as memory-intensive as it is, I propose making almost all characters
allowed in tagnames in the HTML parser (basically disallow whitespace, '/',
'>'). That would solve issues like crap sites doing
<style="something">
and ending up with the whole page as part of the stylesheet, too. Thoughts?
Comment 2•21 years ago
|
||
Confirmed on linux. Browser continued to run, but system gets very sluggish when
memory is exhausted (and another mozilla process (my Mailnews) crashed, probably
because of out mem). Browser does *not* free memory after page close, which is
probably a bug in itself. Mozilla Seamonkey 1.4.x on SuSE 9.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 3•21 years ago
|
||
> Browser does *not* free memory after page close
Actually, that's not a Mozilla bug... see bug 130157 and discussion therein.
That said, could we stay somewhat focused here? There's really no need to
comment on this bug by anyone not directly involved in parser development or
standards-compliance QA..... (in other words, I'd like to hear from Ian and
choess on the feasibility of what I suggested).
Comment 4•21 years ago
|
||
The following:
"^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W)
...should parse to:
text: |"^YF^|
element: <s^ht^YF^>
+ element: <n^hg^Yr^>
+ element: <z^_[^jAd">; attribute: |(W)|, value: ||
...or some such. That is,
<foo<bar>
...is valid HTML (ignoring the fact that the tag names are from some future
version of HTML that we don't know yet) and is exactly equivalent to <foo><bar>.
I don't exactly understand what you are proposing though. Could you explain it
in more detail? What is the problem you are solving?
Comment 5•21 years ago
|
||
Ian, '^' is not a valid character in an element name, as far as I understand.
At the moment we parse that as follows:
"^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W)
parses to
text: |"^YF^|
element: <s>; attribute: |^ht^YF^|, value: ||
+ element: <n>; attribute: |^hg^Yr^|, value:||
+ element: <z>; attribute: |^_[^jAd"|, value: ||; attribute: |(W)|,
value: ||
There may be more attributes; I'm not sure how '^' affects attribute parsing and
it does not matter for purposes of this testcase.
Now notice that we ended up with an <s> tag. And now we have unclose <s> tags,
and residual style kicks in, so we keep reopening them inside every <p>, leading
to a DOM tree that's several orders of magnitude bigger than it "should" be.
That is the problem I am trying to solve. My proposal is to allow '^' in
tagnames so that we will parse as you propose in comment 4. That would mean we
don't have an <s> tag anymore. Since just allowing '^' is hacky, I propose
allowing all chars that are not whitespace, '>', or '/' in a tagname (that is,
the tagname would go from the '<' to the first whitespace, '/', or '>' char).
Comment 6•21 years ago
|
||
(In reply to comment #5)
> Ian, '^' is not a valid character in an element name, as far as I understand.
Well, no, but it's not allowed in attribute names either, as I understand it.
> Now notice that we ended up with an <s> tag. And now we have unclose <s> tags,
> and residual style kicks in, so we keep reopening them inside every <p>,
> leading to a DOM tree that's several orders of magnitude bigger than it
> "should" be.
Oh, I see. Ok, I now understand what you were talking about before when you
mentioned the residual style thing.
So yes, your proposal makes sense to me. Does it match what IE does, do we know?
I can make test cases if you want, but I don't have IE here.
Comment 7•21 years ago
|
||
IE allows '^' in tagnames as far as I recall from the testing I had people do
before this bug got filed.
If you can make some testcases for random chars, I'm sure we can find IE-using
volunteers to test IE's behavior. ;)
Comment 8•21 years ago
|
||
Ok. Testcase: http://junkyard.damowmow.com/122
The test consists of taking the markup "<p>aaa<s^bbb</p>" and seeing what
innerHTML thinks it is.
IE6: aaa<S^BBB< p>
IE appears to treat the whole thing as an element, although it also seems to get
rather confused about the end tag. It also capitalised, go figure.
Safari 1.2: aaa
Safari appears to drop unknown HTML elements altogether, it thinks the first
element contains no unknown element on that test.
Mozilla: aaa<s ^bbb=""></s>
As noted above, we treat the ^ bit as part of the attribute. (But note that View
Source is different than either innerHTML _or_ the actual source.)
Opera: aaa<s/>
I don't even know where to begin.
In conclusion, I think your proposal is the best thing.
Comment 9•21 years ago
|
||
Ian, what we actually need are tests that will test at least the following
characters as part of "tag names":
! @ # $ % ^ & * ( ) _ + - = { } | [ ] \ : " ; ' , . ? < > /
We could also use tests that use non-ascii chars in tag-soup tagnames.
At the moment, we allow ':', '_', '-', and '.' in tagnames but no other
punctuation and we don't allow non-ascii. The question is what punctuation to
allow and whether to allow other random non-ascii stuff in there too. That is,
do we keep our current "chars are not allowed in a tagname unless we explicitly
say they are" approach, or do we go with the "chars are allowed unless we
explicitly say they are not" approach?
Comment 10•21 years ago
|
||
The < and > characters are special.
<t<t> is the same as <t><t>
<t>t> is the same as <t>t>
So I didn't test those.
Testcase: http://www.hixie.ch/tests/adhoc/html/parsing/compat/016.html
It seems IE does indeed treat them all as valid characters.
Comment 11•21 years ago
|
||
We don't really want to allow '/' in tagnames because then <img/> will not be an
<img> tag....
What about non-ascii?
Comment 12•21 years ago
|
||
Comment 13•21 years ago
|
||
OK, my informants tell me that on that testcase IE has exactly the behavior the
testcase describes.
I think we should do the same.
Comment 14•21 years ago
|
||
agreed
Assignee | ||
Comment 15•21 years ago
|
||
Hmmm. What if we modified CStartToken::Consume after the call to GetIdentifier
and peeked at the next character? If it's not whitespace, "<", ">", or "/", then
instead of calling LookupTag to set the TypeID, we could just set it to
eHTMLTag_userdefined. That way we'd still keep identifiers to name characters,
per spec, but it *should*--I'd want to check CNavDTD first for surprises--keep
residual style from being activated on the tags, since they'll no longer be
recognized as <s>, <p>, or whatever.
Comment 16•21 years ago
|
||
"per spec"?
Comment 17•21 years ago
|
||
That sounds pretty fragile to me.... And we'd apply the style anyway, even if
non-residually (since for style the tagname is all that matters).
I think that keeping to the spec on this aspect of tag-soup is just a lost cause....
Comment 18•21 years ago
|
||
Incidentally, this patch refixes bug 23791.
The funny thing is, this doesn't make us use non-ridiculous amounts of memory
on the page, since the page also contains about 250 lines that look like:
<p> <a href=pb000194.htm target=_blank>000194</a> "^8<<I am a Target>>" (W)
GUID=18a6a80c8604653277b2929c7352b857(?) [2004.02.16 21:10:19]
and <i am a target> is a perfectly valid HTML tag that we parse as such and
which triggers residual style. (And 125-fold increase in the amount of memory
needed by the page. We need about 200 megs RAM to load the page if I remove
the 'I' from there, so figure we'd need 25GB RAM to load the thing with this
patch. Without this patch, toss in another factor of 5 or so, since there are
around 850 lines that have a "<s".)
Comment 19•21 years ago
|
||
Comment on attachment 143225 [details] [diff] [review]
Patch to give us IE tokenization behavior here
I think this is the right thing to do anyway.... we have separate bugs on the
fact that residual style is really bloaty in cases when we _do_ need it.
Attachment #143225 -
Flags: superreview?(jst)
Attachment #143225 -
Flags: review?(choess)
Comment 20•21 years ago
|
||
> ...perfectly valid HTML tag...
I think you may be stretching the words "perfectly" and "valid" here... :-P
+ case '\b':
+ case '\t':
+ case '\v':
Why do we search for those, but not, say, the form feed character?
Comment 21•21 years ago
|
||
It's valid from a tokenization standpoint.
Because I forgot to add \f to the list. Consider it added.
Comment 22•21 years ago
|
||
Ok, what about a zero width space? or a zero width non-joiner? or...?
Comment 23•21 years ago
|
||
If there is a simple way to express things that should terminate tagnames, I
welcome you telling me what it is. The ugly thing in the XML spec is not
something I'm going to stick into this code (never mind that we are in fact
violating it).
Though I would love to hear what IE does on those testcases.
Comment 24•21 years ago
|
||
Try it (warning: takes several minutes to complete):
http://www.hixie.ch/tests/adhoc/html/parsing/compat/017.html
Comment 25•21 years ago
|
||
Apparently IE totally fails to load that testcase, Ian.
Comment 26•21 years ago
|
||
Hixie updated the test to work. This is the output from IE 6.0:
Test complete. 65536 characters tested. 0 characters not mentioned below were
treated as part of tag names.
U+0 - U+8 created no element
U+9 - U+d delimits tag names from attributes
U+e - U+1f created no element
U+20 delimits tag names from attributes
U+21 - U+2e created no element
U+2f delimits tag names from attributes
U+30 - U+3d created no element
U+3e delimits tag names from attributes
U+3f - U+d7ff created no element
U+e000 - U+ffff created no element
Comment 27•21 years ago
|
||
bz: So in IE, "spaces" are U+0009 to U+000D, U+0020, U+002F (/), and U+003E (>).
I have no clue why IE failed to created characters for the other cases. Probably
some silly bug.
Comment 28•21 years ago
|
||
Comment on attachment 143225 [details] [diff] [review]
Patch to give us IE tokenization behavior here
sr=jst
Attachment #143225 -
Flags: superreview?(jst) → superreview+
Comment 29•21 years ago
|
||
choess, any chance of a review? I'd sorta like to land this as early in 1.8 as
I can....
Assignee | ||
Comment 30•20 years ago
|
||
Comment on attachment 143225 [details] [diff] [review]
Patch to give us IE tokenization behavior here
r=choess, but could you leave this bug on my plate when you check it in? I'd
still like to consider an alternative solution that avoids the "arbitrariness"
of the current behavior, but it will take a while to be sure it's not fragile.
Attachment #143225 -
Flags: review?(choess) → review+
Comment 31•20 years ago
|
||
Checked that patch in to trunk. Over to choess, per request.
Assignee: parser → choess
Updated•20 years ago
|
Attachment #143225 -
Attachment is obsolete: true
Comment 32•20 years ago
|
||
*** Bug 278055 has been marked as a duplicate of this bug. ***
Comment 33•20 years ago
|
||
*** Bug 290795 has been marked as a duplicate of this bug. ***
Updated•15 years ago
|
QA Contact: parser
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•