Closed Bug 177047 Opened 22 years ago Closed 22 years ago

use regexes when testing attribute values

Categories

(SeaMonkey :: Page Info, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: db48x, Assigned: db48x)

References

()

Details

Attachments

(2 files, 9 obsolete files)

I mainly need to do this so that odd whitespace doesn't cause page info to miss things.
Attached patch I think this fits the bill (obsolete) — Splinter Review
mozilla should have given me a syntax error.
Attachment #104343 - Attachment is obsolete: true
Comment on attachment 104368 [details] [diff] [review] thanks timeless for catching that :) >Index: xpfe/browser/resources/content/pageInfo.js >=================================================================== >+const XHTMLNSre = "http\:\/\/www\.w3\.org\/1999\/xhtml"; >+const XHTML2NSre = "http\:\/\/www\.w3\.org\/2002\/06\/xhtml2"; >+const XHTMLre = RegExp(XHTMLNSre + "|" + XHTML2NSre, "i"); A namespace is an opaque string (in other words, we can't say anything about what's in it), "A" and " A " are two different namespaces, as are "A" and "a", so the only valid comparison is |rawnamespace1 == rawnamespace2|. No toLowerCase, no case insensitive regexp matching, no stripping of whitespace. Then for the attribute values you shouldn't have to strip whitespace in most cases, the parser should do that for you. Where it is needed (this is where you'll wanna ask someone else) you don't want to do /bar/.test(str), since that will return true if str = "XXXbarXXX", not what you want in most cases.
Attachment #104368 - Flags: needs-work+
Comment on attachment 104368 [details] [diff] [review] thanks timeless for catching that :) +const XLinkNS = "http://www.w3.org/1999/xlink"; +const XULNS = "http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"; +const XMLNS = "http://www.w3.org/XML/1998/namespace"; +const XHTMLNS = "http://www.w3.org/1999/xhtml"; +const XHTML2NS = "http://www.w3.org/2002/06/xhtml2" These should not be duplicated here, they are already in at least one place in the C++, can't we avoid code duplication? This is going to get out of hand -- where's the SVG and MathML namespaces, for example -- as people add namespaces in one place and not the other. + if (ns) // if !ns then it's html, and won't have a namespace What about unnamespaced elements? e.g. the root element in the following XML document: <test/> + if (!XHTMLre.test(elem.namespaceURI)) // otherwise, it has to be XHTML This would match "http://www.w3.org/1999/xhtml-this-is-not", as well as " http://www.w3.org/1999/xhtml ", neither of which are XHTML. + if (/icon/i.test(rel)) This is extremely wrong, as are several other tests here. It matches "lexicon", for example. "rel" is a space separated list. It should be treated as such. "next icon" is an "icon". "stylesheet alternate" is an alternate stylesheet, as is "stylesheet appendix alternate". If there is whitespace in CDATA attributes it should be dealt with by the parser, not the page info code.
See also bug 177069, which could provide a single API for handling <link> elements. I think there may be value in moving a lot of the page info code out of the page info UI and into a separate component, and simply making the page info code be a consumer of that component.
Attached patch address issues (obsolete) — Splinter Review
ok, the namespace checks should be more correct, duplication of effort not withstanding. If the namespace is null the only thing left to check is the mime type, which wouldn't be bad I suppose. Also, I changed the checks to test for equality with uppercase element names, since html documents allways seem to have uppercase nodenames, whereas xml doc do not. It's faster than using .toLowerCase() too. I changed the expression to correctly match space seperated lists when they should, and single words only in the other cases by testing for word boundaries and beginning/end of line. bug 177069 will be great, but until then... Also, you'll have to forgive me, but I went and fixed another bug in this file while I was waiting. I should seperate the two diffs, but I'm running short on time this morning. I'll see what I can do in a few hours at work.
Attachment #104368 - Attachment is obsolete: true
HTML element names use capital letters, XHTML elements are lower case. XML in general is case sensitive, so it could be either upper or lower case or both. That is to say, in X(HT)ML, <img> and <IMG> are very different elements. - if ((nn == "link" || nn == "input" || nn == "img" || isBG) && - isProtocolAllowed) + if ((/LINK|INPUT|IMG/.test(nn) || isBG) && isProtocolAllowed) This will also catch, e.g., <BLINK>. You need to dot "^...$" around a lot of you regexps, I think. Why use regular expressions though? I'd have thought direct string comparisons were faster. (They feel more right to me too.)
Suprisingly enough, case insensitive regexes are around 7x faster than using .toLowerCase() to make the equality test case insensitive. If you make them case sensitive the regex doesn't speed up, and the equality test is 35x faster than it was before. My theory is that .toLowerCase() has to go through some unicode crap to find out what letters to substitute, while regexes assume it's ANSI or the C locale or whatever. The sad thing is, XHTML documents are XML, so mozilla has lowercase .nodeNames for them, which means this last patch breaks page info for XHTML documents. This means that the best way to do it is to test for both cases: (nn == "foo" || nn == "FOO" || nn == "bar" ...) It's faster than regexes because it avoids the call to .toLowerCase(); I'm assuming we don't normalize attribute values, so those I'll leave as regexes. Also, I'm assuming that since XHTML is supposed to use lowercase nodes (<img> vs <iMg> or <IMG>), missing an <iMg> or similar won't upset anyone.
Attachment #104490 - Attachment is obsolete: true
suprisingly enough, it doesn't even really slow it down more than half a second or so, which isn't very much compared to the amount to time it generally takes to begin with. Provided I've gotten everything just right, this will prevent a node like <html:LINK> from showing up in the links tab for an xml document, since it's not a valid xhtml link element (it should be <html:link>). In fact, initial tests prove this to be the case. (see http://db48x.dyndns.org/pageInfo/tests/case.xml)
Attachment #104537 - Attachment is obsolete: true
You use .nodeName, I think you were right when you said you had to use localName. Note: this code, I think, still treats unnamespaced XML as HTML, so: <foo xmlns=""> <LINK href="fail"/> </foo> ...will find a link when it shouldn't. No? Also, is there a bug on the analyse-in-the-background issue?
yes, but I don't think there's anything I can do about it. and I'm not sure there's a bug for the background processing, but I can look and then file one. that and I just noticed that that's not the right diff, I'll correct that.
Attached patch correct diff (obsolete) — Splinter Review
the previous one didn't have the very last change I made, which fixed the calls to getAbsoluteURL(). I'm adding those calls in this patch rather than the one for bug 153545 because the two conflict badly. I'm still fixing getAbsoluteURL() in bug 153545 though. as for <LINK> in an xml doc, this code thinks it's an html link node, but then goes to get .rel, but that doesn't exist. The same will happen with other node types as well, because rather than being HTMLLinkNodes or whatever, they're just Elements. So, maybe the correct thing to do is call toString() on the node to get it's class name, and match on that as well? Seems silly, but it does expose the information we need that way...
Attachment #104608 - Attachment is obsolete: true
Can't you check to see if the document is HTML or XHTML? There really should be a way. I know there is at the C++ level, because I've done it. <?xml version="1.0" ?> <LINK rel="stylesheet" href="fail"/> ...must not be recognised as any kind of link
well, with this patch it isn't really recognized as a rel link. In fact, it shows up as a rev link, because someone else determined that it wasn't a real link, so it's just an Element, not an nsIHTMLLinkElement or whatever, therefore the test for link.rel fails, and it's inserted in as a link.rev, but with no value for any of the columns except type. Similar things happen for other kinds of nodes. I was thinking of fixing this by calling .toString() on the object, and greping that for HTML and the element name, but this doesn't work for anchors, which return the url rather than a string like [object Foo]. I could explicitly test for the presence of certain properties, and just not add to the tree if they're missing. In the case of link nodes, I could do something like this: if ("rel" in elem && elem.rel) // it's real else if (elem.rev) // it's real but reverse else // it's fake so do nothing but that's a pain, and really doesn't solve the problem of false positives from isa(). Is there a way we can possibly add an attribute to the idl for Element that indicates if this is an html or xhtml node? That would clear this right up.
looks like |elem instanceof Components.interfaces.nsIDOMHTMLAnchorElement| would be a valid way to test a node to see if it's an html anchor element. I can just remove isa() and use that for all of the tests too.
Excellent.
Attached patch well, it works (obsolete) — Splinter Review
Hixie, sure you don't want to review it, or at least test it? You're qualified.
Attachment #104673 - Attachment is obsolete: true
Comment on attachment 104869 [details] [diff] [review] well, it works - //XXX When Java is enabled, the DOM model for <APPLET> is broken. Bug #59686. ... get a real fix on the trunk. how's that going? useless style nit, you seemed to use if (cond), and then there was an if(cond) in @@ -595,15 +595,+14 @@ else after return in +function FormControlFilter(node) + if (!item instanceof Components.interfaces.nsIDOMHTMLInputElement) i don't think !a instanceof b behaves the way you want (so use parens): js> ({}) instanceof Function false js> !({}) instanceof Function false ^^ this is the only real bug i found in your code. everything else in this comment is some sort of minor comment. you might consider doing something like this: const inputElement = Components.interfaces.nsIDOMHTMLInputElement; it would make things easier to read/write, but it might cost more. > imageView.addRow([getAbsoluteURL(elem.code || elem.object, elem), gStrings.mediaApplet, "",elem]); please add a space after "",
Attachment #104869 - Flags: review+
I'm afraid I don't have a build environment at the moment. And I'm definitely not qualified, I hardly know JS and I've never seen this area of code before.
Attached patch fix stuff (obsolete) — Splinter Review
the applet stuff seems to have been dormant for some time now, with talks of rewriting something or other. not a good sign, as far as I'm concerned.
Attachment #104869 - Attachment is obsolete: true
Comment on attachment 104874 [details] [diff] [review] fix stuff >Index: xpfe/browser/resources/content/pageInfo.js >=================================================================== >RCS file: /cvsroot/mozilla/xpfe/browser/resources/content/pageInfo.js,v >retrieving revision 1.44 >diff -u -r1.44 pageInfo.js >--- xpfe/browser/resources/content/pageInfo.js 1 Nov 2002 02:12:10 -0000 1.44 >+++ xpfe/browser/resources/content/pageInfo.js 1 Nov 2002 19:35:48 -0000 >@@ -212,11 +212,30 @@ > // do nothing, later code will handle the error > } > >+// interfaces for the different html elements >+const anchorElement = Components.interfaces.nsIDOMHTMLAnchorElement ... So typically to make it harder to clash with variable names, and to make the code a little easier to read, I use const nsIDOMHTMLAnchorElement = Components.interfaces.nsIDOMHTMLAnchorElement. That way I know I'm comparing against an interface when I'm reading the code below. Other than that, I like this. sr=jag
Attachment #104874 - Flags: superreview+
Attachment #104874 - Attachment is obsolete: true
- var formfields = form.elements; + var formfields = findFormControls(form); This will not find the controls on many HTML pages where the parser fixes up the HTML and moves the controls outside the form... (it keeps form.elements in sync, though). Simple testcase: <table><form><tr><td> <input> </td></tr></form></table>
I'm rather inclined to leave that as an evang issue.
I think you underestimate the number of sites doing that... it's the only way to keep <form> from screwing with layout in NS4. There's a reason we go to such pains to support it.... But it's your call, of course. ;) I can land this as-is if you want to keep it as-is
Go ahead and check it in, I'll see what I can do later today. Does form.elements always come out in a particular order? The document order maybe?
Yeah, it comes out in document order. I'll land this when I get home this afternoon.
Good to see instanceof being put to good use.
hmm.. .that patch does not apply to a clean pageInfo.js....
where did it fail? I cannot get a clean copy, since cvs-mirror is gone and lxr is broken. maybe you could mail me one :P
Attached patch clean diff? (obsolete) — Splinter Review
actually, lxr seems to have caught up, how does this patch look?
That applies. ;) I assume you tested it and such? If so, I'll land it....
Attached patch clean diffSplinter Review
Attachment #105750 - Attachment is obsolete: true
Yes, I was getting to that. It does apear that I've made one small mistake, nsiAppletElement on line 222 should be changed to nsIAppletElement. Then there's just bug 122125... :)
> var physWidth = physHeight = 0; I changed this to var physWidth = 0, physHeight = 0; to avoid the strict warning. Checked in with that.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: