Closed Bug 37941 Opened 24 years ago Closed 22 years ago

[RFE] Regular Expression Searches

Categories

(SeaMonkey :: UI Design, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 32641
Future

People

(Reporter: danielpeng, Assigned: sfraser_bugs)

References

Details

(Keywords: helpwanted)

Find sections in a page based on a regular expression.
->sfraser
Assignee: don → sfraser
Status: UNCONFIRMED → NEW
Ever confirmed: true
M20
Status: NEW → ASSIGNED
Target Milestone: --- → M20
moving to future milestone
Assignee: sfraser → beppe
Status: ASSIGNED → NEW
moving back to previous owner
Assignee: beppe → sfraser
Target Milestone: M20 → Future
adding help wanted keyword
Status: NEW → ASSIGNED
Keywords: helpwanted
I've been thinking of working on something along these lines.  Any ideas on how
to do regexp stuff in C++ (since the search is currently in C++)?
You'll need a RegExp implementation. The JS engine has one, but I don't believe 
it's exposed; doing that would probably require some refactoring.
Who needs it in C++?  We're all about XPCOM, baby!

You need an nsIRegExp interface, a tiny JS implementation, and wham-o: Henry
Spencer's your uncle!
The JS RegExp stuff can be exposed.  Some of it is.  You'll need a JSContext*,
which you can get from an nsIScriptContext, which you can get from a DOM global
object.  Where would the call(s) to compile and test a regexp come from?

Here's a question: does the search have to convert the entire document into a
string to find a match?  Or is there an iterator that can be used character by
character?

/be
*** Bug 118507 has been marked as a duplicate of this bug. ***
mass moving open bugs pertaining to find in page/frame to pmac@netscape.com as
qa contact.

to find all bugspam pertaining to this, set your search string to
"AppleSpongeCakeWithCaramelFrosting".
QA Contact: sairuh → pmac
I have bug 32641, to implement simple wildcard searching.  The request in that
bug asks specifically that full regexps NOT be implemented.

We should choose one of these two options, and resolve one of these bugs as a
dup of the other.  I should probably own the resulting bug, unless Simon
specifically wants this one.

Simon (or anyone on the cc list), what do you think?  Wildcard or regexp?  And
please feel free to reassign this one to me, or dup it to 32641.

Unfortunately, we can't just plug in a regexp library, as our searches have to
be able to span multiple DOM nodes while iterating backward and forward in the
dom and skipping over invisible nodes.  
I have to say that I find regexp searching infinitely more useful than wildcard
searching.  I also agree (unfortunately) that Brian has a point in bug 32641 --
wildcard searching may be a lot more likely to be understandable to the average
"power user" who is used to dealing with ? and * in shells.

I was wondering whether it would be possible to make the code flexible enough
that the matching engine could be swapped out (so that someone could write a
"regexp search" xpi) while keeping it fast (and whether we really care, I guess).

I've been thinking some more....

For a typical document with text (which is what one mostly views with a web
browser) the only wildcard that's really useful is "?"...  "*" would have a
strong tendency to match something like half the document (And it sounds like
bug 32641 is asking for "*foo*" to act as the "\b\w*foo\w*\b" regexp, which is,
imo, not at all intuitive even for someone used to wildcards.)

Also, on Unix regular expression searches are the standard for tools that
manipulate text and allow searching -- wildcards are only really used by shells.
> Unfortunately, we can't just plug in a regexp library,
> as our searches have to be able to span multiple DOM
> nodes while iterating backward and forward in the
> dom and skipping over invisible nodes.


How does the literal text matching engine do it?

This might be Too Much Bloat, but what if a search command caused a plain text
version to be assembled on the fly, with an associated table of relations
between positions in the plain text version and location in the document? The
table would not need to be complete, only enough to be able to construct
information to hilight "this much off the end of that node, these nodes, and
that much off the start of the next node".

The text equivilant and table would be built only when the search was initiated
(and cached until some DHTML whatzit rendered the table obselete). The plain
text could be searched by any regex library which could return start/end of
match indicies which would then be translated into a useful result relative to
the actual page.

If the literal text search doesn't already do something like this, it *could* be
doing something like this, and the whole thing could be very pluggable. One
could have different options for search: literal text, wildcards, regular
expressions, soundex, whatever.

-matt
akkana: feel free to take this. I'd also strongly recommend that you future it  :)
I'm going to dup this to bug 32641.  Those arguing for regexps rather than
wildcards, discuss it there where the pro-wildcard folks are.

Boris:
> I was wondering whether it would be possible to make the code flexible enough
> that the matching engine could be swapped out

Unfortunately, not, it isn't really possible:

m_mozilla:
> what if a search command caused a plain text  
> version to be assembled on the fly

That's what the previous version of find did, and that's why it was up to an
order of magnitude slower on big documents.  It's not reasonable on big
documents.  Part of the problem was that it had to be redone for every search,
because we have no way of knowing whether the document changed since the last
search.

In various attempts at rewriting this code, I tried several different approaches
involving combining text from several text nodes together and then calling the
built-in searches in our string classes (which would also have allowed for
calling regexp comparisons), but the result wasn't fast enough, and I never came
up with a satisfactory answer to the question of "How do you determine how many
nodes you have to convert to plaintext before you have enough to call the
pattern search on it?"  I suppose you could just keep building the string as you
iterate through the document, re-doing the regexp search each time.

*** This bug has been marked as a duplicate of 32641 ***
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
Product: Core → Mozilla Application Suite
You need to log in before you can comment on or make changes to this bug.