Closed Bug 116242 Opened 23 years ago Closed 15 years ago

[mozTXTToHTMLConv] Function: Find URL in plaintext string

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.5beta

People

(Reporter: BenB, Assigned: BenB)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Several features in Mozilla (e.g. the spellchecker, an "open selection as URL"
context menu item, maybe the urlbar etc.) need to find a URL in a plaintext
string. I.e. you have a string and you suspect a URL in it, but you don't know,
where it starts and ends.

The task is difficult. But we have code in Mozilla which performs exactly that,
namely in the TXT->HTML converter called mozTXTToHTMLConv in netwerk. I think,
this code works fairly reliably, so we should reuse it in the other part ofs the
app.

The subject of this bug is to create an XPCOM- or C++-function suitable for use
in these other features. The signature could look like

void findURLinPlaintext(in string text, out long start, out long end)

If an URL has been found, the function returns NS_SUCCESS and fills |start| and
|end| with the indices of the start and end of the first URL found (|end| is the
index of the last charater of the URL, not the char after it). If no URL could
be found, it returns a certain (non-fatal) error code.
Keywords: mozilla1.0
Summary: Find URL in string → Function: Find URL in plaintext string
Could we also provide an nsAString version that will take start and end 
iterators and adjust them to point to the url (a la FindInReadable)?
Boris Zbarsky, do you have a concrete use in mind?
Yes.  The concrete use is if I have a unicode string and don't want to
UTF8-encode it, make a copy, send it through findURLinPlaintext, take the
substring and convert it back into UCS2....

This is most likely to be needed by the spellchecker, since I presume the
message being composed is in UCS2 internally...
I intended to use 16bit wide strings anyway. If for no other reason, then
because indices in utf8 are harder (do they mean the char-index or the
byte-index? ...).
Ah. That was not clear from the proposed prototype... :)

In that case, what you have is likely fine. It _does_ require a flat string, but
that can be worked around...

Iterators are just more convenient than numeric indices for a lot of string
work, which is why I suggested that.
Blocks: 10080
Blocks: 172186
Summary: Function: Find URL in plaintext string → [mozTXTToHTMLConv] Function: Find URL in plaintext string
This is the function:
  /**
     Pass a plaintext string to it and it will try to find/recognize the
     first URL in it (possibly abbr. URL and burried like in
     "foo@example.com." or augmented like in "<http://www.example.com>")
     and return the loadable URL (e.g. "mailto:foo@example.com" or
     "http://www.example.com", respectively).

     @param text  search for the URL here
     @param startPos  first character of the URL in |text|
     @param endPos  last character of the URL in |text|
     @param url  loadable URL (the URL in |text|, as returned by start/endPos,
		 may be abbreviated). You have to nsMemory:Free() this
     @return  URL found
   */
 boolean findURLTXT([const] in wstring text,
		     out long startPos, out long endPos, out wstring url);


The trigger was bug 172186, but that doesn't need the url param. However, other
expected users of the function, e.g. load selection as URL or the URLbar, will
need it.

And this url out param is also what required quite some reworking of the
mozTXTToHTMLConv class. FindURL previously generanted HTML, but here I need the
real, valid, completed URL, so I had to reorganize the functions to get the
HTML stuff out of FindURL.

While I was at it, I also fixed a number of other stuff, like
- warnings
- a function rename (ShouldLinkify() -> HasValidScheme())
- lines > 80 chars
- comments

I tested this against my old test cases (created when I initially wrote the
class / the converter), and they still all work fine.

The new IDL function is not yet tested, though.
-mozilla1.0 keyword. I guess I missed that target.
Keywords: mozilla1.0
Target Milestone: --- → mozilla1.5beta
Why wstring instead of nsIURI?
No hard reason. I think I could use nsURI, but I'd have to change a number of
internal function signatures.
conversion from nsIURI to wstring is lossy, which sometimes matters.  prefer
nsIURI whenever possible.  if nothing else it helps avoid string copies :)
-> qawanted.

This is interesting stuff, but until the ns and mz stuff is unified, I'm going
to focus on other technologies, since my time is really limited right now. As I
recall, the mail used mz, chatzilla used ns. I guess it doesn't matter what NIM
uses anymore...
Keywords: qawanted
QA Contact: benc → nobody
Blocks: 227922
Bug 254913 (anti-phishing) is another potential user of this function.
No longer blocks: 227922
QA Contact: nobody → networking
This was fixed 2004-02-19 12:44 as part of bug 234936 and bug 172186.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Current API:
 @param a wide string to scan for the presence of a URL.
 @param aLength --> the length of the buffer to be scanned
 @param aPos --> the position in the buffer to start scanning for a url

 aStartPos --> index into the start of a url (-1 if no url found)
 aEndPos --> index of the last character in the url (-1 if no url found)

void findURLInPlaintext(in wstring text, in long aLength, in long aPos, out long aStartPos, out long aEndPos);

<http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/public/mozITXTToHTMLConv.idl#111> (m-c)
<http://mxr.mozilla.org/seamonkey/source/netwerk/streamconv/public/mozITXTToHTMLConv.idl#111> (CVS)
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: