Closed Bug 116242 Opened 24 years ago Closed 16 years ago

[mozTXTToHTMLConv] Function: Find URL in plaintext string

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.5beta

People

(Reporter: BenB, Assigned: BenB)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Several features in Mozilla (e.g. the spellchecker, an "open selection as URL" context menu item, maybe the urlbar etc.) need to find a URL in a plaintext string. I.e. you have a string and you suspect a URL in it, but you don't know, where it starts and ends. The task is difficult. But we have code in Mozilla which performs exactly that, namely in the TXT->HTML converter called mozTXTToHTMLConv in netwerk. I think, this code works fairly reliably, so we should reuse it in the other part ofs the app. The subject of this bug is to create an XPCOM- or C++-function suitable for use in these other features. The signature could look like void findURLinPlaintext(in string text, out long start, out long end) If an URL has been found, the function returns NS_SUCCESS and fills |start| and |end| with the indices of the start and end of the first URL found (|end| is the index of the last charater of the URL, not the char after it). If no URL could be found, it returns a certain (non-fatal) error code.
Keywords: mozilla1.0
Summary: Find URL in string → Function: Find URL in plaintext string
Could we also provide an nsAString version that will take start and end iterators and adjust them to point to the url (a la FindInReadable)?
Boris Zbarsky, do you have a concrete use in mind?
Yes. The concrete use is if I have a unicode string and don't want to UTF8-encode it, make a copy, send it through findURLinPlaintext, take the substring and convert it back into UCS2.... This is most likely to be needed by the spellchecker, since I presume the message being composed is in UCS2 internally...
I intended to use 16bit wide strings anyway. If for no other reason, then because indices in utf8 are harder (do they mean the char-index or the byte-index? ...).
Ah. That was not clear from the proposed prototype... :) In that case, what you have is likely fine. It _does_ require a flat string, but that can be worked around... Iterators are just more convenient than numeric indices for a lot of string work, which is why I suggested that.
Blocks: 10080
Blocks: 172186
Summary: Function: Find URL in plaintext string → [mozTXTToHTMLConv] Function: Find URL in plaintext string
This is the function: /** Pass a plaintext string to it and it will try to find/recognize the first URL in it (possibly abbr. URL and burried like in "foo@example.com." or augmented like in "<http://www.example.com>") and return the loadable URL (e.g. "mailto:foo@example.com" or "http://www.example.com", respectively). @param text search for the URL here @param startPos first character of the URL in |text| @param endPos last character of the URL in |text| @param url loadable URL (the URL in |text|, as returned by start/endPos, may be abbreviated). You have to nsMemory:Free() this @return URL found */ boolean findURLTXT([const] in wstring text, out long startPos, out long endPos, out wstring url); The trigger was bug 172186, but that doesn't need the url param. However, other expected users of the function, e.g. load selection as URL or the URLbar, will need it. And this url out param is also what required quite some reworking of the mozTXTToHTMLConv class. FindURL previously generanted HTML, but here I need the real, valid, completed URL, so I had to reorganize the functions to get the HTML stuff out of FindURL. While I was at it, I also fixed a number of other stuff, like - warnings - a function rename (ShouldLinkify() -> HasValidScheme()) - lines > 80 chars - comments I tested this against my old test cases (created when I initially wrote the class / the converter), and they still all work fine. The new IDL function is not yet tested, though.
-mozilla1.0 keyword. I guess I missed that target.
Keywords: mozilla1.0
Target Milestone: --- → mozilla1.5beta
Why wstring instead of nsIURI?
No hard reason. I think I could use nsURI, but I'd have to change a number of internal function signatures.
conversion from nsIURI to wstring is lossy, which sometimes matters. prefer nsIURI whenever possible. if nothing else it helps avoid string copies :)
-> qawanted. This is interesting stuff, but until the ns and mz stuff is unified, I'm going to focus on other technologies, since my time is really limited right now. As I recall, the mail used mz, chatzilla used ns. I guess it doesn't matter what NIM uses anymore...
Keywords: qawanted
QA Contact: benc → nobody
Blocks: 227922
Bug 254913 (anti-phishing) is another potential user of this function.
No longer blocks: 227922
QA Contact: nobody → networking
This was fixed 2004-02-19 12:44 as part of bug 234936 and bug 172186.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Current API: @param a wide string to scan for the presence of a URL. @param aLength --> the length of the buffer to be scanned @param aPos --> the position in the buffer to start scanning for a url aStartPos --> index into the start of a url (-1 if no url found) aEndPos --> index of the last character in the url (-1 if no url found) void findURLInPlaintext(in wstring text, in long aLength, in long aPos, out long aStartPos, out long aEndPos); <http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/public/mozITXTToHTMLConv.idl#111> (m-c) <http://mxr.mozilla.org/seamonkey/source/netwerk/streamconv/public/mozITXTToHTMLConv.idl#111> (CVS)
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: