Closed Bug 19251 Opened 21 years ago Closed 20 years ago

improve way to recognize URLs in messages

Categories

(MailNews Core :: MIME, enhancement, P1)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: warrensomebody, Assigned: BenB)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

Attachments

(6 files)

The current mechanism for recognizing URLs in mail messages is hard coded to
only look for a few URL schemes. We need to make this extensible so that URLs
associated with all protocol plugins are recognized. For instance, jar: URLs
aren't recognized right now.

To do this, I think all we have to do is first detect something that looks like
a protocol scheme (e.g. "foo:") and then take the text up to the next
whitespace character and hand it to nsIOService::NewURI. If this successfully
constructs a URL, then we know that the protocol scheme does correspond to an
installed protocol plugin, and that the URL should be converted into an actual
link in the text.
Assignee: phil → rhp
Reassign to rhp
Ben had been working on this code (actually working on a rewrite of some of
these routines) so he would be the person to look at this.

- rhp
Thats:

Ben Bucksch
http://www.bucksch.org
Assignee: rhp → mozilla
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Accepting
Severity: normal → enhancement
Component: Front End → MIME
OS: other → All
Priority: P3 → P1
Target Milestone: M19 → M12
The most basic recognition functionality seems to work. Need to do more testing.

Not working: mailto, abbreviated URLs.
All the other funtions in the class (ParseURL etc.) still don't use Necko and
need to be rewritten (by me).
Some description of my code:
It works mode-based: modes are tested in sequence (defined by a const) and the
first successful one wins.
Modes are the following (copied from source code):
         RFC1738,          /* Check, if RFC1738, APPENDIX compliant,
                              like <URL:http://www.mozilla.org>. */
         RFC2396,          /* RFC2396, APPENDIX E allows anglebrackets (like
                              <http://www.mozilla.org>) or quotation marks
                              (like "http://www.mozilla.org") (w/o "URL:"). */
         freetext          /* assume heading scheme
                              with "[a-zA-Z0-9]*:" like "news:".
                              Certain characters (see code) or any whitespace
                              (including linebreaks) end the URL.
                              Other certain (punctation) characters (see code)
                              at the end are stripped off.
                           */
  /*     RFC1738 and RFC2396 type URLs may may use multiple lines,
         whitespace is stripped. Special characters like ',' stay intact.*/
Sounds like you're saying that you wrote your own recognizer based on the
specs, but I'd rather see us use what necko has for consistency. That way if
the thing is highlighted, we'll be assured that we can handle it.

If necko's url parsing doesn't meet the specs specified, then we should fix it.
Warren,
all my functiom does is to decide, where the URL starts and ends. I don't know
of a Necko function doing this. After that is done, I leave it up to Necko
(NS_NewURI) to decide, is the result is valid or not.

I'll attach the current code (work in progress). If you still think, it should
be moved to Necko, please provide me with the necessary background (knowledge)
and I'll integrate it.
I forgot to mention: function in question is FindURL.
Depends on: 19313
Blocks: 18410
Blocks: 5351
*** Bug 7176 has been marked as a duplicate of this bug. ***
Blocks: 19992
The patches/files create a new defunct stream converter with an XPCOM interface
and 3 (static) functions: ScanTXT, ScanHTML and CiteLevel. The latter is
currently unused, ScanTXT is used by mimetpla.cpp and mimetplf.cpp, ScanHTML by
nsMsgSendPart. I changed these functions to use the new class and removed the
old functions from nsMimeURLUtils.

I will ask Shaver, if the licence is OK.

The callers still need some work for I18N and perf checking to pass the right
modes to the functions, but most if not all points are marked with a XXX
comment.

rhp, can you please review the mime parts and the 3 functions?
valeski, can you please review the converter and it's integration in Necko? Is
it OK, that it registers for text/plain? If not, can you make it register with
the Factory, so libmime can access it? Tnx.
Typo: "pref(erences) checking", not "perf checking"
Everyone,
It probably makes sense for one person to land all of these changes. If you
want, I can step up and take that role. I will probably get this stuff ready to
rock over the weekend and look for a Monday landing.

If anyone objects, please let me know.

- rhp

PS: Warren: this will include the other changes we talked about today.
Note, that the license for moz(I)TXTToHTMLConv is possibly invalid. I may
release it under a different licence (e.g. a modified MPL or new BSD-style (w/o
ad restriction) license.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Ok gang, this is all checked in now. There seems to be an issue with the
emoticon detection that Ben is working on, but other than that, we seem to be
working.

- rhp
Judging from Ben's comment about just looking at where the url starts and ends
and then calling NS_NewURI, I'm happy. I haven't looked at the code though.

One thing I'd like to see though that I always considered broken in 4.x
releases: If a url is broken across a line, there should be heuristics that
recognize that fact, and pick up the rest of the url as the continuation, e.g.:

bla bla bla bla bla bla bla bla bla bla bla bla bla http://listings.ebay.com/aw/
listings/list/category1497/index.html bla bla bla bla bla bla

The recognizer should notice the "<text>*:" as the start of the url, then notice
that it ends with the newline, and then look on the next line for a string a of
text containing slashes, and dots, etc. and including it in the url string.
Warren, bug #5351 (dependant on this) addresses the linebreaks in URLs.
Blocks: 21564
No longer blocks: 21564
QA Contact: lchiang → esther
Can anyone give me some test url's for this bug.  I have verified a mailto link
works OK (comment #9) and in (comment #22) a long string url with text in front and
in back of it sent and received as a url OK.  I'm not sure what all the protocols
are as stated in original description, so if someone can help with this I would
appreciate it. 
Esther, this code is so old and so visible that you can fairly securely mark
this verified. (I won't do so, because I am the one who fixed it.)
Thanks Ben!
Verified.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.