Closed Bug 182366 Opened 23 years ago Closed 16 years ago

Using machine learning to order autocomplete results

Categories

(SeaMonkey :: Autocomplete, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nisheeth_mozilla, Assigned: nisheeth_mozilla)

References

Details

Attachments

(4 files, 9 obsolete files)

4.71 KB, text/plain
Details
5.42 KB, text/plain
Details
50.96 KB, patch
Details | Diff | Splinter Review
3.49 KB, text/plain
Details
The attached patch is a first cut at using machine learning techniques to better order urls in the urlbar autocomplete dropdown window. The current implementation uses a simple linear classifier (perceptron) which is trained implicitly each time the user selects a url from the autocomplete dropdown. A bunch of numeric features (X1, X2, ... Xn) are calculated for each url and input to the perceptron along with the boolean value (B) of whether the url was selected by the user. The perceptron updates an array of numeric weights (W1, W2, ... Wn), one for each input feature, as it sees the training data and attempts to arrive at a set of weights such that: sigmoid( sigma(i = 1 to n, Wi * Xi) ) = B where sigmoid(x) = 1 / 1 + exp(-x) and sigma represents the summation operator over variable i from 1 to n. The goal is to train the perceptron with enough instances of training tuples, <X1, X2, ... Xn, B> so that it converges on a set of weights such that it can correctly output B from any new data instance, <X1, X2, ... Xn>. Once such a set of weights is reached, the perceptron can be used to predict whether a given url will be selected by the user or not. This is a very simple implementation and there are many ideas that have been left out in the above short description. I'll add more information here as questions are raised...
Attached patch Patch v0.1 (obsolete) — — Splinter Review
Version 0.1 of patch.
Comment on attachment 107650 [details] [diff] [review] Patch v0.1 Heikki, please review the attached patch. Thanks!
Attachment #107650 - Flags: review?(heikki)
Adding Heikki to the cc list.
Joe hewitt is the owner of urlbar autocomplete. CCing him as well. Joe, I'll get in touch with you to go over this patch interactively...
There is a patch in bug 180336 that should give you some pointers on how to replace the standard C++ file handling with Mozilla versions.
Heikki, I'm about to attach a revision of the patch that addresses the following comments from your past review: - In FillInputFeatures, use NS_NewURL more for parsing the url. - Class names, Member variables, function arguments should follow Mozilla naming conventions - File saving and loading should use nsIFile - Since output format is so close to XML, make it XML - Fix tabbing
Status: NEW → ASSIGNED
Attached patch Patch v0.2 (obsolete) — — Splinter Review
Fixes to first review from Heikki.
Attachment #107650 - Attachment is obsolete: true
Attached patch Patch v0.3 (obsolete) — — Splinter Review
IDs are now output to the data file to identify a URL in case the user doesn't want to output the url path.
Attachment #108111 - Attachment is obsolete: true
Attachment #108254 - Flags: review?(heikki)
Comment on attachment 108254 [details] [diff] [review] Patch v0.3 You should try to test build these changes on all the major platforms at least, because you use more math etc. stuff than most of the code I've seen. Some math stuff we have in NSPR actually, I'll try to see if I can spot any which could be changed to NSPR. >Index: autocomplete/resources/content/autocomplete.xml >=================================================================== >Index: bookmarks/public/nsIBookmarksService.idl >=================================================================== >+ void getAddDate(in string aBookmarkURL, out long long aAddDate); Is there some Date data type that you could use instead, |long long| does not look like a date to me (although I know how it works, but still:). Also I'd like to see the javadoc style comments for members in IDL files. >Index: bookmarks/src/nsBookmarksService.cpp >=================================================================== Will continue...
Comment on attachment 108254 [details] [diff] [review] Patch v0.3 sr=me
Attachment #108254 - Flags: superreview+
> You should try to test build these changes on all the major platforms at least I've started a checkout on Linux and I'm building on Windows. >Is there some Date data type that you could use instead, |long long| does not >look like a date to me I agree it looks strange. I looked for a nsDate in xpcom/ds and the closest thing they have there is class nsTime that uses the same PRTime (#defined to PRInt64) parameters that I am using. I couldn't find any usage of nsTime as a parameter in IDL files. I could use it in nsBookMarksService.idl but then I'd have to use the native qualifier and make the method unscriptable. I did find usage of "long long" in IDL files to denote time (see LXR snippet below). One way to fix the strangeness might be to rename the method from getAddDate() to getAddTime(). That's actually closer to what is really going on anyway... What do you think? /xpfe/appshell/public/nsITimingService.idl, line 52 -- long long getElapsedTime(in string timername); /xpfe/appshell/public/nsITimingService.idl, line 60 -- void setTimer(in string timername, in long long starttime); /xpfe/components/download-manager/public/nsIDownloadManager.idl, line 80 -- in long long startTime, /mailnews/base/public/nsIMsgGroupRecord.idl, line 49 -- in long long time, /mailnews/base/public/nsIMsgGroupRecord.idl, line 61 -- in long long time, /mailnews/base/public/nsIMsgGroupRecord.idl, line 85 -- readonly attribute long long addTime; > Also I'd like to see the javadoc style comments for members in IDL files Consider it done! :-)
Comment on attachment 108254 [details] [diff] [review] Patch v0.3 Putting into approval queue as well... I've sent email out about this separately to drivers@mozilla.org.
Attachment #108254 - Flags: approval1.3a?
OK, the patch builds fine on Linux...
And it builds fine on the Mac OSX build as well... So, build wise we are all set for all the platforms built on the main tinderbox page...
Comment on attachment 108254 [details] [diff] [review] Patch v0.3 >Index: history/src/nsGlobalHistory.cpp >=================================================================== >+#define AC_NUM_URL_FEATURES 44 It would be nice to comment what these mean in practice: >+const PRFloat64 LEARN_RATE = 0.5; >+const PRFloat64 HISTORY_FAST_DECAY_CONSTANT = 0.2; >+const PRFloat64 HISTORY_SLOW_DECAY_CONSTANT = 0.8; >+const PRFloat64 BOOKMARK_FAST_DECAY_CONSTANT = 0.2; >+const PRFloat64 BOOKMARK_SLOW_DECAY_CONSTANT = 0.8; >+//---------------------------------------------------------------------- >+// Sigmoid perceptron definitions and implementation >+// XXX The class definitions need to go in a separate header file >+ >+class nsPerceptron The comment talks only about sigmoid perceptron but you have the plain one here as well, and it's first. Also you may want to change the name to tell which/what perceptron, but since it is the only one for now it doesn't matter that much. >+ nsPerceptron(int aNumFeatures); You might want to put the default constructor (without args) into private members without implementation, which would guarantee that this constructor will always be called or it won't compile. >+ if (mNumWeights > 0 && mWeights != NULL) We tend to group things explicitly, so add some parens. >+ virtual Train(PRFloat64* aInputs, int aNumInputs, PRFloat64 aTargetOutput); >+ virtual Test(PRFloat64* aInputs, int aNumInputs, PRFloat64* aOutput); >+ int mNumWeights; |PRInt32| instead of |int|? Same in nsSigmoidPerceptron. >+static nsSigmoidPerceptron sAutoCompleteLearner(AC_NUM_URL_FEATURES); Static objects are bad. Make this into a service or normal member variable of the history/autocomplete service. >+nsPerceptron::nsPerceptron(int aNumFeatures) >+{ >+ mNumWeights = aNumFeatures; >+ mWeights = NULL; >+ if (mNumWeights > 0) >+ { >+ mWeights = new PRFloat64[aNumFeatures]; >+ } Maybe you should have Init method or something, because that |new| could fail and you can't inform the caller if you do this in the constructor. Also, set the mNumWeights after the |new| succeeded, otherwise those two member variables are out of sync and you might cause crash. Do not use |NULL|, use |nsnull| or |0|. >+ double output = 0.0; Strange, I don't seem to be able to find any PRDouble types. Maybe double is good then.
Comment on attachment 108254 [details] [diff] [review] Patch v0.3 >Index: history/src/nsGlobalHistory.cpp >=================================================================== >+#define AC_NUM_URL_FEATURES 44 The way the features & this define are connected is quite loose, but couldn't come up with a nice way at the moment to do this. >+ for (int i = 0; i < mNumWeights; i++) |int| to |PRInt32|, there seem to be more of these elsewhere as well. Also, you should be careful with initializing variables like that in the for loop; some compilers don't get the scope right so if you use |i| later in the function it might have different values than you expect. >+ nsresult rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, getter_AddRefs(file)); Is that the correct profile dir nowadays? Didn't it change to ".../Profiles/..."? >+ nsCOMPtr<nsILocalFile> local_file(do_QueryInterface(file)); We tend to use interCaps -> localFile. >+ fscanf(from, "%lf", mWeights[i]); I'm really suspicuous of all file reads. Maybe this is good enough for now, though... >+ local_file->AppendNative(NS_LITERAL_CSTRING("ac-weights.txt")); Could you make that string a define or const or something, since you are using it in more than one location? >+PRFloat64 nsSigmoidPerceptron::Sigmoid(PRFloat64 d) We usually have 'a' prepend the interCap parameter name. There are also other functions were you don't follow this convention. >+ PRFloat64 m; Btw, it seems PRFloat64 is a typedef for double, just you know... >+ nsAutoString url; >+ CopyASCIItoUCS2(nsDependentCString(aURL), url); URL is UTF8. Use NS_ConvertUTF8toUCS2? This pattern is repeated later as well. >+nsGlobalHistory::SetRowValue(nsIMdbRow *aRow, mdb_column aCol, const PRFloat64 aValue) The last param doesn't really need to be const. >+ if (err != 0) return NS_ERROR_FAILURE; The convention is: if (err) return NS_ERROR_FAILURE; >+ if (id == 0) Either that or !id. >+ mACFeatures = new PRFloat64[AC_NUM_URL_FEATURES]; Check if new failed and do something if it did. >+ (*aResult)->AddRef(); Don't call addref directly, because that will confuse the refcount debugging tools that rely on the macros. Use NS_ADDREF(). >+// Feature 41: Is this a google search url? >+// Feature 42: Is this a netscape search url? >+// Feature 43: Is this a yahoo search url? These URLs might obviously change in the future, so if these things were to be turned on indefinitely they would need to come from prefs. >+ do_GetService("@mozilla.org/browser/bookmarks-service;1", &rv); If there is a define for this progid you should use it. >+ if (HasCell(mEnv, row, kToken_TypedColumn)) >+ aFeatures[2] = 1; >+ else >+ aFeatures[2] = 0; For boolean settings you could simply do: aFeatures[2] = HasCell(mEnv, row, kToken_TypedColumn); >+ nsCAutoString curl; >+ NS_NewURI(getter_AddRefs(uri), aUrl); >+ uri->GetSpec(curl); Check if you got the URI object before using it. Also, now that you have the URI object, don't use the spec for everything! >+ nsAutoString url(NS_ConvertASCIItoUCS2(curl).get()); >+ ToLowerCase(url); URL is not ASCII, it is UTF8. >+ if (FindInReadable(NS_LITERAL_STRING(".htm"), start, end)) uri->GetPath() and then look for that. >+ if (FindInReadable(NS_LITERAL_STRING(".com"), start, end)) >+ if (FindInReadable(NS_LITERAL_STRING(".edu"), start, end)) >+ if (FindInReadable(NS_LITERAL_STRING(".org"), start, end)) >+ if (FindInReadable(NS_LITERAL_STRING(".net"), start, end)) >+ if (FindInReadable(NS_LITERAL_STRING(".gov"), start, end)) uri->GetHost() and then look for those. >+ if (FindInReadable(NS_LITERAL_STRING("~"), start, end)) Look at path. >+ // Feature 18: Does the URL end in a two letter country code? >+ if (url.RFindChar('.') == ((url.Length() - 1) - 2)) Look at host. >+ PRUint32 size, i; >+ for ( ; start != end; start.advance(size)) >+ { >+ const PRUnichar* buf = start.get(); >+ size = start.size_forward(); Uh, that looks needlessly complicated. Wouldn't you be able to do that in a single loop, starting with: for ( ; start != end; ++start) >+ fprintf(mURLDataFile, " path='%s'", NS_LossyConvertUCS2toASCII(aURL).get()); Use NS_ConvertUCS2toUTF8? > nsGlobalHistory::OnAutoComplete(const PRUnichar *searchString, > nsIAutoCompleteResults *previousSearchResult, > nsIAutoCompleteListener *listener) > { >- return NS_OK; >+ nsCOMPtr<nsISupportsArray> results; >+ nsCOMPtr<nsIAutoCompleteItem> item; >+ PRUint32 count; >+ nsAutoString value; >+ PRBool found = PR_FALSE; >+ PRUint32 i; >+ nsresult rv = NS_OK; >+ >+ if (mLearningMode == ACL_NONE && mDataCaptureMode == UDC_NONE) >+ return rv; Declare the variables as close to the place where you use them. >+ previousSearchResult->GetItems(getter_AddRefs(results)); >+ results->Count(&count); Shouldn't you check if results exists before using? There are other times with this same function call. NS_LossyConvertUCS2toASCII(searchString).get()); UTF8. >+ PRUnichar *comment; ... >+ aItem->GetComment(&comment); ... >+ nsMemory::Free(comment); You might want to use nsXPIDLString which would take care of the memory management. >Index: history/src/nsGlobalHistory.h >=================================================================== Done. Fix those and r=heikki. Please note that many of the things I commented on are fairly generic, so check for that same pattern elsewhere.
Attachment #108254 - Flags: review?(heikki) → review+
Attached patch Patch v0.4 incorporates Heikki's comments (obsolete) — — Splinter Review
Here's a new patch with the following changes: - Javadoc comments for new method in the bookmark service IDL file. - Renamed the method to getAddTime and parameter to aAddTime - sAutoCompleteLearner isn't a global static. Its now mAutoCompleteLearner, a member of nsGlobalHistory. - Commented the constants in nsGlobalHistory.cpp - Added private default constructor to nsPerceptron, nsSigmoidPerceptron - Replaced int with PRInt32 >Maybe you should have Init method or something, because that |new| could fail >and you can't inform the caller if you do this in the constructor. This is something I'd like to do once and for all when I move nsPerceptron out into a separate idl, .h, and .cpp file. Then, I'll have return error values for each function. Right now, since the perceptron will be off by default anyway, please let me check in without handling the edge case of running out of memory. >Also, set the mNumWeights after the |new| succeeded, otherwise those two member >variables are out of sync and you might cause crash. Do not use |NULL|, use >|nsnull| or |0|. Done!
Attachment #108254 - Attachment is obsolete: true
Attachment #108415 - Flags: superreview?(hewitt)
Attachment #108415 - Flags: review?(heikki)
Attachment #108415 - Flags: approval1.3a?
Attachment #108415 - Flags: superreview?(hewitt)
Attachment #108415 - Flags: review?(heikki)
Attachment #108415 - Flags: approval1.3a?
>Right now, since the perceptron will be off by default anyway, please let me >check in without handling the edge case of running out of memory. Ok, but at least add a XXX comment saying need to check out of mem and move to Init() method.
I'm about to attach a patch that addresses everything in your comments above except for the following points discussed below: > Is that the correct profile dir nowadays? Didn't it change to > ".../Profiles/..."? I've seen the same usage in other places, and I've tested it and made sure that the file is created in the right directory... >>+ PRUint32 size, i; >>+ for ( ; start != end; start.advance(size)) >>+ { >>+ const PRUnichar* buf = start.get(); >>+ size = start.size_forward(); > > >Uh, that looks needlessly complicated. Wouldn't you be able to do that in a >single loop, starting with: > >for ( ; start != end; ++start) Actually, the Mozilla string guide recommends the more complicated code because it is more performant. See http://www.mozilla.org/projects/xpcom/string-guide.html#Looping_Iterators.
This patch is what I will check into 1.3 alpha...
Attachment #108415 - Attachment is obsolete: true
Attachment #108254 - Flags: approval1.3a?
Comment on attachment 108426 [details] [diff] [review] Patch v.05 incorporates 2nd batch of comments from Heikki moving review and superreview forward and adding a=asa for checkin to 1.3a.
Attachment #108426 - Flags: superreview+
Attachment #108426 - Flags: review+
Attachment #108426 - Flags: approval1.3a+
I just checked in Patch 108426 (v 0.5). Keeping this bug open because more work needs to happen here...
OS: Windows 2000 → All
Hardware: PC → All
So... is this actually enabled? Based on the checkins that were required to get this compiling (including rev 1.67 of nsGlobalHistory.cpp especially), I have to wonder how much testing this received... (the code rev 1.67 changed could not possibly have worked properly).
I'm sorry, but I backed out the checkin (after fruitless attempts to locate nisheeth last night and earlier today). This checkin was responsible for the crash on shutdown on the tinderboxes running the bloat and leak tests. In a debug build on Linux, I saw the following on shutdown with this patch in my tree (prior to the crash): ###!!! ASSERTION: can't get xpt search path!: 'Error', file /home/bzbarsky/mozilla/profile/mozilla/xpcom/reflect/xptinfo/src/xptiInterfaceInfoManager.cpp, line 67 ###!!! ASSERTION: null monitor: 'mMonitor', file ../../dist/include/xpcom/nsAutoLock.h, line 240 ###!!! ASSERTION: null monitor: 'mMonitor', file ../../dist/include/xpcom/nsAutoLock.h, line 248 ###!!! ASSERTION: Expected only one call!: '!gCallCount++', file /home/bzbarsky/mozilla/profile/mozilla/xpcom/reflect/xptinfo/src/xptiInterfaceInfoManager.cpp, line 232 These were repeated 5 or 6 times before the actual crash. Oh, and there was a startup assertion: ###!!! ASSERTION: nsDependentCString must wrap only null-terminated strings: '!*aEndPtr', file ../../../../dist/include/string/nsDependentString.h, line 160 Break: at file ../../../../dist/include/string/nsDependentString.h, line 160 Finally, please fix the perceptron constructor to _always_ initialize mNumWeights... the current code will do hideous things if '0' is passed to the constructor and then Train is called.
Comment on attachment 108426 [details] [diff] [review] Patch v.05 incorporates 2nd batch of comments from Heikki please bear with me, this is the first time i've paid any attention to this bug, first comments on comments. >> + nsresult rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, getter_AddRefs(file)); > Is that the correct profile dir nowadays? no. that class set is going away (for more info see bug 156733) the correct api is to use to directoryservice. sample code can be found all over the place. (there's currently a broken consumer in the xp filepicker, but there will be a correct patch in bug 125324 which you could work from). -- nit#1, please don't check in trailing whitespace >+// Autocomplete learning modes >+#define ACL_NONE 0 please rename the ACL things to anything but ACL. acl should mean Access Control List. AUTOCOMPLETE_NO_LEARNING, AUTOCOMPLETE_ENABLE_TRAINING, AUTOCOMPLETE_AFFECT_URL_LIST seem perfectly reasonable. >+#define ACL_ENABLE_TRAINING 1 >+#define ACL_AFFECT_URL_LIST 2 >+#define UDC_NONE 0 >+#define UDC_WITHOUT_URL_INFO 1 >+#define UDC_WITH_URL_INFO 2 I can't remember what UDC stands for, please replace it with a longer string that is meaningful. why not use consts, or perhaps idl consts? >+const PRInt32 AC_NUM_URL_FEATURES = 44; yuck. const's are kInitialCaps. 44 appears to be a magic number, if it is then please decompose it or comment. >+// XXX The implementations need to be moved out to their own .cpp file this stuff should have been checked in *very* early in the alpha cycle, not during the freeze at the end. >+// XXX This should move to an Init method so that error handling can happen. This isn't acceptable. >+nsPerceptron::nsPerceptron(PRInt32 aNumFeatures) >+{ why take a signed int? >+ mWeights = nsnull; >+ if (aNumFeatures > 0) >+ { >+ mWeights = new PRFloat64[aNumFeatures]; >+ } >+nsPerceptron::Train(PRFloat64* aInputs, PRInt32 aNumInputs, PRFloat64 aTargetOutput) >+{ >+ double output = 0.0; >+ double delta = 0.0; you just switched from nspr types to compiler types... >+nsPerceptron::LoadWeights() >+{ >+ nsCOMPtr<nsIFile> file; >+ FILE* from = 0; >+ nsresult rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, getter_AddRefs(file)); again, this should be replaced with calls to the directoryservice. >+ localFile->OpenANSIFileDesc("r", &from); I think we prefer for people to use OpenNSPRFileDesc (and i'll note that the example patch heikki pointed to *did* use the NSPRFileDesc method) and the prfile apis >- return PR_TRUE; >+ return PR_TRUE; you added trailing whitespace here -- i won't repeat the nit but i do want to point out that it means blame was transfered a few times for something that never should have changed. >+ fflush(mURLDataFile); flushing on the main thread is really mean when it's possible the file might live on an AFS volume (although i suppose a decent AFS would try to cache i'm not certain NFS can since the idea of flush is to guarantee success). >+ nsresult rv = NS_OK; >+ mURLID = PR_Now(); this should be a do {}while() loop. >+ while (NS_SUCCEEDED(rv)) >+ { >+ rv = FindRow(kToken_URLIDColumn, ++mURLID, getter_AddRefs(oldRow)); >+ } >+ mACFeatures = NULL; heikki's comments said not to use NULL. >+ nsresult rv; >+ rv = FindRow(aCol, aValue, aResult); early return is preferred, i.e. if (NS_FAILED(rv)) \n return rv; >+ if (NS_SUCCEEDED(rv)) >+ { >+ rv = GetRowValue(*aResult, kToken_URLIDColumn, aRowID); >+ } >+ return rv; >+} this stuff would make a great javadoc comment /** * javadoc comments * ... */ >+// The input features into the autocomplete perceptron are as follows: >+// >+// Features 1 = Frequency and recency metric for page in history >+// (domain = positive real numbers) >+// Value decays fast with age of page >+// Uses HISTORY_FAST_DECAY_CONSTANT >+ // Now, calculate a bunch of url related features >+ nsCOMPtr<nsIURI> uri; >+ nsCAutoString curl, chost, cpath; >+ rv = NS_NewURI(getter_AddRefs(uri), aUrl); >+ if (NS_SUCCEEDED(rv) && uri) >+ { >+ uri->GetSpec(curl); >+ uri->GetHost(chost); >+ uri->GetPath(cpath); >+ } this is very very very wasteful and silly. >+ nsAutoString url(NS_ConvertUTF8toUCS2(curl).get()); >+ nsAutoString path(NS_ConvertUTF8toUCS2(cpath).get()); >+ nsAutoString host(NS_ConvertUTF8toUCS2(chost).get()); you just converted things from char* to PRUnichar* >+ ToLowerCase(url); >+ ToLowerCase(host); >+ ToLowerCase(path); and now you're lowercasing them >+ // Feature 6: Whether url ends in .htm or .html >+ nsAString::const_iterator start, end; >+ >+ path.BeginReading(start); >+ path.EndReading(end); and now you're comapinrg them with things which by all rights should be char* >+ aFeatures[5] = FindInReadable(NS_LITERAL_STRING(".htm"), start, end); please please please save on conversions. don't promote to ucs2 unless you have a reason. in fact, you shouldn't be using findinreadable. you should be comparing the last n chars with the strings. >+ // Feature 7: Is it a .com URL? >+ host.BeginReading(start); >+ host.EndReading(end); >+ aFeatures[6] = FindInReadable(NS_LITERAL_STRING(".com"), start, end); >+ >+ // Feature 8: Is it a .edu URL? >+ host.BeginReading(start); >+ host.EndReading(end); >+ aFeatures[7] = FindInReadable(NS_LITERAL_STRING(".edu"), start, end); >+ >+ // Feature 9: Is it a .org URL? >+ host.BeginReading(start); >+ host.EndReading(end); >+ aFeatures[8] = FindInReadable(NS_LITERAL_STRING(".org"), start, end); >+ >+ // Feature 10: Is it a .net URL? >+ host.BeginReading(start); >+ host.EndReading(end); >+ aFeatures[9] = FindInReadable(NS_LITERAL_STRING(".net"), start, end); >+ >+ // Feature 11: Is it a .gov URL? >+ host.BeginReading(start); >+ host.EndReading(end); >+ aFeatures[10] = FindInReadable(NS_LITERAL_STRING(".gov"), start, end); >+ this is the first place that actually cares about a non terminal: >+ // Feature 12: Does the URL contain a ~ ? >+ path.BeginReading(start); >+ path.EndReading(end); >+ aFeatures[11] = FindInReadable(NS_LITERAL_STRING("~"), start, end); the next few things obviously care about beginning chars only. however i wonder if there's some other way to do it, this doesn't scale well at all. >+ // Feature 13: Does the URL start with http:// ? >+ PRBool isScheme; >+ aFeatures[12] = aFeatures[13] = aFeatures[14] = aFeatures[15] = aFeatures[16] = 0; >+ if (NS_SUCCEEDED(uri->SchemeIs("http", &isScheme))) >+ { >+ aFeatures[12] = isScheme; >+ } ... >+ >+ // Feature 18: Does the URL end in a two letter country code? your comment's wrong. it's not the URL it's the host. >+ aFeatures[17] = (host.RFindChar('.') == ((host.Length() - 1) - 2)); instead of doing a search which could cost O(N) for something like: http://myververyverydshjfhsdkjfhsdakjlfhsalkjfhlsdjkahflkjsadhflkjsdhfljaksdfhl sdkjfhaslkdfhlonglocaldomain also consider that the following url should be considered valid: http://www.state.ca.us./state/portal/myca_homepage.jsp Ignoring that fact (because it isn't my job to right your code) the pseudo code to efficiently check for the case you cared about would be something like: x=host.length(); (x>3 && host[x-3] == '.' && host[x-2] != '.' && host[x-1] != '.') (there's a very simple way to handle the trailing .) i'm not sure if it makes sense to do the next counting using PRUnichar ... (said counting snipped) >+ // Calculate a bunch of hostname related features. >+ // Feature 33: Number of .s in the hostname if we omit initial "www." or "ftp." (if any) >+ // Feature 35: Number of .s in the hostname if we omit ending initial "www." or "ftp" this comment looks wrong >+ if (chost.Find("www.") == 0 || chost.Find("ftp.") == 0) again *please* don't use find when you want to know if a specific set of characters match. use substring+equals or strcmp >+ if (chost.RFindChar('.') == ((chost.Length() - 1) - 2)) again, the same complaint about the tail... and isn't there a boolean lying around that you could check instead of recalculating? if not then this stuff just isn't local to similar stuff... you should group related comparisons and share conditions. >+ // Feature 41: Is this a google search url? http://www.google.com/mac?q=is+a+google+search+url and I use it. XXX fix your code :) >+ aFeatures[40] = FindInReadable(NS_LITERAL_STRING("http://www.google.com/search"), start, end); XXX Find is bad. >+ // Feature 43: Is this a yahoo search url? there have to be other yahoo search urls... >+ aFeatures[42] = FindInReadable(NS_LITERAL_STRING("http://search.yahoo.com/bin/search"), start, end); >+ // If searchString found in the previous search results, assume >+ // that the user selected that url (searchString) from the previous >+ // list of autocomplete results. i wonder if you could ask the search service instead of relying on the evil hard coding you have ... >+ fprintf(mURLDataFile, " url='%s'", NS_ConvertUCS2toUTF8(searchString)); you just converted to char*, eek. no you just converted to char* once >+ if (NS_SUCCEEDED(FindRowAndID(kToken_URLColumn, >+ NS_ConvertUCS2toUTF8(searchString).get(), >+ getter_AddRefs(row), &rowID))) and you just did it again. 1. i don't know if you ever needed PRUnichar* but if you did, please save a comparison by not doing it again. --SX MARK i'll be back >+ { >+ if (rowID == 0) >+ { >+ AssignUniqueURLID(row); >+ rowID = mURLID;
Attachment #108426 - Flags: review-
Attachment #108426 - Flags: review+
Attachment #108426 - Flags: approval1.3a+
Blocks: 183901
timeless, thanks for your comments. I'm about to attach a patch that address most of your concerns. First some general replies related to your comments: Your points about optimizing FillInputFeatures() make sense. But, all the code in FillInputFeatures is only executed when a pref is turned on by the users that will participate in data collection. Also, all the code in FillInputFeatures will be discarded once we've collected the data. So, I don't want to spend more time optimizing that code. Even though a lot of extra work is being done, there is no perceptible delay on my 750 Mhz laptop. I'm hoping that users who participate in the data collection will have equal or better machines and will also not feel a lag in performance. I promise that if we decide to keep the code around or enable it by default, I will performance optimize it to everyone's satisfaction. I know that the perceptron code also needs more work. As with the data capture code, a perceptron is only created and used if a pref is set by the user. Please ignore the XXX comments as they will be removed when the perceptron becomes its own component. More specific replies to your comments follow: >>> + nsresult rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, >getter_AddRefs(file)); >> Is that the correct profile dir nowadays? >no. that class set is going away (for more info see bug 156733) the correct api >is to use to directoryservice. sample code can be found all over the place. >(there's currently a broken consumer in the xp filepicker, but there will be a >correct patch in bug 125324 which you could work from). I double checked with Conrad Carlen about the above usage and he was fine with both NS_GetSpecialDirectory and NS_APP_USER_PROFILE_50_DIR. This code is temporary anyway, so it doesn't really matter. >nit#1, please don't check in trailing whitespace Agreed! I will exorcise these before checking in. >>+// Autocomplete learning modes >>+#define ACL_NONE 0 >please rename the ACL things to anything but ACL. acl should mean Access >Control List. >AUTOCOMPLETE_NO_LEARNING, AUTOCOMPLETE_ENABLE_TRAINING, >AUTOCOMPLETE_AFFECT_URL_LIST seem perfectly reasonable. >>+#define ACL_ENABLE_TRAINING 1 >>+#define ACL_AFFECT_URL_LIST 2 > >>+#define UDC_NONE 0 >>+#define UDC_WITHOUT_URL_INFO 1 >>+#define UDC_WITH_URL_INFO 2 >I can't remember what UDC stands for, please replace it with a longer string >that is meaningful. Done! I've replaced them with more descriptive names. >why not use consts, or perhaps idl consts? I prefer #defines mostly because I want to use all caps names without prefixing a "k". I don't want to use IDL constants because these aren't public constants intended for consumers other than nsGlobalHistory. >>+ nsresult rv = NS_OK; >>+ mURLID = PR_Now(); >>+ while (NS_SUCCEEDED(rv)) >>+ { >>+ rv = FindRow(kToken_URLIDColumn, ++mURLID, getter_AddRefs(oldRow)); >>+ } > >this should be a do {}while() loop. Not necessarily, because rv is initialized to NS_OK in the beginning, so NS_SUCCEEDED(rv) will always return TRUE and the loop will get executed at least once. do {} while() has the same semantics. I've changed this to a do {} while nevertheless because we do save a condition test with the do-while... >>+nsPerceptron::Train(PRFloat64* aInputs, PRInt32 aNumInputs, PRFloat64 >aTargetOutput) >>+{ >>+ double output = 0.0; >>+ double delta = 0.0; >you just switched from nspr types to compiler types... I tried to use PRFloat64 but ran into weird behavior (don't remember exactly, calculations weren't happening properly or something) in my debugger. Switching to native types fixed the problem and I didn't explore it further. I'll fix this when I make the perceptron a proper component. >>+ localFile->OpenANSIFileDesc("r", &from); >I think we prefer for people to use OpenNSPRFileDesc (and i'll note that the >example patch heikki pointed to *did* use the NSPRFileDesc method) and the >prfile apis The OpenANSIFileDesc method fit better with the code that I'd already written. All the data capture code is going to get thrown away once we've gotten data back from the users. Please ignore this nit for now. >>+ fflush(mURLDataFile); >flushing on the main thread is really mean when it's possible the file might >live on an AFS volume (although i suppose a decent AFS would try to cache i'm >not certain NFS can since the idea of flush is to guarantee success). This flushing only happens when the user enables a pref for data capture. Users who participate in data collection can turn off the pref if they experience a significant slowdown. This code will go away after we've gotten data back from some users. >>+ mACFeatures = NULL; >heikki's comments said not to use NULL. Fixed. I missed this one in my last patch.
Boris, thanks a lot for your comments. I've fixed the shutdown crash. It wasn't a problem with my code but was unmasked by it. I've also verified that my code doesn't cause new debug assertions to fire. >Finally, please fix the perceptron constructor to _always_ initialize >mNumWeights... the current code will do hideous things if '0' is passed to the >constructor and then Train is called. Done!
I've been biting my tongue on this comment, but I can't hold myself back any longer. I have to ask: is this amazing over-engineering worth the trouble? Can it really offer appreciably better results than the existing frequency-weighted order? (I find the existing autocomplete dropdown ordering to be thoroughly excellent.) There are probably plenty of places in Mozilla where machine-learning can be applied fruitfully. If the goal of this bug is to prove the mechanisms before applying them to bigger matters then that sounds quite fair to me. Otherwise this seems like a large effort targetting a nonexistant problem.
Adam, you are exactly right. The long term goal is to expand the use of machine learning to other parts of Mozilla. Autocomplete is a small baby step. I don't know whether we'll end up with better results than the existing mechanism. We might not. But, I'm treating this as a research project. Once we have collected the data, we'll see if we can come up with a learning algorithm that performs better than the current algorithm. Another goal is to get folks in the machine learning community to take notice of Mozilla. Hopefully, as we spread the word about small machine learning projects like this one, we will spark more ambitious ideas in the heads of machine learning gurus in academia. If even a few university professors start working on implementing their ideas on top of Mozilla, we will all benefit. If you have ideas about where we can apply machine learning within Mozilla, lets start discussing them in the Mozilla newsgroups. Some ideas we've had till now are: - Currently, Mozilla does LINK attribute based prefetching of web pages. Can we use a user's past surfing behavior to predict what to prefetch? - Can we algorithmically identify ad images on a web page and block them out? - How do the activities of email, instant messaging, newsgroup reading, and web browsing intersect and overlap with each other? Can we apply machine learning techniques to learn a user’s habits and use that knowledge to filter out unnecessary information and reduce information overload as they conduct these activities? The Bayesian Spam filter recently added to Mozilla is a big step in this direction. - We run security hole scanners on the Mozilla codebase today and get a lot of false positives. Can we apply machine learning to automatically flag these mistaken errors and save developer time? - Is it possible to identify security holes in the Mozilla codebase more intelligently than grep like scanning by using artificial intelligence techniques? Some of these ideas are hard research problems that are unlikely to have clear solutions. Exploring these ideas might not lead to shippable code. But, we should still take a shot at it for the sake of learning more about difficult problems. If we come up with an interesting insight, we will be able to apply it not only to Mozilla but to other applications as well.
Some more replies to comments from timeless: >>+ if (chost.RFindChar('.') == ((chost.Length() - 1) - 2)) >again, the same complaint about the tail... and isn't there a boolean lying >around that you could check instead of recalculating? if not then this stuff >just isn't local to similar stuff... you should group related comparisons and >share conditions. Good point. I'm now re-using the earlier boolean value. >also consider that the following url should be considered valid: >http://www.state.ca.us./state/portal/myca_homepage.jsp I've added a check for this case in FillInputFeatures although I doubt if there are many urls on the web with a trailing period in their hostname.
This patch contains the stuff I said I'd do in comments 26, 27 and 30. It also includes the following: - We now get one instead of three autocomplete calls for each autocomplete selection. - Fixed debug asserts that fired on shutdown in nsXPTIInterfaceManager - We now output the referrer url for each visited url to the data file - We now output Mozilla startup and exit times to the data file. - Removed mURLID from nsGlobalHistory. It was unnecessary. - nsGlobalHistory::HidePage outputs to the data file. We need this to weed out (i)frame loads within a page. - Made all the changes needed to build patch on all platforms when it was first checked in. - Made the comment before FillInputFeatures a javadoc comment. - With rjc@netscape.com's help, reworked code so that we no longer need to add the GetAddTime method to nsIBookmarksService. Now, pretty much all the code in this patch is in nsGlobalHistory.h/.cpp and a few lines in nsDocShell.h/.cpp
Attachment #108426 - Attachment is obsolete: true
Attachment #110621 - Flags: review?(heikki)
Comment on attachment 110621 [details] [diff] [review] Patch v0.6: Incorporates timeless and boris' comments Sorry to be anal here, but in nsGlobalHistory.cpp, could you use this style: if (foo) { ... } else { ... } rather than if (foo) { ... } else { ... } I know the latter is prevalent in docshell (I really wish that mozilla.org would adopt a standard style that every one should follow. Having source with multiple styles is worse IMO than reading a file that is not your preferred style) but nsGlobalHistory.cpp uses the former exclusively (except for one-liners which have no braces) and "when-in-rome" should apply here.
- Got rid of gcc warnings. - Numeric url features are now output with 2 decimal places rather than 6.
Attachment #110621 - Attachment is obsolete: true
Attachment #110803 - Flags: review?(heikki)
Chris, I will fix the if-else syntax to match the rest of the file when I update the patch after heikki's review. I agree that the when-in-rome rule applies here. I've also compiled and run with my patch on Mac Classic without any problems.
Comment on attachment 110803 [details] [diff] [review] Patch v0.7: Created after compile and test cycle on Linux. >Index: history/public/nsIBrowserHistory.idl >=================================================================== >+ /** >+ * setReferrerURL >+ * Sets the referrer url for a url in global history I am concerned about this, even though our implementation does not do what I am really concerned about (namely, sending the referrer on URLs that the user goes to from global history, because that would be a privacy/security issue). It might be sufficient to add notes/warnings to the IDL and implementation that the referrer will/must not be sent when traversing to the URL. (But in that light it begs the question: why is there a method for referrer when it must not used for its "intended" purpose? Maybe you need to add some verbiage to the IDL comment saying something about used for internal purposes or something like that. Am I being too picky?) >Index: history/src/nsGlobalHistory.cpp >=================================================================== >+ if (id == 0) Why not: if (!id) This is a general pattern. If you find yourself comparing something to 0/null, you should change it to |if(foo)| or |if(!foo)|. >+ // Sanity check the URL >+ PRInt32 len = PL_strlen(aURL); >+ NS_ASSERTION(len != 0, "no URL"); >+ if (! len) >+ return NS_ERROR_INVALID_ARG; That could simply be written as: if (!*aURL) return NS_ERROR_INVALID_ARG; >+ // XXX Handle out of memory condition >+ mAutoCompleteLearner = new nsSigmoidPerceptron(AC_NUM_URL_FEATURES); >+ mACFeatures = new PRFloat64[AC_NUM_URL_FEATURES]; >+ >+ if (!mAutoCompleteLearner || !mACFeatures) >+ { >+ // Disable learning >+ mLearningMode = AUTOCOMPLETE_NO_LEARNING; >+ mAutoCompleteLearner = nsnull; >+ mACFeatures = nsnull; >+ } This is sorta handling OOM conditions already, but doing it in a buggy way. If mACFeatures failed, then we will leak mAutoCompleteLearner, and vice versa. Please fix this. >- (*aResult)->AddRef(); >+ NS_ADDREF(*aResult); If you see anyone calling Release() directly, fix it with NS_RELEASE at the same time. Otherwise this change will show up as new "leak". >+nsGlobalHistory::FindRow(mdb_column aCol, >+ err = mStore->FindRow(mEnv, kToken_HistoryRowScope, >+ aCol, &yarn, >+ &rowId, getter_AddRefs(row)); You are not doing anything with |err| after this. Please fix. >+ nsCAutoString curl, chost, cpath; >+ rv = NS_NewURI(getter_AddRefs(uri), aUrl); >+ if (NS_SUCCEEDED(rv) && uri) >+ { >+ uri->GetSpec(curl); >+ uri->GetHost(chost); >+ uri->GetPath(cpath); >+ } >+ nsAutoString url(NS_ConvertUTF8toUCS2(curl).get()); >+ nsAutoString path(NS_ConvertUTF8toUCS2(cpath).get()); >+ nsAutoString host(NS_ConvertUTF8toUCS2(chost).get()); >+ // Feature 6: Whether url ends in .htm or .html >+ nsAString::const_iterator start, end; >+ >+ path.BeginReading(start); >+ path.EndReading(end); >+ aFeatures[5] = FindInReadable(NS_LITERAL_STRING(".htm"), start, end); Rather than go through all this trouble, don't we have nsACString equivalents? If not, I am fine with it. >+ // Feature 12: Does the URL contain a ~ ? >+ path.BeginReading(start); >+ path.EndReading(end); >+ aFeatures[11] = FindInReadable(NS_LITERAL_STRING("~"), start, end); I just realized that a path would start with tilde, wouldn't it? Tilde anywhere else in the path shouldn't probably be anything special. But a minor point, fine either way (I don't expect to see tildes anywhere else in paths). >Index: history/src/nsGlobalHistory.h >=================================================================== >+ if ((mNumWeights > 0) && (mWeights != NULL)) >+ { >+ delete [] mWeights; >+ } NULL again. Besides, mNumWeights shouldn't matter. Suppose we allocated mWeights but mNumWeights is, or becomes 0; we'd leak. Write the if like this: >+ if (mWeights) Done. After those changes, r=heikki
Attachment #110803 - Flags: review?(heikki) → review+
About to attach a patch that addresses Heikki's review. Heikky, here are my responses to your comments: >Index: history/public/nsIBrowserHistory.idl >=================================================================== >+ /** >+ * setReferrerURL >+ * Sets the referrer url for a url in global history > >I am concerned about this, even though our implementation does not do what I am >really concerned about (namely, sending the referrer on URLs that the user goes >to from global history, because that would be a privacy/security issue). It >might be sufficient to add notes/warnings to the IDL and implementation that >the referrer will/must not be sent when traversing to the URL. > >(But in that light it begs the question: why is there a method for referrer >when it must not used for its "intended" purpose? Maybe you need to add some >verbiage to the IDL comment saying something about used for internal purposes >or something like that. Am I being too picky?) No, you aren't being too picky. This method is a big hack. I should really use a standalone service for logging data as needed and called that service from places where I need to output data. But, this method will go away after the data collection phase, so I'm taking a short cut. I've added a comment to this effect in the IDL file and renamed the method to OutputReferrerURL. >Index: history/src/nsGlobalHistory.cpp >=================================================================== >+ if (id == 0) > >Why not: > >if (!id) > >This is a general pattern. If you find yourself comparing something to 0/null, >you should change it to |if(foo)| or |if(!foo)|. Done. >+ // Sanity check the URL >+ PRInt32 len = PL_strlen(aURL); >+ NS_ASSERTION(len != 0, "no URL"); >+ if (! len) >+ return NS_ERROR_INVALID_ARG; > >That could simply be written as: > >if (!*aURL) > return NS_ERROR_INVALID_ARG; Agreed. I had cut and pasted this from existing code in AddPageToDatabase(). I'm changing my function but leaving AddPageToDatabase() alone. >+ // XXX Handle out of memory condition >+ mAutoCompleteLearner = new nsSigmoidPerceptron(AC_NUM_URL_FEATURES); >+ mACFeatures = new PRFloat64[AC_NUM_URL_FEATURES]; >+ >+ if (!mAutoCompleteLearner || !mACFeatures) >+ { >+ // Disable learning >+ mLearningMode = AUTOCOMPLETE_NO_LEARNING; >+ mAutoCompleteLearner = nsnull; >+ mACFeatures = nsnull; >+ } > >This is sorta handling OOM conditions already, but doing it in a buggy way. If >mACFeatures failed, then we will leak mAutoCompleteLearner, and vice versa. >Please fix this. Good catch. Fixed. >- (*aResult)->AddRef(); >+ NS_ADDREF(*aResult); > >If you see anyone calling Release() directly, fix it with NS_RELEASE at the >same time. Otherwise this change will show up as new "leak". I didn't find any place in nsGlobalHistory.cpp. I'll check other consumers of FindRow() and fix if necessary. >+nsGlobalHistory::FindRow(mdb_column aCol, >+ err = mStore->FindRow(mEnv, kToken_HistoryRowScope, >+ aCol, &yarn, >+ &rowId, getter_AddRefs(row)); > >You are not doing anything with |err| after this. Please fix. This code is cut and pasted from the other FindRow methods in nsGlobalHistory.cpp. I don't know enough about the error values returned from the mork database store to make this change. If the other FindRows are ok with ignoring err (which is of type mdb_err, not nsresult, so I can't use the NS_SUCCEEDED macro on it), then I am too. What do you think? >+ path.BeginReading(start); >+ path.EndReading(end); >+ aFeatures[5] = FindInReadable(NS_LITERAL_STRING(".htm"), start, end); > >Rather than go through all this trouble, don't we have nsACString equivalents? >If not, I am fine with it. I remember trying to look for them but couldn't find them easily, so I resorted to the nsAString methods. I'll leave this as is for now but fix it up if we decide to enable this code path by default. >Index: history/src/nsGlobalHistory.h >=================================================================== >+ if ((mNumWeights > 0) && (mWeights != NULL)) >+ { >+ delete [] mWeights; >+ } NULL again. Besides, mNumWeights shouldn't matter. Suppose we allocated mWeights but mNumWeights is, or becomes 0; we'd leak. Write the if like this: + if (mWeights) Oops, sorry for the NULL. I found that another one had creeped in as well and changed it too. This habit is gonna die hard! I've also changed the if condition like you suggested.
>I didn't find any place in nsGlobalHistory.cpp. I'll check other consumers of >FindRow() and fix if necessary. Duh, the only possible consumers of nsGlobalHistory::FindRow() are in nsGlobalHistory.cpp because FindRow() is a protected method. So, we are ok.
Fixed the if-else syntax and made the changes I mentioned in comment 36.
Attachment #110803 - Attachment is obsolete: true
Attachment #110901 - Flags: superreview?(hewitt)
Attachment #110901 - Flags: review?(heikki)
Attachment #110901 - Flags: approval1.0.x?
While it's great that you're experimenting with how learning approaches could improve the user experience, mozilla needs to keep the core stable. Perhaps this work would be better suited to a branch, extensions/ (not in default build), or mozdev? This would also allow you to avoid running the r/sr gauntlet while doing the fine-tuning. Once you've got a technique working nicely and have enough testing to have confidence in it we'd of course welcome it in the mozilla tree. This need for testing/confidence especially applies to the 1.0.x branch, which I notice you've flagged this patch for.
Oops, sorry, tor. I am new to the Bugzilla review/approval process. I thought that I needed to get approval for getting this checked into the trunk. Heikki tells me that I don't. I'll go remove the approval request.
Attachment #110901 - Flags: approval1.0.x?
tor, I just realized that I didn't address the second part of your comment. The reason I'm not doing this work on a branch is that I want this to be a part of the 1.3 beta release which will get into the hands of lots of Mozilla community members. I then hope to convince some fraction of the downloaders to turn on the data capture pref, run with it on for a couple of weeks, and then send me back the data file. The data collection phase of this project needs wide distribution and an easy turn on/off mechanism for data capture. I won't be able to meet these requirements if I don't ship this in the default mozilla builds. I remember that I sent out an email to drivers about this. I'm going to paste that email on this bug because it provides relevant info for people cc'd here.
<snippet> The long term work plan is to use machine learning techniques to better order the autocomplete results in the url bar. Right now, the goal is to capture data (when the user chooses to turn on a pref) about urls visited, urls seen in the autocomplete dropdown, and urls selected in the autocomplete session by the user. When we get the data back, Prof. Andrew Ng (the CS professor I'm working with at Stanford) and I will try to make sense of it and identify the set of url features and the learning algorithm that best predict which url the user will visit next. There are obvious privacy concerns about this. To deal with them, there are two data capture modes. In one, url data is captured. In the other, only numeric features about the url are captured so that it is impossible to determine which url the person visited. The url data files are human readable so people who participate in the data gathering phase will be able to see exactly what they are sending back to us. </snippet> I hope to weed out bugs over the next couple of weeks and then include a blurb in the 1.3 beta release notes requesting Mozilla volunteers to help in data collection. Any suggestions on this plan are very welcome!
Attached patch Patch v0.9: Does better out of memory checking. (obsolete) — — Splinter Review
Heikki, the only thing that I want you to take a look at in this new patch is the following code snippet that handles the out of memory condition for the heap allocated variables: + if (mLearningMode > AUTOCOMPLETE_NO_LEARNING || + mDataCaptureMode > URLDATACAPTURE_NONE) { + // Create perceptron and feature array. + mAutoCompleteLearner = new nsSigmoidPerceptron(AC_NUM_URL_FEATURES); + if (mAutoCompleteLearner) { + mACFeatures = new PRFloat64[AC_NUM_URL_FEATURES]; + if (!mACFeatures) { + delete mAutoCompleteLearner; + mAutoCompleteLearner = nsnull; + mLearningMode = AUTOCOMPLETE_NO_LEARNING; + } + } + else { + mLearningMode = AUTOCOMPLETE_NO_LEARNING; + } + } + else { + mAutoCompleteLearner = nsnull; + mACFeatures = nsnull; + } Apart from this, your review+ of the previous patch carries forward to this one. Thanks!
Attachment #110901 - Attachment is obsolete: true
Attachment #110916 - Flags: superreview?(hewitt)
The above code snippet is making the assumption that mAutoCompleteLearner and mACFeatures are initialized to nsnull in the constructor.
Comment on attachment 110916 [details] [diff] [review] Patch v0.9: Does better out of memory checking. More drive-by comments: operators |delete| and |delete[]| are both null-safe so there is no need to do: if (foo) delete foo; Rather, just do delete foo; Also, you missed switching some braces in nsresult nsGlobalHistory::GetRowValue(nsIMdbRow *aRow, mdb_column aCol, PRFloat64 *aResult) In the middle of +nsresult +nsGlobalHistory::FillInputFeatures(nsAString &aUrl, + PRFloat64 *aFeatures) Around this code: + nsCOMPtr<nsIRDFDataSource> bookmarkDS = do_QueryInterface(bs, &rv); + if (NS_SUCCEEDED(rv) && bookmarkDS) { ^^^^ Evil tab characters, please remove in favor of spaces. + nsCOMPtr<nsIRDFNode> nodeType; - In NS_IMETHODIMP nsGlobalHistory::OnAutoComplete(const PRUnichar *searchString, nsIAutoCompleteResults *previousSearchResult, nsIAutoCompleteListener *listener) + nsCOMPtr<nsISupportsArray> results; nsISupportsArray is considered deprecated and efforts to remove it from the tree have already been started. Looking at your usage, you want to use nsCOMArray, instead. Still in the same method, + // If searchString found in the previous search results, assume + // that the user selected that url (searchString) from the previous + // list of autocomplete results. + if (found) + { Please fix the brace. And towards the end of that method, + if (mLearningMode >= AUTOCOMPLETE_ENABLE_TRAINING) + { + mAutoCompleteLearner->SaveWeights(); + } + } + + return rv; Again the brace. Also on a separate note, you may wish to consider using ++foo rather than foo++ as it is more efficient, and you use those quite frequently.
Comment on attachment 110916 [details] [diff] [review] Patch v0.9: Does better out of memory checking. sr=me
Attachment #110916 - Flags: superreview?(hewitt) → superreview+
Nisheeth, that looks ok once you remove the tabs.
I would expect that the same group of people who would be willing to mail back a data collection file would also be willing to download a special testing build.
Attached patch Patch v1.0: Final patch — — Splinter Review
This patch contains the following changes: - Fixed tab characters and if-else braces pointed out by Chris and Heikki. - Did not replace nsISupportsArray because it is a parameter in a method in nsIAutoCompleteResults, an interface that existed before my patch. - Replaced post-increments with pre-increments in FillInputFeatures().
Attachment #110916 - Attachment is obsolete: true
Patch v1.0 checked in.
ugh. this is way cool but it bloated us by a whopping 22k of static footprint. I really like the concept but I'm really disappointed in the bloat. I think this one of those features where we need to decide if its worth the feature for the footprint hit? how about runtime footprint, do we have any idea what the impact was there?
Alec, runtime footprint isn't affected for the user unless some prefs are enabled that switch on data capture or learning. The 21k static footprint increase is temporary. Once we've received data back from enough mozilla users who participate in the data collection phase, we'll remove the data capture code. My hope is to take the data capture code out after the 1.3 beta release and before the 1.3 final release. The learning code might also be taken out if data analysis indicates that the same set of feature weights can be used for all users. Then, we will simply use the hard coded weights rather than dynamically adjust them for each user. So, the question before us is if we can accept a temporary 21 K increase in static footprint for this research experiment. What do you think?
Thanks for the clarification - I'm fine with the temporary code inclusion (and in fact, I'm now interested in participating!) but there are only a few days left to mozilla1.3 and I'd hate to see experimental/temporary code bloating a release - can we make sure this gets backed by the 1.3 beta release? Is a week of data gathering enough? (hopefully!)
If this needs community participation in a hurry, better write up a web page on mozilla.org that describes how to turn this on in its various modes (I can't figure it out) and push it to mozillazine.org etc.
People are waiting for Your sign Nisheeth :) Just tell us what to do...
Alex Flett: > can we make sure this gets backed by the 1.3 beta release? Is a week > of data gathering enough? (hopefully!) I was under the impression that the plan was to include this in 1.3 Beta (see comment 42). Adam D. Moss: > If this needs community participation in a hurry, better write up a web page > on mozilla.org that describes how to turn this on in its various modes (I > can't figure it out) and push it to mozillazine.org etc. A MozillaZine article is an excellent idea and should get a lot of people participating.
I stumbled across some of this code in Purify. Specifically mDataCaptureMode and mLearningMode are not initialized in nsGlobalHistory::Init (I don't have these pref's) and they are used later on. Not sure what the state of this bug is, but I thought I'd post here since it's still open. Let me know if I need to file a separate bug.
David, thanks for the heads up. I'm looking into the bug. Bug 184167 is probably a result of this...
David, I've checked in a fix. Thanks for finding the bug! I hope to have a document describing this stuff written up for mozillazine and mozilla.org in time for the 1.3 beta release. Till then, if you can run the nightly builds with the "browser.history.url.datacapture.mode" pref set to 2, you will start seeing a file called "url-data.txt" in your profile directory. If some of you can try this out and report problems back to me, that would be great! I'd really like to ferret out lurking bugs before the 1.3 beta release. Thanks!
Alright, I've been running CVS-HEAD (with mode set to 1, not 2) for a few days now. Haven't seen any problems. I'm turning it off for the moment because url-data.txt is starting to uncomfortably fill up my /home partition. :) Is there a place where I can usefully send my url-data.txt or should I start over when the data-collection 'officially' beings?
Adam, I'd love to take a peek at the data. Would you zip it up and email it to me, please? Thanks a lot! Not sure if I'll be able to get to this, but, ideally there should be a web page so that the data file can be submitted anonymously. Anyone up for creating and hosting such a thing? It would probably be a simple CGI that accepts the data file and emails it to me. If it can double-check that the data file is zipped and, if not, zip it up before emailing it, even better!
> Not sure if I'll be able to get to this, but, ideally there should be a web > page so that the data file can be submitted anonymously. Anyone up for > creating and hosting such a thing? It would probably be a simple CGI that > accepts the data file and emails it to me. If it can double-check that the > data file is zipped and, if not, zip it up before emailing it, even better! You mean like this? http://www.mozillazine.org/misc/bug-182366-submit-anon-url-data.html It'll need more testing before going into production and I'd have to check with Kerz that it's okay for MozillaZine to host it. The wording, of course, can be updated (and probably should be). The script (it's PHP, not CGI) just grabs the zip file and mails it to nisheeth@netscape.com as an attachment named url-data.zip. The script checks that the data is zipped (i.e. that it's receiving a file of the type application/x-zip-compressed) but it doesn't zip up the data if it isn't. It also doesn't check that the zip file contains a url-data.txt file. But essentially, it should work (though, as I said, it will need testing). How big are these (zipped) url-data files going to be? I think the form's limit is 2MB (though I believe this can be changed).
Here's the PHP script so you can all suggest any necessary changes (or just mock my coding skills).
Attachment #111782 - Attachment description: THe PHP script behind the anonymous form → The PHP script behind the anonymous form
Alex, this is really cool! Thanks a lot for whipping this up! Adam, please try and submit your data file from Alex's page. Let the testing begin!
I can't use the form to submit the data, because I deleted the data after I emailed it to you. :)
patch v0.9 caused an embedding smoketest blocker today (see bug 189222). not sure why it took so long to see the regression.
Alex, there is an error in Your script. Probably You should change line } elseif ($_FILES['data']['type'] != "application/x-zip-compressed") { to accept "application/zip" type too. I used to send my data today and recived error: "You forgot to zip up your data before sending it! We told you that it wouldn't work. The type of file you tried to send was application/zip. A zip file is application/x-zip-compressed."
> Alex, there is an error in Your script. Probably You should change line > > } elseif ($_FILES['data']['type'] != "application/x-zip-compressed") { > > to accept "application/zip" type too. Whoops... should be fixed now. Try sending your data again. I don't suppose there's any other MIME types that zip files can be?
well, if you're unlucky, they can be application/octet-stream, I suppose
Will this bug better sort the autocomplete results when starting to type in a recipient's name? (I read most of the bug and did a "Find" for "mail" but found nothing conclusive.)
Peter, you should consider looking at the patch. This bug has nothing to do with mail so far (though the results could perhaps be applied there sometime in the future if the experiment is a success).
As promised, here's the link to the project page titled "Applying machine learning to autocomplete" on mozilla.org. Any feedback on the page is very welcome! Please participate in the data gathering process! Thanks! http://www.mozilla.org/projects/ml/autocomplete/index.html
OK, I've updated the anon form to link to the page. I've checked with Kerz and he's okay about MozillaZine hosting the form as long as it doesn't spike the traffic too much so I'll be keeping an eye on that over the next few days. As the form hasn't really been tested much, it would be great if anyone here who gets it to work a) successfully; or b) not at all could post a comment or email me, at least for the first few days. Sorry, should have sorted this earlier. Nisheeth, do you want me to modify the form so that it also emails the data to ang@cs.stanford.edu? Also, if there's any other changes you want made, just yell.
Thanks for updating the text on the page, Alex! I think it looks great! Yes, please add ang@cs.stanford.edu to the recipient list for submitted data files.
> Yes, please add ang@cs.stanford.edu to the recipient list for submitted data > files. OK, the script should now Cc all the data files to him. I've also updated all the mailto: links to include ang@cs.stanford.edu. > Any feedback on the page is very welcome! A couple of things: * You variously use "I" and "us" on the page. Probably best if you spell out who you mean, as nobody reads the authors line at the top. * In point 3 of the 'Data Collection' section, you could tell people to use about:config rather than manually altering prefs.js (which is much easier to screw up): === Type about:config in Mozilla's Location Bar to bring up a list of all your preferences. Right-click on any of them (Cmd + Click if you're using a Mac) to bring up the context menu and then select 'Integer' from the 'New' submenu. Enter the following as the preference name: <code>browser.history.url.datacapture.mode</code> When asked for the value, enter <code>2</code> or <code>1</code>. Both settings will dump data into a file called url-data.txt in your profile directory. The different data capture modes are explained below. === * If you really wanted, you could include the anonymous submission form directly on the page: <form action="http://www.mozillazine.org/misc/bug-182366-submit-anon-url-data.html" enctype="multipart/form-data" method="POST"> <p><input type="file" name="data"> <input type="submit" name="send" value="Send"></p> </form> That should work, assuming I'm not forgetting about any security measures surrounding uploaded data from a form on one domain being processed by a script on another. * You might want to include a brief explanation of exactly what machine learning is.
so are we backing this out, now that 1.3beta has shipped?
Oh. My. God. Before you try to make autocomplete be all DWIMmy AI, how about first making it be something (anything) less than completely incomprehensible? See very simple and obvious proposal in bug http://bugzilla.mozilla.org/show_bug.cgi?id=175725
Alex, I just got this message from your script: > The type of file you tried to send was application/zip. > A zip file is application/x-zip-compressed or application/zip. Nice, huh?
alex, i updated http://www.mozilla.org/projects/ml/autocomplete/index.html with your prose about "about:config". alecf, yes the plan is still to back this out for 1.3 final. Bug 193347 is a 1.3 final blocker and will take care of the code bloat.
> Alex, I just got this message from your script: > > The type of file you tried to send was application/zip. > > A zip file is application/x-zip-compressed or application/zip. > > Nice, huh? Sorry, looking into it.
Andreas, do you want to try sending your file again? I think the problem's fixed now.
Alex: ok, same file worked now.
> Alex: ok, same file worked now. Cool. It was a bracket thing. I didn't have enough brackets so the NOT operator's scope was restricted to just the left-hand side of the... wait, nobody cares!
What's the status of backing this out? At this point it needs to come out of both 1.3 branch and 1.4a trunk.....
> What's the status of backing this out? At this point it needs to come out of > both 1.3 branch and 1.4a trunk..... See bug 193347. I believe the fix was landed on the trunk beofe 1.3 was branched.
I have enabled the ("browser.urlbar.autocomplete.learning.mode", 2) and the most evident change is a delay when typing in the location bar. Especially the first character of any url freezes the browser (ID:2003021008, linux) several seconds. Perhaps the search tree is much smaller, if the proposed url is calculated when at least two or three characters are already known (apart from http(s)://, www, etc.). For these cases, the classical algorithm should be applied.
Gunnar, I've received the performance complaint from a couple of other people as well who've had large history files. The learning code wasn't streamlined for performance at all and meant only as a fun prototype for people to play around with. Once we get the data back from Mozilla volunteers and if we find that a learning algorithm works better than the current algorithm, we'll implement it with performance in mind. It would really help us to know what the processor speed and memory of your Linux box. It would also help us to get our hands on your history file so that we can test our learning algo against it...
Is the autocomplete code still in the Mozilla Builds? I downloaded the RC1 and tried to activate it, but the feature seems to be broken. Or is the code removed due to the 1.3 release?
no, it was removed from all builds after the 1.3beta release.
Does the project still need additional learning data? If so, is it possible to get a post-1.3 version with this turned on?
Gordon, the project has received enough data (around 400 data files). So, there isn't any need to put the code back in. I guess more data is always good so in case you want to participate in this for a couple of weeks, please use the 1.3 beta builds for that time to collect data. We should still be able to use whatever data you submit to us a couple of weeks from now. Thanks!
This project is very important, in fact, it is first and solid step towards creation really intelligent user interface for Mozilla applications. It is a strategic project, so, code shouldn't be removed, even if there were some bugs. As to training data: I think those data should mostly come from user, as a result as user-specific experience: Mozilla application(s) can collect data from user and use them for autocompletion. Thanks Nisheeth!
Blocks: 168902
Comment on attachment 110901 [details] [diff] [review] Patch v0.8: Patch that addresses Heikki and Chris Aillon's comments removing obsolete review request
Attachment #110901 - Flags: superreview?(hewitt)
What's going on with this? Is it planned for 1.4? 1.5?
We're in the middle of data analysis on the data that we got back. A quick update is that we have written up a perl script to convert the data file into a stream of url feature vectors. We've also written up a Matlab script to run a softmax regression probabilistic model on it. The short term plan is to compare how this model does against the current Mozilla algorithm on the data. If anyone is interested in more detail about what we are doing, please contact me. I think results from this work will start showing up in the 1.5 timeframe.
Blocks: 205158
Came across this from bug 177360. I've heard a couple people ask about it in the mozillazine forums, and 1.5a is out. Inquiring minds want to know: what's up with this project?
Is this working now in moz 1.5? I didn't get any url-data.txt file or any effect by enabling user_pref("browser.urlbar.autocomplete.learning.mode", 2);
>Is this working now in moz 1.5? no
How about setting a target milestone for this project? Prog.
the author/driver behind this project is gone now. I suggest we mark this WONTFIX.
Actually, he just needs to update his bugmail address I think. I talked to nisheeth recently and he still is working on this at stanford, though I'm not sure at what pace or timeframe he is looking at, or if he is doing it for purely academic reasons. I'll see if I can get him to comment here.
I would still be interested to see his results, although I am quite happy with how the address bar works at the moment. I also fear that it would make mozilla a bit slower again.
caillon: Any progress in getting nisheeth's status? Is this to still be considered an 'assigned' bug?
Sorry for the long silence. I was back in school and had very little time away from course work. I worked on this off and on last year and then continued this work as part of my class project for the machine learning class I took earlier this year. Happily, now we are at a stage where we can make a call on whether to use the machine learning approach or not. The final iteration of the machine learning approach saves the user an average of one keystroke per autocomplete event (on a data set of 237 users) as compared to the current Mozilla algorithm. I am about to upload the project report and some slides on mozilla.org that describe the different ideas we tried and the results we achieved. The final result is good but not great. The question we need to discuss now is whether a one key per event advantage justifies changing the current Mozilla autocomplete code... I'd love some input on this question from all of you who've been following the progress of this bug. I will post links to the project report and the slides (which basically summarize the report) once I've uploaded them. Please read them and post your comments to this bug. Thanks a lot!
Thanks for the update. In my experience, Mozilla already has a feature that provides about one key per event advantage compared to its default autocomplete, it's called "Autocomplete best match as you type" (browser.urlbar.autoFill=true). The only problem I find with this feature is that it's disabled by default, and this brings me to where machine learning could really help Mozilla: Better default prefs. I have a firm belive that gathering information from users (about their browsing pattern) can help Mozilla choose better defaults. I won't tire anyone with this off topic issue, as it probably requires a new bug. Prog.
Thanks for the update. I look forward to seeing what you have.
One key isn't much, but it may still feel to work better than the default behaviour or the autoFill behaviour. I would still like to experiment with it. Maybe it should be an installable addon?
Just uploaded the report at http://www.mozilla.org/projects/ml/autocomplete/ac-report.html. Also, a PDF version of the slides that summarize the report is at http://www.mozilla.org/projects/ml/ autocomplete/ac-preso.pdf.
i tried to configure mozilla web browser for the data capture mode '2' to participate in this autocomplete project but the file "url-data.txt" was not created.
Product: Core → Mozilla Application Suite
Nisheeth, Are you still working on this ?
QA Contact: claudius → nobody
Now that we have the smart locationbar (T3h Awesomebar) with the places freency algorithm, I'm marking this bug FIXED and closing.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
No longer blocks: 205158
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: