Closed Bug 101086 Opened 23 years ago Closed 23 years ago

[regression] Minimal length for LDAP search should reflect script differences

Categories

(MailNews Core :: LDAP Integration, defect, P2)

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9.5

People

(Reporter: momoi, Assigned: dmosedale)

Details

(Keywords: intl, regression, Whiteboard: regression, [PDT+], need approvals)

Attachments

(1 file, 1 obsolete file)

** Observed with 9/21/2001 Wi32 trunk build **

This bug comes out of discussion in Bug 100645. In there
dmose@netscape.com mentions that:

"The minimum number of characters required to do a search against a given server
is settable by hidden preference, though:
{directoryServerPrefix}.autoComplete.minStringLength"

I experimented with this and has found that this minimum
string length rule applies indiscriminately whether or
not the matching strings represent Alphabetic characters
or non-Alphabetic CJK ideographs. Apparently the length is
determined by a character according to UCS2 reperesentation.

This length seems to be set to 2 currently so that a 
match will not be returned unless there is 2 Unicode character 
sequence match. This makes sense for scripts which use Alphabetic characters
(via case insensitive matching) whose repertoires 
probably do not exceed 35-40 characters when case distinctions 
are eliminated.
This will help such script users from seeing too many matches
at the 1st character. This is good.

This, however, is not true in scripts for languages such
as Chinese & Japanese. These languages at minimum use 2000
Han/Kanji characters or more for personal names. Thus,
the 1st character match is much more useful and expected.

There can also be many unqiue 1-st character matches if the 
Chinese characters are from less frequently occurring 
repertoire such as the 1st character of my name, 
MOMO in Momoi, and even against a failry large databse of 
10,000 or more names. (In my name's 1st character, probably
at most a few matches in 100,000 Japanese name database.)

In my opinion, current indiscriminate limitation of 1 character
makes the auto-complete feature much less useful for CJ(K) scripts.

If LDAP autocomlete is meant as an enterprise feature, I 
suggest we correct this problem before shipping the next
release.

My proposal would be to be set the limit to 1 Unicode
character when it falls within the CJK Unifided ideograph range 
and to the current 2-characters when it falls outside of this
range.

The CJK Unified Ideograph range by Unicode 2.0 is:  

\u4E00 - \u9FFF
 
This should catch most modern CJK Han characters used for names. 

Maybe there are other ranges or Unicode 3.1 update to this, if 
so, someone, please make additional suggestions.

Let me hear others' opnions on this.
It turns out that this is a regression from Mozilla 0.9.2/NS6.1.
There we were probably doing indiscriminate 1-char matching.

The current change may help Alphabetic script users but will 
inconvenience CJ(K) users who benefit from 1-char matching.
For the latter class of users, this is a regression from
familiar behvaior in Mozilla 0.9.2/NS6.1.

Nominating for nsenterprise for these reasons.
Keywords: nsenterprise
Summary: Minimal length for LDAP search should reflect script differences → [regression] Minimal length for LDAP search should reflect script differences
Whiteboard: regression
Severity: normal → major
MScott- This is a regression. Pls triage. IS this a stop ship?
I object to PDT asking a coding engineer whether or not 
this is a stop-ship bug. This is a feature-level decision.
We should be shipping a product that serves users' needs
and unless a coding engineer understands script/language
needs issues well, that is not the right person to ask
this question. 
The reason this is a regression is that whoever made this
change did not take into account script differences 
and how the feature in question serves users of different
languages and what would be the best way to serve their
needs.
 
This is interesting stuff.  The limit of 2 characters was actually not an
arbitrary choice, but has to do with how some LDAP servers do indexing.  See bug
76593 and bug 84566 for some of the history here.  Relevant stuff:

* Unindexed searches have the potential to be much slower than indexed searches,
and put significantly more load on the server.

* If a beginning of the search string does not have a wildcard, that means that
it's "anchored" and the anchor counts as one of the 3 characters that are
required to not cause an indexed search (which is why 2 was chosen; all our
searches currently are anchored).

* The LDAP wire protocol encodes everything in UTF8, and I assume (but don't
actually know) that servers will typically store their disk databases in UTF8 as
well.  This begs the question of what how LDAP servers count "characters" for
doing unindexed searches: bytes?  UTF8 chars?  something else?  mcs / mhein; can
you chime in here?

* Communicator 4.x did not try and limit the server load and search time this
way (ie it essentially had a minStringLength of 0), which is I think what 6.1
had also.  Whether this is a bug or a feature depends on your perspective (cf
bug 84566).
More thoughts: 

* Dropping only the default minStringLength will cause the hang in bug 84566 to
come back. We may be able to work around that by also dropping the default max
search entries from 100 down to something much smaller.

* Bug 84566 is really just a manifestation of bug 50104: mozilla's event queues
don't prioritize events at all, so they're easy to flood.  If the above strategy
doesn't work, it's possible that changing from async proxy events to sync proxy
events on the LDAP thread could workaround 50104.
momoi: it would be easy to add a preference for an installation-wide default of
the string length, which you could then set in the default JS prefs for
localized builds.  Does that work for you?
Assignee: srilatha → dmose
OS: Windows 2000 → All
Priority: -- → P2
Hardware: PC → All
Target Milestone: --- → mozilla0.9.5
How about the following fix? Characters in range of (0x2e80, 0xd7ff) are all 
belong to a rather big repertoire. I have no idea about how LDAP server do 
index/unindex search. Any utf8 search that is not done base on characters is 
meaningless and is a BUG that should be fixed. 

Index: nsLDAPAutoCompleteSession.cpp
===================================================================
RCS file: /cvsroot/mozilla/xpfe/components/autocomplete/src/nsLDAPAutoCompleteSe
ssion.cpp,v
retrieving revision 1.18
diff -u -r1.18 nsLDAPAutoCompleteSession.cpp
--- nsLDAPAutoCompleteSession.cpp       2001/09/17 23:35:02     1.18
+++ nsLDAPAutoCompleteSession.cpp       2001/09/24 22:33:39
@@ -64,6 +64,8 @@
     }
 }

+#define IS_CJK_CHAR_FOR_LDAP(PRUnichar u)  (0x2e80 <= (u) && (u) <= 0xd7ff)
+
 /* void onStartLookup (in wstring searchString, in nsIAutoCompleteResults previ
ousSearchResult, in nsIAutoCompleteListener listener); */
 NS_IMETHODIMP
 nsLDAPAutoCompleteSession::OnStartLookup(const PRUnichar *searchString,
@@ -100,7 +102,8 @@
     if (searchString[0] == 0 ||
         nsDependentString(searchString).FindChar(PRUnichar('@'), 0) !=
         kNotFound ||
-        mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength)  {
+        mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength
+                         && (!IS_CJK_CHAR_FOR_LDAP(searchString[0])))  {

         FinishAutoCompleteLookup(nsIAutoCompleteStatus::ignored);
         return NS_OK;
> it would be easy to add a preference for an installation-wide 
> default of the string length, which you could then set in the 
> default JS prefs for localized builds.  Does that work for you?

This provides only a limited workaround. We have an internationalized
and multilingual browser in Mozilla. For such software, this cannot
be a correct solution. Making distinctions based on script differences
is a requirement of a multilingual browser and User Experience needs to 
reflect that quality as well.
Let's put in a real fix.

Some info. on how Netscape (and the iPlanet) Directory Servers work:

As Dan said, unindexed searches should be avoided because they load the server 
and are too slow for something like autocomplete, but you can set a time limit 
to make sure the server does not spend too much time servicing an autocomplete 
request (maybe this is already being done?  As a user, I would say that waiting 
more than a few seconds for autocomplete is too long).

On the wire with LDAPv3 and inside the Directory Server, everything is processed 
as UTF-8.  I believe that the substring indexing (used for filters such as 
cn=Ma*) counts bytes when it creates the index.  What that means is that the 
number of characters is not the important thing when avoiding unindexed 
searches, but the number of UTF-8 bytes is the important thing.  If I am 
correct, then to work best with the Netscape and iPlanet servers, the 
minStringLength preference should count UTF-8 bytes.  This would be easy for 
someone to try.

I think it is may be a bug in the server to use UTF-8 byte counts when breaking 
a string up into substrings though, because the substring index keys may not 
contain complete characters.  But I think that happens to work okay inside the 
server because of the way the substring matching is done.
According to mark's comment, my patch should be fine. All CJK character in my 
CJK testing range use more than one byte. We don't want to use UTF8 byte count 
here because:
1, It takes several more steps to calculate UTF8 byte count, 
2, It is undesired for user to receive a lot of matches. If we use UTF8 byte 
count as criteria, that might happen. Suppose user's language is russian, and 
all russian characters are encoded using more than one byte. That will lead to 
searching with one character, and might have a lot of matches because russian 
character repertorie is small. 
what are the chances of getting this fixed for the 0.9.4 branch?
Whiteboard: regression → regression, [PDT], [ETA ?]
Kat, what is the default stock behaviour of Communicator 4.78 [en] on a CJK system?
jussi, with Comm 4.78, I get matches with 1 character prompt 
for both ASCII and non-ASCII type entries.
Whiteboard: regression, [PDT], [ETA ?] → regression, [PDT], [Need ETA]
shanjian: this seems like the wrong fix, because it makes it impossible for the
user to override the minStringLength for CJK chars via preference.
blizzard, do you have any thoughts on this?  My current thinking is that it
would be better to split minStringLength up into two preferences: one that
applies to strings beginning with CJK characters, and one that applies to
everything else.
So if we take any fix that sets minStringLength to 0 for some set of
autocompletions, we also risk exposing people to bug 88427 (in addition to the
already mentioned bug 84566).
Status: NEW → ASSIGNED
OK, having thought about this for a whole 30 seconds here's what I see:

It seems pretty obvious to me that you need to have preferences for the minimum
length of what gets sent to the server, no matter what language it is.  Also, it
would be nice to have a way to recognize a language character that is more
signifigant than latin languages.  So, personally I would recommend using the
check for the character type with two prefs, one for latin, one for CJKV languages.
OK, I've attached a patch that I think is reasonable.  It uses the
cjkMinStringLength pref (defaulted to 0) for CJK chars, and the minStringLength
pref (defaulted to 2) for everything else.  Shanjian and Leif, can you review
this?  Also, if someone who knows how to enter these characters and has an
appropriate server can build with this patch and actually test it, I'd be very
grateful.
Whiteboard: regression, [PDT], [Need ETA] → regression, [PDT], have patch, need review and testing
I'm sure shanjian can help with the patch testing.
I myselft can test this on a trunk build tonight. For a branch
build, tomorrow morning by 11am or so. Is that soon enough?
momoi: sure; that should be fine (I still need the reviews anyhow).  Thanks!
I tried your patches (4 in all) on today's trunk build.
It works as intended on 2 LDAP servers I tried. Required
2 characters for non-CJK matches (Latin1, Turkish, Czech, 
Greek, Russian) but for Chinese, Japanese, and Korean,
I got 1 character matching. So this is working as intended
on a fresh profile on the current trunk build.
We don't have samples for Vietnamese Han characters
but I expect those falling outside of the user-defined area
to be the same as CJK languages as far as the minimum
length is concerned since they fall within the Chinese
range of Unicode standard. Those in the user-defined
range, I don't know if there is any easy way to test
them. For now, I think we can safely ignore the issue
and come back to it when needed.
Question: I think this is unrelated to your patch but 
          I have not seen any CJK matches for First Name
          entries. Last or full name matches only for them.
          For non-CJK entries, I am able to match with 
          First, Last or Full names. Is this by design? It's
          been like this for some time now.
          
Dan, you patch looks reasonable, but the if part might need to be rewritten to 
save an additional judgement. 
I have another suggestion for your consideration. Instead of creating a new 
preference, I would suggest to derive CjkMinStringLength  from mMinStringLength. 
That is to set CjkMinStringLength to be half of mMinStringLength. The arguments:
. (Use english and chinese as an example), first 2 english letters has 26*26=676 
combination, and first 3 has 26*26*26=17576. GB2312 encode 7445 code points. So 
user will have a consistent experience working with either CJK or non-CJK.
. Typical CJK users/programmers still believe a CJK character's string length is 
two, since most CJK encoding use 2 byte to encode a character. With monospace 
font, a CJK character does occupy double space comparing with latin latters. 
. We only need on preference. (Is it an advantage?)

Part of the patch should be like:
-        mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength)  {
+        (nsCRT::strlen(searchString) < 
+           (IS_CJK_CHAR_FOR_LDAP(searchString[0])?
+                  mMinStringLength/2 : mMinStringLength))){
On today's branch pull, dmose's pataches work as expected
but there is one major difference between the branch 
and trunk builds.
This may be an area that mscott worked on.

Problem in branch with dmose's patch:

1. If there is a single CJK match, that selection displays OK.
2. Now if you erase over this selection and start over within the
   same address field, no selection is displayed if there is only
   one match. In other words, you can try other entries which
   have only a single match but none will be displayed. 
   If you hit CR, however, the correct entry will be supplied.
3. If you enter the same character that failed in step 2 on 
   another address field, it will succeed again.

This problem is not observed in the trunk build. Step 2 is something
I do fairly often. I probably should file another bug for it. 
The problem described above does not seem to be due to dmose's
patches and so was filed as a new Bug 102954.
shanjian: I prefer the patch as it stands, because it gives the user more
control over the behavior.  Can you r= the existing patch?
dan, I would like to see following modification to your exist patch if you decide to keep 2 preference. We don't want to end 
up with 2 "IS_CJK_CHAR_FOR_LDAP(searchString[0])" judgement. 

-        mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength)  {
+        (nsCRT::strlen(searchString) < 
+           (IS_CJK_CHAR_FOR_LDAP(searchString[0])?
+                  mCjkMinStringLength : mMinStringLength))){
shanjian: the change you suggest will hurt performance in many cases, because it
will always call nsCRT::strcmp().  The patch as written avoids calling
nsCRT::strcmp() unless it absolutely has to.  So although the code appears
redundant, that's actually intentional.
Is that so? Then how about this:

-        mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength)  {
+        ( IS_CJK_CHAR_FOR_LDAP(searchString[0])?
+           mMinStringLength && nsCRT::strlen(searchString) < mMinStringLength :
+           mCjkMinStringLength && nsCRT::strlen(searchString) < mCjkMinStringLength ) ) {
Sold, although it looks like a ! is missing, or else the ? : clauses are
misordered.  New patch coming right up.
Attachment #51800 - Attachment is obsolete: true
Comment on attachment 51988 [details] [diff] [review]
CJK autocomplete regression fix, v2

sr=blizzard
Attachment #51988 - Flags: superreview+
Attachment #51988 - Flags: review+
Whiteboard: regression, [PDT], have patch, need review and testing → regression, [PDT], need approvals
Comment on attachment 51988 [details] [diff] [review]
CJK autocomplete regression fix, v2

a=asa (on behalf of drivers) for checkin to 0.9.5
Attachment #51988 - Flags: approval+
I used dmose's 2nd patch with the current trunk build (10/4/2001)
on Windows. The patch(es) are working as intended. The results
I got were the same as before when I tried the 1st patch. 

In case you want to try CJK input without a CJK system, here's
what you can do:

1. On your Windows (NT/2000/XP preferred but 98/95 is OK, too), get
   yourself a multilingual font:

   ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP

   After unarchiving it, install this font into your system using the
   Font utility on the Control Panel.

2. Use Comm 4.75 or later. First, set the font "Bitstream Cyberbit" in
   step 1 for "Edit | Prefs | Appearance | Fonts | for the encoding | 
   Unicode". (Don't worry about setting for Mozilla. It will automatically
   use it if it is on your system.)
3. Using the AdBook on 4.7x, set up the LDAP server which contains
   multilingual entries. (NS-internal one: polyglot)
4. Type in "*" into the query field. This will get all the entries
   in the database. With the multilingual font above, all entries
   should display properly. 
5. Pick any CJK entry and double-click on it. This will open a browser 
   window and displays the content of the LDAP URL.
6. Copy/paste parts of Last name from the browser window 
   into Mozilla address field for Mail compose window.

(Note: ldap url does not seem to work for Mozilla -- a Windows
       dialog window comes up instead when I used LDAP URL. )
momoi & shanjian: thanks for all the help.

The fix has been checked into the trunk.

My manager has informed me that PDT decided that this didn't meet their criteria
for branch checkin, so I'm gonna resolve this as FIXED.  If anyone wants to go
fight PDT about this, feel free to re-open...
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
> My manager has informed me that PDT decided that this didn't meet 
> their criteria for branch checkin, so I'm gonna resolve this as 
> FIXED.  If anyone wants to go fight PDT about this, feel free 
> to re-open.

If this next release is really going to be used in 
corportations with DS data in CJK, you're going to get a
lot of complaints. But I have no hard data if such target 
users are known at this point. Those who are planning
the next NS release should know. Please be aware that you have
been warned. 

I'm going to re-open this for 0.9.4 branch. 

I have to say the following even if PDT decides to ignore it because 
it seems like people have not understood the full implication of 
this bug for CJK users.

1. Some servers don't return results when you query for first names
   in CJK. So far the NS 3.1.x DS server we use for testing is like this. 
   Maybe it is our bug but this means that if a person's last name contains 1   
   character only, there will no be results returned. There is a 
   sizable percentage of Japanese names which contain only 1 Chinese 
   character. These people will not be found by entering CJK input because 
   you need 2 characters minimum! 

   The LDAP autocomplete is completely broken for them.

2. All Chinese last names as far as I know contain only 1 Chinese character.
   So if you have one of these NS DS servers, NOT a single entry
   will be returned!

   So this feature is broken for nearly 100% of Chinese search!

3. Many Korean last names begin with only 1 character. These names will not
   be found.

4. Most users expect to find CJK names with 1 character input. It should be
   product requirement for this feature -- not an option. This is a 
   regression from NS 6.1 and Comm 4.7x also. 

Basically this feature is broken for a huge percentage of CJK users names.
How can anyone claim that LDAP search is enabled for CJK when it is
THIS broken?

If you really going to ship like this, I would out out a warning that
says that CJK search is not working and avoid all criticisms.

CC'ing PDT people for reconsideration. At least you should know all 
the facts as to how broken LDAP autocomplete is for CJK without this
fix.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
ji or julian, please verify the fix on tomorrow's trunk build.
Verified with 2001-10-05-06-trunk build, it's fixed on the trunk.
pls check this into the branch - PDT+
Whiteboard: regression, [PDT], need approvals → regression, [PDT+], need approvals
For testers :

When you test for the branch build fix, please note that
I just added several 1-char last name/1-char first name entries in CJK to 
polyglot (o=Airius.com). When testing for this fix, please make sure that
you contrast the pre-fix version and the fixed version with regard to
these 1-char CJK names. With the former, you will not get any return
but with the latter, you should be able to get all the matches returned.

Also you might experience a problem mentioned in Bug 102954
when you erase over the CJK input and try it the 2nd time. 
The above contains workaround suggestions.
Fix checked in on embedding (0.9.4) branch.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Verified with 10/08 branch builds. It's fixed.
verified with 20011008 branch build against polyglot.mcom.com
Status: RESOLVED → VERIFIED
Adding cwozniak to CC for possible documentation of hidden pref.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: