Closed Bug 86999 (detectHebrew) Opened 23 years ago Closed 19 years ago

Hebrew support for Universal (All) Autodetect

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.8beta4

People

(Reporter: bobj, Assigned: shooshx)

References

Details

(Keywords: intl)

Attachments

(2 files, 4 obsolete files)

The Universal Autodetect should support Hebrew since 6.1 now includes support
for Hebrew pages.
we also need to have both logical hebrew and visual hebrew detect since the 
current statiscal model will have different outcome depend on visual or logical. 

roy is moving shanjian's code to extensions/chardet right now. He said it will 
take < 2 weeks to do so. after that is done. we can ask simon@softel.co.il to 
give us some hebrew data. (He already give me the whole hebrew OT bible in 
logical order. but I am not sure it is good enough for moderm content)
> roy is moving shanjian's code to extensions/chardet right now. He said it
> will take < 2 weeks to do so. after that is done. we can ask
> simon@softel.co.il to give us some hebrew data. (He already give me the
> whole hebrew OT bible in logical order. but I am not sure it is good
> enough for moderm content)

There is no dependency on moving the code to start creating the data.
Maybe if Shanjian (who is still on paternity leave) could tell someone
else how to create the data, this work could be started sooner.
It is not just data. Data is improtant. But someone still need to write some 
code for it as I understand. 
Target Milestone: --- → Future
In an email to me, Shanjian wrote:
> As for Arabic and Hebrew, I am not sure if IBM can provide us with at
> least 10M of plain text for each languages. If so, I can build our own
> language model and add such support in our commercial build. 

But I don't know what it means to build the language model...
Keywords: intl
QA Contact: andreasb → giladehven
> But I don't know what it means to build the language model...

This is explained in a paper Shanjian and I will be 
presenting at the 19th Unicode Conf in San Jose.
We will soon make it available Netscape-internally when the
final paper has ben submitted.
To add a language support to my new universal detector, it should be done 
through some simple process.
1, Prepare at least 10M of plain text, better from various source and style.
2, Run my tools and generate 2 tables (language model).
3, Put the 2 tables to a new header file.
4, Make some small changes in make file and main control module

In the worst case, if this language is strange enough and my current algorithm 
could not handle it well, I need to make some modification to the other files.
Normally, it can be done by some adjustment to some constants. This situation 
did not happen recently, so most likely we do not have to do this.
Status: NEW → ASSIGNED
BTW it's possible to distinguish between logical and visual hebrew in the 
following way: if a final-form-letter (pretty frequently used in hebrew scripts) 
occurs after word delimiter - it's visual hebrew, otherwise - logical.
Of course, I don't know whether it's compatible with shanjian's algorithm..
 
QA Contact: giladehven → zach
Blocks: 115714
from comment #6 :
>Prepare at least 10M of plain text, better from various source and style.

I have most of it here, and can get up to 10mb easily. Is there anyone out there
who can do something with it? i can send it to him/her by email (as it is text,
it should compress well).
*** Bug 222440 has been marked as a duplicate of this bug. ***
Shanjian Li wrote:
> 1, Prepare at least 10M of plain text, better from various source and style.

We got it, see comment #8. Now, what does it take to go through to the next
three steps?

Prog.
Kat, are the instructions for building the language model and any necessary code
available anywhere? Maybe we can work on this together.
Simon, there is a description of one such model in Li and Momoi paper at:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

and the example that should provide a model for you is Russian detector, which
uses 2-char
sequence method. (See section 4.7) If Shanjian can spend a bit of time, that
would be 
best because he knows how to make a language model and the best info available is 
only a description of the process rather than how-to.

You might be able to come up with your own language model based on the collected
Hebrew data and
the Russian-related files contained in the tree at:

mozbuild\mozilla\intl\chardet\tools

(I could be wrong about this, but there seems to be no "automatic" way to
generate a language
model from a set of data. You need to apply language specific logic to your
model generator a la
the Russian model and them decide how you want to define teh Hebrew model.)

Shanjian, are you around to get Simon started on this?
The tools to build language model haven't been checked in. Part of the reason is
I didn't find time to write the documentation. It is rather useless without any
doc.  Kat talked to me before about further improve the detection on single byte
charsets, if I had time, I would opt to go that direction. The existing detector
does not do a good job among those simlar latin language charsets. Hebrew can
probably be covered by just using the existing algorithm. 

Here is what I suggest to do:
1,  (Shoshannah Forbes?) please send me the text you collected, via email or let
me ftp. 
2, I will prepare a language model and add a module for hebrew. I might need to
talk to simon to get basic knowledge about hebrew. 
3, I can explain what I did and let simon tune the code. The good thing about
universal charset detector is that it can be easily extended, and any kind of
knowledge can be applied to the detector to improve detecting result. This part
need a good knowledge of the language, and simon will be the perfect candidate. 
the Hebrew text file can be downloaded from:
http://www.xslf.com/temp/hbtext.tar.bz
Blocks: 240501
(In reply to comment #14)
> the Hebrew text file can be downloaded from:
> http://www.xslf.com/temp/hbtext.tar.bz

The link seems to be dead :(
(In reply to comment #15)

> The link seems to be dead :(


Yes, they have been removed from the server as I needed the disk space, and a
few months have past since I uploaded them.

If someone is planning to work on this bug, I'll be glad to upload them again
(but not for an unlimited amount of time).
OK, let's proceed like this:

Shosh, if you can upload the file again, or email it to me, I'll start working
on some documentation about Hebrew script which can be used as background for
preparing the language model. There are several issues that need to be
considered: a few that come to mind are

* pointed vs. unpointed text
* logical vs. visual
* Hebrew script used for other languages
  * with similar writing systems (e.g. Aramaic)
  * with radically different writing systems (e.g. Ladino, Yiddish, etc.)
(In reply to comment #17)
 
> Shosh, if you can upload the file again

The file is back up at http://www.xslf.com/temp/hbtext.tar.bz for a limited
amount of time.
Is there any progress on this one?

Prog.
OS: other → All
*** Bug 276271 has been marked as a duplicate of this bug. ***
hello
I would like to resume work on this bug.
does any one know where I can find the code of the tool to build the language 
model?
So far, Katsuhiko Momoi doesn't seem to be responsive. Does this mean we're
stuck without tools to generate language support for any other language than the
existing ones?
Shanjian emailed me this week the original tools used to construct the language
model, so we should be able to start moving forwards.
Attached patch Preliminary patch (obsolete) — Splinter Review
I made this patch by running Shanjian's tool on Shosh's Hebrew text file. It
needs a lot more testing and refining - firstly we need to allow for the
possibility of Visual Hebrew, and secondly the 64x64 matrix is really overkill
for Hebrew, and we should probably make the matrix size variable.
Comment on attachment 171414 [details] [diff] [review]
Preliminary patch

Sorry, I just noticed the patch is incomplete.
Attachment #171414 - Attachment is obsolete: true
Here are some test cases that I have been collecting. Whenever I came across a
Hebrew page without a valid encoding declaration in the last few months I
bookmarked it (of course we also need to test pages which are not in Hebrew for
false positives)

Logical Hebrew:

http://et.hopeways.org/
http://www.mibereshit.org/Hpash/104.html
http://www.usajewish.com/rotblit.html
http://www.daat.ac.il/mishpat-ivri/havat/7-2.htm
http://www.eventact.com/stier/computax04/
http://morshem.free.fr/lamed/lamed.htm

Visual Hebrew:

http://www.hitechjob.co.il/
http://www.torahcodes.co.il/havlin.htm
Including everything this time...
shanjian is no longer working on mozilla for 2 years and these bugs are still
here. Mark them won't fix. If you want to reopen it, find a good owner first. 
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
ftang!
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Assignee: shanjian → smontagu
Status: REOPENED → NEW
(In reply to comment #27)
> Created an attachment (id=171436) [edit]
> Preliminary patch

Simon, assuming your patch hasn't bit-rotted yet, how about getting it reviewed?

Thanks,

Prog.
(In reply to comment #30)
> Simon, assuming your patch hasn't bit-rotted yet, how about getting it reviewed?

It needs a little work yet (see comment 24) and I understand that Shy (shoosh)
was planning to work on it. If not, I can carry on with it myself.
I've been working on this for a few days and made a new patch based on the
perliminary patch by simon
this patch contains:

- win1255 hebrew language model produced by the model producing utility (from
simon's patch).

- fully featured identification of visual hebrew (ISO-8859-8) as opposed to 
  Logical hebrew (windows-1255).
  this addition involved the following:
  * writing nsHebrewProber which makes the Visual hebrew or Logical hebrew 
    decision by looking at ending letter and where they are found in the text
  * adding in the general Single Byte CharSet Prover the ability to lookup
pairs 
    of characters in reverse and the ability to take a 'helper prober' for name
 
    decision which is used in GetCharSetName()
  * adding to the SBCSGroupProber an SBCSProber for visual hebrew with and a
    nsHebrewProber
  see more about how it works in the comment in nsHebrewProber.h

- refactoring of FilterWithEnglishLetters() and FilterWithoutEnglishLetters()
  appeared in two places, one of them not up to date, now only in 
  nsCharSetProber.cpp (new file) and up to date.

- reorganization of the debug output lines to make them more useful and
readable

- lots of documentation about hebrew identification and in general where it was

  missing

Testing:
in addition to the patch, I'm adding an HTML page containing alot of links to 
test-cast sites I've found. these include sites in both visual and logical
hebrew without any meta charset tag which appear in the wrong language in 1.0.4
and the current alpha.
I've tested the patch on all of these sites and it worked well (see comments
near the links where it didn't and why)

an easy method of finding pages in visual hebrew is - not using google :)
a hebrew query in google get searched both "logically" and "visually"
(backwards). so searching backward hebrew words doesn't work in google. in
Yahoo search however it works well.

possible further testing might include manufacturing pages in russian and greek
without a meta charset tag and making sure they do not get identified as
hebrew. so far I've tested only a handful of russian pages.

I would really like to get this patch through so I would really appreciate any
comments and reviews.
Attachment #188869 - Flags: review?(smontagu)
Attached file test cases
test cases URLs
Shy, is there any chance you could provide a build for Windows? I don't mind
hosting it and it will help many other users test it. Needless to say, I don't
mind hosting builds for other platforms as well.

Thanks,

Prog.
of course. here's a link for a windows build containing my patch:
http://www.shiny.co.il/shooshx/86999build.zip

this is a debug build of a more or less recent Deer Park alpha 2 checkout
patched with my proposed patch. I've changed the #define in nsCharSetProber.h to
print the debug output.
this was build using vc71. There are some additional vc71 dlls needed by the
executable in:
http://www.shiny.co.il/shooshx/vc71runtime.zip
I didn't test this on a machine without vc71 (or infact on any machine other
then my own) so if anyone is missing anything, let me know and I'll upload it.

to make this work - 
- Unzip everything to a directory somewhere
- Open a command line and set the following:
set XPCOM_DEBUG_BREAK=warn
This makes assertions not show up as annoying message boxes.
- Run firefox.exe
In the console you can see all its debug output as well as the charset scores.
(don't forget to View->Character Encoding->Auto Detect->Universal)
- To make Deer Park run in addition to a Firefox already running one should also
set:
set MOZ_NO_REMOTE=1
And make Deer Park use a different profile then the one used by Firefox using
the command line argument -ProfileManager or -p someprofile.
Enjoy.
corrected patch. after a review from timeless@bemail.org

- lots of bad english fixes.
- checking for memory allocation failures.
- licence undated with my name.

notice: the diff of nsSBCSGroupProber.cpp is abit non-trivial to follow
due to two changes made in the same place: the removal of
FilterWithoutEnglishLetters and the change in Reset for memory checking. it's
best to view this file patched then look at the diff.
Attachment #188869 - Attachment is obsolete: true
Attachment #189608 - Flags: review?(smontagu)
Comment on attachment 189608 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

>+ * Contributor(s):
>+ *          Shy Shalom <shooshX@gmail.com>

Please add me and Shosh Forbes to the contributor lists of LangHebrewModel.cpp:

   Simon Montagu <smontagu@smontagu.org>
   Shoshannah Forbes <xslf@xslf.com>

>+#define FINAL_PEH ('\xf3')
>+#define NORMAL_PEH ('\xf4')
>+#define FINAL_TSADIK ('\xf5')
>+#define NORMAL_TSADIK ('\xf6')

Nit: it's easier for non-Hebrew speakers to follow if you use the Unicode names
PE and TSADI (KAF MEM and NUN are the same)

>+  // The normal Tsadik is not a good Non-Final letter due to words like 'lechotet' (to chat)

Nit: can you format at least the long comments into 80-character lines? Did I
mention that you *rock* for writing these comments?

>+  // The letters Peh and Kaf rarely displays a related behavior of not being a good Non-Final letter.

s/displays/display/

>+  char cur, prev = ' ', beforePrev = ' ';
>+  // prev and beforePrev are initialized to space in order to simulate a word delimiter at the 
>+  // beginning of the buffer

I wonder if that's really what you want to do. Except for the first buffer
received, the beginning of the buffer isn't necessarily a word boundary.

+  if ((finalsub >= MIN_FINAL_CHAR_DISTANCE) || (finalsub <=
-(MIN_FINAL_CHAR_DISTANCE)))
+  {
+    if (finalsub > 0) return LOGICAL_HEBREW_NAME;
+    return VISUAL_HEBREW_NAME;
+  }

I would like this better in separate ifs:

  if (finalsub >= MIN_FINAL_CHAR_DISTANCE)
    return LOGICAL_HEBREW_NAME;
  if (finalsub <= -(MIN_FINAL_CHAR_DISTANCE))
    return VISUAL_HEBREW_NAME;

and the same for the next condition.

>+void  nsHebrewProber::DumpStatus()
>+{
>+  printf("  HP score: %d - %d\r\n", mFinalCharLogicalScore, mFinalCharVisualScore);
>+}

This rubric could be made more informative, like "Hebrew logical - visual
score"

>+ * It is the renderer responsibility to display the text from right to left. 

"renderer's"

>+ * To sum all of the above, the Hebrew probing mechanism knows about two charsets:

"sum up"

>+ *    line order is natural. For charset recognition purposes the lines order is unimportant
>+ *    (In fact, for this implementation, even words order is unimportant).

"line order" both times, and "word order"

r=me with all those nits picked.
Attachment #189608 - Flags: review?(smontagu) → review+
(In reply to comment #37)
> >+ * It is the renderer responsibility to display the text from right to left. 
> 
> "renderer's"

As an inanimate object, "renderer" shouldn't take possession either (unless you
wish to personify it). "The responsibility of the renderer" is better IMHO.

Prog.
(In reply to comment #38)
> As an inanimate object, "renderer" shouldn't take possession either (unless you
> wish to personify it).

Do you have a source for this principle? I never heard it before (which doesn't
mean it's wrong).
You can find about 6000 such sources here:

http://www.google.com/search?hl=en&q=inanimate+objects+possession+apostrophe

Prog.
(In reply to comment #40)
> You can find about 6000 such sources here:
> 
> http://www.google.com/search?hl=en&q=inanimate+objects+possession+apostrophe

Judging by a small sample, not all 6000 state it as a hard and fast rule ;-) In
any case, attributing responsibility to a renderer seems to me to be
personification already.
Alias: detectHebrew
Comment on attachment 189608 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

i'm not sure i agree with these licenses,

>Index: extensions/universalchardet/src/LangHebrewModel.cpp
>+ * The Original Code is Mozilla Communicator client code.
change this to something about the hebrew-prober-autodetect or something like
that (applies to all relevant files)

>+ * The Initial Developer of the Original Code is
>+ * Netscape Communications Corporation.
>+ * Portions created by the Initial Developer are Copyright (C) 1998
>+ * the Initial Developer. All Rights Reserved.
this should implicate smontagu and 2004

>+#include "nsSBCharSetProber.h"
>+
>+
>+
do you need 3 blank lines? :)
>+//Windows-1255 language model

>Index: extensions/universalchardet/src/nsCharSetProber.cpp
>+ * The Original Code is mozilla.org code.
this should really be changed to something remotely meaningful, but we can
ignore it.

>+ * The Initial Developer of the Original Code is
>+ * Netscape Communications Corporation.
>+ * Portions created by the Initial Developer are Copyright (C) 1998
>+ * the Initial Developer. All Rights Reserved.
i bet cvs archeology will show that the original code is post 1998, oh well.

>+      else //ignore current segment. (either because it is just a symbol or just an English word
please add a )
>+  newLen = newptr - *newBuf;
>+
>+  return PR_TRUE;
>+}
>+
>+
please remove the trailing lines.

>Index: extensions/universalchardet/src/nsCharSetProber.h
>+  // Helper functions used in the Latin1 and Group probers.
>+  // Allocates a new buffer for newBuf. This buffer should be freed by the caller using PR_FREEIF.
perhaps "They allocate" instead of "Allocates"?

>+  // Both functions return PR_FALSE in case of memory allocation failure.
>+  static PRBool FilterWithoutEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen);
>+  static PRBool FilterWithEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen);

>Index: extensions/universalchardet/src/nsHebrewProber.cpp
>+ * The Original Code is mozilla.org code.
change this to something about the hebrew-prober-autodetect or something like
that (applies to all relevant files)

it looks like this file is yours, in which case the following blob is wrong
(you should be initial and the current year, 2005, should be listed):
>+ * The Initial Developer of the Original Code is
>+ * Netscape Communications Corporation.
>+ * Portions created by the Initial Developer are Copyright (C) 1998
>+ * the Initial Developer. All Rights Reserved.

>+  // The letters Peh and Kaf rarely displays a related behavior of not being a good Non-Final letter.
display

>+ * 3) A word longer than 1 letter, starting with a final letter. Final letters should not 
>+ *    appear in the  beginning of a word. This is an indication that the text is laid out backwards.
appear at the beginning...

>+ * This method relies on the input buffer being an output of FilterWithoutEnglishLetters().
smontagu, brad, and myself decided that we disliked this sentence and any
rewording we could find (at our varying time of night/day) that it's better to
simply rely on the fact that the caller documents this behavior.

for reference, here's a one sided synopsis:
<smontagu> This method expects an input buffer which has been passed through
FilterWithoutEnglishLetters() and so contains only Hebrew characters and ASCII
space characters, without multiple spaces between words.
<smontagu> but "Hebrew characters" is wrong, because that's what we are trying
to establish. * smontagu doesn't know a good term for "characters in the range
0x80-0xFF" <smontagu> I hate "high ascii"

<blockquote quoter=smontagu saying="there is a comment elsewhere which says">
+ * The nsSBCSGroupProber is responsible for stripping the original text of
HTML tags, English characters,
+ * numbers, low-ASCII punctuation characters, spaces and new lines. It reduces
any sequence of such
+ * characters to a single space. The buffer fed to each prober in the SBCS
group prober is pure text in
+ * high-ASCII.
so maybe it doesn't need to be said again
</blockquote>

>Index: extensions/universalchardet/src/nsHebrewProber.h
>+ * The Original Code is mozilla.org code.
change this to something about the hebrew-prober-autodetect or something like
that (applies to all relevant files)

it looks like this file is yours, in which case the following blob is wrong
(you should be initial and the current year, 2005, should be listed):
>+ * The Initial Developer of the Original Code is
>+ * Netscape Communications Corporation.
>+ * Portions created by the Initial Developer are Copyright (C) 1998
>+ * the Initial Developer: All Rights Reserved.
>+ *
>+ * Contributor(s):
>+ *          Shy Shalom <shooshX@gmail.com>

>+ * Texts in x-mac-hebrew are almost impossible to find in the internet. From what little evidence 
i'd write "on the internet" (i might capitalize Internet, but we think that
might be a lost cause)

>+#endif /* nsHebrewProber_h__ */
>+
>+
>+
soo many blank lines at the end :p

in the place where you hard code 10/11/12 you should probably make a note so
that people are careful not to mess up your slot numbers :)
corrected patch, after reviews from smontagu and timeless

- corrected comments according to your suggestions
- updated licences details
- in nsHebrewProber moved prev and beforePrev from HandleData to being class
members. This allows to save the state maintained in HandleData across buffers.

thanks for the reviews.
Attachment #189608 - Attachment is obsolete: true
Attachment #189781 - Flags: review?(smontagu)
Attachment #189781 - Flags: review?(timeless)
Attachment #188869 - Flags: review?(smontagu)
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

note that i'm not a module owner, r=smontagu or equivalent is still required.
Attachment #189781 - Flags: review?(timeless) → review+
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

>+  printf("  HEB: %d - %d [Logical-Visusal score]\r\n", mFinalCharLogicalScore, mFinalCharVisualScore);

No need for a new patch, but the typo "Visusal" should be corrected before
checking in.
Attachment #189781 - Flags: review?(smontagu) → review+
Attachment #189781 - Flags: superreview?(blizzard)
Attachment #189781 - Flags: superreview?(blizzard) → superreview?(roc)
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

looks nice. The code you moved has some style issues but you don't need to fix
them.
Attachment #189781 - Flags: superreview?(roc) → superreview+
Attachment #189781 - Flags: approval1.8b4?
Assignee: smontagu → shoosh20012001
Target Milestone: Future → mozilla1.8beta4
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

When requesting approval to check in at this late stage, please include an
explanation of the value of the change, the risk associated with the change and
any testing that's been done on the change. Thanks.
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

This patch has almost no affect on anything but Hebrew autodetect, which
doesn't isn't supported at all right now. Should be safe enough.
lets see now.
value of the change: As of now, hebrew pages with no charset specification
either are displayed as gibbrish or if the Universal charset detector is on, are
recognized as russian or greek. this patch solves this problem with the added
recognition of hebrew.

risk: as already commented, virtually none. There is a certain theoretical
chance that with this patch, greek or russian pages will be identified by the
Universal charset detector as hebrew. I have not seen any evidance this can
happen in practice. 
Also, since the hebrew probers are ran on every page without charset
specification, this patch has some theoretical performance hit on page load
time. the performance hit is already there for the entire universal charset
detector and this patch, adding two more probers to it, is adding some
additional load.

testing: I have a wide variery of visual and logical hebrew pages as test cases.
most of them are in the attachment html file. on all of these pages the patch
works with expected and almost perfect results. I have also tested it on a
number of russian and greek pages which didn't seem to be remotly affected by
the change and infact the hebrew probers were no where near identifying them as
hebrew.

I think this patch is important since it serves a genuine advantage to the
hebrew reading users making their lives somewhat happier :).
I definitely agree there is a great value to include Hebrew for Universal 
Auto-detect.  I just want to make sure that existing auto-detection for CJK
still works and someone tested.
Attachment #189781 - Flags: approval1.8b4? → approval1.8b4+
checked in by timeless.
Status: NEW → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
I've tested many of the links in attachment 188870 [details], and the patch seems to be
very effective with most of them. A few pages have wrong encoding specified in
the HTML code, but are actually iso-8859-8:

"X-MAC-HEBREW"
http://www.escape.com/~elyaqim/music/jew/yiddishlyrics/yontevdi.html

"WINDOWS-1255"
http://www.tau.ac.il/humanities/publications/zmanim/amiragel/gel61_3.htm

"iso-8859-1"
http://www.jewishmag.com/nlpjerusalem/hebnews1.htm

"Hebrew Alphabet (ISO-Visual)"
http://www.haifamuseums.org.il/japintr1.htm

"ISO-8859-8-i"
http://www.arkline.com/arik/Eitai.html

Since the browser shouldn't override pages that explicitly specify encoding, the
above is not a bug in the auto-detection code.

In addition, the following page had text which was replaced with question marks
(probably when submitted to the site). Not much can be done here.

http://www.icq.com/whitepages/wwp.php?Uin=303856019

Verified with Firefox-trunk/20050809/WinXP.

Prog.
Status: RESOLVED → VERIFIED
Comment on attachment 189781 [details] [diff] [review]
proposed patch - full hebrew identification, logical and visual

mozilla/extensions/universalchardet/src/nsSBCharSetProber.h	1.8
mozilla/extensions/universalchardet/src/nsSBCharSetProber.cpp	1.9
mozilla/extensions/universalchardet/src/nsSBCSGroupProber.h	1.8
mozilla/extensions/universalchardet/src/nsSBCSGroupProber.cpp	1.12
mozilla/extensions/universalchardet/src/nsMBCSGroupProber.cpp	1.8
mozilla/extensions/universalchardet/src/nsLatin1Prober.h	1.3 
mozilla/extensions/universalchardet/src/nsLatin1Prober.cpp	1.6
mozilla/extensions/universalchardet/src/nsHebrewProber.h	1.1
mozilla/extensions/universalchardet/src/nsHebrewProber.cpp	1.1
mozilla/extensions/universalchardet/src/nsCharSetProber.h	1.9
mozilla/extensions/universalchardet/src/nsCharSetProber.cpp	1.1
mozilla/extensions/universalchardet/src/Makefile.in	1.11
mozilla/extensions/universalchardet/src/LangHebrewModel.cpp	1.1
mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp 	1.20
mozilla/extensions/universalchardet/src/nsUdetXPCOMWrapper.cpp	1.2
mozilla/extensions/universalchardet/src/nsHebrewProber.cpp	1.2
Attachment #189781 - Attachment is obsolete: true
Depends on: 304951
Testing with 1.5 Beta 2, there is a regression from 1.0.7 showing Hebrew pages.
See for example
http://www.haaretz.co.il/hasite/pages/LiArtEc.jhtml?contrassID=4&subContrassID=0&sbSubContrass=0&from=hasot
(or any http://www.haaretz.co.il economic section)
That page has a meta tag with charset=windows-1255. However, if I select char
encoding "universal" , FF 1.5b2 switches to visual Hebrew, resulting in about
30% of the page showing reversed chars. FF 1.0.7, OTOH, marks windows-1255, as
it should be (althout a small currency exchage table may then show cyrillic char
set instead of Hebrew). Please confirm and see if bug need to be reopen.
(In reply to comment #54)
> Testing with 1.5 Beta 2, there is a regression from 1.0.7 showing Hebrew pages.
> See for example
>
http://www.haaretz.co.il/hasite/pages/LiArtEc.jhtml?contrassID=4&subContrassID=0&sbSubContrass=0&from=hasot
> (or any http://www.haaretz.co.il economic section)

The issue you are describing is bug 308187, which is a tech evangelism issue
with Haaretz, and unrelated to this bug. Autodetection is not even activated in
this case, because the site explicitly specifies a (wrong) charset in the HTTP
headers (this also overrides the correct charset specified in the META tag). The
reason the problem was not evident in 1.0 was bug 244964 (fixed for 1.5).
You need to log in before you can comment on or make changes to this bug.