Closed
Bug 86999
(detectHebrew)
Opened 23 years ago
Closed 19 years ago
Hebrew support for Universal (All) Autodetect
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
VERIFIED
FIXED
mozilla1.8beta4
People
(Reporter: bobj, Assigned: shooshx)
References
Details
(Keywords: intl)
Attachments
(2 files, 4 obsolete files)
16.29 KB,
patch
|
Details | Diff | Splinter Review | |
18.77 KB,
text/html
|
Details |
The Universal Autodetect should support Hebrew since 6.1 now includes support for Hebrew pages.
Comment 1•23 years ago
|
||
we also need to have both logical hebrew and visual hebrew detect since the current statiscal model will have different outcome depend on visual or logical. roy is moving shanjian's code to extensions/chardet right now. He said it will take < 2 weeks to do so. after that is done. we can ask simon@softel.co.il to give us some hebrew data. (He already give me the whole hebrew OT bible in logical order. but I am not sure it is good enough for moderm content)
> roy is moving shanjian's code to extensions/chardet right now. He said it
> will take < 2 weeks to do so. after that is done. we can ask
> simon@softel.co.il to give us some hebrew data. (He already give me the
> whole hebrew OT bible in logical order. but I am not sure it is good
> enough for moderm content)
There is no dependency on moving the code to start creating the data.
Maybe if Shanjian (who is still on paternity leave) could tell someone
else how to create the data, this work could be started sooner.
Comment 3•23 years ago
|
||
It is not just data. Data is improtant. But someone still need to write some code for it as I understand.
Target Milestone: --- → Future
In an email to me, Shanjian wrote:
> As for Arabic and Hebrew, I am not sure if IBM can provide us with at
> least 10M of plain text for each languages. If so, I can build our own
> language model and add such support in our commercial build.
But I don't know what it means to build the language model...
Comment 5•23 years ago
|
||
> But I don't know what it means to build the language model...
This is explained in a paper Shanjian and I will be
presenting at the 19th Unicode Conf in San Jose.
We will soon make it available Netscape-internally when the
final paper has ben submitted.
Comment 6•23 years ago
|
||
To add a language support to my new universal detector, it should be done through some simple process. 1, Prepare at least 10M of plain text, better from various source and style. 2, Run my tools and generate 2 tables (language model). 3, Put the 2 tables to a new header file. 4, Make some small changes in make file and main control module In the worst case, if this language is strange enough and my current algorithm could not handle it well, I need to make some modification to the other files. Normally, it can be done by some adjustment to some constants. This situation did not happen recently, so most likely we do not have to do this.
Status: NEW → ASSIGNED
Comment 7•23 years ago
|
||
BTW it's possible to distinguish between logical and visual hebrew in the following way: if a final-form-letter (pretty frequently used in hebrew scripts) occurs after word delimiter - it's visual hebrew, otherwise - logical. Of course, I don't know whether it's compatible with shanjian's algorithm..
Updated•23 years ago
|
QA Contact: giladehven → zach
Comment 8•21 years ago
|
||
from comment #6 : >Prepare at least 10M of plain text, better from various source and style. I have most of it here, and can get up to 10mb easily. Is there anyone out there who can do something with it? i can send it to him/her by email (as it is text, it should compress well).
*** Bug 222440 has been marked as a duplicate of this bug. ***
Comment 10•21 years ago
|
||
Shanjian Li wrote: > 1, Prepare at least 10M of plain text, better from various source and style. We got it, see comment #8. Now, what does it take to go through to the next three steps? Prog.
Comment 11•21 years ago
|
||
Kat, are the instructions for building the language model and any necessary code available anywhere? Maybe we can work on this together.
Comment 12•21 years ago
|
||
Simon, there is a description of one such model in Li and Momoi paper at: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html and the example that should provide a model for you is Russian detector, which uses 2-char sequence method. (See section 4.7) If Shanjian can spend a bit of time, that would be best because he knows how to make a language model and the best info available is only a description of the process rather than how-to. You might be able to come up with your own language model based on the collected Hebrew data and the Russian-related files contained in the tree at: mozbuild\mozilla\intl\chardet\tools (I could be wrong about this, but there seems to be no "automatic" way to generate a language model from a set of data. You need to apply language specific logic to your model generator a la the Russian model and them decide how you want to define teh Hebrew model.) Shanjian, are you around to get Simon started on this?
Comment 13•21 years ago
|
||
The tools to build language model haven't been checked in. Part of the reason is I didn't find time to write the documentation. It is rather useless without any doc. Kat talked to me before about further improve the detection on single byte charsets, if I had time, I would opt to go that direction. The existing detector does not do a good job among those simlar latin language charsets. Hebrew can probably be covered by just using the existing algorithm. Here is what I suggest to do: 1, (Shoshannah Forbes?) please send me the text you collected, via email or let me ftp. 2, I will prepare a language model and add a module for hebrew. I might need to talk to simon to get basic knowledge about hebrew. 3, I can explain what I did and let simon tune the code. The good thing about universal charset detector is that it can be easily extended, and any kind of knowledge can be applied to the detector to improve detecting result. This part need a good knowledge of the language, and simon will be the perfect candidate.
Comment 14•21 years ago
|
||
the Hebrew text file can be downloaded from: http://www.xslf.com/temp/hbtext.tar.bz
Comment 15•20 years ago
|
||
(In reply to comment #14) > the Hebrew text file can be downloaded from: > http://www.xslf.com/temp/hbtext.tar.bz The link seems to be dead :(
Comment 16•20 years ago
|
||
(In reply to comment #15) > The link seems to be dead :( Yes, they have been removed from the server as I needed the disk space, and a few months have past since I uploaded them. If someone is planning to work on this bug, I'll be glad to upload them again (but not for an unlimited amount of time).
Comment 17•20 years ago
|
||
OK, let's proceed like this: Shosh, if you can upload the file again, or email it to me, I'll start working on some documentation about Hebrew script which can be used as background for preparing the language model. There are several issues that need to be considered: a few that come to mind are * pointed vs. unpointed text * logical vs. visual * Hebrew script used for other languages * with similar writing systems (e.g. Aramaic) * with radically different writing systems (e.g. Ladino, Yiddish, etc.)
Comment 18•20 years ago
|
||
(In reply to comment #17) > Shosh, if you can upload the file again The file is back up at http://www.xslf.com/temp/hbtext.tar.bz for a limited amount of time.
Comment 19•20 years ago
|
||
Is there any progress on this one? Prog.
Updated•20 years ago
|
OS: other → All
Comment 20•20 years ago
|
||
*** Bug 276271 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 21•20 years ago
|
||
hello I would like to resume work on this bug. does any one know where I can find the code of the tool to build the language model?
Comment 22•19 years ago
|
||
So far, Katsuhiko Momoi doesn't seem to be responsive. Does this mean we're stuck without tools to generate language support for any other language than the existing ones?
Comment 23•19 years ago
|
||
Shanjian emailed me this week the original tools used to construct the language model, so we should be able to start moving forwards.
Comment 24•19 years ago
|
||
I made this patch by running Shanjian's tool on Shosh's Hebrew text file. It needs a lot more testing and refining - firstly we need to allow for the possibility of Visual Hebrew, and secondly the 64x64 matrix is really overkill for Hebrew, and we should probably make the matrix size variable.
Comment 25•19 years ago
|
||
Comment on attachment 171414 [details] [diff] [review] Preliminary patch Sorry, I just noticed the patch is incomplete.
Attachment #171414 -
Attachment is obsolete: true
Comment 26•19 years ago
|
||
Here are some test cases that I have been collecting. Whenever I came across a Hebrew page without a valid encoding declaration in the last few months I bookmarked it (of course we also need to test pages which are not in Hebrew for false positives) Logical Hebrew: http://et.hopeways.org/ http://www.mibereshit.org/Hpash/104.html http://www.usajewish.com/rotblit.html http://www.daat.ac.il/mishpat-ivri/havat/7-2.htm http://www.eventact.com/stier/computax04/ http://morshem.free.fr/lamed/lamed.htm Visual Hebrew: http://www.hitechjob.co.il/ http://www.torahcodes.co.il/havlin.htm
Comment 27•19 years ago
|
||
Including everything this time...
Comment 28•19 years ago
|
||
shanjian is no longer working on mozilla for 2 years and these bugs are still here. Mark them won't fix. If you want to reopen it, find a good owner first.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
Updated•19 years ago
|
Assignee: shanjian → smontagu
Status: REOPENED → NEW
Comment 30•19 years ago
|
||
(In reply to comment #27) > Created an attachment (id=171436) [edit] > Preliminary patch Simon, assuming your patch hasn't bit-rotted yet, how about getting it reviewed? Thanks, Prog.
Comment 31•19 years ago
|
||
(In reply to comment #30) > Simon, assuming your patch hasn't bit-rotted yet, how about getting it reviewed? It needs a little work yet (see comment 24) and I understand that Shy (shoosh) was planning to work on it. If not, I can carry on with it myself.
Assignee | ||
Comment 32•19 years ago
|
||
I've been working on this for a few days and made a new patch based on the perliminary patch by simon this patch contains: - win1255 hebrew language model produced by the model producing utility (from simon's patch). - fully featured identification of visual hebrew (ISO-8859-8) as opposed to Logical hebrew (windows-1255). this addition involved the following: * writing nsHebrewProber which makes the Visual hebrew or Logical hebrew decision by looking at ending letter and where they are found in the text * adding in the general Single Byte CharSet Prover the ability to lookup pairs of characters in reverse and the ability to take a 'helper prober' for name decision which is used in GetCharSetName() * adding to the SBCSGroupProber an SBCSProber for visual hebrew with and a nsHebrewProber see more about how it works in the comment in nsHebrewProber.h - refactoring of FilterWithEnglishLetters() and FilterWithoutEnglishLetters() appeared in two places, one of them not up to date, now only in nsCharSetProber.cpp (new file) and up to date. - reorganization of the debug output lines to make them more useful and readable - lots of documentation about hebrew identification and in general where it was missing Testing: in addition to the patch, I'm adding an HTML page containing alot of links to test-cast sites I've found. these include sites in both visual and logical hebrew without any meta charset tag which appear in the wrong language in 1.0.4 and the current alpha. I've tested the patch on all of these sites and it worked well (see comments near the links where it didn't and why) an easy method of finding pages in visual hebrew is - not using google :) a hebrew query in google get searched both "logically" and "visually" (backwards). so searching backward hebrew words doesn't work in google. in Yahoo search however it works well. possible further testing might include manufacturing pages in russian and greek without a meta charset tag and making sure they do not get identified as hebrew. so far I've tested only a handful of russian pages. I would really like to get this patch through so I would really appreciate any comments and reviews.
Attachment #188869 -
Flags: review?(smontagu)
Assignee | ||
Comment 33•19 years ago
|
||
test cases URLs
Comment 34•19 years ago
|
||
Shy, is there any chance you could provide a build for Windows? I don't mind hosting it and it will help many other users test it. Needless to say, I don't mind hosting builds for other platforms as well. Thanks, Prog.
Assignee | ||
Comment 35•19 years ago
|
||
of course. here's a link for a windows build containing my patch: http://www.shiny.co.il/shooshx/86999build.zip this is a debug build of a more or less recent Deer Park alpha 2 checkout patched with my proposed patch. I've changed the #define in nsCharSetProber.h to print the debug output. this was build using vc71. There are some additional vc71 dlls needed by the executable in: http://www.shiny.co.il/shooshx/vc71runtime.zip I didn't test this on a machine without vc71 (or infact on any machine other then my own) so if anyone is missing anything, let me know and I'll upload it. to make this work - - Unzip everything to a directory somewhere - Open a command line and set the following: set XPCOM_DEBUG_BREAK=warn This makes assertions not show up as annoying message boxes. - Run firefox.exe In the console you can see all its debug output as well as the charset scores. (don't forget to View->Character Encoding->Auto Detect->Universal) - To make Deer Park run in addition to a Firefox already running one should also set: set MOZ_NO_REMOTE=1 And make Deer Park use a different profile then the one used by Firefox using the command line argument -ProfileManager or -p someprofile. Enjoy.
Assignee | ||
Comment 36•19 years ago
|
||
corrected patch. after a review from timeless@bemail.org - lots of bad english fixes. - checking for memory allocation failures. - licence undated with my name. notice: the diff of nsSBCSGroupProber.cpp is abit non-trivial to follow due to two changes made in the same place: the removal of FilterWithoutEnglishLetters and the change in Reset for memory checking. it's best to view this file patched then look at the diff.
Attachment #188869 -
Attachment is obsolete: true
Attachment #189608 -
Flags: review?(smontagu)
Comment 37•19 years ago
|
||
Comment on attachment 189608 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual >+ * Contributor(s): >+ * Shy Shalom <shooshX@gmail.com> Please add me and Shosh Forbes to the contributor lists of LangHebrewModel.cpp: Simon Montagu <smontagu@smontagu.org> Shoshannah Forbes <xslf@xslf.com> >+#define FINAL_PEH ('\xf3') >+#define NORMAL_PEH ('\xf4') >+#define FINAL_TSADIK ('\xf5') >+#define NORMAL_TSADIK ('\xf6') Nit: it's easier for non-Hebrew speakers to follow if you use the Unicode names PE and TSADI (KAF MEM and NUN are the same) >+ // The normal Tsadik is not a good Non-Final letter due to words like 'lechotet' (to chat) Nit: can you format at least the long comments into 80-character lines? Did I mention that you *rock* for writing these comments? >+ // The letters Peh and Kaf rarely displays a related behavior of not being a good Non-Final letter. s/displays/display/ >+ char cur, prev = ' ', beforePrev = ' '; >+ // prev and beforePrev are initialized to space in order to simulate a word delimiter at the >+ // beginning of the buffer I wonder if that's really what you want to do. Except for the first buffer received, the beginning of the buffer isn't necessarily a word boundary. + if ((finalsub >= MIN_FINAL_CHAR_DISTANCE) || (finalsub <= -(MIN_FINAL_CHAR_DISTANCE))) + { + if (finalsub > 0) return LOGICAL_HEBREW_NAME; + return VISUAL_HEBREW_NAME; + } I would like this better in separate ifs: if (finalsub >= MIN_FINAL_CHAR_DISTANCE) return LOGICAL_HEBREW_NAME; if (finalsub <= -(MIN_FINAL_CHAR_DISTANCE)) return VISUAL_HEBREW_NAME; and the same for the next condition. >+void nsHebrewProber::DumpStatus() >+{ >+ printf(" HP score: %d - %d\r\n", mFinalCharLogicalScore, mFinalCharVisualScore); >+} This rubric could be made more informative, like "Hebrew logical - visual score" >+ * It is the renderer responsibility to display the text from right to left. "renderer's" >+ * To sum all of the above, the Hebrew probing mechanism knows about two charsets: "sum up" >+ * line order is natural. For charset recognition purposes the lines order is unimportant >+ * (In fact, for this implementation, even words order is unimportant). "line order" both times, and "word order" r=me with all those nits picked.
Attachment #189608 -
Flags: review?(smontagu) → review+
Comment 38•19 years ago
|
||
(In reply to comment #37) > >+ * It is the renderer responsibility to display the text from right to left. > > "renderer's" As an inanimate object, "renderer" shouldn't take possession either (unless you wish to personify it). "The responsibility of the renderer" is better IMHO. Prog.
Comment 39•19 years ago
|
||
(In reply to comment #38) > As an inanimate object, "renderer" shouldn't take possession either (unless you > wish to personify it). Do you have a source for this principle? I never heard it before (which doesn't mean it's wrong).
Comment 40•19 years ago
|
||
You can find about 6000 such sources here: http://www.google.com/search?hl=en&q=inanimate+objects+possession+apostrophe Prog.
Comment 41•19 years ago
|
||
(In reply to comment #40) > You can find about 6000 such sources here: > > http://www.google.com/search?hl=en&q=inanimate+objects+possession+apostrophe Judging by a small sample, not all 6000 state it as a hard and fast rule ;-) In any case, attributing responsibility to a renderer seems to me to be personification already.
Updated•19 years ago
|
Alias: detectHebrew
Comment 42•19 years ago
|
||
Comment on attachment 189608 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual i'm not sure i agree with these licenses, >Index: extensions/universalchardet/src/LangHebrewModel.cpp >+ * The Original Code is Mozilla Communicator client code. change this to something about the hebrew-prober-autodetect or something like that (applies to all relevant files) >+ * The Initial Developer of the Original Code is >+ * Netscape Communications Corporation. >+ * Portions created by the Initial Developer are Copyright (C) 1998 >+ * the Initial Developer. All Rights Reserved. this should implicate smontagu and 2004 >+#include "nsSBCharSetProber.h" >+ >+ >+ do you need 3 blank lines? :) >+//Windows-1255 language model >Index: extensions/universalchardet/src/nsCharSetProber.cpp >+ * The Original Code is mozilla.org code. this should really be changed to something remotely meaningful, but we can ignore it. >+ * The Initial Developer of the Original Code is >+ * Netscape Communications Corporation. >+ * Portions created by the Initial Developer are Copyright (C) 1998 >+ * the Initial Developer. All Rights Reserved. i bet cvs archeology will show that the original code is post 1998, oh well. >+ else //ignore current segment. (either because it is just a symbol or just an English word please add a ) >+ newLen = newptr - *newBuf; >+ >+ return PR_TRUE; >+} >+ >+ please remove the trailing lines. >Index: extensions/universalchardet/src/nsCharSetProber.h >+ // Helper functions used in the Latin1 and Group probers. >+ // Allocates a new buffer for newBuf. This buffer should be freed by the caller using PR_FREEIF. perhaps "They allocate" instead of "Allocates"? >+ // Both functions return PR_FALSE in case of memory allocation failure. >+ static PRBool FilterWithoutEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen); >+ static PRBool FilterWithEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen); >Index: extensions/universalchardet/src/nsHebrewProber.cpp >+ * The Original Code is mozilla.org code. change this to something about the hebrew-prober-autodetect or something like that (applies to all relevant files) it looks like this file is yours, in which case the following blob is wrong (you should be initial and the current year, 2005, should be listed): >+ * The Initial Developer of the Original Code is >+ * Netscape Communications Corporation. >+ * Portions created by the Initial Developer are Copyright (C) 1998 >+ * the Initial Developer. All Rights Reserved. >+ // The letters Peh and Kaf rarely displays a related behavior of not being a good Non-Final letter. display >+ * 3) A word longer than 1 letter, starting with a final letter. Final letters should not >+ * appear in the beginning of a word. This is an indication that the text is laid out backwards. appear at the beginning... >+ * This method relies on the input buffer being an output of FilterWithoutEnglishLetters(). smontagu, brad, and myself decided that we disliked this sentence and any rewording we could find (at our varying time of night/day) that it's better to simply rely on the fact that the caller documents this behavior. for reference, here's a one sided synopsis: <smontagu> This method expects an input buffer which has been passed through FilterWithoutEnglishLetters() and so contains only Hebrew characters and ASCII space characters, without multiple spaces between words. <smontagu> but "Hebrew characters" is wrong, because that's what we are trying to establish. * smontagu doesn't know a good term for "characters in the range 0x80-0xFF" <smontagu> I hate "high ascii" <blockquote quoter=smontagu saying="there is a comment elsewhere which says"> + * The nsSBCSGroupProber is responsible for stripping the original text of HTML tags, English characters, + * numbers, low-ASCII punctuation characters, spaces and new lines. It reduces any sequence of such + * characters to a single space. The buffer fed to each prober in the SBCS group prober is pure text in + * high-ASCII. so maybe it doesn't need to be said again </blockquote> >Index: extensions/universalchardet/src/nsHebrewProber.h >+ * The Original Code is mozilla.org code. change this to something about the hebrew-prober-autodetect or something like that (applies to all relevant files) it looks like this file is yours, in which case the following blob is wrong (you should be initial and the current year, 2005, should be listed): >+ * The Initial Developer of the Original Code is >+ * Netscape Communications Corporation. >+ * Portions created by the Initial Developer are Copyright (C) 1998 >+ * the Initial Developer: All Rights Reserved. >+ * >+ * Contributor(s): >+ * Shy Shalom <shooshX@gmail.com> >+ * Texts in x-mac-hebrew are almost impossible to find in the internet. From what little evidence i'd write "on the internet" (i might capitalize Internet, but we think that might be a lost cause) >+#endif /* nsHebrewProber_h__ */ >+ >+ >+ soo many blank lines at the end :p in the place where you hard code 10/11/12 you should probably make a note so that people are careful not to mess up your slot numbers :)
Assignee | ||
Comment 43•19 years ago
|
||
corrected patch, after reviews from smontagu and timeless - corrected comments according to your suggestions - updated licences details - in nsHebrewProber moved prev and beforePrev from HandleData to being class members. This allows to save the state maintained in HandleData across buffers. thanks for the reviews.
Attachment #189608 -
Attachment is obsolete: true
Attachment #189781 -
Flags: review?(smontagu)
Assignee | ||
Updated•19 years ago
|
Attachment #189781 -
Flags: review?(timeless)
Attachment #188869 -
Flags: review?(smontagu)
Comment 44•19 years ago
|
||
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual note that i'm not a module owner, r=smontagu or equivalent is still required.
Attachment #189781 -
Flags: review?(timeless) → review+
Comment 45•19 years ago
|
||
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual >+ printf(" HEB: %d - %d [Logical-Visusal score]\r\n", mFinalCharLogicalScore, mFinalCharVisualScore); No need for a new patch, but the typo "Visusal" should be corrected before checking in.
Attachment #189781 -
Flags: review?(smontagu) → review+
Assignee | ||
Updated•19 years ago
|
Attachment #189781 -
Flags: superreview?(blizzard)
Assignee | ||
Updated•19 years ago
|
Attachment #189781 -
Flags: superreview?(blizzard) → superreview?(roc)
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual looks nice. The code you moved has some style issues but you don't need to fix them.
Attachment #189781 -
Flags: superreview?(roc) → superreview+
Updated•19 years ago
|
Attachment #189781 -
Flags: approval1.8b4?
Updated•19 years ago
|
Assignee: smontagu → shoosh20012001
Target Milestone: Future → mozilla1.8beta4
Comment 47•19 years ago
|
||
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual When requesting approval to check in at this late stage, please include an explanation of the value of the change, the risk associated with the change and any testing that's been done on the change. Thanks.
Comment 48•19 years ago
|
||
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual This patch has almost no affect on anything but Hebrew autodetect, which doesn't isn't supported at all right now. Should be safe enough.
Assignee | ||
Comment 49•19 years ago
|
||
lets see now. value of the change: As of now, hebrew pages with no charset specification either are displayed as gibbrish or if the Universal charset detector is on, are recognized as russian or greek. this patch solves this problem with the added recognition of hebrew. risk: as already commented, virtually none. There is a certain theoretical chance that with this patch, greek or russian pages will be identified by the Universal charset detector as hebrew. I have not seen any evidance this can happen in practice. Also, since the hebrew probers are ran on every page without charset specification, this patch has some theoretical performance hit on page load time. the performance hit is already there for the entire universal charset detector and this patch, adding two more probers to it, is adding some additional load. testing: I have a wide variery of visual and logical hebrew pages as test cases. most of them are in the attachment html file. on all of these pages the patch works with expected and almost perfect results. I have also tested it on a number of russian and greek pages which didn't seem to be remotly affected by the change and infact the hebrew probers were no where near identifying them as hebrew. I think this patch is important since it serves a genuine advantage to the hebrew reading users making their lives somewhat happier :).
Comment 50•19 years ago
|
||
I definitely agree there is a great value to include Hebrew for Universal Auto-detect. I just want to make sure that existing auto-detection for CJK still works and someone tested.
Updated•19 years ago
|
Attachment #189781 -
Flags: approval1.8b4? → approval1.8b4+
Comment 51•19 years ago
|
||
checked in by timeless.
Status: NEW → RESOLVED
Closed: 19 years ago → 19 years ago
Resolution: --- → FIXED
Comment 52•19 years ago
|
||
I've tested many of the links in attachment 188870 [details], and the patch seems to be very effective with most of them. A few pages have wrong encoding specified in the HTML code, but are actually iso-8859-8: "X-MAC-HEBREW" http://www.escape.com/~elyaqim/music/jew/yiddishlyrics/yontevdi.html "WINDOWS-1255" http://www.tau.ac.il/humanities/publications/zmanim/amiragel/gel61_3.htm "iso-8859-1" http://www.jewishmag.com/nlpjerusalem/hebnews1.htm "Hebrew Alphabet (ISO-Visual)" http://www.haifamuseums.org.il/japintr1.htm "ISO-8859-8-i" http://www.arkline.com/arik/Eitai.html Since the browser shouldn't override pages that explicitly specify encoding, the above is not a bug in the auto-detection code. In addition, the following page had text which was replaced with question marks (probably when submitted to the site). Not much can be done here. http://www.icq.com/whitepages/wwp.php?Uin=303856019 Verified with Firefox-trunk/20050809/WinXP. Prog.
Status: RESOLVED → VERIFIED
Comment 53•19 years ago
|
||
Comment on attachment 189781 [details] [diff] [review] proposed patch - full hebrew identification, logical and visual mozilla/extensions/universalchardet/src/nsSBCharSetProber.h 1.8 mozilla/extensions/universalchardet/src/nsSBCharSetProber.cpp 1.9 mozilla/extensions/universalchardet/src/nsSBCSGroupProber.h 1.8 mozilla/extensions/universalchardet/src/nsSBCSGroupProber.cpp 1.12 mozilla/extensions/universalchardet/src/nsMBCSGroupProber.cpp 1.8 mozilla/extensions/universalchardet/src/nsLatin1Prober.h 1.3 mozilla/extensions/universalchardet/src/nsLatin1Prober.cpp 1.6 mozilla/extensions/universalchardet/src/nsHebrewProber.h 1.1 mozilla/extensions/universalchardet/src/nsHebrewProber.cpp 1.1 mozilla/extensions/universalchardet/src/nsCharSetProber.h 1.9 mozilla/extensions/universalchardet/src/nsCharSetProber.cpp 1.1 mozilla/extensions/universalchardet/src/Makefile.in 1.11 mozilla/extensions/universalchardet/src/LangHebrewModel.cpp 1.1 mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp 1.20 mozilla/extensions/universalchardet/src/nsUdetXPCOMWrapper.cpp 1.2 mozilla/extensions/universalchardet/src/nsHebrewProber.cpp 1.2
Attachment #189781 -
Attachment is obsolete: true
Comment 54•19 years ago
|
||
Testing with 1.5 Beta 2, there is a regression from 1.0.7 showing Hebrew pages. See for example http://www.haaretz.co.il/hasite/pages/LiArtEc.jhtml?contrassID=4&subContrassID=0&sbSubContrass=0&from=hasot (or any http://www.haaretz.co.il economic section) That page has a meta tag with charset=windows-1255. However, if I select char encoding "universal" , FF 1.5b2 switches to visual Hebrew, resulting in about 30% of the page showing reversed chars. FF 1.0.7, OTOH, marks windows-1255, as it should be (althout a small currency exchage table may then show cyrillic char set instead of Hebrew). Please confirm and see if bug need to be reopen.
Comment 55•19 years ago
|
||
(In reply to comment #54) > Testing with 1.5 Beta 2, there is a regression from 1.0.7 showing Hebrew pages. > See for example > http://www.haaretz.co.il/hasite/pages/LiArtEc.jhtml?contrassID=4&subContrassID=0&sbSubContrass=0&from=hasot > (or any http://www.haaretz.co.il economic section) The issue you are describing is bug 308187, which is a tech evangelism issue with Haaretz, and unrelated to this bug. Autodetection is not even activated in this case, because the site explicitly specifies a (wrong) charset in the HTTP headers (this also overrides the correct charset specified in the META tag). The reason the problem was not evident in 1.0 was bug 244964 (fixed for 1.5).
You need to log in
before you can comment on or make changes to this bug.
Description
•