Closed Bug 304149 Opened 19 years ago Closed 18 years ago

Until we can switch to UTF-8 fully, make new data coming into bugzilla.mozilla.org be UTF-8

Categories

(bugzilla.mozilla.org :: General, defect, P1)

defect

Tracking

()

VERIFIED WONTFIX

People

(Reporter: Wurblzap, Assigned: justdave)

References

Details

Attachments

(2 files)

This is inspired by bug 135762.

The current plan in bug 280633 is to convert data to UTF-8 more or less
one-by-one. I think this task might be made easier if we stopped the heap from
growing and made sure we knew the character set of new data coming in.
This is a patch against HEAD how I think this might be facilitated. From the
time this is active, longdescs are consistently ISO-8859-1 and may be converted
to UTF-8 as a whole.
Summary: Make new data coming into bugzilla.mozilla.org be ISO-8859-1 → Until we can switch to UTF-8, make new data coming into bugzilla.mozilla.org be ISO-8859-1
ummm...  we're already forcing new data to be UTF-8.  Why backpedal on it? 
Changing existing bugs is the only time there's ambiguity right now.
oh, I see what you're heading for, now that I've looked at the patch. :)  Can we
do that with UTF-8 instead of ISO-8859-1?  We're already sending a content-type
header with UTF-8 in it on enter_bug, create_account, and so forth (pages where
you create new bugs, accounts, and so forth).  But we're not forcing the
character set on the forms themselves at all.
Enforcing UTF-8 would essentially be the same thing, but I think ISO-8859-1
might be the better choice because it wouldn't clash as hard with current
"legacy" data. That's me assuming it to be ISO-8859-1 mostly by a wide margin.
Provided my assumption is close enough, bugs containing mixed
legacy/ISO-enforced data would display better.

The full switch to UTF-8 wouldn't be much harder if we went for ISO-8859-1 for
the meantime. We'd need an additional step to update all longdescs having come
in after a certain date, that's all.
(In reply to comment #4)
> Enforcing UTF-8 would essentially be the same thing, but I think ISO-8859-1
> might be the better choice because it wouldn't clash as hard with current
> "legacy" data. That's me assuming it to be ISO-8859-1 mostly by a wide margin.
> Provided my assumption is close enough, bugs containing mixed
> legacy/ISO-enforced data would display better.

Yeah, unfortunately, that doesn't seem to be a close assumption.  From random
samplings and the problems I've already run into, the majority of the data that
isn't plain ASCII seems to be Windows-1252.
I gather we hurt no matter which way we turn, so we can make the forms send
UTF-8 right away. (Note that this patch includes account/prefs/prefs.html.tmpl
which I missed in the previous patch.)

Assume we apply the UTF-8 patch. We'd end up with new bugs begging for
charset=UTF-8 headers, old bugs covering their eyes and wanting no such
headers, and active bugs (containg older and newer comments) partly borked up.
What's the plan here? We could send charset=UTF-8 headers for bugs created
after today, and cut over to sending such headers for all bugs at some time in
the future when we start feeling it'll hurt less if we do.
what I want to do is get some sort of utility I can run that will translate a
specific single comment or bug field from whatever character set it's in to
UTF-8.  It should probably do this via a browser so we can make use of the
browser's autodetect to do the translation.  i.e. have a text field to enter a
character set into on an otherwise-blank page that shows only the contents of
that field, and have javascript check the charset and fill it into the text
field.  Then you can use the View menu in the browser to change the character
set until it looks right if the auto-detect got it wrong, then submit it and
have Perl's Encode module use the character set from the text field as the
source character set translating it to UTF-8.  Once we have a utility like that,
we can just force UTF-8 via the headers on everything except that translation
form, and then go use that form on things people report that don't look right.
That sounds very sensible to me. And it's covered by bug 280633, too, which is good.

Until we have such a tool, how about we make life a little easier for ourselves
in the future and provide for a means along the lines of one of the attached
patches, so that we don't need to use this comment-by-comment tool on more
comments than we already have on the heap?  :)
sounds good to me.  what version of Bugzilla is this patch against?  b.m.o is
likely getting upgraded within the next week.  I'll be upgrading it to the tip
of the 2.20 branch, so that would be a good place for the patch to be built
against. :)
The patch is against HEAD, but it applies to the 2.20 branch as well.

Unless common browsers' charset auto-detection works really well, it seems to me
we'll want to start sending charset=UTF-8 HTTP headers for show_bug.cgi not long
after this, at least for new bugs.
See also bug 304944...
Morphing to ask for UTF-8 instead of ISO-8859-1 as per comment 3 ff.
Assignee: justdave → wurblzap
Summary: Until we can switch to UTF-8, make new data coming into bugzilla.mozilla.org be ISO-8859-1 → Until we can switch to UTF-8 fully, make new data coming into bugzilla.mozilla.org be UTF-8
hmmm perhaps modify the patch with 'Make forms force UTF-8' to also add
acceptcharset="UTF-8" to be compatible with old Microsoft browser?

Microsoft explanation:

http://msdn.microsoft.com/workshop/author/dhtml/reference/properties/acceptcharset.asp
Assignee: wurblzap → justdave
Priority: -- → P1
Non-UTF-8-characters in https://bugzilla.mozilla.org/show_bug.cgi?id=323905#c17 show why I think we should do this soon.

Dave?
Assignee: justdave → justdave
No longer blocks: bmo-upgrade-051022
*** Bug 345678 has been marked as a duplicate of this bug. ***
>  We're already sending a content-type
> header with UTF-8 in it on enter_bug, create_account, and so forth (pages where
> you create new bugs, accounts, and so forth).  

Is it true? 'enter_bug' at b.m.o still just sends 'Content-Type: text/html'. Now that localization bugs are tracked at b.m.o, I think it's very urgent to emit 'text/html; charset=UTF-8' for enter_bug and  create_account.


i was said to comment to this bug about the experience of Bugzilla-ja by dynamis,
pikemac, chofmann (or so).

Bugzilla-ja is an one of the imprementation of i18n-ed version of Bugzilla, and 
currently working on bugzilla.mozilla.gr.jp (which is called as bugzilla-jp, and 
it's for handling bugs about Mozilla products in Japanese.
# but,,, we do not test bugzilla-ja code in other 2byte languages.. sorry.

Our conventional requirements was followings
1. treat all longdescs as UTF-8 instead of EUC-JP (was used in 2.16-ja)
2. can manage MIME for sending bugmails
3. and should be change char-code account by account (like ISO-2022-JP / JIS)
4. can display UTF-8 well in buglists or so

1st was easy. that we wrote a trans-code program of database contents.
put off saved search, we could do it with mysqldump and nkf. but in saved search
(and another some fields), there're some %xx encoded words.
# this might come from 2.16-ja's code...

For 2nd and 3rd, bug-ja modified Bugzilla/BugMail.pm to manage MIME of bugmails.
and for 3rd, we add new column into DB/profiles. (to store char-code settings 
account by account)

for 4th, first we thought to use Text::I18NWrap (or WrapI18N), but i could not
work that module well on our system.
and with another serious problem on buglist, we changed the way.
in bugzilla-ja, new code called cutStringUTF8 to manage width of UTF-8 string.
and we use that for buglists or bugmails.
and for show_bug's comment input area, we set wrap=hard to wrap text before
inputting comment data to DB. this should be treated as a bug, but we could not
make it success to wrap text easily in perl/template codes.

our Bugzilla-ja patch against 2.20 or 2.20.1 of Bugzilla is distributed at
ftp://ftp.mozilla.gr.jp as diff patch.
why does each user have an encoding setting!? what's that used for?
japanese has many historical encoding like Shift-JIS, EUC-JP, ISO-2022-JP, and these modified version like cp932, euc-jp-ms? or so.
currently, standard encoding for e-mail system in japanese encoding is not utf-8 but ISO-2022-JP (called as JIS encoding). and some web-based mail service doesn't recognize UTF-8 encoding. like yahoo or so, and many bugzilla-jp user use them.
so, we need the feature to switch encoding for e-mail account by account.
Comment on attachment 192256 [details] [diff] [review]
Make forms force UTF-8

all i can find about acceptcharset= was a struts  bug where they messed up their content.
Attachment #192256 - Flags: review?(justdave)
Looks like we'll be migrating to UTF-8 after the next upgrade anyway...

Gerv
Yeah, we'll be almost completely UTF8 after the next upgrade, which will be happening within the next couple weeks.  This isn't worth the effort now.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → WONTFIX
Attachment #192256 - Flags: review?(justdave)
Status: RESOLVED → VERIFIED
Component: Bugzilla: Other b.m.o Issues → General
Product: mozilla.org → bugzilla.mozilla.org
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: