304149 - Until we can switch to UTF-8 fully, make new data coming into bugzilla.mozilla.org be UTF-8

Reporter

Description

•

20 years ago

This is inspired by bug 135762. The current plan in bug 280633 is to convert data to UTF-8 more or less one-by-one. I think this task might be made easier if we stopped the heap from growing and made sure we knew the character set of new data coming in.

Marc Schumann [:Wurblzap]

Reporter

Comment 1

•

20 years ago

Attached patch Make forms force ISO-8859-1 — Details — Splinter Review

This is a patch against HEAD how I think this might be facilitated. From the time this is active, longdescs are consistently ISO-8859-1 and may be converted to UTF-8 as a whole.

Marc Schumann [:Wurblzap]

Reporter

Updated

•

20 years ago

Summary: Make new data coming into bugzilla.mozilla.org be ISO-8859-1 → Until we can switch to UTF-8, make new data coming into bugzilla.mozilla.org be ISO-8859-1

Dave Miller [:justdave]

Assignee

Comment 2

•

20 years ago

ummm... we're already forcing new data to be UTF-8. Why backpedal on it? Changing existing bugs is the only time there's ambiguity right now.

Dave Miller [:justdave]

Assignee

Comment 3

•

20 years ago

oh, I see what you're heading for, now that I've looked at the patch. :) Can we do that with UTF-8 instead of ISO-8859-1? We're already sending a content-type header with UTF-8 in it on enter_bug, create_account, and so forth (pages where you create new bugs, accounts, and so forth). But we're not forcing the character set on the forms themselves at all.

Marc Schumann [:Wurblzap]

Reporter

Comment 4

•

20 years ago

Enforcing UTF-8 would essentially be the same thing, but I think ISO-8859-1 might be the better choice because it wouldn't clash as hard with current "legacy" data. That's me assuming it to be ISO-8859-1 mostly by a wide margin. Provided my assumption is close enough, bugs containing mixed legacy/ISO-enforced data would display better. The full switch to UTF-8 wouldn't be much harder if we went for ISO-8859-1 for the meantime. We'd need an additional step to update all longdescs having come in after a certain date, that's all.

Dave Miller [:justdave]

Assignee

Comment 5

•

20 years ago

(In reply to comment #4) > Enforcing UTF-8 would essentially be the same thing, but I think ISO-8859-1 > might be the better choice because it wouldn't clash as hard with current > "legacy" data. That's me assuming it to be ISO-8859-1 mostly by a wide margin. > Provided my assumption is close enough, bugs containing mixed > legacy/ISO-enforced data would display better. Yeah, unfortunately, that doesn't seem to be a close assumption. From random samplings and the problems I've already run into, the majority of the data that isn't plain ASCII seems to be Windows-1252.

Marc Schumann [:Wurblzap]

Reporter

Comment 6

•

20 years ago

Attached patch Make forms force UTF-8 — Details — Splinter Review

I gather we hurt no matter which way we turn, so we can make the forms send UTF-8 right away. (Note that this patch includes account/prefs/prefs.html.tmpl which I missed in the previous patch.) Assume we apply the UTF-8 patch. We'd end up with new bugs begging for charset=UTF-8 headers, old bugs covering their eyes and wanting no such headers, and active bugs (containg older and newer comments) partly borked up. What's the plan here? We could send charset=UTF-8 headers for bugs created after today, and cut over to sending such headers for all bugs at some time in the future when we start feeling it'll hurt less if we do.

Dave Miller [:justdave]

Assignee

Comment 7

•

20 years ago

what I want to do is get some sort of utility I can run that will translate a specific single comment or bug field from whatever character set it's in to UTF-8. It should probably do this via a browser so we can make use of the browser's autodetect to do the translation. i.e. have a text field to enter a character set into on an otherwise-blank page that shows only the contents of that field, and have javascript check the charset and fill it into the text field. Then you can use the View menu in the browser to change the character set until it looks right if the auto-detect got it wrong, then submit it and have Perl's Encode module use the character set from the text field as the source character set translating it to UTF-8. Once we have a utility like that, we can just force UTF-8 via the headers on everything except that translation form, and then go use that form on things people report that don't look right.

Marc Schumann [:Wurblzap]

Reporter

Comment 8

•

20 years ago

That sounds very sensible to me. And it's covered by bug 280633, too, which is good. Until we have such a tool, how about we make life a little easier for ourselves in the future and provide for a means along the lines of one of the attached patches, so that we don't need to use this comment-by-comment tool on more comments than we already have on the heap? :)

Dave Miller [:justdave]

Assignee

Comment 9

•

20 years ago

sounds good to me. what version of Bugzilla is this patch against? b.m.o is likely getting upgraded within the next week. I'll be upgrading it to the tip of the 2.20 branch, so that would be a good place for the patch to be built against. :)

Dave Miller [:justdave]

Assignee

Updated

•

20 years ago

Blocks: bmo-upgrade-051022

Marc Schumann [:Wurblzap]

Reporter

Comment 10

•

20 years ago

The patch is against HEAD, but it applies to the 2.20 branch as well. Unless common browsers' charset auto-detection works really well, it seems to me we'll want to start sending charset=UTF-8 HTTP headers for show_bug.cgi not long after this, at least for new bugs.

Marc Schumann [:Wurblzap]

Reporter

Comment 11

•

20 years ago

Comment 12

•

20 years ago

Morphing to ask for UTF-8 instead of ISO-8859-1 as per comment 3 ff.

Assignee: justdave → wurblzap

Summary: Until we can switch to UTF-8, make new data coming into bugzilla.mozilla.org be ISO-8859-1 → Until we can switch to UTF-8 fully, make new data coming into bugzilla.mozilla.org be UTF-8

François Gagné

Comment 13

•

20 years ago

hmmm perhaps modify the patch with 'Make forms force UTF-8' to also add acceptcharset="UTF-8" to be compatible with old Microsoft browser? Microsoft explanation: http://msdn.microsoft.com/workshop/author/dhtml/reference/properties/acceptcharset.asp

Dave Miller [:justdave]

Assignee

Updated

•

20 years ago

Assignee: wurblzap → justdave

Priority: -- → P1

Marc Schumann [:Wurblzap]

Reporter

Comment 14

•

20 years ago

Non-UTF-8-characters in https://bugzilla.mozilla.org/show_bug.cgi?id=323905#c17 show why I think we should do this soon. Dave?

Dave Miller [:justdave]

Assignee

Updated

•

20 years ago

Assignee: justdave → justdave

Dave Miller [:justdave]

Assignee

Updated

•

19 years ago

No longer blocks: bmo-upgrade-051022

Dave Miller [:justdave]

Assignee

Comment 15

•

19 years ago

*** Bug 345678 has been marked as a duplicate of this bug. ***

Jungshik Shin

Comment 16

•

19 years ago

> We're already sending a content-type > header with UTF-8 in it on enter_bug, create_account, and so forth (pages where > you create new bugs, accounts, and so forth). Is it true? 'enter_bug' at b.m.o still just sends 'Content-Type: text/html'. Now that localization bugs are tracked at b.m.o, I think it's very urgent to emit 'text/html; charset=UTF-8' for enter_bug and create_account.

A. Shimono [:himorin]

Comment 17

•

19 years ago

i was said to comment to this bug about the experience of Bugzilla-ja by dynamis, pikemac, chofmann (or so). Bugzilla-ja is an one of the imprementation of i18n-ed version of Bugzilla, and currently working on bugzilla.mozilla.gr.jp (which is called as bugzilla-jp, and it's for handling bugs about Mozilla products in Japanese. # but,,, we do not test bugzilla-ja code in other 2byte languages.. sorry. Our conventional requirements was followings 1. treat all longdescs as UTF-8 instead of EUC-JP (was used in 2.16-ja) 2. can manage MIME for sending bugmails 3. and should be change char-code account by account (like ISO-2022-JP / JIS) 4. can display UTF-8 well in buglists or so 1st was easy. that we wrote a trans-code program of database contents. put off saved search, we could do it with mysqldump and nkf. but in saved search (and another some fields), there're some %xx encoded words. # this might come from 2.16-ja's code... For 2nd and 3rd, bug-ja modified Bugzilla/BugMail.pm to manage MIME of bugmails. and for 3rd, we add new column into DB/profiles. (to store char-code settings account by account) for 4th, first we thought to use Text::I18NWrap (or WrapI18N), but i could not work that module well on our system. and with another serious problem on buglist, we changed the way. in bugzilla-ja, new code called cutStringUTF8 to manage width of UTF-8 string. and we use that for buglists or bugmails. and for show_bug's comment input area, we set wrap=hard to wrap text before inputting comment data to DB. this should be treated as a bug, but we could not make it success to wrap text easily in perl/template codes. our Bugzilla-ja patch against 2.20 or 2.20.1 of Bugzilla is distributed at ftp://ftp.mozilla.gr.jp as diff patch.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 18

•

19 years ago

why does each user have an encoding setting!? what's that used for?

A. Shimono [:himorin]

Comment 19

•

19 years ago

japanese has many historical encoding like Shift-JIS, EUC-JP, ISO-2022-JP, and these modified version like cp932, euc-jp-ms? or so. currently, standard encoding for e-mail system in japanese encoding is not utf-8 but ISO-2022-JP (called as JIS encoding). and some web-based mail service doesn't recognize UTF-8 encoding. like yahoo or so, and many bugzilla-jp user use them. so, we need the feature to switch encoding for e-mail account by account.

timeless

Comment 20

•

19 years ago

Comment on attachment 192256 [details] [diff] [review] Make forms force UTF-8 all i can find about acceptcharset= was a struts bug where they messed up their content.

Attachment #192256 - Flags: review?(justdave)

Gervase Markham [:gerv]

Comment 21

•

19 years ago

Looks like we'll be migrating to UTF-8 after the next upgrade anyway... Gerv

Dave Miller [:justdave]

Assignee

Comment 22

•

19 years ago

Yeah, we'll be almost completely UTF8 after the next upgrade, which will be happening within the next couple weeks. This isn't worth the effort now.

Status: NEW → RESOLVED

Closed: 19 years ago

Resolution: --- → WONTFIX

Dave Miller [:justdave]

Assignee

Updated

•

19 years ago

Attachment #192256 - Flags: review?(justdave)

timeless

Updated

•

19 years ago

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

15 years ago

Component: Bugzilla: Other b.m.o Issues → General

Product: mozilla.org → bugzilla.mozilla.org

Make forms force ISO-8859-1 20 years ago Marc Schumann [:Wurblzap] 6.27 KB, patch		Details \| Diff \| Splinter Review
Make forms force UTF-8 20 years ago Marc Schumann [:Wurblzap] 7.00 KB, patch		Details \| Diff \| Splinter Review