280633 - (bz-recode) Tools to migrate existing legacy encoded database to UTF-8 (Unicode)

Reporter

Description

•

20 years ago

This is a spin off bug from bug 126266 (Use UTF-8 (Unicode) charset encoding for pages and email). Tools and/or documentation are required to provide a method for users of bugzilla databases containing legacy data encodings to move to a UTF-8 encoded database. The tools will enable them to automatically transcode known encodings into UTF-8, or statistically detect encoding and where confidence is high, transcode to UTF-8. This will enable existing bugzilla installations to upgrade to the use of UTF-8 encoding throughout the entire database.

Dave Miller [:justdave]

Comment 1

•

20 years ago

I have a work-in-progress CGI that does manual re-encoding of individual comments. It's not exactly a bulk conversion tool, but I was figuring on throwing the "not valid utf-8" detection stuff into show_bug and adding a button for privileged users next to such comments to allow them to recode them. It uses the web browser's charset detection stuff to decide which charset to recode from (because Mozilla's charset detection usually seems to work better than the stuff included with Perl, and it also lets the user override it if it's wrong and they can figure it out on their own), but uses the Perl Encode module to do the recoding once the user submits which character set to use for the source.

Stephen Lee

Comment 2

•

20 years ago

Would adding a field for charset to the longdescs table be appropriate? ... and then it could be converted to UTF-8 (or another selected charset) on the fly as the page is viewed. This would avoid having to touch the underlying data, unless you a specific comment must be converted to UTF-8 in the DB too (e.g. for searchability), in which case bug 540 would seem to provide an appropriate mechanism that a recoding CGI could tack on to.

Jungshik Shin

Comment 3

•

20 years ago

Copied from bug 126266 comment #195 (there are some overlaps with comment #1 here) 1) send emails to those with their names stored in ISO-8859-1 (I've never seen anyone use non-ASCII characters in encodings other than ISO-8859-1 for their names at bugzilla.mozilla.org) to update their account info in UTF-8. 2) Begin to emit 'charset=UTF-8' for bugs filed after a certain day.(say, 2005-03-01). Do the same for current bugs with ASCII characters alone in their comments and title. 3) For existing bugs, add a very prominent warning right above 'additional comments' text area that 'View | Character Encoding' should be set to UTF-8 if one's comment includes non-ASCII characters. This is to ensure existing bugs are not further 'polluted' with non-UTF-8 comments. We may go a step further and add a server-side check for 'UTF8ness' of a new comment. If it's not UTF-8, send them back with the following message: - Your comment contains non-ASCII characters, but it's not in UTF-8. Please, go back, set your browser to use 'UTF-8' and enter your comment again.... The actual content of a comment : blahblah..... 4) If we really want to migrate all existing bugs to UTF-8, add a button to each comment to indicate the current character encoding. If necessary, this button can be made available only to the select group of people knowledgable enough to identify encodings reliably (and perhaps the authors of comments) 5) search/query may need some more tinkering... As for the charset detection, I wouldn't trust it much for a short span of text like bugzilla comments. If it's not automatic and merely given as a hint to the privileged users (as described in comment #1), that would be all right.

Anne (:annevk)

Comment 4

•

20 years ago

Sorry for putting it on the wrong bug. Bug 126266 comment 198: # A while back someone proposed on IRC that we turned on UTF-8 for every bug ID # that was greater than a certain number. Although that isn't a perfect solution # (still mixed character encoding in the database) it would at least make all new # bugs and all comments on new bugs forward compatible.

Jungshik Shin

Comment 5

•

20 years ago

(In reply to bug 126266 comment #202) > (In reply to bug 126266 comment #201) > > Anne: that breaks things when you have content from multiple bugs on the same > > page, such as buglists or longlist.cgi output. This is not new, either. It was given as a reason for not doing that in 2002(?). Result? We've kept accumulating bugs with mixed encodings for 2+ more years. > So does the current solution of allowing any encoding. Absolutely. > Sending pages that have content from multiple bugs as UTF-8 and sending all bugs > with ID > current bug number as UTF-8 seems like a reasonable start to me. Can we start sometime soon? At minimum, a prominent warning (in red or whatever is the most clear way) just after 'Additional Comments' that reads "Set View | Encoding to UTF-8 before adding a comment with non-ASCII characters" should be added now. > We'll stop accumulating content of unknown encoding in new bugs, and there will > still be a way to view the content on the older bugs (by viewing the bug as its > own page) if there's some content that isn't UTF-8. For list pages with multiple encodings, one can always override the encoding emitted by HTTP/meta manually. Needless to say, only bugs with that encoding will be shown correctly at a time while all other bugs in the list will be mangled. However, that's not any worse than what we have now. Currently we don't emit MIME charset so that the default char. encoding of a user is used (or with a browser with autodetector like Mozilla, whatever autodetector detects). With UTF-8 emitted, UTF-8 will be used by default.

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Comment 6

•

20 years ago

Does anyone have a hunch about the number of bugs where the following would be wrong: 1) If a given comment is a valid UTF-8 byte sequence, assume it is UTF-8. 2) Else, assume it is Windows-1252-encoded and convert to UTF-8. 3) Label the output as UTF-8. ? (FWIW, when I assessed, whether that approach would work for www.mozilla.org, it appeared that it was ok for pretty much everything except i18n test cases and evang letter translations.)

Jungshik Shin

Comment 7

•

20 years ago

(In reply to comment #6) > Does anyone have a hunch about the number of bugs where the following would be > wrong: I can't give you the number, but theare quite a lot of them. > 1) If a given comment is a valid UTF-8 byte sequence, assume it is UTF-8. > 2) Else, assume it is Windows-1252-encoded and convert to UTF-8. Step 2 is dangerous and we should never take that step. We don't have to do that, either. What justdave and I sketched in comment #1 and comment #4 (step 4) is a more sensible approach. We can live with mixed encodings in the output of lists and in existing bugs, but we cannot afford to lose/damage data. > (FWIW, when I assessed, whether that approach would work for www.mozilla.org, www.mozilla.org and bugzilla.mozilla.org are totally different beasts. In the former, most, if not all, documents had been in ISO-8859-1 (actually US-ASCII). If somebody wanted to add content beyond ISO-8859-1, they used UTF-8. So, it's not surprising that it worked well for www.mozilla.org. bugzilla.mozilla.org has a number of bugs with comments in KOI8-R, Windows-1251, Shift_JIS, EUC-KR, GB2312, Big5, etc.

Nick Barnes

Comment 8

•

20 years ago

For b.m.o, do we have any statistics on how many old non-UTF8 comments we have? Or how many bugs are affected? Various people have asserted "many", "thousands", "hundreds", etc. What's the actual number? I favour the idea of a run-once script to convert to UTF-8 (using a per-comment guessed encoding based on an operator-set default), and providing an administrator mode to specify an encoding to re-convert any given comment (i.e. back from UTF-8 to the guessed encoding, then forward to the newly-specified encoding). How about a tool which allows the administrator to specify a small number of encodings and then request "show me what this comment would have looked like if I had converted it in each of these encodings". Pick one and you're done. We would have to store the guessed encoding for each comment, probably in a separate table. Yes, I know that b.m.o (for instance) has a bajillion non-ASCII comments (although I don't know, and it looks like nobody knows, how big the "bajillion" actually is). I don't know how many of those can't be correctly guessed using an informed encoding guesser. Does anyone? Of course, we can't automatically tell that an encoding guesser has got it wrong: they need eyeballing. My suspicion is that b.m.o has a few million comments, of which a few thousand are non-ISO-8859-15, and maybe a hundred of those are hard to guess. But that's really just a wild guess.

Gervase Markham [:gerv]

Comment 9

•

20 years ago

Nick: if you can write some SQL query which will tell us, I'm happy to run it. I could also dump the comments to a file and run a script over that if necessary. Gerv

Stephen Lee

Comment 10

•

20 years ago

Just a quick thought that occurred to me... (and looking back appears to have occurred to me in a different form in comment 2) A column could be added to the longdescs table recording the browser's charset (with NULL for unknown, and all existing data) with almost immediate effect. i.e. no conversions, no UI changes etc. but just recording the data. This way, we would at least know that all future data has tha capability to be reliably converted to UTF-8 when the rest of the code catches up. It could also act as a hint for existing data on the same bug, or by the same user.

Dave Miller [:justdave]

Comment 11

•

20 years ago

Attached file script to count non-utf8 data (obsolete) — Details

OK, here's the statistics from b.m.o as of 2005-03-24 23:00 PST: total rows non-ascii non-utf8 attachments.description: 178536 524 199 attachments.filename: 178536 56 51 attachments.mimetype: 178536 6 0 bugs.alias: 287484 3 3 bugs.bug_file_loc: 287484 86 57 bugs.short_desc: 287484 563 455 bugs.status_whiteboard: 287484 3 2 longdescs.thetext: 2454941 40674 13631 namedqueries.name: 11757 24 23 namedqueries.query: 11757 0 0 quips.quip: 1805 13 6 series.name: 788 0 0 series.query: 788 0 0 series_categories.name: 143 0 0 whine_events.subject: 9 0 0 whine_events.body: 9 0 0 whine_queries.query_name: 7 0 0 whine_queries.title: 7 0 0 These numbers were generated with the attached script (though I prettied it up a little bit for this bug comment).

Dave Miller [:justdave]

Comment 12

•

20 years ago

of course, I swiped the header off of another file and forgot to fix the contributor line. oops :)

Dave Miller [:justdave]

Comment 13

•

20 years ago

Ack, I missed one: profiles.realname: total: 192221 non-ascii: 3688 non-utf8: 3618

Dave Miller [:justdave]

Comment 14

•

20 years ago

Attached file script to count non-utf8 data — Details

Here's a fixed up copy of the script. Includes the column I forget, fixes the license header, and the output actually looks like what I posted to the bug now.

Dave Miller [:justdave]

Updated

•

20 years ago

Attachment #178544 - Attachment is obsolete: true

Gervase Markham [:gerv]

Comment 15

•

20 years ago

Dave, Nice work :-) We need to remember, of course, that just because something decodes as UTF-8 doesn't necessarily mean that it is. It would be good to have figures for open bugs only - that gives us a better handle on the scale of the problem in practice. I would do it myself but the script is owned by root and not writable by other users. Looking at the data, the big problems are the values which no-one can fix up afterwards - attachment descriptions and comments. No surprise there. Gerv

Gervase Markham [:gerv]

Comment 16

•

20 years ago

Here's some data for open bugs, given that I think we care a lot less if resolved bugs get a bit mangled. total rows non-ascii non-utf8 attachments.description: 31153 195 75 attachments.filename: 31153 27 23 attachments.mimetype: 31153 4 0 bugs.alias: 53971 1 1 bugs.bug_file_loc: 53971 20 14 bugs.short_desc: 53971 99 68 bugs.status_whiteboard: 53971 0 0 longdescs.thetext: 415466 8580 3021 This does seem to make the problem a lot less scary. The comments and the attachment descriptions are the key ones, as people can't fix up manually if we mess them up. 75 attachment descriptions isn't really very many. I bet if we reduced it to non-obsolete attachments it would be even fewer. What's the next step? Evaluate the performance of some charset-guessing Perl modules on the relevant comments? Wild idea: could we detect the Accept-Charset of each user, store it, and use it as a first guess for the comments they've added? Gerv

Olav Vitters

Comment 17

•

20 years ago

What about attachments.thedata? In an installation I help to admin there are still 6560 non-utf8 attachments.

Dave Miller [:justdave]

Comment 18

•

20 years ago

(In reply to comment #17) > What about attachments.thedata? In an installation I help to admin there are > still 6560 non-utf8 attachments. Attachments can be binary. Attachments are arbitrary data, and it doesn't really matter what charset they are. Whether it matters or not probably depends a lot on the mime type.

Jungshik Shin

Comment 19

•

20 years ago

(In reply to comment #17) > What about attachments.thedata? In an installation I help to admin there are > still 6560 non-utf8 attachments. We MUST leave them alone other than fixing MIME type if necessary. ('text/*' => 'text/*; charset=XYZ') (In reply to comment #16) > Here's some data for open bugs, given that I think we care a lot less if > resolved bugs get a bit mangled. > > total rows non-ascii non-utf8 > attachments.description: 31153 195 75 > mess them up. 75 attachment descriptions isn't really very many. I bet if we > reduced it to non-obsolete attachments it would be even fewer. > > What's the next step? Evaluate the performance of some charset-guessing Perl > modules on the relevant comments? If it's only 75 (or 195), why bother with evaluating Perl charset-guessing modules? Just doing it manually would be easier even if that means writing to those who attached them to ask them to identify the charset. My point is that I wouldn't use any charset-guessing (other than UTF-8 vs non-UTF-8, which is not fool-proof either as you wrote) for this kind of task. It can be pretty much automated if a form mail is sent to those who attached them asking them to identify charset. Hmm, with this possibility, the number of non-ASCII attachment descriptions matters less (probably, the higher they're, the lower the response ratio will be because some people are not around any more) > Wild idea: could we detect the Accept-Charset of each user, store it, and use it > as a first guess for the comments they've added? That may or may not work. 'Accept-lang' + the most widely used charset for the language could help, too. Again, with only 75, I don't see much point in tinkering with ideas like that.

Gervase Markham [:gerv]

Comment 20

•

20 years ago

jshin: the ideas I mentioned about charset guessing and accept-lang were really ideas for the 3000+ comments, not the attachment descriptions. I agree that 75 attachment descriptions could be fixed manually - or even, perhaps, just ignored. Gerv

Jungshik Shin

Comment 21

•

20 years ago

Sorry for misunderstanding. As for 3000+ comments, I would try automating the conversion the way I described in my previous comment (send out form mails replies to which can be automatically validated and processed with the help of Perl's encoding module) instead of relying (solely) on the not-so-reliable perl encoding guessing module (ok, it should be quantified, but my hunch is that it's not so good).

Jungshik Shin

Comment 22

•

20 years ago

(In reply to comment #21) > Sorry for misunderstanding. As for 3000+ comments, I would try automating the > conversion the way I described in my previous comment (send out form mails > replies to which can be automatically validated and processed with the help of Obviously, we don't have to do things the way we would have done before the web came out. Instead of sending out form mails, a simple web interface can be set up and the link to it can be mailed so that those who wrote comment can identify the encoding of their comments. The response can be processed either in batch or on-line('in situ'). I don't expect everyone to respond for every comment, but this will significantly reduce the need to rely on the manual fixing or encoding guessing

Stephen Lee

Comment 23

•

20 years ago

(In reply to comment #22) > link to it can be mailed ... or initially, just included in the bugzilla page header/footer when any logged in user has comments that need converting. It doesn't seem to me to be an important enough issue to mass mail everyone about. In most cases a user's comments will be in the same charset, so might be useful to offer bulk conversion (select charset for one comment, and confirm that all the others display correctly). Hmmmm.... wonder what it currently makes of these: "«ö¦¹¶i¤J°T®§½s¿è..." "¸Þ½ÃÁö¸¦ ÆíÁýÇÏ·Á¸é ÀÌ°÷À» Å¬¸¯ÇÏ½Ã¿À..." "ƒGƒfƒBƒbƒgƒ�ƒbƒZ�[ƒW‚ð“¾‚é‚É‚Í�A‚±‚±‚ðƒNƒŠƒbƒN‚µ‚Ä‚‚¾‚³‚¢..."

Jungshik Shin

Comment 24

•

20 years ago

(In reply to comment #23) > (In reply to comment #22) > > link to it can be mailed > > ... or initially, just included in the bugzilla page header/footer when any > logged in user has comments that need converting. > > It doesn't seem to me to be an important enough issue to mass mail everyone > about. What I have in mind is to send a *single* email to each user to alert about comments that need to be converted with the link to a single *dynamic* web page that lists links to all the comments (not the actual content) made by her or him that have not yet been converted to UTF-8. One can identify the character encoding of comments (s)he made at her/his leisure. > In most cases a user's comments will be in the same charset, so might be useful > to offer bulk conversion (select charset for one comment, and confirm that all > the others display correctly). That's a nice idea, but should offer an option to do that one by one because it's not always the case.

Stephen Lee

Comment 25

•

20 years ago

(In reply to comment #24) > What I have in mind is to send a *single* email to each user to alert about [how a user might read such a mail if this is done naïvely] Dear user, you made one or two comments in bugs a couple of years back, and these bugs have still not been fixed... and although you never used bugzilla again, nor even asked to be kept informed about the status of the bug, you had the audacity to enter one or two non-ASCII characters in a couple of comments. So now we want you to dig out your login details and come back to the site just to save us having to figure out what the characters you meant to enter were, even though this should be obvious to anyone that cares. Not sure if anyone else will even read your comment again after you've done this, but we do like to keep our database in order, and obviously this is far more important than fixing the 2-year old bug. Even the banner I suggested would be quite in-your-face, but at least in this case the user is already logged in, so we know they can deal with it in a couple of clicks... Many comments will be barely worth 10 seconds of a users time to fix as it will be obvious to anyone reading what the missing character is. Take bug 187403 comment 10 as a random example. > That's a nice idea, but should offer an option to do that one by one because > it's not always the case. Ummm... lets try "(select charset for one comment, and confirm [by checking a box next to each one] that all the others display correctly)".

Jungshik Shin

Comment 26

•

20 years ago

(In reply to comment #25) > (In reply to comment #24) > > What I have in mind is to send a *single* email to each user to alert about > > [how a user might read such a mail if this is done na�vely] > > Dear user, you made one or two comments in bugs a couple of years back, and > these bugs have still not been fixed... and although you never used > bugzilla again, nor even asked to be kept informed about the status of Depending on how it's worded, that's certainly possible. There's a trade-off between two approaches. Your approach would have a higher response rate among those who're asked to identify the char. encoding, but it leaves out those who don't log on on a regular basis but who may be interested enough to do the chore. Two approaches can be combined. Try your approach for a certain period (say, a month) and then mail to the rest of people.

Stephen Lee

Comment 27

•

20 years ago

Idea!!! Just thought of a way this can be done so that ALL pages are displayed in UTF-8 without having to change a single byte of existing comments... if a comment doesn't decode as valid UTF-8, then display it something like this: ----- Additional Comment #99 From user@example.com 2001-01-01 01:01 ----- [ No charset specified. Ambiguous characters shown as #. _View_Raw_Comment_ ] I'm getting a sense of d#j# vu about this. (or perhaps using some other character such as ' ', U+2588, '?', or the 'empty box' character that windows uses when a font is missing a character, rather than '#', and the placeholder character stylised (e.g. linkified, bold, etc.) to make it look different from the same character actually appearing in the comment) The original comment author, and any sufficiently empowered user should probably also a link to a page allowing them select the correct character set and convert to UTF-8 next to the View Raw Comment link. For "View Raw Comment", it would serve a page containing JUST that comment and no charset header, and use the browsers charset detection as at present.

Gervase Markham [:gerv]

Comment 28

•

20 years ago

slee: you're a genius! :-) That's exactly what we should do. We make a best-efforts rendering in the body of the bug (perhaps using charset guessing, perhaps not) but allow the raw comment to be served alone if necessary. Gerv

Stephen Lee

Comment 29

•

20 years ago

Comment on attachment 178546 [details] script to count non-utf8 data Shouldn't this also be checking: ['bugs_activity','added'], ['bugs_activity','removed'], ... and possibly also admin-defined data such as product/keyword/flag descriptions.

Gervase Markham [:gerv]

Comment 30

•

20 years ago

More specifically, we should add a show_comment.cgi script which takes a bug number and comment number, and just displays the raw text of the comment, with no processing or hyperlinking, and no charset header. A link to this could be embedded in all the comments which we couldn't convert. Gerv

Marc Schumann [:Wurblzap]

Comment 31

•

19 years ago

Attached file Script to convert single-encoding 2.19.3 databases to UTF-8 (obsolete) — Details

For people who are lucky enough to have a database in a single encoding, here is a hack to convert an unsuspecting iso-8859-1 encoded 2.19.3 Bugzilla database to UTF-8.

Dave Miller [:justdave]

Comment 32

•

19 years ago

(In reply to comment #28) > slee: you're a genius! :-) That's exactly what we should do. We make a > best-efforts rendering in the body of the bug (perhaps using charset guessing, > perhaps not) but allow the raw comment to be served alone if necessary. I'm with Gerv on that. I want it!!! :) :)

Marc Schumann [:Wurblzap]

Updated

•

19 years ago

Attachment #185456 - Attachment description: Script to convert single-encoding databases to UTF-8 → Script to convert single-encoding 2.19.3 databases to UTF-8

Marc Schumann [:Wurblzap]

Comment 33

•

19 years ago

*** Bug 304944 has been marked as a duplicate of this bug. ***

Joel Peshkin

Comment 34

•

19 years ago

Nice script Any reason not to add a safety check like.... eval {Encode::decode("utf-8", $new, 1);}; if ($@) { Encode::from_to($new, "iso-8859-1","utf-8"); so that it will not reconvert anything that is already converted?

Joel Peshkin

Comment 35

•

19 years ago

A few commnents.... 1) VERY NEAT TOOL 2) It would be best to use Bugzilla::DB to open the database and get the handle. That will make it work regardless of port/socket options. Can anyone think of any cases where this conversion could lose something? I can't unless it were run on a database that already has UTF8 data in it.

Joel Peshkin

Comment 36

•

19 years ago

By the way... it looks like ven bugzillas that specify ISO8859-1 wind up with special "windows" characters. Probably, this means that we need to convert from "cp1252" rather than "iso8859-1" It is "supposed" to be a superset. http://czyborra.com/charsets/codepages.html#CP1252

script to count non-utf8 data 20 years ago Dave Miller [:justdave] 2.88 KB, text/plain		Details
script to count non-utf8 data 20 years ago Dave Miller [:justdave] 2.95 KB, text/plain		Details
Script to convert single-encoding 2.19.3 databases to UTF-8 19 years ago Marc Schumann [:Wurblzap] 4.08 KB, text/plain		Details
Work In Progress 18 years ago Max Kanat-Alexander 9.03 KB, patch		Details \| Diff \| Splinter Review
v1: contrib/recode.pl 18 years ago Max Kanat-Alexander 7.29 KB, text/plain		Details
v2 18 years ago Max Kanat-Alexander 8.80 KB, text/plain		Details
v3 18 years ago Max Kanat-Alexander 9.75 KB, text/plain		Details
v3.1 18 years ago Max Kanat-Alexander 9.75 KB, text/plain		Details
v3.2 18 years ago Max Kanat-Alexander 9.77 KB, text/plain		Details
v4 18 years ago Max Kanat-Alexander 10.59 KB, text/plain	justdave : review+	Details
Additional Fix for v4 18 years ago Max Kanat-Alexander 1.44 KB, patch		Details \| Diff \| Splinter Review