Bug 280633 (bz-recode)

Tools to migrate existing legacy encoded database to UTF-8 (Unicode)

RESOLVED FIXED in Bugzilla 3.0

Status

()

enhancement
P2
normal
RESOLVED FIXED
14 years ago
11 years ago

People

(Reporter: bmo, Assigned: mkanat)

Tracking

unspecified
Bugzilla 3.0
Dependency tree / graph
Bug Flags:
approval +

Details

(Whiteboard: [Implementation: Comment 27])

Attachments

(3 attachments, 8 obsolete attachments)

(Reporter)

Description

14 years ago
This is a spin off bug from bug 126266 (Use UTF-8 (Unicode) charset encoding for
pages and email). 

Tools and/or documentation are required to provide a method for users of
bugzilla databases containing legacy data encodings to move to a UTF-8 encoded
database. The tools will enable them to automatically transcode known encodings
into UTF-8, or statistically detect encoding and where confidence is high,
transcode to UTF-8. This will enable existing bugzilla installations to upgrade
to the use of UTF-8 encoding throughout the entire database.
I have a work-in-progress CGI that does manual re-encoding of individual
comments.  It's not exactly a bulk conversion tool, but I was figuring on
throwing the "not valid utf-8" detection stuff into show_bug and adding a button
for privileged users next to such comments to allow them to recode them.  It
uses the web browser's charset detection stuff to decide which charset to recode
from (because Mozilla's charset detection usually seems to work better than the
stuff included with Perl, and it also lets the user override it if it's wrong
and they can figure it out on their own), but uses the Perl Encode module to do
the recoding once the user submits which character set to use for the source.

Comment 2

14 years ago
Would adding a field for charset to the longdescs table be appropriate?

... and then it could be converted to UTF-8 (or another selected charset) on 
the fly as the page is viewed.

This would avoid having to touch the underlying data, unless you a specific 
comment must be converted to UTF-8 in the DB too (e.g. for searchability), in 
which case bug 540 would seem to provide an appropriate mechanism that a 
recoding CGI could tack on to.

Comment 3

14 years ago
Copied from bug 126266 comment #195 (there are some overlaps with comment #1 here)

 1) send emails to those with their names stored in
ISO-8859-1 (I've never seen anyone use non-ASCII characters in encodings other
than ISO-8859-1 for their names at bugzilla.mozilla.org) to update their account
info in UTF-8. 
 2) Begin to emit 'charset=UTF-8' for bugs filed after a certain
day.(say, 2005-03-01). Do the same for current bugs with ASCII characters alone
in their comments and title. 
 3) For existing bugs, add a very prominent warning right above 'additional
comments' text area that 'View | Character Encoding'
should be set to UTF-8 if one's comment includes non-ASCII characters. This is
to ensure existing bugs are not further 'polluted' with non-UTF-8 comments. We
may go a step further and add a server-side check for 'UTF8ness' of a new
comment. If it's not UTF-8, send them back with the following message:

   - Your comment contains non-ASCII characters, but it's not in UTF-8. Please,
go back, set your browser to use 'UTF-8' and enter your comment again....

     The actual content of a comment : blahblah.....

 
 4) If we really want to migrate all existing bugs to UTF-8, add a button to
each comment to indicate the current character encoding. If necessary, this
button can be made available only to the select group of people knowledgable
enough to identify encodings reliably (and perhaps the authors of comments) 

 5) search/query may need some more tinkering...

As for the charset detection, I wouldn't trust it much for a short span of text
like bugzilla comments. If it's not automatic and merely given as a hint to the
privileged users (as described in comment #1), that would be all right.

Comment 4

14 years ago
Sorry for putting it on the wrong bug. Bug 126266 comment 198:

# A while back someone proposed on IRC that we turned on UTF-8 for every bug ID
# that was greater than a certain number. Although that isn't a perfect solution
# (still mixed character encoding in the database) it would at least make all new
# bugs and all comments on new bugs forward compatible.

Comment 5

14 years ago
(In reply to bug 126266 comment #202)
> (In reply to bug 126266 comment #201)
> > Anne: that breaks things when you have content from multiple bugs on the same
> > page, such as buglists or longlist.cgi output.

This is not new, either. It was given as a reason for not doing that in 2002(?).
Result? We've kept accumulating bugs with mixed encodings for 2+ more years.

> So does the current solution of allowing any encoding.

Absolutely.
 
> Sending pages that have content from multiple bugs as UTF-8 and sending all bugs
> with ID > current bug number as UTF-8 seems like a reasonable start to me.

Can we start sometime soon? At minimum, a prominent warning (in red or whatever
is the most clear way) just after 'Additional Comments' that reads "Set View |
Encoding to UTF-8 before adding a comment with non-ASCII characters" should be
added now. 
 
> We'll stop accumulating content of unknown encoding in new bugs, and there will
> still be a way to view the content on the older bugs (by viewing the bug as its
> own page) if there's some content that isn't UTF-8.

For list pages with multiple encodings, one can always override the encoding
emitted by HTTP/meta manually. Needless to say, only bugs with that encoding
will be shown correctly at a time while all other bugs in the list will be
mangled. However, that's not any worse than what we have now. Currently we don't
emit MIME charset so that the default char. encoding of a user is used (or with
a browser with autodetector like Mozilla, whatever autodetector detects). With
UTF-8 emitted, UTF-8 will be used by default. 
Does anyone have a hunch about the number of bugs where the following would be
wrong:
1) If a given comment is a valid UTF-8 byte sequence, assume it is UTF-8.
2) Else, assume it is Windows-1252-encoded and convert to UTF-8.
3) Label the output as UTF-8.
?

(FWIW, when I assessed, whether that approach would work for www.mozilla.org, it
appeared that it was ok for pretty much everything except i18n test cases and
evang letter translations.)

Comment 7

14 years ago
(In reply to comment #6)
> Does anyone have a hunch about the number of bugs where the following would be
> wrong:

  I can't give you the number, but theare quite a lot of them. 

> 1) If a given comment is a valid UTF-8 byte sequence, assume it is UTF-8.
> 2) Else, assume it is Windows-1252-encoded and convert to UTF-8.

Step 2 is  dangerous and we should never take that step. We don't have to do
that, either. What justdave and I sketched in comment #1 and comment #4 (step 4)
is a more sensible approach. We can live with mixed encodings in the output of
lists and in existing bugs, but we cannot afford to lose/damage data. 

> (FWIW, when I assessed, whether that approach would work for www.mozilla.org,
 
www.mozilla.org and bugzilla.mozilla.org are totally different beasts. In the
former, most, if not all, documents had been in ISO-8859-1 (actually US-ASCII).
If somebody wanted to add content beyond ISO-8859-1, they used UTF-8. So, it's
not surprising that it worked well for www.mozilla.org. bugzilla.mozilla.org has
a number of bugs with comments in KOI8-R, Windows-1251, Shift_JIS, EUC-KR,
GB2312, Big5, etc.  

Comment 8

14 years ago
For b.m.o, do we have any statistics on how many old non-UTF8 comments we have?
 Or how many bugs are affected?  Various people have asserted "many",
"thousands", "hundreds", etc.  What's the actual number?

I favour the idea of a run-once script to convert to UTF-8 (using a per-comment
guessed encoding based on an operator-set default), and providing an
administrator mode to specify an encoding to re-convert any given comment (i.e.
back from UTF-8 to the guessed encoding, then forward to the newly-specified
encoding).  How about a tool which allows the administrator to specify a small
number of encodings and then request "show me what this comment would have
looked like if I had converted it in each of these encodings".  Pick one and
you're done.

We would have to store the guessed encoding for each comment, probably in a
separate table.

Yes, I know that b.m.o (for instance) has a bajillion non-ASCII comments
(although I don't know, and it looks like nobody knows, how big the "bajillion"
actually is).  I don't know how many of those can't be correctly guessed using
an informed encoding guesser.  Does anyone?  Of course, we can't automatically
tell that an encoding guesser has got it wrong: they need eyeballing.  My
suspicion is that b.m.o has a few million comments, of which a few thousand are
non-ISO-8859-15, and maybe a hundred of those are hard to guess.  But that's
really just a wild guess.
Nick: if you can write some SQL query which will tell us, I'm happy to run it. I
could also dump the comments to a file and run a script over that if necessary.

Gerv

Comment 10

14 years ago
Just a quick thought that occurred to me... (and looking back appears to have 
occurred to me in a different form in comment 2)


A column could be added to the longdescs table recording the browser's charset 
(with NULL for unknown, and all existing data) with almost immediate effect.

i.e. no conversions, no UI changes etc. but just recording the data.

This way, we would at least know that all future data has tha capability to be 
reliably converted to UTF-8 when the rest of the code catches up. It could also 
act as a hint for existing data on the same bug, or by the same user.
Posted file script to count non-utf8 data (obsolete) —
OK, here's the statistics from b.m.o as of 2005-03-24 23:00 PST:

			 total rows  non-ascii	non-utf8
attachments.description:    178536	 524	   199
attachments.filename:	    178536	  56	    51
attachments.mimetype:	    178536	   6	     0
bugs.alias:		    287484	   3	     3
bugs.bug_file_loc:	    287484	  86	    57
bugs.short_desc:	    287484	 563	   455
bugs.status_whiteboard:     287484	   3	     2
longdescs.thetext:	   2454941     40674	 13631
namedqueries.name:	     11757	  24	    23
namedqueries.query:	     11757	   0	     0
quips.quip:		      1805	  13	     6
series.name:		       788	   0	     0
series.query:		       788	   0	     0
series_categories.name:        143	   0	     0
whine_events.subject:		 9	   0	     0
whine_events.body:		 9	   0	     0
whine_queries.query_name:	 7	   0	     0
whine_queries.title:		 7	   0	     0

These numbers were generated with the attached script (though I prettied it up
a little bit for this bug comment).
of course, I swiped the header off of another file and forgot to fix the
contributor line.  oops :)
Ack, I missed one:

profiles.realname: total: 192221  non-ascii:  3688  non-utf8: 3618
Here's a fixed up copy of the script.  Includes the column I forget, fixes the
license header, and the output actually looks like what I posted to the bug
now.
Attachment #178544 - Attachment is obsolete: true
Dave,

Nice work :-) We need to remember, of course, that just because something
decodes as UTF-8 doesn't necessarily mean that it is.

It would be good to have figures for open bugs only - that gives us a better
handle on the scale of the problem in practice. I would do it myself but the
script is owned by root and not writable by other users.

Looking at the data, the big problems are the values which no-one can fix up
afterwards - attachment descriptions and comments. No surprise there.

Gerv
Here's some data for open bugs, given that I think we care a lot less if
resolved bugs get a bit mangled.

                        total rows  non-ascii  non-utf8
attachments.description:    31153       195        75
attachments.filename:       31153        27        23
attachments.mimetype:       31153         4         0
bugs.alias:                 53971         1         1
bugs.bug_file_loc:          53971        20        14
bugs.short_desc:            53971        99        68
bugs.status_whiteboard:     53971         0         0
longdescs.thetext:         415466      8580      3021

This does seem to make the problem a lot less scary. The comments and the
attachment descriptions are the key ones, as people can't fix up manually if we
mess them up. 75 attachment descriptions isn't really very many. I bet if we
reduced it to non-obsolete attachments it would be even fewer.

What's the next step? Evaluate the performance of some charset-guessing Perl
modules on the relevant comments?

Wild idea: could we detect the Accept-Charset of each user, store it, and use it
as a first guess for the comments they've added?

Gerv

Comment 17

14 years ago
What about attachments.thedata? In an installation I help to admin there are
still 6560 non-utf8 attachments.
(In reply to comment #17)
> What about attachments.thedata? In an installation I help to admin there are
> still 6560 non-utf8 attachments.

Attachments can be binary.  Attachments are arbitrary data, and it doesn't
really matter what charset they are.  Whether it matters or not probably depends
a lot on the mime type.

Comment 19

14 years ago
(In reply to comment #17)
> What about attachments.thedata? In an installation I help to admin there are
> still 6560 non-utf8 attachments.

We MUST leave them alone other than fixing MIME type if necessary. ('text/*' =>
'text/*; charset=XYZ') 

(In reply to comment #16)
> Here's some data for open bugs, given that I think we care a lot less if
> resolved bugs get a bit mangled.
> 
>                         total rows  non-ascii  non-utf8
> attachments.description:    31153       195        75

> mess them up. 75 attachment descriptions isn't really very many. I bet if we
> reduced it to non-obsolete attachments it would be even fewer.
> 
> What's the next step? Evaluate the performance of some charset-guessing Perl
> modules on the relevant comments?

If it's only 75 (or 195), why bother with evaluating Perl charset-guessing
modules? Just doing it manually would be easier even if that means writing to
those who attached them to ask them to identify the charset.  My point is that I
wouldn't use any charset-guessing (other than UTF-8 vs non-UTF-8, which is not
fool-proof either as you wrote) for this kind of task. It can be pretty much
automated if a form mail is sent to those who attached them asking them to
identify charset. Hmm, with this possibility, the number of non-ASCII attachment
descriptions matters less (probably, the higher they're, the lower the response
ratio will be because some people are not around any more)


> Wild idea: could we detect the Accept-Charset of each user, store it, and use it
> as a first guess for the comments they've added?

That may or may not work. 'Accept-lang' + the most widely used charset for the
language could help, too. Again, with only 75, I don't see much point in
tinkering with ideas like that.

jshin: the ideas I mentioned about charset guessing and accept-lang were really
ideas for the 3000+ comments, not the attachment descriptions. I agree that 75
attachment descriptions could be fixed manually - or even, perhaps, just ignored.

Gerv

Comment 21

14 years ago
Sorry for misunderstanding. As for 3000+ comments, I would try automating the
conversion the way I described in my previous comment (send out form mails
replies to which can be automatically validated and processed with the help of
Perl's encoding module) instead of relying (solely) on the not-so-reliable perl
encoding guessing module (ok, it should be quantified, but my hunch is that it's
not so good).

Comment 22

14 years ago
(In reply to comment #21)
> Sorry for misunderstanding. As for 3000+ comments, I would try automating the
> conversion the way I described in my previous comment (send out form mails
> replies to which can be automatically validated and processed with the help of

 Obviously, we don't have to do things the way we would have done before the web
came out. Instead of sending out form mails, a simple web interface can be set
up and the link to it can be mailed so that those who wrote comment can identify
the encoding of their comments. The response can be processed either in batch or
on-line('in situ'). I don't expect everyone to respond for every comment, but
this will significantly reduce the need to rely on the manual fixing or encoding
guessing  

Comment 23

14 years ago
(In reply to comment #22)
> link to it can be mailed

... or initially, just included in the bugzilla page header/footer when any 
logged in user has comments that need converting.

It doesn't seem to me to be an important enough issue to mass mail everyone 
about.

In most cases a user's comments will be in the same charset, so might be useful 
to offer bulk conversion (select charset for one comment, and confirm that all 
the others display correctly).

Hmmmm.... wonder what it currently makes of these:
 "«ö¦¹¶i¤J°T®§½s¿è..."
 "¸Þ½ÃÁö¸¦ ÆíÁýÇÏ·Á¸é ÀÌ°÷À» Ŭ¸¯ÇϽÿÀ..."
 "ƒGƒfƒBƒbƒgƒ�ƒbƒZ�[ƒW‚𓾂é‚É‚Í�A‚±‚±‚ðƒNƒŠƒbƒN‚µ‚Ä‚­‚¾‚³‚¢..."

Comment 24

14 years ago
(In reply to comment #23)
> (In reply to comment #22)
> > link to it can be mailed
> 
> ... or initially, just included in the bugzilla page header/footer when any 
> logged in user has comments that need converting.
> 
> It doesn't seem to me to be an important enough issue to mass mail everyone 
> about.

What I have in mind is to send a *single* email to each user to alert about
comments that need to be converted with the link to a single *dynamic* web page
that lists links to all the comments (not the actual content) made by her or him
that have not yet been converted to UTF-8. One can identify the character
encoding of comments (s)he made at her/his leisure. 

> In most cases a user's comments will be in the same charset, so might be useful 
> to offer bulk conversion (select charset for one comment, and confirm that all 
> the others display correctly).

That's a nice idea, but should offer an option to do that one by one because
it's not always the case.

Comment 25

14 years ago
(In reply to comment #24)
> What I have in mind is to send a *single* email to each user to alert about

[how a user might read such a mail if this is done naïvely]

Dear user, you made one or two comments in bugs a couple of years back, and
these bugs have still not been fixed... and although you never used
bugzilla again, nor even asked to be kept informed about the status of
the bug, you had the audacity to enter one or two non-ASCII characters
in a couple of comments. So now we want you to dig out your login details
and come back to the site just to save us having to figure out what the
characters you meant to enter were, even though this should be obvious
to anyone that cares. Not sure if anyone else will even read your comment
again after you've done this, but we do like to keep our database in order,
and obviously this is far more important than fixing the 2-year old bug.


Even the banner I suggested would be quite in-your-face, but at least in this 
case the user is already logged in, so we know they can deal with it in a 
couple of clicks...

Many comments will be barely worth 10 seconds of a users time to fix as it will 
be obvious to anyone reading what the missing character is. Take bug 187403 
comment 10 as a random example.


> That's a nice idea, but should offer an option to do that one by one because
> it's not always the case.

Ummm... lets try "(select charset for one comment, and confirm [by checking a 
box next to each one] that all the others display correctly)".

Comment 26

14 years ago
(In reply to comment #25)
> (In reply to comment #24)
> > What I have in mind is to send a *single* email to each user to alert about
> 
> [how a user might read such a mail if this is done na�vely]
> 
> Dear user, you made one or two comments in bugs a couple of years back, and
> these bugs have still not been fixed... and although you never used
> bugzilla again, nor even asked to be kept informed about the status of

Depending on how it's worded, that's certainly possible. There's a trade-off
between two approaches. Your approach would have a higher response rate among
those who're asked to identify the char. encoding, but it leaves out those who
don't log on on a regular basis but who may be interested enough to do the
chore.  Two approaches can be combined. Try your approach for a certain period
(say, a month) and then mail to the rest of people. 

Comment 27

14 years ago
Idea!!! Just thought of a way this can be done so that ALL pages are displayed 
in UTF-8 without having to change a single byte of existing comments...

if a comment doesn't decode as valid UTF-8, then display it something like this:

  ----- Additional Comment #99 From user@example.com 2001-01-01 01:01 -----
  [ No charset specified. Ambiguous characters shown as #. _View_Raw_Comment_ ]

  I'm getting a sense of d#j# vu about this.

(or perhaps using some other character such as ' ', U+2588, '?', or the 'empty 
box' character that windows uses when a font is missing a character, rather 
than '#', and the placeholder character stylised (e.g. linkified, bold, etc.) 
to make it look different from the same character actually appearing in the 
comment)

The original comment author, and any sufficiently empowered user should 
probably also a link to a page allowing them select the correct character set 
and convert to UTF-8 next to the View Raw Comment link.


For "View Raw Comment", it would serve a page containing JUST that comment and 
no charset header, and use the browsers charset detection as at present.
slee: you're a genius! :-) That's exactly what we should do. We make a
best-efforts rendering in the body of the bug (perhaps using charset guessing,
perhaps not) but allow the raw comment to be served alone if necessary.

Gerv

Comment 29

14 years ago
Comment on attachment 178546 [details]
script to count non-utf8 data

Shouldn't this also be checking:

  ['bugs_activity','added'],
  ['bugs_activity','removed'],


... and possibly also admin-defined data such as product/keyword/flag
descriptions.
More specifically, we should add a show_comment.cgi script which takes a bug
number and comment number, and just displays the raw text of the comment, with
no processing or hyperlinking, and no charset header. A link to this could be
embedded in all the comments which we couldn't convert.

Gerv
For people who are lucky enough to have a database in a single encoding, here
is a hack to convert an unsuspecting iso-8859-1 encoded 2.19.3 Bugzilla
database to UTF-8.
(In reply to comment #28)
> slee: you're a genius! :-) That's exactly what we should do. We make a
> best-efforts rendering in the body of the bug (perhaps using charset guessing,
> perhaps not) but allow the raw comment to be served alone if necessary.

I'm with Gerv on that.  I want it!!! :) :)

Attachment #185456 - Attachment description: Script to convert single-encoding databases to UTF-8 → Script to convert single-encoding 2.19.3 databases to UTF-8
*** Bug 304944 has been marked as a duplicate of this bug. ***

Comment 34

14 years ago
Nice script

Any reason not to add a safety check like....

eval {Encode::decode("utf-8", $new, 1);};
if ($@) {
 Encode::from_to($new, "iso-8859-1","utf-8");

so that it will not reconvert anything that is already converted?

Comment 35

14 years ago
A few commnents....

1) VERY NEAT TOOL

2) It would be best to use Bugzilla::DB to open the database and get the handle.
That will make it work regardless of port/socket options.

Can anyone think of any cases where this conversion could lose something?  I
can't unless it were run on a database that already has UTF8 data in it.

Comment 36

14 years ago
By the way...   it looks like ven bugzillas that specify ISO8859-1 wind up with
special "windows" characters.   Probably, this means that we need to convert
from "cp1252" rather than "iso8859-1"  It is "supposed" to be a superset.

http://czyborra.com/charsets/codepages.html#CP1252
(Assignee)

Updated

14 years ago
Whiteboard: [Implementation: Comment 27]
Target Milestone: --- → Bugzilla 2.24

Comment 37

14 years ago
*** Bug 311398 has been marked as a duplicate of this bug. ***

Comment 38

14 years ago
With newer version of mysql (4.1 and up for sure), it seems that it is not necessary to do this.   Mysql already knows that all the tables are Latin1.  If we tell mysql which tables/fields we want to change to UTF8 and which we want to change to binary, then it will handle sorting, regexp matches, length(), properly and it will convert the existing data automatically.  In fact, there is even a warning in the documentation that the only way to supress the automatic conversion is to change a table to BINARY first and then to the new encoding.

I have not experimented with this yet (I used a hacked version of the tool here for my site), but it looke like the way to go.
Assigning this to me on my work account to make sure it stays on my plate prior to the b.m.o upgrade.    It'll be at least a few weeks before I get to it though, I can promise you that.  If someone else wants to grab it between now and then, please do.  The sooner this is done the sooner we can upgrade b.m.o :)
Assignee: nobody → justdave
Priority: -- → P2
(Assignee)

Comment 40

13 years ago
Posted patch Work In Progress (obsolete) — Splinter Review
Okay, things are looking good in this department. :-) It turns out there's an Encode::Detect module that hooks into the Gecko Universal Charset Detector. :-)

The code I've posted works extremely well. It just doesn't actually update the database yet. But you can see in the GUI that the conversions work really well.
Assignee: justdave → mkanat
Status: NEW → ASSIGNED
(Assignee)

Comment 41

13 years ago
Posted file v1: contrib/recode.pl (obsolete) —
Okay, here's a script that does the conversion. Its guessing is usually pretty good. If you run it, it will explain how it works.
Attachment #185456 - Attachment is obsolete: true
Attachment #239676 - Attachment is obsolete: true
Attachment #239884 - Flags: review?
(Assignee)

Updated

13 years ago
Attachment #239884 - Flags: review? → review?(justdave)
(Assignee)

Comment 42

13 years ago
*** Bug 135762 has been marked as a duplicate of this bug. ***
(Assignee)

Updated

13 years ago
Alias: bz-utf8-migrate
(Assignee)

Comment 43

13 years ago
Posted file v2 (obsolete) —
Okay, now it has a --dry-run argument.

I'd be really interested in seeing the results of --dry-run from a very large, old Bugzilla database like bmo.

FWIW, it works extremely well on landfill's bugzilla-tip.
Attachment #239884 - Attachment is obsolete: true
Attachment #240080 - Flags: review?
Attachment #239884 - Flags: review?(justdave)
Attachment #239884 - Attachment is patch: true
(Assignee)

Updated

13 years ago
Attachment #239884 - Attachment is patch: false
(Assignee)

Updated

13 years ago
Attachment #240080 - Attachment is patch: false
(Assignee)

Updated

13 years ago
Attachment #240080 - Flags: review? → review?(justdave)
(Assignee)

Comment 44

13 years ago
Posted file v3 (obsolete) —
Okay, here's version 3. I've improved the failure detection slightly (it now only says it failed when it really failed) and I've added the ability to override the encodings for certain values.
Attachment #240080 - Attachment is obsolete: true
Attachment #240080 - Flags: review?(justdave)
(Assignee)

Comment 45

13 years ago
Posted file v3.1 (obsolete) —
I fixed the output format slightly.
Attachment #240093 - Attachment is obsolete: true
Attachment #240094 - Flags: review?(justdave)
(Assignee)

Comment 46

13 years ago
Posted file v3.2 (obsolete) —
I realized why my "eval use" wasn't working--because eval { use } never works! :-) So I fixed that here.
Attachment #240094 - Attachment is obsolete: true
Attachment #240251 - Flags: review?(justdave)
Attachment #240094 - Flags: review?(justdave)
(Assignee)

Updated

13 years ago
Alias: bz-utf8-migrate → bz-recode
(Assignee)

Comment 47

13 years ago
Posted file v4
I improved the script again. Now, if we fail a guess, but the data is valid UTF-8, we never attempt to convert it (we don't even use the fallback encoding). This produced much better results on landfill.

I also fixed up the POD.
Attachment #240251 - Attachment is obsolete: true
Attachment #241153 - Flags: review?(justdave)
Attachment #240251 - Flags: review?(justdave)
Things I noticed in the POD:

 contrib/recode.pl [--guess [--show-failures]] [--charset=iso-8859-2]
                   [--overrides=file_name]

^^^ Should you demonstrate all options here, including --dry-run?

"Don't modify the database, just print out what the conversions will be." <-- s/,/;/

"character set into the UTF-8." <-- s/into the/into/
(Assignee)

Updated

13 years ago
Blocks: 304550
(Assignee)

Comment 49

13 years ago
Comment on attachment 241153 [details]
v4

I can't wait any longer for a review on this script--the freeze is too soon.
Attachment #241153 - Flags: review?(justdave) → review?(LpSolit)
(Assignee)

Updated

13 years ago
Attachment #241153 - Flags: review?(LpSolit) → review?(justdave)
Comment on attachment 241153 [details]
v4

OK, this looks really good...  I've ran it three or four times on the production database, and for what it's supposed to do, it works good, and I don't see any problems sticking out at me looking at the source.

There's probably still room for improvement, but it's good enough to include in a release, and we can continue bugfixing it as we encounter issues.
Attachment #241153 - Flags: review?(justdave) → review+

Comment 51

13 years ago
(In reply to comment #47)
> Created an attachment (id=241153) [edit]
> v4

If Encode-Detect is not installed, the script fails to print out the correct error message. Instead it aborts with

# contrib/recode.pl --dry-run
Bareword "ROOT_USER" not allowed while "strict subs" in use at contrib/recode.pl line 141.
Execution of contrib/recode.pl aborted due to compilation errors.

Comment 52

13 years ago
Is the script expected to work with a 2.22.1 database? When I run in in --dry-run mode on a 2.22.1 DB, it fails with:

Converting attachments.description...
Converting attachments.mimetype...
Converting attachments.filename...
[lots of further output]
Converting logincookies.ipaddr...
Converting longdescs.thetext...
Use of uninitialized value in concatenation (.) or string at contrib/recode.pl line 209.
Use of uninitialized value in split at contrib/recode.pl line 213.
DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'FROM longdescs 
                                      WHERE thetext IS NOT NULL
' at line 1 [for Statement "SELECT thetext,  FROM longdescs 
                                      WHERE thetext IS NOT NULL
                                            AND thetext != ''"] at contrib/recode.pl line 218
(Assignee)

Comment 53

13 years ago
(In reply to comment #52)
> Is the script expected to work with a 2.22.1 database?

  No. Nor is it expected to work with 2.22.1 code.
(Assignee)

Comment 54

13 years ago
RCS file: /cvsroot/mozilla/webtools/bugzilla/contrib/recode.pl,v
done
Checking in contrib/recode.pl;
/cvsroot/mozilla/webtools/bugzilla/contrib/recode.pl,v  <--  recode.pl
initial revision: 1.1
done
Status: ASSIGNED → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → FIXED
(Assignee)

Comment 55

13 years ago
Okay, so the script wasn't working when the UTF-8 parameter was turned on. I've fixed this with this patch.

The script also wasn't running if you set shutdownhtml, which didn't make sense, so I exempted it from shutdownhtml.

Checking in Bugzilla.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v  <--  Bugzilla.pm
new revision: 1.53; previous revision: 1.52
done
Checking in contrib/recode.pl;
/cvsroot/mozilla/webtools/bugzilla/contrib/recode.pl,v  <--  recode.pl
new revision: 1.2; previous revision: 1.1
done

Comment 56

12 years ago
Hi!

I just tried recode.pl and got the following errors: 
#./recode.pl --dry-run
Can't find param named user_verify_class at ../Bugzilla/Config.pm line 171.
BEGIN failed--compilation aborted at ../Bugzilla/Auth.pm line 43.
Compilation failed in require at ../Bugzilla.pm line 28.
BEGIN failed--compilation aborted at ../Bugzilla.pm line 28.
Compilation failed in require at ./recode.pl line 26.
BEGIN failed--compilation aborted at ./recode.pl line 26.

What I did was to copy recode.pl version 4 into the contrib directory and run it. 
Is there something else I have to do? 
I checked the whole bug and did not find any instructions. 

Regards
Werner
(Assignee)

Comment 57

12 years ago
(In reply to comment #56)
> What I did was to copy recode.pl version 4 into the contrib directory and run
> it. 

  Well, that won't work! :-) You have to actually upgrade all of Bugzilla--you can't just take this script.

  This is a support question--for any more details, please ask on the support list, described here:

  http://www.bugzilla.org/support/

Comment 58

12 years ago
Which version is it supposed to run with? 
Is it just the 2.23.4 or should it run with the new release 2.22.2?

I'll ask all further questions there. 
Regards
Werner
Blocks: 229010
You need to log in before you can comment on or make changes to this bug.