Closed Bug 1045303 Opened 10 years ago Closed 10 years ago

encoding problem in fetch-adi-from-hive

Categories

(Socorro :: Database, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

Latest version of fetch-adi-from-hive is failing in prod:

2014-07-28 16:05:49,959 ERROR  - MainThread - Exception raised during socorro.external.postgresql.connection_context transaction
Traceback (most recent call last):
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/crontabber/transaction_executor.py", line 46, in __call__
    result = function(connection, *args, **kwargs)
  File "/data/socorro/application/socorro/cron/jobs/fetch_adi_from_hive.py", line 204, in run
    'count'
DataError: invalid byte sequence for encoding "UTF8": 0xc0 0xaf
CONTEXT:  COPY raw_adi_logs, line 55517
If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it should be '\xc3\x80 \xc2\xaf'. 

Perhaps the hive DB was too heavily mocked for realistic output of unicode strings.
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

Hmm so the DataError is really coming from psycopg2 when it's doing COPY, so I think we're hitting something where python and postgres are disagreeing about what's acceptable in UTF-8 encoding (again :( - we already special-cased this for the NULL byte).

I wonder if psycopg2 or postgres provide something we can use to just ignore any values it can't encode.. I'd like to stop playing whack-a-mole here. I'll look into it.
Status: NEW → ASSIGNED
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

What gets me here, is why doesn't this work?:

https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py#L180

Shouldn't that be ignoring anything that isn't UTF-8, before we even try to send it to postgres ?
I wonder if we're using it correctly. Like why do we bother turning the string into a byte string? Why can't it be left as a unicode string. psycopg2 is good at that kind of stuff.
OK well this failed pretty consistently on prod, but *not* on stage at all... they are both pointing to the same Hive host, so not sure what's going on with that! I am having trouble reproducing locally, as well.
(In reply to Peter Bengtsson [:peterbe] from comment #4)
> I wonder if we're using it correctly. Like why do we bother turning the
> string into a byte string? Why can't it be left as a unicode string.
> psycopg2 is good at that kind of stuff.

The problem is that it's not unicode that we're getting back from hive apparently, or somehow we're not writing unicode to teh temp file that then gets passed to psycopg2.copy_from()
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/28f7a61512f15a48728fbb887ee012e8058a2acc
fix bug 1045303 - handle UTF-8 and control codes consistently

https://github.com/mozilla/socorro/commit/eafd45604681c39c03bbe790c4cb8de0206f9ae9
Merge pull request #2257 from rhelmer/bug1045303-hive-encoding-problem

fix bug 1045303 - handle UTF-8 and control codes consistently
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97
Target Milestone: 97 → 98
You need to log in before you can comment on or make changes to this bug.