encoding problem in fetch-adi-from-hive

RESOLVED FIXED in 98

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: rhelmer, Assigned: rhelmer)

Tracking

unspecified

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

4 years ago
Latest version of fetch-adi-from-hive is failing in prod:

2014-07-28 16:05:49,959 ERROR  - MainThread - Exception raised during socorro.external.postgresql.connection_context transaction
Traceback (most recent call last):
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/crontabber/transaction_executor.py", line 46, in __call__
    result = function(connection, *args, **kwargs)
  File "/data/socorro/application/socorro/cron/jobs/fetch_adi_from_hive.py", line 204, in run
    'count'
DataError: invalid byte sequence for encoding "UTF8": 0xc0 0xaf
CONTEXT:  COPY raw_adi_logs, line 55517
If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it should be '\xc3\x80 \xc2\xaf'. 

Perhaps the hive DB was too heavily mocked for realistic output of unicode strings.
(Assignee)

Comment 2

4 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

Hmm so the DataError is really coming from psycopg2 when it's doing COPY, so I think we're hitting something where python and postgres are disagreeing about what's acceptable in UTF-8 encoding (again :( - we already special-cased this for the NULL byte).

I wonder if psycopg2 or postgres provide something we can use to just ignore any values it can't encode.. I'd like to stop playing whack-a-mole here. I'll look into it.
(Assignee)

Updated

4 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 3

4 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

What gets me here, is why doesn't this work?:

https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py#L180

Shouldn't that be ignoring anything that isn't UTF-8, before we even try to send it to postgres ?
I wonder if we're using it correctly. Like why do we bother turning the string into a byte string? Why can't it be left as a unicode string. psycopg2 is good at that kind of stuff.
(Assignee)

Comment 5

4 years ago
OK well this failed pretty consistently on prod, but *not* on stage at all... they are both pointing to the same Hive host, so not sure what's going on with that! I am having trouble reproducing locally, as well.
(Assignee)

Comment 6

4 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #4)
> I wonder if we're using it correctly. Like why do we bother turning the
> string into a byte string? Why can't it be left as a unicode string.
> psycopg2 is good at that kind of stuff.

The problem is that it's not unicode that we're getting back from hive apparently, or somehow we're not writing unicode to teh temp file that then gets passed to psycopg2.copy_from()

Comment 7

4 years ago
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/28f7a61512f15a48728fbb887ee012e8058a2acc
fix bug 1045303 - handle UTF-8 and control codes consistently

https://github.com/mozilla/socorro/commit/eafd45604681c39c03bbe790c4cb8de0206f9ae9
Merge pull request #2257 from rhelmer/bug1045303-hive-encoding-problem

fix bug 1045303 - handle UTF-8 and control codes consistently

Updated

4 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED

Updated

4 years ago
Target Milestone: --- → 97

Updated

4 years ago
Target Milestone: 97 → 98
You need to log in before you can comment on or make changes to this bug.