Closed
Bug 1045303
Opened 10 years ago
Closed 10 years ago
encoding problem in fetch-adi-from-hive
Categories
(Socorro :: Database, task)
Socorro
Database
Tracking
(Not tracked)
RESOLVED
FIXED
98
People
(Reporter: rhelmer, Assigned: rhelmer)
References
Details
Latest version of fetch-adi-from-hive is failing in prod: 2014-07-28 16:05:49,959 ERROR - MainThread - Exception raised during socorro.external.postgresql.connection_context transaction Traceback (most recent call last): File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/crontabber/transaction_executor.py", line 46, in __call__ result = function(connection, *args, **kwargs) File "/data/socorro/application/socorro/cron/jobs/fetch_adi_from_hive.py", line 204, in run 'count' DataError: invalid byte sequence for encoding "UTF8": 0xc0 0xaf CONTEXT: COPY raw_adi_logs, line 55517
Comment 1•10 years ago
|
||
If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it should be '\xc3\x80 \xc2\xaf'. Perhaps the hive DB was too heavily mocked for realistic output of unicode strings.
Assignee | ||
Comment 2•10 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #1) > If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it > should be '\xc3\x80 \xc2\xaf'. > > Perhaps the hive DB was too heavily mocked for realistic output of unicode > strings. Hmm so the DataError is really coming from psycopg2 when it's doing COPY, so I think we're hitting something where python and postgres are disagreeing about what's acceptable in UTF-8 encoding (again :( - we already special-cased this for the NULL byte). I wonder if psycopg2 or postgres provide something we can use to just ignore any values it can't encode.. I'd like to stop playing whack-a-mole here. I'll look into it.
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•10 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #1) > If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it > should be '\xc3\x80 \xc2\xaf'. > > Perhaps the hive DB was too heavily mocked for realistic output of unicode > strings. What gets me here, is why doesn't this work?: https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py#L180 Shouldn't that be ignoring anything that isn't UTF-8, before we even try to send it to postgres ?
Comment 4•10 years ago
|
||
I wonder if we're using it correctly. Like why do we bother turning the string into a byte string? Why can't it be left as a unicode string. psycopg2 is good at that kind of stuff.
Assignee | ||
Comment 5•10 years ago
|
||
OK well this failed pretty consistently on prod, but *not* on stage at all... they are both pointing to the same Hive host, so not sure what's going on with that! I am having trouble reproducing locally, as well.
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #4) > I wonder if we're using it correctly. Like why do we bother turning the > string into a byte string? Why can't it be left as a unicode string. > psycopg2 is good at that kind of stuff. The problem is that it's not unicode that we're getting back from hive apparently, or somehow we're not writing unicode to teh temp file that then gets passed to psycopg2.copy_from()
Comment 7•10 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/28f7a61512f15a48728fbb887ee012e8058a2acc fix bug 1045303 - handle UTF-8 and control codes consistently https://github.com/mozilla/socorro/commit/eafd45604681c39c03bbe790c4cb8de0206f9ae9 Merge pull request #2257 from rhelmer/bug1045303-hive-encoding-problem fix bug 1045303 - handle UTF-8 and control codes consistently
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Target Milestone: --- → 97
Updated•10 years ago
|
Target Milestone: 97 → 98
You need to log in
before you can comment on or make changes to this bug.
Description
•