Closed Bug 1045303 Opened 10 years ago Closed 10 years ago

encoding problem in fetch-adi-from-hive

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

Robert Helmer [:rhelmer]

Assignee

Description

•

10 years ago

Latest version of fetch-adi-from-hive is failing in prod:

2014-07-28 16:05:49,959 ERROR  - MainThread - Exception raised during socorro.external.postgresql.connection_context transaction
Traceback (most recent call last):
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/crontabber/transaction_executor.py", line 46, in __call__
    result = function(connection, *args, **kwargs)
  File "/data/socorro/application/socorro/cron/jobs/fetch_adi_from_hive.py", line 204, in run
    'count'
DataError: invalid byte sequence for encoding "UTF8": 0xc0 0xaf
CONTEXT:  COPY raw_adi_logs, line 55517

Peter Bengtsson [:peterbe]

Comment 1

•

10 years ago

If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it should be '\xc3\x80 \xc2\xaf'. 

Perhaps the hive DB was too heavily mocked for realistic output of unicode strings.

Robert Helmer [:rhelmer]

Assignee

Comment 2

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

Hmm so the DataError is really coming from psycopg2 when it's doing COPY, so I think we're hitting something where python and postgres are disagreeing about what's acceptable in UTF-8 encoding (again :( - we already special-cased this for the NULL byte).

I wonder if psycopg2 or postgres provide something we can use to just ignore any values it can't encode.. I'd like to stop playing whack-a-mole here. I'll look into it.

Robert Helmer [:rhelmer]

Assignee

Updated

•

10 years ago

Status: NEW → ASSIGNED

Robert Helmer [:rhelmer]

Assignee

Comment 3

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If the string is u'\xc0 \xaf' in unicode, when encoded back into ascii it
> should be '\xc3\x80 \xc2\xaf'. 
> 
> Perhaps the hive DB was too heavily mocked for realistic output of unicode
> strings.

What gets me here, is why doesn't this work?:

https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py#L180

Shouldn't that be ignoring anything that isn't UTF-8, before we even try to send it to postgres ?

Peter Bengtsson [:peterbe]

Comment 4

•

10 years ago

I wonder if we're using it correctly. Like why do we bother turning the string into a byte string? Why can't it be left as a unicode string. psycopg2 is good at that kind of stuff.

Robert Helmer [:rhelmer]

Assignee

Comment 5

•

10 years ago

OK well this failed pretty consistently on prod, but *not* on stage at all... they are both pointing to the same Hive host, so not sure what's going on with that! I am having trouble reproducing locally, as well.

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #4)
> I wonder if we're using it correctly. Like why do we bother turning the
> string into a byte string? Why can't it be left as a unicode string.
> psycopg2 is good at that kind of stuff.

The problem is that it's not unicode that we're getting back from hive apparently, or somehow we're not writing unicode to teh temp file that then gets passed to psycopg2.copy_from()

[github robot]

Comment 7

•

10 years ago

Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/28f7a61512f15a48728fbb887ee012e8058a2acc
fix bug 1045303 - handle UTF-8 and control codes consistently

https://github.com/mozilla/socorro/commit/eafd45604681c39c03bbe790c4cb8de0206f9ae9
Merge pull request #2257 from rhelmer/bug1045303-hive-encoding-problem

fix bug 1045303 - handle UTF-8 and control codes consistently

[github robot]

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Lonnen :lonnen

Updated

•

10 years ago

Target Milestone: --- → 97

Lonnen :lonnen

Updated

•

10 years ago

Target Milestone: 97 → 98

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

encoding problem in fetch-adi-from-hive

Categories

(Socorro :: Database, task)

Tracking

(Not tracked)

People

(Reporter: rhelmer, Assigned: rhelmer)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Updated