Closed Bug 1168511 Opened 9 years ago Closed 9 years ago

crontabber failing due to invalid unicode in JSON

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: lars)

References

Details

GraphicsDeviceCronApp is failing with:

Traceback (most recent call last):
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/crontabber/
transaction_executor.py", line 46, in __call__
    result = function(connection, *args, **kwargs)
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/socorro-mas
ter-py2.6.egg/socorro/cron/jobs/matviews.py", line 52, in run
    self.run_proc(connection, [target_date])
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/socorro-mas
ter-py2.6.egg/socorro/cron/jobs/matviews.py", line 28, in run_proc
    cursor.callproc(self.get_proc_name(), signature)
DataError: invalid input syntax for type json
DETAIL:  low order surrogate must follow a high order surrogate.
CONTEXT:  JSON data, line 1: ...gress": "xpcom-shutdown", "TelemetryEnvironment":...
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
OK this has been failing for several days, I can repro it:

SELECT
      uuid
    , json_object_field_text(r.raw_crash, 'IsGarbageCollecting') as is_garbage_collecting
FROM
    raw_crashes r
WHERE
    date_processed BETWEEN '2015-05-22'::timestamptz
        AND '2015-05-22'::timestamptz + '1 day'::interval
OK here's an example containing the data that Postgres is unhappy with:

37b76c55-1dc6-4c65-9159-3f04d2150522
I archived (on sp-admin01) this crash, and removed it to see if we can get reporting to continue for now:

breakpad=# delete from raw_crashes where uuid = '37b76c55-1dc6-4c65-9159-3f04d2150522';
DELETE 1
Lars - I've emailed you the raw JSON from comment 3, from the PG error message I believe that data in the "TelemetryEnvironment" has the problematic invalid unicode char.
The actual problematic character in this case seems to be the unicode-encoded null byte, I've been finding and removing these like this (PG's representation of this char is '\u0000'):

DELETE FROM raw_crashes WHERE raw_crash::text LIKE '%\u0000%'::text AND date_processed >= '2015-07-21';
Assignee: rhelmer → lars
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/1c8cd57f6702e4f837d32a077b84c1bf78fd57ad
Merge pull request #2915 from twobraids/remove-null-byte-from-raw-crashes

fixes Bug 1168511 - filter all strings in raw crash input to remove \x00
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
This shipped, but looks like we're still seeing the problem in raw_crashes.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This happened again today - Lars I saved it and emailed it to you this time.
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/89d3f394d8b083a061f8800b9dad472d7268b6cf
Merge pull request #2950 from twobraids/zero-theorem

Fixes Bug 1168511 (again) - add ability to test keys for null bytes
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/d38cffb48fa8a46792e12473f38530145c9af088
Merge pull request #2957 from twobraids/more-unicode-crap

more Fixes Bug 1168511 - separate unicode & str cases in cleaning of null bytes.
You need to log in before you can comment on or make changes to this bug.