Closed Bug 934501 Opened 12 years ago Closed 8 years ago

Allow the collector to log and then remove bogus UTF8 characters in value portion of key/value pairs in raw_crash

Categories

(Socorro :: Backend, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Unassigned)

References

Details

From lars: right now, the data within a raw_crash is not interpreted or validated. All values are left as strings. Postgres is the first crashstore implementation that cares about validation of strings. To prevent errors like this, we'll need to validate fields before they get inserted into Postgres. This cleansing of the data could happen either just for PG or for all the system. I'd rather have all crash stores that save raw_crashes save the same thing. The PG raw crash shouldn't end up being different from the FS or HB raw crash. I guess that means I'm in favor of an all-system sort of validation. That has to happen at the collector level (here? https://github.com/mozilla/socorro/blob/master/socorro/collector/wsgi_breakpad_collector.py#L47). I'm loath to give that responsibility to the collector, but really that's the one time to 'nip this problem in the bud'. If we were to do it there, then we ensure that all storage schemes get clean data (even if they don't really care themselves). Validation could be broken into two types: structural and semantic, with collector validation only caring about structural validation. Structural: is this string a properly formed UTF character stream? At this level we don't care what the strings mean, just ensure that we've got valid characters. I can't offhand think of other structural validations. Semantic: is this string a proper representation of a timestamp,integer or whatever. This type of validation shouldn't happen at the collector level as we might find that it is a rabbit hole of issues that could make the collector very inefficient. The processor is already doing that sort of validation as it converts strings into their semantically useful forms. So lets say that the collector is doing structural validation and it finds this problem within the value of a field. What does it do? Drop the field? Replace the field with a default? Try to correct the problem? From selena: I'd say log it and replace with a default that indicates it's been modified. Like '[malformed data]' or something.
Depends on: 934498
selena, would you please add an attachment to this bug with the offending data? I need the test case to make sure that it is validating correctly.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #1) > selena, would you please add an attachment to this bug with the offending > data? I need the test case to make sure that it is validating correctly. From: https://github.com/selenamarie/socorro/commit/669f17b6915d857511ccd3248b70e70eff0a2eb4 > 'badstuff': u'\udc02', The explanation from Andrew Dunstan is as follows: > Unicode characters in the range U+dc80 .. U+dcff are surrogate pair > values, used for encoding values above the Unicode basic multilingual > plane. they MUST be used in pairs, the first in the range U+dc80 .. > U+dbff and the second in the range U+dc00 U+dbff. > So the error you're getting is 100% correct, as your JSON does contain > sequences that don't obey these rules. The error message seems to have > hit the wrong spot, but it is 100% correct apart from that, it's your > JSON that's at fault. Here's the characters surrounding this bogus character: > \uf602\ufe02\ufe02\udc02\u0202\udc03\u0202
My point in including more than just that single character is that the issue is the unicode characters are *pairs* and we need to test for the issue that Andrew described, rather than just the single character that I put into the failing test in my PR.
I'm not sure that I know how to detect the problem. Python seems to think that the string of bogosity is just fine: >>> s = u"\uf602\ufe02\ufe02\udc02\u0202\udc03\u0202" >>> s u'\uf602\ufe02\ufe02\udc02\u0202\udc03\u0202' >>> print s ︂︂�Ȃ�Ȃ >>> a = s.encode('utf-8') >>> a '\xef\x98\x82\xef\xb8\x82\xef\xb8\x82\xed\xb0\x82\xc8\x82\xed\xb0\x83\xc8\x82' >>> print a ︂︂�Ȃ�Ȃ >>> s2 = a.decode('utf-8') >>> s2 u'\uf602\ufe02\ufe02\udc02\u0202\udc03\u0202' >>> print s2 ︂︂�Ȃ�Ȃ Got any idea how to figure out if the string is bad?
I don't know how to determine if the string is bad in python. In Postgres, this will do it: > SELECT json_object_field_text('{"bad_data": "\uf602\ufe02\ufe02\udc02\u0202\udc03\u0202"}', 'AvailableVirtualMemory');
I can't tell if this is relevant anymore. I'm pretty sure we're not saving crashes to postgres anymore, so this might be moot?
We just did a rewrite of the collector. The collector removed these values and thus Antenna does, too. I think we're good here regardless of whether we're saving things to Postgres or not anymore.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.