Open Bug 1420363 Opened 2 years ago Updated 12 days ago

Write out the .extra file in JSON format

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set

Tracking

()

ASSIGNED

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Blocks 4 open bugs)

Details

Attachments

(10 files, 2 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
As per bug 1395507 comment 1, .extra files are currently in an unspecified, INI-ish format. Turning them into JSON presents certain challenges but would have the important benefit of throwing away all the custom code for parsing them, escaping line-breaks, JSON entries, etc... The biggest downsides are that if we get interrupted while "appending" to the file it might end up being broken and we'll need to update the tooling too.

I've started thinking how to do this and there's an issue that's been nagging me: we currently fully write out the .extra file every time we add, modify or remove a crash annotation. Since this generates a lot of write traffic and a handful of syscalls it has performance implications but it's usually in the noise. If we use the same approach once we turn the .extra file into JSON we'll also be serializing a bunch of annotations into JSON which might further increase the performance cost of annotating this operation.

What if we only wrote out the .extra file in the exception handler? For content processes this is already happening since crash annotations are sent via IPC so the change would be needed mostly for the main process. We won't be able to allocate memory so we'll need our own little specialized writer that would only use a fixed buffer + syscalls but apart from that I don't see any particular issues.

The only other negative thing I can think of is if the crashing code overwrites some of the in-memory annotations we'll end up writing out garbage. I'm not too worried about this too because I've already seen it happen with the current model so it's not like we have a bulletproof solution. Last but not least once we move the whole thing OOP this will become a non-issue.

Ted can you think of any other downsides to this approach?

Blocks: 1416078
Flags: needinfo?(ted)

I think this is probably fine. I don't recall why I originally implemented the "serialize everything when we change an annotation" model, probably because I wasn't sure about the safety of something in the exception handler? If we can safely iterate over annotations there, and safely serialize to JSON, that should be fine.

Flags: needinfo?(ted)
Depends on: 1533732
Depends on: 1547698
Blocks: 1553536

Taking this since we decided this is the best approach, here's a quick rundown of what needs to be done:

  • Modify the main process exception handler to write out JSON using a combination of the exception-safe PlatformWriter and the JSONWriter
  • Bump the event file format version and adapt the CrashManager to parse it properly
  • Modify the crashreporter client to parse, modify and submit the JSON payload
  • Modify content crash report generation to also write out JSON
  • Modify the CrashService to properly parse the new JSON payload
  • Rip out the now useless code to escape/un-escape/re-escape JSON fields and the pseudo-INI-parser used to read the .extra files
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Summary: Evaluate writing out the .extra file in JSON format → Write out the .extra file in JSON format

Aaron I'm curious why the DLL blocklist uses the gWriteFn function pointer [1] instead of calling InternalWriteNotes directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.

Flags: needinfo?(aklotz)

(In reply to Gabriele Svelto [:gsvelto] from comment #3)

  • Modify the crashreporter client to parse, modify and submit the JSON payload
  • Modify content crash report generation to also write out JSON

Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?

Flags: needinfo?(gsvelto)

(In reply to :Gijs (he/him) from comment #5)

Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?

Yes. The minidump-analyzer is already using jsoncpp so reading, modifying & writing back the file should be trivial. This has the added advantage that we don't care about escaping which has been a never-ending source of problems when mixing the newline-separated annotations and JSON containing newlines.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

Aaron I'm curious why the DLL blocklist uses the gWriteFn function pointer [1] instead of calling InternalWriteNotes directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.

The legacy DLL blocklist in mozglue.dll is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides in firefox.exe. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #7)

The legacy DLL blocklist in mozglue.dll is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides in firefox.exe. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.

I see, that's tricky. Right now we add those annotations in the main process' exception handler which is the most unwieldy place I can think of. I don't know why that is but I suspect it was put in there so that the annotation would be recorded even in early startup crashes. If there were a better place where to move adding those annotations to I'd be very keen on doing it.

Depends on: 1555917
Depends on: 1555954

This is a WIP vertical-slice of the functionality. It changes the exception handler to write out the .extra file in JSON format then updates the minidump-analyzer and crashreporter client to deal with it. The CrashManager doesn't support the .extra-based JSON yet nor the content crash handlers. Linux is also the only platform this is enabled on, Windows and macOS will require some platform-dependent code.

With this patch applied I can send JSON-ified crash reports to Socorro but they are discarded so there must be something wrong in the format or in the POST command.

Attached file Crash reporter HTTP POST payload (obsolete) —

Will, I tried to send my first JSON-infused crash report but Socorro rejected it with a "Discarded=1" response. I've attached the contents of the HTTP POST message and everything looks fine but maybe I'm missing something.

Flags: needinfo?(willkg)

I'm guessing it's getting throttled since it's not nightly or beta. Try switching to ReleaseChannel nightly and see if that works.

Flags: needinfo?(willkg)

I gave it a spin but it didn't help :-( Is there a way to get a more detailed response from Socorro in these cases? I've tried to launch an instance locally this afternoon but I wasn't very successful.

I pulled out the extra.json part of the payload and saved that locally, then I ran Antenna in one terminal and in another terminal, did this:

python -m testlib.mini_poster --url http://localhost:8000/submit --raw_crash extra.json --use-json

That works. It's possible it's a bad test, but i'm not sure what would be wrong about it.

One thing I noticed about the attachment is that it suggests it's posting to / and not /submit. If that's true, try /submit and see if that works. If not, ping me on IRC and we'll work through it.

So I tried sending the same JSON blob and minidump with testlib.mini_poster and the crash reporter client and the former is accepted while the latter is not. If I intercept the HTTP POST with nc they are almost identical. The only difference is that testlib.mini_poster rearrange the fields in the JSON payloads and adds some space but there's absolutely no difference in the content. The URL we send the POST too is correct, it's the same both w/ and w/o my patch applied. I'll try with content crashes - which I aren't supported yet - and see what happens there. Maybe I'm doing something wrong, but if I am I'm pretty sure it's either something very simple or very obscure.

The collector will now return "Discarded=REASON" for rejections. That should help figure out what's going on.

Thanks, I'll try it again today. In the meantime I have a working theory of why this might happen - and I should be able to test it. I haven't yet tried to change the user agent when testing with curl because I thought it didn't matter. But then it occurred to me that maybe something in between the client and the collector is rewriting the contents of the message if it sees it coming from Breakpad. Anyway, this is easy enough to test so I'll give it a spin today.

The reason why the report is discarded is Discarded=has_json_and_kv which seems odd because when I intercepted the message with nc there was just the minidump entry and the JSON blob. I will try tweaking the user agent since that's the only thing I haven't tried while investigating this.

So it seems that when the POST message is parsed here:

https://github.com/mozilla-services/antenna/blob/master/antenna/breakpad_resource.py#L288

The resulting fs object contains a field that is neither the minidump nor the JSON blob.

I think there are a couple of possibilities:

  1. Another annotation field is getting in there sometimes somehow. I haven't read your code changes, so I have no idea how possible this is.
  2. There's a bug in the Antenna code that iterates over fields in the resulting FieldStorage where FieldStorage sometimes spits out a nothing or something like that and the Antenna loop handles that poorly.
  3. cgi.FieldStorage has a bug where it thinks there's another field when there isn't.

If there's a way we can reproduce this in a local dev environment, that'd be super handy. If you can reproduce it, it'd be great if you could add a print line in that loop that says what's in each field it parses. Maybe something like this:

print(item_name, fs.item_type)

I wrote a comment that sometimes the field name is None and that is super fishy. I wonder if that's what we're seeing.

The cgi.FieldStorage code is really old part of Python stdlib and might get removed and has issues. We might be seeing an issue with that here. I could rework the code to remove that, but that's a non-trivial project. Pretty sure I have a bug for it somewhere.

Let me know if you can reproduce it and if you can what the output of that print line is. That'll illuminate us as to whether this is a bug I should fix in Antenna or not.

Locally it's not happening, the docker-launched antenna processes the HTTP POST just fine and there's no spurious fields showing up. What if - as a temporary workaround - you could change the code to ignore any other form entries once a JSON one is found? I know it's not pretty to silently ignore potentially malformed messages but since we don't know where it's coming from it might be the only solution for now. I will have tests in mozilla-central that parse the HTTP POST and ensure it only contains the minidump file and JSON payload so we will enforce that assumption on the gecko side at least.

Mmm... I'll write up a bug for this. Sure seems like this is a bug in Antenna. Probably easiest to treat it as such until we know it's definitely not.

Antenna (collector) should be fixed now. I verified it with the test script I've been using. I think your test should work now. If not, let me know and we'll tear it all apart to find the bugs!

This is the first content crash successfully submitted with a JSON payload:

https://crash-stats.mozilla.org/report/index/ffa37399-dc2f-477b-89bc-167a00190628

And this is the first main process crash:

https://crash-stats.mozilla.org/report/index/030241a7-c8b0-43bf-8278-da9050190628

Things are looking good but I have to modify our fake crash server to accommodate for the new reports before I can fix all the tests.

I also have to test crash reports with multiple minidumps attached but looking at antenna's code I believe those should work out-of-the-box.

I've managed to add Windows and macOS support and submitted crashes for both. I'll be adding GeckoView support next week then the patch-set should be ready for review. Testing will take a while because I have a huge matrix of manual tests to run to ensure every platform is behaving correctly.

While adjusting the GeckoView/Fennec crash reporting machinery I noticed that when submitting crash reports we usually send a filed called minidumpSuccess that indicates whether a minidump was obtained successfully or not for the crash. However I haven't found mention of that field in antenna and breakpad, is it still used or can I just drop it?

^^^

Flags: needinfo?(willkg)

It looks like Antenna treats it like other fields and it gets added to the raw crash. For example, bp-41dc7dbe-9355-4a71-aa3e-2ff0c0190722 has:

"minidumpSuccess": "True"

in the raw crash data.

It's not searchable and I'd never heard of it before. Nothing in Socorro uses it and I've never heard of anything else using it, either. I don't know if it's helpful for crash reporting diagnostics or not.

Flags: needinfo?(willkg)

Thanks Will, I'll file a separate bug to remove that filed from GeckoView's crash reporter since we don't use it. As for the rest I'm still ironing out issues in the patch set. It should be ready for review soon.

Depends on: 1570789

This is taking far, far longer than I had anticipated. I just cleared up a problem with GeckoView and I need to green up a couple more plugin hang tests before it's ready for review.

Depends on D46852

Depends on D46853

Depends on D46854

Depends on D46855

Depends on D46856

Attachment #9070249 - Attachment is obsolete: true
Attachment #9070283 - Attachment is obsolete: true

Note to the reviewers: I split the patch in reasonable chunks trying to group affected functionality as much as possible. All the patches need to be applied for this to work though. While the changes are largely mechanical as we're only switching the format of the .extra files there's areas of the code that have undergone more changes for two main reasons:

  • Certain synchronous code paths are now asynchronous
  • In some cases the new JSON format means stricter requirements and thus tighter checks

I find the latter a very nice improvement as this code has been notoriously brittle and in more than one occasion has worked out of chance.

Cleaning up the DLL blocklist annotation turned out to be a can of worms and it's taking far longer than anticipated. However, given that this paves the way for the future rewrite I think it's worth the effort to clean up that part before proceeding further.

I've finally updated all the patches, addressing the review comments. As for the DLL blocklist I've exposed the abstract annotation write so that it can be used directly without spilling out the implementation. Since writing the list would still require some sort of "streaming" access I'm now writing it to a fixed-length string. The compromise here is that if it's too long it will be truncated but I guess we can live with that.

I've also switched the annotation writers from using NUL-terminated strings to optionally accepting a length. This is required to allow NUL characters in the annotations which - while unlikely - are possible and we don't want to miss some crash reports just because of that.

Finally I've rolled my own, very simple JSON-writing logic because JSONWriter used allocations to escape characters which wouldn't work in the exception handler.

With these changes in place, no matter what ends up in the annotations, they should be written out into valid JSON.

You need to log in before you can comment on or make changes to this bug.