Closed Bug 1420363 Opened 3 years ago Closed 7 months ago

Write out the .extra file in JSON format

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file, 11 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
As per bug 1395507 comment 1, .extra files are currently in an unspecified, INI-ish format. Turning them into JSON presents certain challenges but would have the important benefit of throwing away all the custom code for parsing them, escaping line-breaks, JSON entries, etc... The biggest downsides are that if we get interrupted while "appending" to the file it might end up being broken and we'll need to update the tooling too.

I've started thinking how to do this and there's an issue that's been nagging me: we currently fully write out the .extra file every time we add, modify or remove a crash annotation. Since this generates a lot of write traffic and a handful of syscalls it has performance implications but it's usually in the noise. If we use the same approach once we turn the .extra file into JSON we'll also be serializing a bunch of annotations into JSON which might further increase the performance cost of annotating this operation.

What if we only wrote out the .extra file in the exception handler? For content processes this is already happening since crash annotations are sent via IPC so the change would be needed mostly for the main process. We won't be able to allocate memory so we'll need our own little specialized writer that would only use a fixed buffer + syscalls but apart from that I don't see any particular issues.

The only other negative thing I can think of is if the crashing code overwrites some of the in-memory annotations we'll end up writing out garbage. I'm not too worried about this too because I've already seen it happen with the current model so it's not like we have a bulletproof solution. Last but not least once we move the whole thing OOP this will become a non-issue.

Ted can you think of any other downsides to this approach?

Blocks: 1416078
Flags: needinfo?(ted)

I think this is probably fine. I don't recall why I originally implemented the "serialize everything when we change an annotation" model, probably because I wasn't sure about the safety of something in the exception handler? If we can safely iterate over annotations there, and safely serialize to JSON, that should be fine.

Flags: needinfo?(ted)
Depends on: 1533732
Depends on: 1547698
Blocks: 1553536

Taking this since we decided this is the best approach, here's a quick rundown of what needs to be done:

  • Modify the main process exception handler to write out JSON using a combination of the exception-safe PlatformWriter and the JSONWriter
  • Bump the event file format version and adapt the CrashManager to parse it properly
  • Modify the crashreporter client to parse, modify and submit the JSON payload
  • Modify content crash report generation to also write out JSON
  • Modify the CrashService to properly parse the new JSON payload
  • Rip out the now useless code to escape/un-escape/re-escape JSON fields and the pseudo-INI-parser used to read the .extra files
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Summary: Evaluate writing out the .extra file in JSON format → Write out the .extra file in JSON format

Aaron I'm curious why the DLL blocklist uses the gWriteFn function pointer [1] instead of calling InternalWriteNotes directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.

Flags: needinfo?(aklotz)

(In reply to Gabriele Svelto [:gsvelto] from comment #3)

  • Modify the crashreporter client to parse, modify and submit the JSON payload
  • Modify content crash report generation to also write out JSON

Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?

Flags: needinfo?(gsvelto)

(In reply to :Gijs (he/him) from comment #5)

Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?

Yes. The minidump-analyzer is already using jsoncpp so reading, modifying & writing back the file should be trivial. This has the added advantage that we don't care about escaping which has been a never-ending source of problems when mixing the newline-separated annotations and JSON containing newlines.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

Aaron I'm curious why the DLL blocklist uses the gWriteFn function pointer [1] instead of calling InternalWriteNotes directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.

The legacy DLL blocklist in mozglue.dll is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides in firefox.exe. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #7)

The legacy DLL blocklist in mozglue.dll is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides in firefox.exe. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.

I see, that's tricky. Right now we add those annotations in the main process' exception handler which is the most unwieldy place I can think of. I don't know why that is but I suspect it was put in there so that the annotation would be recorded even in early startup crashes. If there were a better place where to move adding those annotations to I'd be very keen on doing it.

Depends on: 1555917
Depends on: 1555954

This is a WIP vertical-slice of the functionality. It changes the exception handler to write out the .extra file in JSON format then updates the minidump-analyzer and crashreporter client to deal with it. The CrashManager doesn't support the .extra-based JSON yet nor the content crash handlers. Linux is also the only platform this is enabled on, Windows and macOS will require some platform-dependent code.

With this patch applied I can send JSON-ified crash reports to Socorro but they are discarded so there must be something wrong in the format or in the POST command.

Attached file Crash reporter HTTP POST payload (obsolete) —

Will, I tried to send my first JSON-infused crash report but Socorro rejected it with a "Discarded=1" response. I've attached the contents of the HTTP POST message and everything looks fine but maybe I'm missing something.

Flags: needinfo?(willkg)

I'm guessing it's getting throttled since it's not nightly or beta. Try switching to ReleaseChannel nightly and see if that works.

Flags: needinfo?(willkg)

I gave it a spin but it didn't help :-( Is there a way to get a more detailed response from Socorro in these cases? I've tried to launch an instance locally this afternoon but I wasn't very successful.

I pulled out the extra.json part of the payload and saved that locally, then I ran Antenna in one terminal and in another terminal, did this:

python -m testlib.mini_poster --url http://localhost:8000/submit --raw_crash extra.json --use-json

That works. It's possible it's a bad test, but i'm not sure what would be wrong about it.

One thing I noticed about the attachment is that it suggests it's posting to / and not /submit. If that's true, try /submit and see if that works. If not, ping me on IRC and we'll work through it.

So I tried sending the same JSON blob and minidump with testlib.mini_poster and the crash reporter client and the former is accepted while the latter is not. If I intercept the HTTP POST with nc they are almost identical. The only difference is that testlib.mini_poster rearrange the fields in the JSON payloads and adds some space but there's absolutely no difference in the content. The URL we send the POST too is correct, it's the same both w/ and w/o my patch applied. I'll try with content crashes - which I aren't supported yet - and see what happens there. Maybe I'm doing something wrong, but if I am I'm pretty sure it's either something very simple or very obscure.

The collector will now return "Discarded=REASON" for rejections. That should help figure out what's going on.

Thanks, I'll try it again today. In the meantime I have a working theory of why this might happen - and I should be able to test it. I haven't yet tried to change the user agent when testing with curl because I thought it didn't matter. But then it occurred to me that maybe something in between the client and the collector is rewriting the contents of the message if it sees it coming from Breakpad. Anyway, this is easy enough to test so I'll give it a spin today.

The reason why the report is discarded is Discarded=has_json_and_kv which seems odd because when I intercepted the message with nc there was just the minidump entry and the JSON blob. I will try tweaking the user agent since that's the only thing I haven't tried while investigating this.

So it seems that when the POST message is parsed here:

https://github.com/mozilla-services/antenna/blob/master/antenna/breakpad_resource.py#L288

The resulting fs object contains a field that is neither the minidump nor the JSON blob.

I think there are a couple of possibilities:

  1. Another annotation field is getting in there sometimes somehow. I haven't read your code changes, so I have no idea how possible this is.
  2. There's a bug in the Antenna code that iterates over fields in the resulting FieldStorage where FieldStorage sometimes spits out a nothing or something like that and the Antenna loop handles that poorly.
  3. cgi.FieldStorage has a bug where it thinks there's another field when there isn't.

If there's a way we can reproduce this in a local dev environment, that'd be super handy. If you can reproduce it, it'd be great if you could add a print line in that loop that says what's in each field it parses. Maybe something like this:

print(item_name, fs.item_type)

I wrote a comment that sometimes the field name is None and that is super fishy. I wonder if that's what we're seeing.

The cgi.FieldStorage code is really old part of Python stdlib and might get removed and has issues. We might be seeing an issue with that here. I could rework the code to remove that, but that's a non-trivial project. Pretty sure I have a bug for it somewhere.

Let me know if you can reproduce it and if you can what the output of that print line is. That'll illuminate us as to whether this is a bug I should fix in Antenna or not.

Locally it's not happening, the docker-launched antenna processes the HTTP POST just fine and there's no spurious fields showing up. What if - as a temporary workaround - you could change the code to ignore any other form entries once a JSON one is found? I know it's not pretty to silently ignore potentially malformed messages but since we don't know where it's coming from it might be the only solution for now. I will have tests in mozilla-central that parse the HTTP POST and ensure it only contains the minidump file and JSON payload so we will enforce that assumption on the gecko side at least.

Mmm... I'll write up a bug for this. Sure seems like this is a bug in Antenna. Probably easiest to treat it as such until we know it's definitely not.

Antenna (collector) should be fixed now. I verified it with the test script I've been using. I think your test should work now. If not, let me know and we'll tear it all apart to find the bugs!

This is the first content crash successfully submitted with a JSON payload:

https://crash-stats.mozilla.org/report/index/ffa37399-dc2f-477b-89bc-167a00190628

And this is the first main process crash:

https://crash-stats.mozilla.org/report/index/030241a7-c8b0-43bf-8278-da9050190628

Things are looking good but I have to modify our fake crash server to accommodate for the new reports before I can fix all the tests.

I also have to test crash reports with multiple minidumps attached but looking at antenna's code I believe those should work out-of-the-box.

I've managed to add Windows and macOS support and submitted crashes for both. I'll be adding GeckoView support next week then the patch-set should be ready for review. Testing will take a while because I have a huge matrix of manual tests to run to ensure every platform is behaving correctly.

While adjusting the GeckoView/Fennec crash reporting machinery I noticed that when submitting crash reports we usually send a filed called minidumpSuccess that indicates whether a minidump was obtained successfully or not for the crash. However I haven't found mention of that field in antenna and breakpad, is it still used or can I just drop it?

^^^

Flags: needinfo?(willkg)

It looks like Antenna treats it like other fields and it gets added to the raw crash. For example, bp-41dc7dbe-9355-4a71-aa3e-2ff0c0190722 has:

"minidumpSuccess": "True"

in the raw crash data.

It's not searchable and I'd never heard of it before. Nothing in Socorro uses it and I've never heard of anything else using it, either. I don't know if it's helpful for crash reporting diagnostics or not.

Flags: needinfo?(willkg)

Thanks Will, I'll file a separate bug to remove that filed from GeckoView's crash reporter since we don't use it. As for the rest I'm still ironing out issues in the patch set. It should be ready for review soon.

Depends on: 1570789

This is taking far, far longer than I had anticipated. I just cleared up a problem with GeckoView and I need to green up a couple more plugin hang tests before it's ready for review.

Depends on D46848

Depends on D46852

Depends on D46853

Depends on D46854

Depends on D46855

Attached file Bug 1420363 - GeckoView support; r=agi (obsolete) —

Depends on D46856

Attachment #9070249 - Attachment is obsolete: true
Attachment #9070283 - Attachment is obsolete: true

Note to the reviewers: I split the patch in reasonable chunks trying to group affected functionality as much as possible. All the patches need to be applied for this to work though. While the changes are largely mechanical as we're only switching the format of the .extra files there's areas of the code that have undergone more changes for two main reasons:

  • Certain synchronous code paths are now asynchronous
  • In some cases the new JSON format means stricter requirements and thus tighter checks

I find the latter a very nice improvement as this code has been notoriously brittle and in more than one occasion has worked out of chance.

Cleaning up the DLL blocklist annotation turned out to be a can of worms and it's taking far longer than anticipated. However, given that this paves the way for the future rewrite I think it's worth the effort to clean up that part before proceeding further.

I've finally updated all the patches, addressing the review comments. As for the DLL blocklist I've exposed the abstract annotation write so that it can be used directly without spilling out the implementation. Since writing the list would still require some sort of "streaming" access I'm now writing it to a fixed-length string. The compromise here is that if it's too long it will be truncated but I guess we can live with that.

I've also switched the annotation writers from using NUL-terminated strings to optionally accepting a length. This is required to allow NUL characters in the annotations which - while unlikely - are possible and we don't want to miss some crash reports just because of that.

Finally I've rolled my own, very simple JSON-writing logic because JSONWriter used allocations to escape characters which wouldn't work in the exception handler.

With these changes in place, no matter what ends up in the annotations, they should be written out into valid JSON.

Thanks for the final reviews Nathan. I've squashed all the patches together, manually re-tested it on Linux and sent it to, try. This is the result:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=605f3eacfa7e43509f3325c77c11593f8a7403a0

Mostly green but there's a few things that still need fixing:

  • mingw doesn't have some of Windows' overloads so I have to fix the blocklist code
  • The GeckoView linter is telling me that the minimum Android API we support is 16 (that's Android 4.1!) and I'm using functions that are too modern. I have to double-check this because this wiki page says that only applies to Fennec, GeckoView is on 18 and Fenix 21.
  • These two failures seem genuine, but I haven't run non-e10s tests in so long it's hard to tell. I'll try and fix them anyway.
  • This macOS failure also seems genuine so I'll look into it
  • I also need to re-run manual tests on Windows, macOS and GeckoView before I can land this

I fixed the remaining API issues in GeckoView and updated the API file since the crash reporter interface is exposed. I've manually tested the change on Linux, Windows and GeckoView, carefully comparing the resulting crashes and crash pings and I've identified and fixed an issue with the crash notes on GeckoView. I still have to test macOS and iron out the non-e10s test failures as well as the one on macOS. If I can sort them out I feel confident landing it tomorrow.

Yet one more rebase. I think that the problem I'm seeing in the macOS crash reporter client stems from the buffer of a C++ string not being retained correctly once I turn it into an NSMutableString and thus causing freed memory to be accessed afterwards. I'll try to fix it by leaking the string... and if it works I swear we'll leak the thrice-damned string and I'll land this. Only 5 months after the first draft.

I've identified the issue in the macOS crashreporter; I was using an autorelease'd string where I should have used a manually allocated one. Apple's documentation was not particularly helpful in letting me figure out this one since it doesn't mention what the various creation/initialization functions do with the object's reference count. Oh well. I've also rebased and adjusted the API changelog for Android. There's still one last test that's failing on Linux and once that fixed we should be good to land.

I've fixed all the remaining issue. I'll abandon the patch-set and upload a squashed patch for review. The changes compared to the original patch are minimal so it should only need rubber-stamping.

Attachment #9094722 - Attachment is obsolete: true
Attachment #9094723 - Attachment is obsolete: true
Attachment #9094724 - Attachment is obsolete: true
Attachment #9094725 - Attachment is obsolete: true
Attachment #9094726 - Attachment is obsolete: true
Attachment #9094727 - Attachment is obsolete: true
Attachment #9094728 - Attachment is obsolete: true
Attachment #9094729 - Attachment is obsolete: true
Attachment #9094730 - Attachment is obsolete: true
Attachment #9094731 - Attachment is obsolete: true
Attachment #9094722 - Attachment description: Bug 1420363 - Switch the DLL blocklist annotations to JSON; r=froydnj → Bug 1420363 - Write crash annotations as JSON
Attachment #9094722 - Attachment is obsolete: false

Agi, can you r+ the squashed patch (attachment 9094722 [details]) so that I can land it? Sorry for the noise but I couldn't land this as a series because all the patches need to be together to work correctly.

Flags: needinfo?(agi)

Done!

Flags: needinfo?(agi)
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1603f5abc56e
Write crash annotations as JSON r=froydnj,agi

Investigating this; there seems to be a problem when dealing with non-ASCII characters in the telemetry environment field which is causing the .extra file for that particular test not to be parsed correctly.

Flags: needinfo?(gsvelto)

I went on a wild-goose chase around UTF-16-to-UTF-8 conversion only to realize that the problem was caused by a promise chain I had replaced with synchronous code but forgetting to remove the promises. I've got a couple more try runs going before I try to land again.

Still can't land: I found a very subtle issue that was already present but this patch made more apparent. In mochitests we wait for the crashes to be fully registered by the CrashManager before parsing the .extra file if the test is expecting crashes. For crashes that might happen in tests that do not expect them we don't wait for them to be processed and we can hit a race.

Depends on: 1598005

(In reply to Gabriele Svelto [:gsvelto] from comment #55)

In mochitests we wait for the crashes to be fully registered by the CrashManager before parsing the .extra file if the test is expecting crashes.

This is interesting. I wonder if we should also use that in those Marionette tests which explicitly check for expected crashes. Do you have some more information about that, or just links? Thanks.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #56)

This is interesting. I wonder if we should also use that in those Marionette tests which explicitly check for expected crashes. Do you have some more information about that, or just links? Thanks.

It's pretty simple to use, given one or more crash IDs you call CrashManager.ensureCrashIsPresent() and wait on the promises it returns. This is how it's done in SpecialPowersParent.jsm, we gather promises for every crash here:

https://searchfox.org/mozilla-central/rev/652014ca1183c56bc5f04daf01af180d4e50a91c/testing/specialpowers/content/SpecialPowersParent.jsm#776-779

Then we wait for them before touching the .dmp/.extra files:

https://searchfox.org/mozilla-central/rev/652014ca1183c56bc5f04daf01af180d4e50a91c/testing/specialpowers/content/SpecialPowersChild.jsm#711-713

BrowserTestUtils.jsm does something similar:

https://searchfox.org/mozilla-central/rev/652014ca1183c56bc5f04daf01af180d4e50a91c/testing/mochitest/BrowserTestUtils/BrowserTestUtils.jsm#1686-1756

Depends on: 1598258

Great. Thank you! I filed bug 1598274.

Try is nice and green, the tree are (should be?) open so let's try and land this for good.

Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/995ac9a0ab02
Write crash annotations as JSON r=froydnj,agi
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
Regressions: 1280561
Regressions: 1603201
Blocks: 1603499
Regressions: 1627360
You need to log in before you can comment on or make changes to this bug.