Write out the .extra file in JSON format
Categories
(Toolkit :: Crash Reporting, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox73 | --- | fixed |
People
(Reporter: gsvelto, Assigned: gsvelto)
References
(Blocks 2 open bugs, Regressed 1 open bug)
Details
Attachments
(1 file, 11 obsolete files)
Assignee | ||
Comment 1•6 years ago
|
||
I've started thinking how to do this and there's an issue that's been nagging me: we currently fully write out the .extra file every time we add, modify or remove a crash annotation. Since this generates a lot of write traffic and a handful of syscalls it has performance implications but it's usually in the noise. If we use the same approach once we turn the .extra file into JSON we'll also be serializing a bunch of annotations into JSON which might further increase the performance cost of annotating this operation.
What if we only wrote out the .extra file in the exception handler? For content processes this is already happening since crash annotations are sent via IPC so the change would be needed mostly for the main process. We won't be able to allocate memory so we'll need our own little specialized writer that would only use a fixed buffer + syscalls but apart from that I don't see any particular issues.
The only other negative thing I can think of is if the crashing code overwrites some of the in-memory annotations we'll end up writing out garbage. I'm not too worried about this too because I've already seen it happen with the current model so it's not like we have a bulletproof solution. Last but not least once we move the whole thing OOP this will become a non-issue.
Ted can you think of any other downsides to this approach?
Comment 2•6 years ago
|
||
I think this is probably fine. I don't recall why I originally implemented the "serialize everything when we change an annotation" model, probably because I wasn't sure about the safety of something in the exception handler? If we can safely iterate over annotations there, and safely serialize to JSON, that should be fine.
Assignee | ||
Comment 3•5 years ago
|
||
Taking this since we decided this is the best approach, here's a quick rundown of what needs to be done:
- Modify the main process exception handler to write out JSON using a combination of the exception-safe PlatformWriter and the JSONWriter
- Bump the event file format version and adapt the CrashManager to parse it properly
- Modify the crashreporter client to parse, modify and submit the JSON payload
- Modify content crash report generation to also write out JSON
- Modify the CrashService to properly parse the new JSON payload
- Rip out the now useless code to escape/un-escape/re-escape JSON fields and the pseudo-INI-parser used to read the .extra files
Assignee | ||
Comment 4•5 years ago
|
||
Aaron I'm curious why the DLL blocklist uses the gWriteFn
function pointer [1] instead of calling InternalWriteNotes
directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.
Comment 5•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #3)
- Modify the crashreporter client to parse, modify and submit the JSON payload
- Modify content crash report generation to also write out JSON
Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to :Gijs (he/him) from comment #5)
Am I right in thinking that, with an eye on bug 1553536, we'd need to read, modify, and rewrite the JSON data in question to do any appending once the file has been written once?
Yes. The minidump-analyzer is already using jsoncpp so reading, modifying & writing back the file should be trivial. This has the added advantage that we don't care about escaping which has been a never-ending source of problems when mixing the newline-separated annotations and JSON containing newlines.
Comment 7•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #4)
Aaron I'm curious why the DLL blocklist uses the
gWriteFn
function pointer [1] instead of callingInternalWriteNotes
directly to write out its crash annotations? I'm rewriting this code and wanted to be sure I won't break stuff if I refactor it.
The legacy DLL blocklist in mozglue.dll
is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides in firefox.exe
. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #7)
The legacy DLL blocklist in
mozglue.dll
is not used in the parent process when the launcher process is enabled. In that case, the code accessing the blocklist notes resides infirefox.exe
. We needed to redirect that function call depending on whether we are using legacy DLL blocking or the new launcher process DLL blocking.
I see, that's tricky. Right now we add those annotations in the main process' exception handler which is the most unwieldy place I can think of. I don't know why that is but I suspect it was put in there so that the annotation would be recorded even in early startup crashes. If there were a better place where to move adding those annotations to I'd be very keen on doing it.
Assignee | ||
Comment 9•5 years ago
|
||
This is a WIP vertical-slice of the functionality. It changes the exception handler to write out the .extra file in JSON format then updates the minidump-analyzer and crashreporter client to deal with it. The CrashManager doesn't support the .extra-based JSON yet nor the content crash handlers. Linux is also the only platform this is enabled on, Windows and macOS will require some platform-dependent code.
With this patch applied I can send JSON-ified crash reports to Socorro but they are discarded so there must be something wrong in the format or in the POST command.
Assignee | ||
Comment 10•5 years ago
|
||
Will, I tried to send my first JSON-infused crash report but Socorro rejected it with a "Discarded=1" response. I've attached the contents of the HTTP POST message and everything looks fine but maybe I'm missing something.
Comment 11•5 years ago
|
||
I'm guessing it's getting throttled since it's not nightly or beta. Try switching to ReleaseChannel nightly and see if that works.
Assignee | ||
Comment 12•5 years ago
|
||
I gave it a spin but it didn't help :-( Is there a way to get a more detailed response from Socorro in these cases? I've tried to launch an instance locally this afternoon but I wasn't very successful.
Comment 13•5 years ago
|
||
I pulled out the extra.json part of the payload and saved that locally, then I ran Antenna in one terminal and in another terminal, did this:
python -m testlib.mini_poster --url http://localhost:8000/submit --raw_crash extra.json --use-json
That works. It's possible it's a bad test, but i'm not sure what would be wrong about it.
One thing I noticed about the attachment is that it suggests it's posting to / and not /submit. If that's true, try /submit and see if that works. If not, ping me on IRC and we'll work through it.
Assignee | ||
Comment 14•5 years ago
|
||
So I tried sending the same JSON blob and minidump with testlib.mini_poster
and the crash reporter client and the former is accepted while the latter is not. If I intercept the HTTP POST with nc
they are almost identical. The only difference is that testlib.mini_poster
rearrange the fields in the JSON payloads and adds some space but there's absolutely no difference in the content. The URL we send the POST too is correct, it's the same both w/ and w/o my patch applied. I'll try with content crashes - which I aren't supported yet - and see what happens there. Maybe I'm doing something wrong, but if I am I'm pretty sure it's either something very simple or very obscure.
Comment 15•5 years ago
|
||
The collector will now return "Discarded=REASON" for rejections. That should help figure out what's going on.
Assignee | ||
Comment 16•5 years ago
|
||
Thanks, I'll try it again today. In the meantime I have a working theory of why this might happen - and I should be able to test it. I haven't yet tried to change the user agent when testing with curl because I thought it didn't matter. But then it occurred to me that maybe something in between the client and the collector is rewriting the contents of the message if it sees it coming from Breakpad. Anyway, this is easy enough to test so I'll give it a spin today.
Assignee | ||
Comment 17•5 years ago
|
||
The reason why the report is discarded is Discarded=has_json_and_kv
which seems odd because when I intercepted the message with nc
there was just the minidump entry and the JSON blob. I will try tweaking the user agent since that's the only thing I haven't tried while investigating this.
Assignee | ||
Comment 18•5 years ago
|
||
So it seems that when the POST message is parsed here:
https://github.com/mozilla-services/antenna/blob/master/antenna/breakpad_resource.py#L288
The resulting fs
object contains a field that is neither the minidump nor the JSON blob.
Comment 19•5 years ago
|
||
I think there are a couple of possibilities:
- Another annotation field is getting in there sometimes somehow. I haven't read your code changes, so I have no idea how possible this is.
- There's a bug in the Antenna code that iterates over fields in the resulting FieldStorage where FieldStorage sometimes spits out a nothing or something like that and the Antenna loop handles that poorly.
- cgi.FieldStorage has a bug where it thinks there's another field when there isn't.
If there's a way we can reproduce this in a local dev environment, that'd be super handy. If you can reproduce it, it'd be great if you could add a print line in that loop that says what's in each field it parses. Maybe something like this:
print(item_name, fs.item_type)
I wrote a comment that sometimes the field name is None
and that is super fishy. I wonder if that's what we're seeing.
The cgi.FieldStorage code is really old part of Python stdlib and might get removed and has issues. We might be seeing an issue with that here. I could rework the code to remove that, but that's a non-trivial project. Pretty sure I have a bug for it somewhere.
Let me know if you can reproduce it and if you can what the output of that print line is. That'll illuminate us as to whether this is a bug I should fix in Antenna or not.
Assignee | ||
Comment 20•5 years ago
|
||
Locally it's not happening, the docker-launched antenna processes the HTTP POST just fine and there's no spurious fields showing up. What if - as a temporary workaround - you could change the code to ignore any other form entries once a JSON one is found? I know it's not pretty to silently ignore potentially malformed messages but since we don't know where it's coming from it might be the only solution for now. I will have tests in mozilla-central that parse the HTTP POST and ensure it only contains the minidump file and JSON payload so we will enforce that assumption on the gecko side at least.
Comment 21•5 years ago
|
||
Mmm... I'll write up a bug for this. Sure seems like this is a bug in Antenna. Probably easiest to treat it as such until we know it's definitely not.
Comment 22•5 years ago
|
||
Antenna (collector) should be fixed now. I verified it with the test script I've been using. I think your test should work now. If not, let me know and we'll tear it all apart to find the bugs!
Assignee | ||
Comment 23•5 years ago
|
||
This is the first content crash successfully submitted with a JSON payload:
https://crash-stats.mozilla.org/report/index/ffa37399-dc2f-477b-89bc-167a00190628
And this is the first main process crash:
https://crash-stats.mozilla.org/report/index/030241a7-c8b0-43bf-8278-da9050190628
Things are looking good but I have to modify our fake crash server to accommodate for the new reports before I can fix all the tests.
I also have to test crash reports with multiple minidumps attached but looking at antenna's code I believe those should work out-of-the-box.
Assignee | ||
Comment 24•5 years ago
|
||
I've managed to add Windows and macOS support and submitted crashes for both. I'll be adding GeckoView support next week then the patch-set should be ready for review. Testing will take a while because I have a huge matrix of manual tests to run to ensure every platform is behaving correctly.
Assignee | ||
Comment 25•5 years ago
|
||
While adjusting the GeckoView/Fennec crash reporting machinery I noticed that when submitting crash reports we usually send a filed called minidumpSuccess
that indicates whether a minidump was obtained successfully or not for the crash. However I haven't found mention of that field in antenna and breakpad, is it still used or can I just drop it?
Comment 27•5 years ago
|
||
It looks like Antenna treats it like other fields and it gets added to the raw crash. For example, bp-41dc7dbe-9355-4a71-aa3e-2ff0c0190722 has:
"minidumpSuccess": "True"
in the raw crash data.
It's not searchable and I'd never heard of it before. Nothing in Socorro uses it and I've never heard of anything else using it, either. I don't know if it's helpful for crash reporting diagnostics or not.
Assignee | ||
Comment 28•5 years ago
|
||
Thanks Will, I'll file a separate bug to remove that filed from GeckoView's crash reporter since we don't use it. As for the rest I'm still ironing out issues in the patch set. It should be ready for review soon.
Assignee | ||
Comment 29•5 years ago
|
||
This is taking far, far longer than I had anticipated. I just cleared up a problem with GeckoView and I need to green up a couple more plugin hang tests before it's ready for review.
Assignee | ||
Comment 30•5 years ago
|
||
Assignee | ||
Comment 31•5 years ago
|
||
Assignee | ||
Comment 32•5 years ago
|
||
Depends on D46848
Assignee | ||
Comment 33•5 years ago
|
||
Depends on D46849
Assignee | ||
Comment 34•5 years ago
|
||
Depends on D46850
Assignee | ||
Comment 35•5 years ago
|
||
Depends on D46851
Assignee | ||
Comment 36•5 years ago
|
||
Depends on D46852
Assignee | ||
Comment 37•5 years ago
|
||
Depends on D46853
Assignee | ||
Comment 38•5 years ago
|
||
Depends on D46854
Assignee | ||
Comment 39•5 years ago
|
||
Depends on D46855
Assignee | ||
Comment 40•5 years ago
|
||
Depends on D46856
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 41•5 years ago
|
||
Note to the reviewers: I split the patch in reasonable chunks trying to group affected functionality as much as possible. All the patches need to be applied for this to work though. While the changes are largely mechanical as we're only switching the format of the .extra files there's areas of the code that have undergone more changes for two main reasons:
- Certain synchronous code paths are now asynchronous
- In some cases the new JSON format means stricter requirements and thus tighter checks
I find the latter a very nice improvement as this code has been notoriously brittle and in more than one occasion has worked out of chance.
Assignee | ||
Comment 42•5 years ago
|
||
Cleaning up the DLL blocklist annotation turned out to be a can of worms and it's taking far longer than anticipated. However, given that this paves the way for the future rewrite I think it's worth the effort to clean up that part before proceeding further.
Assignee | ||
Comment 43•5 years ago
|
||
I've finally updated all the patches, addressing the review comments. As for the DLL blocklist I've exposed the abstract annotation write so that it can be used directly without spilling out the implementation. Since writing the list would still require some sort of "streaming" access I'm now writing it to a fixed-length string. The compromise here is that if it's too long it will be truncated but I guess we can live with that.
I've also switched the annotation writers from using NUL-terminated strings to optionally accepting a length. This is required to allow NUL characters in the annotations which - while unlikely - are possible and we don't want to miss some crash reports just because of that.
Finally I've rolled my own, very simple JSON-writing logic because JSONWriter used allocations to escape characters which wouldn't work in the exception handler.
With these changes in place, no matter what ends up in the annotations, they should be written out into valid JSON.
Assignee | ||
Comment 44•5 years ago
|
||
Thanks for the final reviews Nathan. I've squashed all the patches together, manually re-tested it on Linux and sent it to, try. This is the result:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=605f3eacfa7e43509f3325c77c11593f8a7403a0
Mostly green but there's a few things that still need fixing:
- mingw doesn't have some of Windows' overloads so I have to fix the blocklist code
- The GeckoView linter is telling me that the minimum Android API we support is 16 (that's Android 4.1!) and I'm using functions that are too modern. I have to double-check this because this wiki page says that only applies to Fennec, GeckoView is on 18 and Fenix 21.
- These two failures seem genuine, but I haven't run non-e10s tests in so long it's hard to tell. I'll try and fix them anyway.
- This macOS failure also seems genuine so I'll look into it
- I also need to re-run manual tests on Windows, macOS and GeckoView before I can land this
Assignee | ||
Comment 45•5 years ago
|
||
I fixed the remaining API issues in GeckoView and updated the API file since the crash reporter interface is exposed. I've manually tested the change on Linux, Windows and GeckoView, carefully comparing the resulting crashes and crash pings and I've identified and fixed an issue with the crash notes on GeckoView. I still have to test macOS and iron out the non-e10s test failures as well as the one on macOS. If I can sort them out I feel confident landing it tomorrow.
Assignee | ||
Comment 46•5 years ago
|
||
Yet one more rebase. I think that the problem I'm seeing in the macOS crash reporter client stems from the buffer of a C++ string not being retained correctly once I turn it into an NSMutableString and thus causing freed memory to be accessed afterwards. I'll try to fix it by leaking the string... and if it works I swear we'll leak the thrice-damned string and I'll land this. Only 5 months after the first draft.
Assignee | ||
Comment 47•5 years ago
|
||
I've identified the issue in the macOS crashreporter; I was using an autorelease'd string where I should have used a manually allocated one. Apple's documentation was not particularly helpful in letting me figure out this one since it doesn't mention what the various creation/initialization functions do with the object's reference count. Oh well. I've also rebased and adjusted the API changelog for Android. There's still one last test that's failing on Linux and once that fixed we should be good to land.
Assignee | ||
Comment 48•5 years ago
|
||
I've fixed all the remaining issue. I'll abandon the patch-set and upload a squashed patch for review. The changes compared to the original patch are minimal so it should only need rubber-stamping.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 49•5 years ago
|
||
Agi, can you r+ the squashed patch (attachment 9094722 [details]) so that I can land it? Sorry for the noise but I couldn't land this as a series because all the patches need to be together to work correctly.
Comment 51•5 years ago
|
||
Comment 52•5 years ago
|
||
Backed out for perma fails on test_busy_hang.xul.
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=276559675&repo=autoland&lineNumber=6589
Backout: https://hg.mozilla.org/integration/autoland/rev/ab0b99b9fbe77e8494816e20472ea593a29b598a
Assignee | ||
Comment 53•5 years ago
|
||
Investigating this; there seems to be a problem when dealing with non-ASCII characters in the telemetry environment field which is causing the .extra file for that particular test not to be parsed correctly.
Assignee | ||
Comment 54•5 years ago
|
||
I went on a wild-goose chase around UTF-16-to-UTF-8 conversion only to realize that the problem was caused by a promise chain I had replaced with synchronous code but forgetting to remove the promises. I've got a couple more try runs going before I try to land again.
Assignee | ||
Comment 55•5 years ago
|
||
Still can't land: I found a very subtle issue that was already present but this patch made more apparent. In mochitests we wait for the crashes to be fully registered by the CrashManager
before parsing the .extra file if the test is expecting crashes. For crashes that might happen in tests that do not expect them we don't wait for them to be processed and we can hit a race.
Comment 56•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #55)
In mochitests we wait for the crashes to be fully registered by the
CrashManager
before parsing the .extra file if the test is expecting crashes.
This is interesting. I wonder if we should also use that in those Marionette tests which explicitly check for expected crashes. Do you have some more information about that, or just links? Thanks.
Assignee | ||
Comment 57•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #56)
This is interesting. I wonder if we should also use that in those Marionette tests which explicitly check for expected crashes. Do you have some more information about that, or just links? Thanks.
It's pretty simple to use, given one or more crash IDs you call CrashManager.ensureCrashIsPresent()
and wait on the promises it returns. This is how it's done in SpecialPowersParent.jsm
, we gather promises for every crash here:
Then we wait for them before touching the .dmp/.extra files:
BrowserTestUtils.jsm
does something similar:
Comment 58•5 years ago
|
||
Great. Thank you! I filed bug 1598274.
Assignee | ||
Comment 59•5 years ago
|
||
Try is nice and green, the tree are (should be?) open so let's try and land this for good.
Comment 60•5 years ago
|
||
Comment 61•5 years ago
|
||
bugherder |
Description
•