Closed Bug 803209 Opened 12 years ago Closed 11 years ago

Create table and stored procedure to track volume of GC crashes over time

Categories

(Socorro :: Webapp, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: billm, Assigned: selenamarie)

References

Details

(Whiteboard: [DB Changes][qa-])

Attachments

(1 file, 2 obsolete files)

A while ago we added an annotation to crash dumps to say whether the crash happened during a GC. However, we don't have a way to access that data. I think the minimum for what we'd need would be a report that would tell you, for each version, what percentage of crashes happened during GC. Or maybe it would be better to know how many GC crashes there were per ADU. The latter seems more useful to me, but it seems like Socorro always uses the former, so maybe that's easier to obtain.

Doing a breakdown by version would at least allow us to see the effect on crash rated of, say, incremental GC. However, it would also be nice to see it broken down by version and buildid, since that would allow us to see spikes caused by particular patches.

Finally, it would be great if we could get a report like this mailed out to certain people every week or so.
This still would be really useful. Benjamin, are you the only person who could work on this, or are there other people I should ask for help?
(In reply to Bill McCloskey (:billm) from comment #1)
> This still would be really useful. Benjamin, are you the only person who
> could work on this, or are there other people I should ask for help?

I could have a look, although I have a bit of a backlog. :)

Do you happen to know what an annotation for a crash that happens during a GC looks like?
(In reply to Selena Deckelmann :selenamarie :selena from comment #2)
> I could have a look, although I have a bit of a backlog. :)

Awesome! thanks.

> Do you happen to know what an annotation for a crash that happens during a
> GC looks like?

Yes, the .json file includes "IsGarbageCollecting": 1. Crashes that are not doing GC don't have this field at all in the json file.
changing this to webapp since it's after a recurring report. A one-off request would be better than nothing though, if we could do that quickly.
Component: Data request → Webapp
This is pretty straightforward, and we might be able to do it with ES.
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
Whiteboard: [search]
Thinking about it, this is not a search related problem. It seems it could easily be added in the signature summary report. Brandon, thoughts?
Priority: P3 → --
Whiteboard: [search]
This will probably require a full change in our stack but shouldn't be impossible and will be faster than waiting for ES. Laura approved such a change. Let's get it on the schedule.
As I explained on IRC, I don't think Signature Summary helps in what the JS team wants to get out of this. We already pretty much know if a signature happens mostly in GC or not.
What we want to have is some tracking of the per-build-date development of the sum of all GC crashes (i.e. those with IsGarbageCollecting=1), so that we see when we have a regression, esp. on Nightly, that causes more such crashes to happen. With that, we can track down what changes have been made between "better" and "worse" builds, and potentially track down sources of memory corruption (which are extremely hard, if not impossible, to track down from the actual crash reports).

So, ideally, we'd get some graph of per-build-date total amount of GC crashes for all versions that have per-build-date reporting enabled (i.e. right now Nightly and Aurora).
lars: need to get this into the DB so we can aggregate on it.  Work with Selena to get that sorted out.
Assignee: nobody → lars
Target Milestone: --- → 38
Depends on: 843788
Depends on: 847939
Target Milestone: 38 → 39
Assignee: lars → sdeckelmann
Target Milestone: 39 → 40
Target Milestone: 40 → Future
Next step for me is to drum up some stored procedures and a report.
:selena, the next step in this process is to get raw crashes flowing from the crashmovers into Postgres so that we can start accessing the GC information.  That'll entail getting a network route from the collector boxes to Postgres in both staging and production.

Of course, we ought to do this in dev first.  Since we've got the new style crashmovers running in both stage and production now, this data flow can be enabled entirely through configuration.  

We need these questions answered and tasks done:

0) fix the cron submitter problem so we can actually test
1) do we need to route to Postgres through PGBouncer?
2) enable the route
3) configure and test the impact in dev
4) configure and test the impact in staging 
5) make it happen in prod
6) of course, then we need a way to actually report on and display this data,
but that's a colaboration with our front end people.

Let's work together on this.
Target Milestone: Future → 46
Depends on: 866960
Release 45 includes the changes to get the JSON into PostgreSQL. That should ship this week.
Depends on: 869221
Hi Bill,

Could you provide an example crash with 'isGarbageCollecting': 1? I can't seem to fetch any crashes with this item in the raw crash JSON in the last week.

-selena
Flags: needinfo?(wmccloskey)
Note that it's studlycaps "IsGarbageCollecting". Looking for a sample using jydoop now.
(In reply to Selena Deckelmann :selenamarie :selena from comment #13)
> Could you provide an example crash with 'isGarbageCollecting': 1? I can't
> seem to fetch any crashes with this item in the raw crash JSON in the last
> week.

I'm not Bill, but here are a few:

bp-aac80d2f-5ec1-430d-a55f-7d5ef2130507
bp-435913f4-62bd-428c-bf47-b43cb2130507
bp-61d5b275-b5a5-49d3-a4e0-ca5412130507
bp-8eaecf65-6482-4fe2-96e5-2babd2130507

(I just looked for two obvious GC crash signatures - in the marking phase - and randomly selected two crashes of each.)
Target Milestone: 46 → 47
Thanks, Selena! I don't understand the format of the file, though. The first half doesn't contain any percentages. Should that be ignored?

Also, could you post the code you used to generate this? I'd like to be able to run it myself, if that's okay.
(In reply to Bill McCloskey (:billm) from comment #18)
> Thanks, Selena! I don't understand the format of the file, though. The first
> half doesn't contain any percentages. Should that be ignored?

I'll figure out what's wrong with the file. I probably just messed up a test on the incoming data from raw_adu.

> Also, could you post the code you used to generate this? I'd like to be able
> to run it myself, if that's okay.

I'll post the code, but you can't run it yourself at this point. 

I'm sorry this isn't easier. I did something out-of-band that typically takes several weeks to get working inside of socorro itself because of the need for access to ADI/ADU and access to raw crashes from HBase.  Because of access controls, I ran the map-reduce on sp-admin01 and then the SQL query to get adu on the database server.
Sent updated code and raw_adu details to :billm offline.

Asked whether we should have divided up the reports based on platform.
Re-running reports based on buildid only!
Thanks, Selena. This is really starting to come together. A couple questions:

1. Are all the lines from nightly, or are other channels included? There seems to be huge variation in the number of ADUs. It's okay to include non-nightly builds as long as they're separated somehow.

2. Some of the lines seem a little off, like this one:

20130409194949 136049.127% 2414872 1775

Could you investigate?

Thanks again for doing this. Once we get it nailed down, would it be possible to run it on crashes as far back as we have?
(In reply to Bill McCloskey (:billm) from comment #23)
> Thanks, Selena. This is really starting to come together. A couple questions:
> 
> 1. Are all the lines from nightly, or are other channels included? There
> seems to be huge variation in the number of ADUs. It's okay to include
> non-nightly builds as long as they're separated somehow.
> 
> 2. Some of the lines seem a little off, like this one:
> 
> 20130409194949 136049.127% 2414872 1775
> 
> Could you investigate?

Yes! 

> Thanks again for doing this. Once we get it nailed down, would it be
> possible to run it on crashes as far back as we have?

We have ADU going back to 12-1-2010.

Do you happen to know which version of Firefox we started reporting IsGarbageCollecting?
(In reply to Bill McCloskey (:billm) from comment #23)
> Thanks, Selena. This is really starting to come together. A couple questions:
> 
> 1. Are all the lines from nightly, or are other channels included? There
> seems to be huge variation in the number of ADUs. It's okay to include
> non-nightly builds as long as they're separated somehow.

I did not differentiate the channels. I will try to do that now.

> 2. Some of the lines seem a little off, like this one:
> 
> 20130409194949 136049.127% 2414872 1775

I had an error in my SQL - I was still grouping by date as well as buildid. I've fixed that and removed some of these weird lines. 

There seems to only be one version left that reports a very small ADU, but had a huge number of crashes by comparison: 

20130215125331 38100.000% 2667 7
Thanks, Selena! I just had a chance to look at this data closely, and it looks really useful. When I sort by buildid and exclude the builds from before 4/1, there's a very clear pattern. The baseline crash rate is about 0.15% (in crashes per ADU). There are spikes at 4/3-4/4 and 4/20-4/22 and 4/30. I investigated the spike at 4/20 and it was caused by bug 860145, which landed on 4/19 and was backed out on 4/22.

So I think this is working as we might expect, which is fantastic! I do have a few more requests, if you have time:

1. The original bug I was interested in was bug 868369. I forgot that it's actually about beta and not release. So it would be nice to incorporate other channels, as long as the channel is included in the data. I think you could change the find-isgc.py script like so:

    if 'IsGarbageCollecting' in raw_crash and 'buildid' in raw_crash and 'version' in raw_crash:
        context.write('\t'.join((raw_crash['version'], raw_crash['buildid'])), 1)

and then similar changes to the ADU query.

2. Could you run the script on more data? It would be great to have data from, maybe, 2/1 to the current date.
Style note on comment 27: I wouldn't use '\t'.join in the mapper; it's more efficient to use a tuple for the key:

context.write((raw_crash['version'], raw_crash['buildid']), 1)

And the CSV output at the end will also come out right automatically.
Target Milestone: 47 → 49
Summary: Track volume of GC crashes over time → Create table and stored procedure to track volume of GC crashes over time
Blocks: 875084
Target Milestone: 49 → 50
Depends on: 871997
Depends on: 880744
Whiteboard: [DB Changes]
Target Milestone: 50 → 52
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/459939e3578afcb9cc370178b0b0bdd440a91671
bug 803209 Support for garbage collection reporting

https://github.com/mozilla/socorro/commit/6c0e607b4c2d0ba487370f9b956ac6a1d4e58ba7
bug 803209 - code review fixes

https://github.com/mozilla/socorro/commit/8d8432b9d3230239da606bea815c62609d69ccfe
Merge pull request #1286 from selenamarie/bug803209-garbage-collecting-report

Bug803209 support for garbage collecting report in database
Target Milestone: 52 → 51
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [DB Changes] → [DB Changes][qa-]
Where can I find a graph or so for total volume of GC crashes over time? IIRC, that was the request here for a report, and I can't find it right now.
Blocks: 915317
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: