Closed Bug 803209 Opened 13 years ago Closed 12 years ago

Create table and stored procedure to track volume of GC crashes over time

Categories

(Socorro :: Webapp, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

51

People

(Reporter: billm, Assigned: selenamarie)

References

Details

(Whiteboard: [DB Changes][qa-])

Attachments

(1 file, 2 obsolete files)

GC crashes as percentage of ADU for April 12 years ago Selena Deckelmann :selenamarie :selena 45.99 KB, text/plain		Details
GC crashes as percentage of ADU for April, based on buildids 12 years ago Selena Deckelmann :selenamarie :selena 20.83 KB, text/plain		Details
GC crashes as percentage of ADU for April, based on buildids, limited to nightly channel 12 years ago Selena Deckelmann :selenamarie :selena 5.40 KB, text/plain		Details

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Description

•

13 years ago

A while ago we added an annotation to crash dumps to say whether the crash happened during a GC. However, we don't have a way to access that data. I think the minimum for what we'd need would be a report that would tell you, for each version, what percentage of crashes happened during GC. Or maybe it would be better to know how many GC crashes there were per ADU. The latter seems more useful to me, but it seems like Socorro always uses the former, so maybe that's easier to obtain. Doing a breakdown by version would at least allow us to see the effect on crash rated of, say, incremental GC. However, it would also be nice to see it broken down by version and buildid, since that would allow us to see spikes caused by particular patches. Finally, it would be great if we could get a report like this mailed out to certain people every week or so.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 1

•

12 years ago

This still would be really useful. Benjamin, are you the only person who could work on this, or are there other people I should ask for help?

Selena Deckelmann :selenamarie :selena

Assignee

Comment 2

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #1) > This still would be really useful. Benjamin, are you the only person who > could work on this, or are there other people I should ask for help? I could have a look, although I have a bit of a backlog. :) Do you happen to know what an annotation for a crash that happens during a GC looks like?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 3

•

12 years ago

(In reply to Selena Deckelmann :selenamarie :selena from comment #2) > I could have a look, although I have a bit of a backlog. :) Awesome! thanks. > Do you happen to know what an annotation for a crash that happens during a > GC looks like? Yes, the .json file includes "IsGarbageCollecting": 1. Crashes that are not doing GC don't have this field at all in the json file.

Comment 4

•

12 years ago

changing this to webapp since it's after a recurring report. A one-off request would be better than nothing though, if we could do that quickly.

Component: Data request → Webapp

Laura Thomson :laura

Comment 5

•

12 years ago

This is pretty straightforward, and we might be able to do it with ES.

[DEACTIVATED] Adrian Gaudebert

Updated

•

12 years ago

OS: Linux → All

Priority: -- → P3

Hardware: x86_64 → All

Whiteboard: [search]

[DEACTIVATED] Adrian Gaudebert

Comment 6

•

12 years ago

Thinking about it, this is not a search related problem. It seems it could easily be added in the signature summary report. Brandon, thoughts?

Priority: P3 → --

Whiteboard: [search]

Brandon Savage [:brandon]

Comment 7

•

12 years ago

This will probably require a full change in our stack but shouldn't be impossible and will be faster than waiting for ES. Laura approved such a change. Let's get it on the schedule.

Comment 8

•

12 years ago

As I explained on IRC, I don't think Signature Summary helps in what the JS team wants to get out of this. We already pretty much know if a signature happens mostly in GC or not. What we want to have is some tracking of the per-build-date development of the sum of all GC crashes (i.e. those with IsGarbageCollecting=1), so that we see when we have a regression, esp. on Nightly, that causes more such crashes to happen. With that, we can track down what changes have been made between "better" and "worse" builds, and potentially track down sources of memory corruption (which are extremely hard, if not impossible, to track down from the actual crash reports). So, ideally, we'd get some graph of per-build-date total amount of GC crashes for all versions that have per-build-date reporting enabled (i.e. right now Nightly and Aurora).

Laura Thomson :laura

Comment 9

•

12 years ago

lars: need to get this into the DB so we can aggregate on it. Work with Selena to get that sorted out.

Assignee: nobody → lars

Target Milestone: --- → 38

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Depends on: 843788

K Lars Lohn [:lars] [:klohn]

Updated

•

12 years ago

Depends on: 847939

Laura Thomson :laura

Updated

•

12 years ago

Target Milestone: 38 → 39

K Lars Lohn [:lars] [:klohn]

Updated

•

12 years ago

Assignee: lars → sdeckelmann

Target Milestone: 39 → 40

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Target Milestone: 40 → Future

Selena Deckelmann :selenamarie :selena

Assignee

Comment 10

•

12 years ago

Next step for me is to drum up some stored procedures and a report.

K Lars Lohn [:lars] [:klohn]

Comment 11

•

12 years ago

:selena, the next step in this process is to get raw crashes flowing from the crashmovers into Postgres so that we can start accessing the GC information. That'll entail getting a network route from the collector boxes to Postgres in both staging and production. Of course, we ought to do this in dev first. Since we've got the new style crashmovers running in both stage and production now, this data flow can be enabled entirely through configuration. We need these questions answered and tasks done: 0) fix the cron submitter problem so we can actually test 1) do we need to route to Postgres through PGBouncer? 2) enable the route 3) configure and test the impact in dev 4) configure and test the impact in staging 5) make it happen in prod 6) of course, then we need a way to actually report on and display this data, but that's a colaboration with our front end people. Let's work together on this.

Laura Thomson :laura

Updated

•

12 years ago

Target Milestone: Future → 46

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Depends on: 866960

Selena Deckelmann :selenamarie :selena

Assignee

Comment 12

•

12 years ago

Release 45 includes the changes to get the JSON into PostgreSQL. That should ship this week.

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Depends on: 869221

Selena Deckelmann :selenamarie :selena

Assignee

Comment 13

•

12 years ago

Hi Bill, Could you provide an example crash with 'isGarbageCollecting': 1? I can't seem to fetch any crashes with this item in the raw crash JSON in the last week. -selena

Flags: needinfo?(wmccloskey)

Benjamin Smedberg

Comment 14

•

12 years ago

Note that it's studlycaps "IsGarbageCollecting". Looking for a sample using jydoop now.

Comment 15

•

12 years ago

(In reply to Selena Deckelmann :selenamarie :selena from comment #13) > Could you provide an example crash with 'isGarbageCollecting': 1? I can't > seem to fetch any crashes with this item in the raw crash JSON in the last > week. I'm not Bill, but here are a few: bp-aac80d2f-5ec1-430d-a55f-7d5ef2130507 bp-435913f4-62bd-428c-bf47-b43cb2130507 bp-61d5b275-b5a5-49d3-a4e0-ca5412130507 bp-8eaecf65-6482-4fe2-96e5-2babd2130507 (I just looked for two obvious GC crash signatures - in the marking phase - and randomly selected two crashes of each.)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 16

•

12 years ago

One crash signature that should be all GC crashes is [@ js::GCMarker::processMarkStackTop(js::SliceBudget&) ] https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2013-05-08&signature=js%3A%3AGCMarker%3A%3AprocessMarkStackTop%28js%3A%3ASliceBudget%26%29&version=Firefox%3A23.0a1

Flags: needinfo?(wmccloskey)

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Target Milestone: 46 → 47

Selena Deckelmann :selenamarie :selena

Assignee

Comment 17

•

12 years ago

Attached file GC crashes as percentage of ADU for April (obsolete) — Details

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 18

•

12 years ago

Thanks, Selena! I don't understand the format of the file, though. The first half doesn't contain any percentages. Should that be ignored? Also, could you post the code you used to generate this? I'd like to be able to run it myself, if that's okay.

Selena Deckelmann :selenamarie :selena

Assignee

Comment 19

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #18) > Thanks, Selena! I don't understand the format of the file, though. The first > half doesn't contain any percentages. Should that be ignored? I'll figure out what's wrong with the file. I probably just messed up a test on the incoming data from raw_adu. > Also, could you post the code you used to generate this? I'd like to be able > to run it myself, if that's okay. I'll post the code, but you can't run it yourself at this point. I'm sorry this isn't easier. I did something out-of-band that typically takes several weeks to get working inside of socorro itself because of the need for access to ADI/ADU and access to raw crashes from HBase. Because of access controls, I ran the map-reduce on sp-admin01 and then the SQL query to get adu on the database server.

Selena Deckelmann :selenamarie :selena

Assignee

Comment 20

•

12 years ago

Sent updated code and raw_adu details to :billm offline. Asked whether we should have divided up the reports based on platform.

Selena Deckelmann :selenamarie :selena

Assignee

Comment 21

•

12 years ago

Re-running reports based on buildid only!

Selena Deckelmann :selenamarie :selena

Assignee

Comment 22

•

12 years ago

Attached file GC crashes as percentage of ADU for April, based on buildids (obsolete) — Details

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 23

•

12 years ago

Thanks, Selena. This is really starting to come together. A couple questions: 1. Are all the lines from nightly, or are other channels included? There seems to be huge variation in the number of ADUs. It's okay to include non-nightly builds as long as they're separated somehow. 2. Some of the lines seem a little off, like this one: 20130409194949 136049.127% 2414872 1775 Could you investigate? Thanks again for doing this. Once we get it nailed down, would it be possible to run it on crashes as far back as we have?

Selena Deckelmann :selenamarie :selena

Assignee

Comment 24

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #23) > Thanks, Selena. This is really starting to come together. A couple questions: > > 1. Are all the lines from nightly, or are other channels included? There > seems to be huge variation in the number of ADUs. It's okay to include > non-nightly builds as long as they're separated somehow. > > 2. Some of the lines seem a little off, like this one: > > 20130409194949 136049.127% 2414872 1775 > > Could you investigate? Yes! > Thanks again for doing this. Once we get it nailed down, would it be > possible to run it on crashes as far back as we have? We have ADU going back to 12-1-2010. Do you happen to know which version of Firefox we started reporting IsGarbageCollecting?

Selena Deckelmann :selenamarie :selena

Assignee

Comment 25

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #23) > Thanks, Selena. This is really starting to come together. A couple questions: > > 1. Are all the lines from nightly, or are other channels included? There > seems to be huge variation in the number of ADUs. It's okay to include > non-nightly builds as long as they're separated somehow. I did not differentiate the channels. I will try to do that now. > 2. Some of the lines seem a little off, like this one: > > 20130409194949 136049.127% 2414872 1775 I had an error in my SQL - I was still grouping by date as well as buildid. I've fixed that and removed some of these weird lines. There seems to only be one version left that reports a very small ADU, but had a huge number of crashes by comparison: 20130215125331 38100.000% 2667 7

Selena Deckelmann :selenamarie :selena

Assignee

Comment 26

•

12 years ago

Attached file GC crashes as percentage of ADU for April, based on buildids, limited to nightly channel — Details

Attachment #747442 - Attachment is obsolete: true

Attachment #749006 - Attachment is obsolete: true

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 27

•

12 years ago

Thanks, Selena! I just had a chance to look at this data closely, and it looks really useful. When I sort by buildid and exclude the builds from before 4/1, there's a very clear pattern. The baseline crash rate is about 0.15% (in crashes per ADU). There are spikes at 4/3-4/4 and 4/20-4/22 and 4/30. I investigated the spike at 4/20 and it was caused by bug 860145, which landed on 4/19 and was backed out on 4/22. So I think this is working as we might expect, which is fantastic! I do have a few more requests, if you have time: 1. The original bug I was interested in was bug 868369. I forgot that it's actually about beta and not release. So it would be nice to incorporate other channels, as long as the channel is included in the data. I think you could change the find-isgc.py script like so: if 'IsGarbageCollecting' in raw_crash and 'buildid' in raw_crash and 'version' in raw_crash: context.write('\t'.join((raw_crash['version'], raw_crash['buildid'])), 1) and then similar changes to the ADU query. 2. Could you run the script on more data? It would be great to have data from, maybe, 2/1 to the current date.

Benjamin Smedberg

Comment 28

•

12 years ago

Style note on comment 27: I wouldn't use '\t'.join in the mapper; it's more efficient to use a tuple for the key: context.write((raw_crash['version'], raw_crash['buildid']), 1) And the CSV output at the end will also come out right automatically.

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Target Milestone: 47 → 49

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Summary: Track volume of GC crashes over time → Create table and stored procedure to track volume of GC crashes over time

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Blocks: 875084

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Target Milestone: 49 → 50

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Depends on: 871997

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Depends on: 880744

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Whiteboard: [DB Changes]

Laura Thomson :laura

Updated

•

12 years ago

Target Milestone: 50 → 52

Comment 29

•

12 years ago

Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/459939e3578afcb9cc370178b0b0bdd440a91671 bug 803209 Support for garbage collection reporting https://github.com/mozilla/socorro/commit/6c0e607b4c2d0ba487370f9b956ac6a1d4e58ba7 bug 803209 - code review fixes https://github.com/mozilla/socorro/commit/8d8432b9d3230239da606bea815c62609d69ccfe Merge pull request #1286 from selenamarie/bug803209-garbage-collecting-report Bug803209 support for garbage collecting report in database

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Target Milestone: 52 → 51

Selena Deckelmann :selenamarie :selena

Assignee

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Updated

•

12 years ago

Whiteboard: [DB Changes] → [DB Changes][qa-]

Comment 30

•

12 years ago

Where can I find a graph or so for total volume of GC crashes over time? IIRC, that was the request here for a report, and I can't find it right now.

Updated

•

12 years ago

Blocks: 915317

You need to log in before you can comment on or make changes to this bug.