Closed Bug 797068 Opened 12 years ago Closed 12 years ago

Report: correlation of Flash crashes against graphics card information

Categories

(Socorro :: Data request, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: selenamarie)

Details

Attachments

(1 file, 9 obsolete files)

2.79 KB, text/plain
Details
Some (many?) of the Flash 11.3+ crashes may be specific to certain graphics configurations. Adobe has asked us to run correlation reports for the top 20 or so Flash crash signatures to see if particular graphics hardware or drivers correlate highly to any of these crashes.

The raw data contains values such as these:

AdapterVendorID: 0x1002, AdapterDeviceID: 0x5159, AdapterSubsysID: 2030148c, AdapterDriverVersion: 6.14.10.6462

We need clarification from the gfx team on the structure and meaning of this data, and how we can group/correlate hardware and software drivers separately or together.

Please treat this as a high priority request. We'd like to have data within a week if possible. Assuming the report is useful we will probably want this to be a recurring report, but we should validate it first.
Indeed, a lot of gfx information is present in the AppNotes. Here's the explanation of the different gfx-related "fields" in AppNotes on Windows.

Note that this is platform-specific -- the info here only applies to Windows.

AdapterVendorID is the PCI vendor ID of the GPU. The usual values are:
   0x1002: ATI (AMD)
   0x10de: NVIDIA
   0x8086: Intel

AdapterDeviceID is the PCI device ID of the GPU. It is nontrivial to map this to commercial GPU names in an efficient way, but for a given AdapterDeviceID, you can find that out by some googling (e.g. google PCI device ID Intel 0x29a2)

AdapterSubsysID is supposed to be OEM-specific information, IIUC; jrmuizel would know more. This is where we can figure if e.g. the crashes are specific to Dell computers.

AdapterDriverVersion is the driver version (in the sense of DLL version). Generally different from user-facing version numbers.

Sometimes you'll see '2'-suffixed fields, like AdapterVendorID2. That's the second GPU info, in systems with 2 GPUs. However the info we get here can be unreliable, i.e. if the user manually replaced his GPU without reinstalling Windows, we expect that in some cases it will be reported as a two-GPU system.

The above info is obtained from the Windows registry.

Next up are flags such as "D3D10 Layers? D3D Layers+". These inform us of Gfx features that have been attempted during the Firefox session that crashed. The features include "D3D10 Layers", "D3D9 Layers", "D2D", "DWrite", and "WebGL". They are followed by a single characted like "?" or "+" which has the following meaning:

   ?     the feature was attempted
   +     the feature was successfully initialized
   -     the feature failed to initialized (typically due to blacklisting or insufficient hardware)
   !     the feature was force-enabled (the crash report is from a ricer, ignore them).

For example, "D3D10 Layers? D3D Layers- D3D9 Layers? D3D9 Layers+" means that we tried to initialize D3D10 Layers, that failed, then we fell back to D3D9 Layers and that succeeded.

A ? not matched with a + or - indicates a crash during the initialization of that feature.
Selena: If you can't do this in Postgres, ought to be doable with pig.  Check in with rhelmer if you need help.
Component: General → Data request
Assignee: nobody → sdeckelmann
Note that *some* of this information is also available as separate fields in the raw JSON, which is much nicer for running analysis against. Unfortunately, some of the info is only available in the free-text mashup "app notes" field in the way described by bjacob.
From IRC: bsmedberg comments that this report should be Windows only. 

To identify Flash crashes there, the presence of the Flash version is enough.
(In reply to Benoit Jacob [:bjacob] from comment #1)

Thanks for writing this all out! Very useful.

> Indeed, a lot of gfx information is present in the AppNotes. Here's the
> explanation of the different gfx-related "fields" in AppNotes on Windows.
> 
> Note that this is platform-specific -- the info here only applies to Windows.
> 
> AdapterVendorID is the PCI vendor ID of the GPU. The usual values are:
>    0x1002: ATI (AMD)
>    0x10de: NVIDIA
>    0x8086: Intel
> 
> AdapterDeviceID is the PCI device ID of the GPU. It is nontrivial to map
> this to commercial GPU names in an efficient way, but for a given
> AdapterDeviceID, you can find that out by some googling (e.g. google PCI
> device ID Intel 0x29a2)
> 
> AdapterSubsysID is supposed to be OEM-specific information, IIUC; jrmuizel
> would know more. This is where we can figure if e.g. the crashes are
> specific to Dell computers.
> 
> AdapterDriverVersion is the driver version (in the sense of DLL version).
> Generally different from user-facing version numbers.

Given all this -- do you want these grouped in any particular way? (following up on Benjamin's question earlier)

It's not too hard to create mappings for all these things, just a bit time consuming if the idea is to google all the GPUs, for example. :)  I imagine it may be more useful to have some mapping be available, but for the report requester to be able to perform their own aggregations. 

I'm happy to provide mapping information when available in the appropriate column, or however you'd like it.
And I sent a draft of a query to rhelmer for review this evening. Will forward a sample report to the bug reporters in the morning if it looks ok to him. :)
(In reply to Selena Deckelmann :selena from comment #5)
> Given all this -- do you want these grouped in any particular way?
> (following up on Benjamin's question earlier)

I would make most sense to try to correlate Flash crashiness to:
 * (VendorID, DeviceID) pair
 * (VendorID, DriverVersion) pair
to start with.

Only if that doesn't show anything conclusive, turn to other factors:
 * SubsysID
 * presence of 2 GPUs
 * Gfx features used --- especially "D2D", "D3D10 Layers" and "D3D9 Layers".


> 
> It's not too hard to create mappings for all these things, just a bit time
> consuming if the idea is to google all the GPUs, for example. :)  I imagine
> it may be more useful to have some mapping be available, but for the report
> requester to be able to perform their own aggregations. 

Indeed, device ID grouping is problematic.

For Intel GPUs, we actually have the grouping into families here:

http://hg.mozilla.org/mozilla-central/file/e679d2e1d02d/widget/xpwidgets/GfxDriverInfo.cpp#l96

This groups known Intel device IDs into families such as "IntelGMA500", etc.

For NVIDIA we have partial lists in this file:
https://bug605749.bugzilla.mozilla.org/attachment.cgi?id=484688

For ATI/AMD we have nothing.

Maybe it makes most sense to start without any grouping, hope that any device ID correlation will still be apparent even without grouping, and only start identifying device groups once a short list of interesting devices is found. In this way, you would avoid wasting time on device IDs that don't turn out to be relevant for this bug.
I'd like to have us do a proper correlation (or signature summary) report against adapter vendor/id (at least) in the future (and for that we'll need to have everything this is done for as a separate field outside of app notes), but this here should be a one-off thing, I don't think Selena wants to do hundreds or thousands of lists for this. Let's start with a single clear objective without a "if this isn't conclusive" and dig down further with additional one-off analyses in cases where we think this is needed.
I'm not a statistician, so I can't comment on how exactly the "correlation" should be done; but I assumed, perhaps naively, that for a given factor (e.g. "device ID is equal to 0x4321")  you could associate a "correlation" which would be a single number, where higher means more correlated to Flash crashiness. So, if we agree to start with trying to correlate to deviceID and to DriverVersion, then the result would be two tables:
 * one table giving for each deviceID its correlation with flash crashiness
 * one table giving for each DriverVersion its correlation with flash crashiness
But I might be completely off here.
From email convo: 

Benjamin said: 

1) The Firefox version is probably not important to this query.
2) This gives us the counts for each crash, but it doesn't tell us the 
total population to compare against. Basically we need the population 
sizes of (VendorID/DeviceID) for all of the Windows Crashes in the same 
period so that we can correlate the populations for a specific signature 
against the general population. [Benoit said the same thing]

====

Progress report:
After running this properly with the correct attributes, I think that including the AdapterDeviceID in my first pass at the report is too much detail.  The report is somewhere around 4GB (.csv!) for 9/27-10/1!

Whereas reporting just on the AdapterVendorID is ~250k. After a few passes over the report, I believe we'll want to figure out which AdapterVendorIDs are interesting, and get reports on each of those with more detail.
Sent new revision of report for review this morning to all.
Benoit - anything more you'd like on this ticket?
Benoit - anything more you'd like on this ticket?
Attached file Flash crashes by AdapterDeviceID (obsolete) —
Uploading to bugzilla the other of Selena's spreadsheets -- Flash crashes by Device ID. xz-compressed.
Selena: can you please explain what is meant by "crashes by deviceid" and "total adapter device crashes"?

I had assumed that the former was flash crashes only and the latter was total crashes regardless of crash cause. So that the quotient of the two would give the Flash crashiness on that particular setup. Is that correct?
(In reply to Benoit Jacob [:bjacob] from comment #17)
> Selena: can you please explain what is meant by "crashes by deviceid" and
> "total adapter device crashes"?
> 
> I had assumed that the former was flash crashes only and the latter was
> total crashes regardless of crash cause. So that the quotient of the two
> would give the Flash crashiness on that particular setup. Is that correct?

No, that's not correct. 

Perhaps we should have voice conversation to talk through how to aggregate data, because it seems that via email, SQL definitions and bug is not working.
(Talking only about AdapterDeviceID correlation here. Same applies to other factors).

It seems that I was confused earlier, and my above interpretation of these numbers (comment 17) was wrong.

As I understand it now:
 - crashes_by_deviceID is the number of crashes with given signature, flash version, device ID, etc.
 - total_adapter_device_crashes is EITHER the number of all crashes involving this device, OR the number of Flash crashes involving the device. I'm not sure. I think the latter.

In either case, there is does not seem to be enough data here to identify (as I now understand is our goal), for each signature, if it is correlated with the device ID.

Here is what I think we need to achieve that goal.

First we need to compute, for each (signature, deviceID) pair, its overall prevalence over all crashes:

In other words, this is the probability for a randomly picked crash to have this signature and deviceID. Let's call that P(signature, deviceID).

                Number of crashes with given (signature, deviceID) pair
P(sig, devID) = -------------------------------------------------------
                Total number of crashes

Next, we need to compare this to what this would be if signature were completely independent of deviceID. Independence of random events means that the probability of the coincidence of the two events is the product of the probability of the individual events. So we need to compute the probability of a given devID across all crash signatures,

           Number of crashes with given deviceID, across all signatures
P(devID) = ------------------------------------------------------------
           Total number of crashes

and the probability of a given signature across all device IDs,

           Number of crashes with given signature, across all device IDs
P(sig) = ---------------------------------------------------------------
           Total number of crashes


And we need to look at the ratio

             P(sig, devID)
         --------------------------
           P(sig)  *  P(devID)

If signature and deviceID are independent (not correlated) then this should be near 1.0  --- except for noise, which will be prevalent when the sample size is too small, so we may have some filtering/grouping to do, but we'll see when we're there.

When this ration is significantly higher than 1.0  --- I'd say at least 4 --- for a non-negligible set of crashes, we'll know that there is a correlation.
(In reply to Selena Deckelmann :selena from comment #18)
> (In reply to Benoit Jacob [:bjacob] from comment #17)
> > Selena: can you please explain what is meant by "crashes by deviceid" and
> > "total adapter device crashes"?
> > 
> > I had assumed that the former was flash crashes only and the latter was
> > total crashes regardless of crash cause. So that the quotient of the two
> > would give the Flash crashiness on that particular setup. Is that correct?
> 
> No, that's not correct. 
> 
> Perhaps we should have voice conversation to talk through how to aggregate
> data, because it seems that via email, SQL definitions and bug is not
> working.

There definitely has been a lot of miscommunication here, so yes that would help. But we also need this bug to be well documented, for the benefit of other people reading it, so can you please start by describing what the columns mean in your spreadsheet and then we can see from there if further conversation is needed?
Columns in the spreadsheet: 

signature: a signature from the Top 30 crashes that contain AdapterVendorID information in app_notes, happened on Windows and have a flash_version
flash_version: flash version
product_name: product name
os_name: operating system name
adaptervendorid: AdapterVendorID pulled from app_notes field
adapterdeviceid: AdapterDeviceID pulled from app_notes field
crashes_by_deviceid: Crashes for a VendorID/DeviceID pair, grouped by signature, flash_version, product_name, os_name and limited to Windows crashes
total_adapter_device_crashes: Crashes for a VendorID/DeviceID pair, limited to crashes with a flash_version and happened on Windows
total_adapter_vendor_crashes: Crashes for a Vendor, limited to crashes that we know occurred for that vendor and happened on Windows
(In reply to Benoit Jacob [:bjacob] from comment #20)
 
> There definitely has been a lot of miscommunication here, so yes that would
> help. But we also need this bug to be well documented, for the benefit of
> other people reading it, so can you please start by describing what the
> columns mean in your spreadsheet and then we can see from there if further
> conversation is needed?

Please see comment #21 for the detail on the columns. 

I don't fully understand the requirements detailed in comment #19, so I hope that we can talk today rather than delay further.
So, to back up. When I say "correlation report" I basically mean something similar to 

https://crash-analysis.mozilla.com/crash_analysis/20121008/20121008_Firefox_17.0a2-interesting-addons.txt

This is the signature-to-addon report that Socorro already produces. For each crash signature we list the addons which correlate highly with that crash. I'm not sure where the code lives which creates that report. From having worked with the results for a while, I believe that the math is pretty simple:

Let PSIG be the % of users with addon X for the population of crashes with that signature (column 1)
Let PPOP be the % of users with addon X for the entire crash population (column 2)
Let CORRELATION be PSIG / PPOP

Show the rows where correlation is greater than 2, sort by correlation descending.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> So, to back up. When I say "correlation report" I basically mean something
> similar to 
> 
> https://crash-analysis.mozilla.com/crash_analysis/20121008/
> 20121008_Firefox_17.0a2-interesting-addons.txt
> 
> This is the signature-to-addon report that Socorro already produces. For
> each crash signature we list the addons which correlate highly with that
> crash. I'm not sure where the code lives which creates that report.

http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/ specifically http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/file/default/per-crash-interesting-modules.py

I am working on integrating this more directly into Socorro but at the moment the stdout of that script is what is pushed to crash-analysis.
Attached file Crashes by deviceid (obsolete) —
Attachment #669216 - Attachment is obsolete: true
Attachment #669285 - Attachment is obsolete: true
Attachment #669606 - Attachment is obsolete: true
Questions/comments about this spreadsheet:
1) In adapterdeviceid, you want to format all entries in one standard form (choose either 0x???? or ???? form). Just in case that might lead to wrongly treat as separate devices that are really the same (0x1234 vs 1234). For example, I see '46' vs '0x0046'. So one issue is the 0x, another is leading 0's. Finally, some entries have trailing non-hex characters, like in 
2) some entries have a adapterdeviceid that suggests it's not Windows, like 'samsung'.
3) How do signature_id numbers map to actual signatures?
4) the probabilities columns all show '0' for nearly all entries. Now, this is normal: these probabilities are indeed very small. But it seems that some kind of rounding took place here, leaving these values as exactly 0. I'd suggest "exponent notation" for these numbers, for example, if a probability is 0.000034, you could write it as 3.4e-5. That's what I meant when I asked for "2 significant digits". Alternatively you could also not round at all.
5) The title of column F, "P(sig, devid) / P(sig) * P(devid)" written in this way, taken literally, would mean that P(devid) is multiplied on the numerator. Additional brackets are needed for it to mean what I meant in comment 19: "P(sig, devid) / (P(sig) * P(devid))". Please confirm that that's what you meant --- I can't confirm it with the current data because of issue 4).
First, sorry I didn't provide the SQL for this earlier. 

Here's my branch with the SQL: https://github.com/selenamarie/socorro/tree/797068-c19-video-card-crash-reports

1) No problem. I added a regex to group these.
2) My error! Fixed.
3) I have put the full signatures into the report
4) I have adapted this to show exponent notation. However, I have not figured out how to reasonably show a mix of scientific notation and not. We've reached the limits of available functions in pure PostgreSQL. Given that more work will go into a different style of report, I'm going to cut short my research short, and refrain from writing a function for now. In the future, I'll make it a bit nicer when it becomes part of a report.
5) I have updated the column so that it shows the proper brackets, and yes, I performed the actual arithmetic with the proper bracketing.

The sample set is crashes that:

* Have an AdapterID 
* Are identified as 'Windows NT' in the reports table (this captures all windows versions)
* Were processed after 10/7/2012
Attachment #670641 - Attachment is obsolete: true
No worries about always using scientific notation, that works for me.

But there seems to be a problem in the computation:

At the first row we have:

 P(sig, devid),   P(sig),    P(devid),   P(sig, devid) / (P(sig) * P(devid))
 5.8e-05,         1.2e-04,   1.7e+00,    1.7e-03

If I compute P(sig, devid) / (P(sig) * P(devid)) myself from this data, I get:

   5.8e-05 / (1.2e-04 * 1.7e+00) = 2.8e-1

So I get 2.8e-1 and not 1.7e-03 as in your spreadsheet. What's wrong?
Oh, also, there is another thing that's wrong here:

>  P(sig, devid),   P(sig),    P(devid),   P(sig, devid) / (P(sig) * P(devid))
>  5.8e-05,         1.2e-04,   1.7e+00,    1.7e-03

Here we have a probability, P(devid) = 1.7e+00 = 1.7 , that's larger than 1. That's impossible. All probabilities should be numbers between 0 and 1.

Furthermore, probabilities should add up to 1. So if you take all the P(sig) for all signatures, the sum should be 1, and likewise if you take the sum of all the P(devid) for all devid, and likewise for the sum of all P(sig, devid) for all (sig, devid) pairs.
(In reply to Benoit Jacob [:bjacob] from comment #29)

> Here we have a probability, P(devid) = 1.7e+00 = 1.7 , that's larger than 1.
> That's impossible. All probabilities should be numbers between 0 and 1.

:) These are all multiplied by 100 to get a percentage. I will take that out if you prefer.
Ah! Yes, I would prefer that since we are doing probability computations here like P(sig, devid) / (P(sig) * P(devid)) that have a simpler form when the probabilities are raw probabilities between 0 and 1 (otherwise you have to adjust all these computations).

Also that doesn't explain the oddity (1.7 vs 2.8e-1) noted in comment 29?
(In reply to Benoit Jacob [:bjacob] from comment #28)
> No worries about always using scientific notation, that works for me.
> 
> But there seems to be a problem in the computation:
> 
> At the first row we have:
> 
>  P(sig, devid),   P(sig),    P(devid),   P(sig, devid) / (P(sig) * P(devid))
>  5.8e-05,         1.2e-04,   1.7e+00,    1.7e-03
> 
> If I compute P(sig, devid) / (P(sig) * P(devid)) myself from this data, I
> get:
> 
>    5.8e-05 / (1.2e-04 * 1.7e+00) = 2.8e-1
> 
> So I get 2.8e-1 and not 1.7e-03 as in your spreadsheet. What's wrong?

Thank you for checking. I typed the formula in wrong in my SQL - did not actually divide again by the total crashes. :/ 

Fixed it now. But the final P() values look weird. I double checked the calculation, and included the raw numbers, and it is the number that it should be.
Attachment #670883 - Attachment is obsolete: true
Thanks! This is starting to look really useful!

Further requests:

Can you please confirm that this is only Flash crashes? I see a lot of signatures here that don't seem flash-related and that weren't found in the earlier tables. This suggests that you might have accidentally dropped the filtering for Flash crashes only.

But, this is actually interesting, as it confirms that our approach to find GPU-related crashes works. Indeed, if I look at the signatures in your table that have the highest deviceID correlation and a significant sample set, many are clearly gfx-related signatures, such as sgfxu64.dll, atidxx32.dll, and gfxFontGroup::MakeTextRun. This confirms that we are on the right track. For these gfx-related crashes, we get correlations > 1e+4 with particular devices. This makes me think that we are on to something here. It is entirely possible that people have been banging their heads on these crashes and never realized the correlation with particular devices. The work you're doing here could well be important for many Gfx bugs.

Another thing that I would like, would be a clear description at the top of the table, of the meaning of columns. This latter table doesn't seem to have column titles?
(Was talking about part 1 of 2 here)
Yes -- (In reply to Benoit Jacob [:bjacob] from comment #34)

> Can you please confirm that this is only Flash crashes? I see a lot of
> signatures here that don't seem flash-related and that weren't found in the
> earlier tables. This suggests that you might have accidentally dropped the
> filtering for Flash crashes only.

Well, this was a tricky thing. I requested information on how to filter these earlier, and was told that it was not straightforward to filter this. Specifically, we asked if we selected crashes that started with 'F...' and were told that this was not a reliable indicator.

Do you have a recommended filter to try?

> But, this is actually interesting, as it confirms that our approach to find
> GPU-related crashes works. Indeed, if I look at the signatures in your table
> that have the highest deviceID correlation and a significant sample set,
> many are clearly gfx-related signatures, such as sgfxu64.dll, atidxx32.dll,
> and gfxFontGroup::MakeTextRun. This confirms that we are on the right track.
> For these gfx-related crashes, we get correlations > 1e+4 with particular
> devices. This makes me think that we are on to something here. It is
> entirely possible that people have been banging their heads on these crashes
> and never realized the correlation with particular devices. The work you're
> doing here could well be important for many Gfx bugs.

\o/ yay! :)

> Another thing that I would like, would be a clear description at the top of
> the table, of the meaning of columns. This latter table doesn't seem to have
> column titles?

Yes, certainly. 

I had to split the file because I am only allowed to upload 4MB at a time.  Once you let me know how you'd like me to filter for flash-only crashes, I will create the remaining reports. 

I have to hop on a plane in a few hours, but will try to get these done before the weekend.
Please just do this for processtype=plugin crashes. Since Flash accounts for basically all of the top plugin crashes, that's the easiest/best way to do this.
I have no idea myself how to filter for Flash crashes alone. One thing you could do is filter plugin crashes: I believe, from looking at the crashdata csv files (the csv files with crash data on FTP) that there is a "plugin" flag on crash reports, at least these csv files have such a flag. I would suppose that Flash crashes would be the overwhelming majority of plugin crashes.
What Benjamin said.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #37)
> Please just do this for processtype=plugin crashes. Since Flash accounts for
> basically all of the top plugin crashes, that's the easiest/best way to do
> this.

The latest upload reflects this change. I've also pushed the change to the query up to the repo. We're so close! :)

Benoit - let me know if the header information is explicit enough now.
(In reply to Selena Deckelmann :selena from comment #36)
> Well, this was a tricky thing. I requested information on how to filter
> these earlier, and was told that it was not straightforward to filter this.
> Specifically, we asked if we selected crashes that started with 'F...' and
> were told that this was not a reliable indicator.

The indicator we use everywhere else and that is most reliable is if the Flash version field is non-empty, as we should be able to detect the version for all Flash crashes and not for others.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #43)
> (In reply to Selena Deckelmann :selena from comment #36)
> > Well, this was a tricky thing. I requested information on how to filter
> > these earlier, and was told that it was not straightforward to filter this.
> > Specifically, we asked if we selected crashes that started with 'F...' and
> > were told that this was not a reliable indicator.
> 
> The indicator we use everywhere else and that is most reliable is if the
> Flash version field is non-empty, as we should be able to detect the version
> for all Flash crashes and not for others.

Ah, I now see that a foreign key generated for flash_version_id contains an ID of '1' when the flash version is NULL, or 215 when it is '[blank]'.

Another thing to document.... :)
Hey, I took a look at the data and it looks really good now!

Just a question -- sorry for being a bit stubborn -- how confident are you that this includes only Flash crashes? The signature at the top, 

"js::mjit::Recompiler::expandInlineFrames(JSCompartment*, js::StackFrame*, js::mjit::CallSite*, js::StackFrame*, js::VMFrame*)"

is clearly a crash in our Javascript engine, however, of course, I can't rule out the possibility that something weird would be happening here and it would actually be a Flash crash.

To see how usefule the data is, I started using the data to correlate Flash crashes to GPU device IDs, and filed bug 802105. This is a Flash crash, and it's 100% correlated to a single GPU, the Intel i815. Yay! That's exactly what we were looking for.

All I did was filter for Column C (crashes for given device and signature) >= 100 and then sort by decreasing Column J (correlation ratio). That crash then came at the top.

Clearly, this would benefit a lot from aggregating more data. I understand that your table has 3 days of data; it would be really useful to aggregate a whole month.

Also, I don't understand why/how you divide the data into 2 separate spreadsheets? If it's just to work around a Bugzilla attachment size limit, maybe just upload the data to some FTP server rather than attaching it?

Also, an idea: in addition to the Device ID, it would actually be very useful to do the same for the Vendor ID. The idea is that often, a crash will affect a whole range of GPUs, but they will all be from the same Vendor. So by using Vendor IDs instead of device IDs, we would immediately see the correlation. I'm suggesting that you add new columns alongside existing ones, NOT that you modify existing ones. New columns for: "vendor ID" , "Crash count: by signature and VendorID", "crash count: VendorID", "P(vendor)", "P(sig, vendor)", and "P(sig,vendor)/(P(sig) * P(vendor))".

Finally I have to admit something: my spreadsheet, LibreOffice Calc, is too stupid to correctly parse the exponent-notation numbers. In the end, it would be more convenient for me if your spreadsheet just had plain un-rounded numbers like 0.12345678. Sorry for the U-turn on this.
(In reply to Benoit Jacob [:bjacob] from comment #45)
> Hey, I took a look at the data and it looks really good now!

Thanks for reviewing! I'm working on incorporating all of your observations in a new version.

> Just a question -- sorry for being a bit stubborn -- how confident are you
> that this includes only Flash crashes? The signature at the top, 
> 
> "js::mjit::Recompiler::expandInlineFrames(JSCompartment*, js::StackFrame*,
> js::mjit::CallSite*, js::StackFrame*, js::VMFrame*)"
> 
> is clearly a crash in our Javascript engine, however, of course, I can't
> rule out the possibility that something weird would be happening here and it
> would actually be a Flash crash.

I'm going to use what KaiRo said earlier and filter by flash_version_id 1 and 215, corresponding to '' and '[blank]' in those fields. Unfortunately, I don't know much beyond what KaiRo and Benjamin have suggested in this ticket.

> Clearly, this would benefit a lot from aggregating more data. I understand
> that your table has 3 days of data; it would be really useful to aggregate a
> whole month.

I will create a report for that and find a place to stash the results.

> Also, I don't understand why/how you divide the data into 2 separate
> spreadsheets? If it's just to work around a Bugzilla attachment size limit,
> maybe just upload the data to some FTP server rather than attaching it?

I'm asking around for a place to stash this stuff.

> Also, an idea: in addition to the Device ID, it would actually be very
> useful to do the same for the Vendor ID. The idea is that often, a crash
> will affect a whole range of GPUs, but they will all be from the same
> Vendor. So by using Vendor IDs instead of device IDs, we would immediately
> see the correlation. I'm suggesting that you add new columns alongside
> existing ones, NOT that you modify existing ones. New columns for: "vendor
> ID" , "Crash count: by signature and VendorID", "crash count: VendorID",
> "P(vendor)", "P(sig, vendor)", and "P(sig,vendor)/(P(sig) * P(vendor))".

Certainly.   I will include this in my next reports.

> Finally I have to admit something: my spreadsheet, LibreOffice Calc, is too
> stupid to correctly parse the exponent-notation numbers. In the end, it
> would be more convenient for me if your spreadsheet just had plain
> un-rounded numbers like 0.12345678. Sorry for the U-turn on this.

Ok! No problem.
(In reply to Benoit Jacob [:bjacob] from comment #45)

> All I did was filter for Column C (crashes for given device and signature)
> >= 100 and then sort by decreasing Column J (correlation ratio). That crash
> then came at the top.

Thanks for the detailed information about how you were looking at the data. This is invaluable information!

Is this a good way to publish this data? It's an easy operation to perform on the output.

(and I'm running the report since 9/1/2012 for both deviceid and vendorid now)
Status: NEW → ASSIGNED
(In reply to Selena Deckelmann :selena from comment #47)
> (In reply to Benoit Jacob [:bjacob] from comment #45)
> 
> > All I did was filter for Column C (crashes for given device and signature)
> > >= 100 and then sort by decreasing Column J (correlation ratio). That crash
> > then came at the top.
> 
> Thanks for the detailed information about how you were looking at the data.
> This is invaluable information!

Hey, don't hesitate to ask, I had no idea really how useful it would be to mention this.

> 
> Is this a good way to publish this data? It's an easy operation to perform
> on the output.

The sorting by decreasing correlation ratio (column J) is definitely wanted, yes.

I am less sure about the filtering (like, crashes >= 100) because it is a destructive operation. It is definitely needed to do something to cut off the noise at some point, but doing it in the report itself means that there is no way to recover that data, so it's a tough question.

I'll try to think of a better approach here, than just brutally cutting off small values. Maybe the ordering-by-column-J could be adjusted in some meaningful way to account for the fact that larger sample sets mean we have higher confidence in the correlation. I will ask people who know about statistics (I don't).

> (and I'm running the report since 9/1/2012 for both deviceid and vendorid
> now)

Thanks!
KaiRo's version is not sufficient, because all kinds of crashes can have a Flash version *present* which are just browser crashes. Please limit this to type=plugin crashes, and also exclude hangs if we're not doing that already.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> KaiRo's version is not sufficient, because all kinds of crashes can have a
> Flash version *present* which are just browser crashes.

KaiRo's suggestion did reduce the run-time for the report significantly. Are you saying also that you would like me to *add records without a flash version recorded* back into the report? 

> Please limit this to type=plugin crashes,

Just for completeness: I did continue to limit the report to the plugin crashes, in addition to filtering out records which had no flash version recorded.

>  and also exclude hangs if we're not doing that already.

Would this be any record that has a 'hangid' in the reports table? (looks like a uuid in there)
> KaiRo's suggestion did reduce the run-time for the report significantly. Are
> you saying also that you would like me to *add records without a flash
> version recorded* back into the report?

No, that's not necessary.

> Would this be any record that has a 'hangid' in the reports table? (looks
> like a uuid in there)

Yes.(In reply to Selena Deckelmann :selena from comment #50)

> Just for completeness: I did continue to limit the report to the plugin
> crashes, in addition to filtering out records which had no flash version
> recorded.

Hrm, I maybe misunderstood comment 45: if the js::mjit::Recompiler::expandInlineFrames signature is showing up then there is a filtering error because that signature never happens in type=plugin reports
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #51)

> > Just for completeness: I did continue to limit the report to the plugin
> > crashes, in addition to filtering out records which had no flash version
> > recorded.
> 
> Hrm, I maybe misunderstood comment 45: if the
> js::mjit::Recompiler::expandInlineFrames signature is showing up then there
> is a filtering error because that signature never happens in type=plugin
> reports

You understood correctly -- I meant in the new versions of the report I am currently generating.

Thanks for clarifying! :)
Here are the latest copies of the reports: 
https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_deviceid.csv.bz2
https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_vendorid.csv.bz2

I left some of the scientific notation in some columns because I didn't have an easy way to convert the 'e-06' numbers. Let me know if that's a problem, and I'll write a function to un-scienceify the numbers.

Once we're settled here, I'll start working on implementing a report like these in Socorro in a separate ticket.
Flags: needinfo?(bjacob)
If I understand correctly, all the numbers smaller than 1e-4 are still written in exponent notation while numbers greater than 1e-4 are in decimal notation.

Again, normally I wouldn't care either way (or would prefer exponent notation) but since we know that major spreadsheet programs (such as LibreOffice) have trouble automatically understanding that 1.2e-05 is a number, it is really preferable to have all numbers in decimal notation. Please do write that conversion function if needed, it's useful :) I don't know what tool you are using, but I thought that any programming language would offer a built-in method to do that. For example, in C, printf("%f", value) would do that (%f gives decimal notation whereas %e gives exponent notation).
Flags: needinfo?(bjacob)
Also, thinking about bug 802105, it would be a great improvement if you could automatically remove "duplicates" so that we wouldn't again be fooled into believing that there exists a strong correlation when it all comes from a single user who had many crashes.

One way to define duplicates is: same signature AND same Install Time AND same Build Architecture Info.

By Install Time and Build Architecture Info, I mean the fields that go by these names in crash reports such as:
https://crash-stats.mozilla.com/report/index/575e2ecb-7767-4e98-ae1e-2a8692120810
   Install Time			2012-08-09 15:48:49
   Build Architecture Info	GenuineIntel family 6 model 42 stepping 7

My understanding is that Build Architecture Info gives the CPU of the user.

If two crashes have the same install time and same Build Architecture Info, it's very likely that they are from the same user.

_Within_ a single signature, it makes sense to count each user only once. That will remove a lot of noise, if my experience trying to find correlations two days ago is any indication.

Initially I considered that Signature and Install Time by itself would be enough to define duplicates. However, there are only 1e6 seconds in a month, so, according to the 'Birthday Paradox' theorem,
  http://en.wikipedia.org/wiki/Birthday_problem
(The same paradox that makes it likely that 2 members in a 23-person soccer team have the same birthday).
The probability of having some false positive would become large as soon as we are considering roughly 1000 users. It's very possible that more than 1000 users would run into the same Flash bug on the stable channel, so I feel more comfortable about adding another criterion into the definition of a duplicate, whence the additional consideration of the Build Architecture Info.

Does that make sense?
(In reply to Benoit Jacob [:bjacob] from comment #55)
> My understanding is that Build Architecture Info gives the CPU of the user.

Yes, it's simply the processed output of CPUID. I'm not sure why it got renamed from CPU Architecture Info, I think that must have been collateral damage from renaming "CPU Architecture" to "Build Architecture". (Which makes more sense, as it's simply the CPU Architecture of the binary we're running.)
What is the status of this bug? I would like to have something useful to Adobe fairly quickly, even if it requires them to review the results manually for deduplication after the report is run. Worse is better!
So, the latest data that we have is Selena's links in comment 53.

Theoretically the data is enough, but in practice, it's still a bit too hard to get significant conclusions from it (filter out noise).

In comment 45 -- 46 we agreed that a good way to get significant results easily was to add VendorID correlations in addition to DeviceID correlations. Indeed, if a crash signature is correlated to e.g. some NVIDIA GPU, in particular it will be correlated to the VendorID being NVIDIA, and that will be far less noisy as there only are 3 vendor IDs as opposed to 1000+ device IDs.

Another issue I raised in comment 54 is that my spreadsheet application doesn't parse exponent notation well so I actually need numbers to be given in plain decimal notation, un-rounded.

Once these 2 issues are addressed, I would gladly go over the data and tell you if I think we have any chance of a correlation there.

As discussed in comment 55, another extra, nice-to-have thing would be to filter out duplicates i.e. when a single user gave a ton of crash reports. That would greatly reduce noise levels, but is not necessary --- I'm not blocked on that part.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #57)
> What is the status of this bug? I would like to have something useful to
> Adobe fairly quickly, even if it requires them to review the results
> manually for deduplication after the report is run. Worse is better!

As bjacob said, you can take the report in comment 53. I can do more work on this next week, but not today.

The deduplication problem is quite complex and several people worked on it many months ago (before my time). I plan to work on that from the Socorro side, but the root issue is that we are getting duplicate crash reports from breakpad. We have quite a few other more pressing issues, IMO, so I think solving deduplication should be left alone for now.
Yeah, deduplication is more difficult. But as I said in comment 58, it's less necessary. If you could just add the VendorID correlations and revert to plain decimal numbers early next week, that would be enough to allow me to try to answer Benjamin's question.
Attachment #671008 - Attachment is obsolete: true
Attachment #671009 - Attachment is obsolete: true
(In reply to Benoit Jacob [:bjacob] from comment #55)
> Also, thinking about bug 802105, it would be a great improvement if you
> could automatically remove "duplicates" so that we wouldn't again be fooled
> into believing that there exists a strong correlation when it all comes from
> a single user who had many crashes.
> 
> One way to define duplicates is: same signature AND same Install Time AND
> same Build Architecture Info.

I started looking into this:

* Signature is easy
* InstallTime is not recorded in reports, but a derived number install_age is recorded which is the difference between InstallTime and CrashTime.
* Build Architecture Info/CPUID is normalized somewhat in reports and not all in a single field - some research needs to be done in order to denormalize it

We may wish to perform the de-duplication work in a stored procedure, or as an update to the reports_duplicates() function that already does related/same/similar work.

I'll see if I can wedge it into the query I'm working with right now.
So as much as I'd love a deduplicated report, I rather more urgently would like something I can hand to adobe which is readable and indicates possible correlations for followup.

Note that there are two different kinds of "deduplication" anyway: in one case we actually receive multiple copies of the exact same crash report. That's not what bjacob is talking about, though: he really wants to discount a single machine crashing multiple times and skewing the correlation.
Agree --- as said in comment 58, de-duplication falls in the "extra, nice-to-have" category.
I've updated and pushed these files: 

https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_deviceid.csv.bz2

https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_vendorid.csv.bz2

I was a little confused by the request for "adding the vendorid correlations". There are two files containing, respectively: the AdapterDeviceID and then the AdapterVendorID correlations as requested.

I have not moved ahead with any deduplication.
Selena, are you going to turn the raw data into a report? Or is bjacob doing that?
I'll see if I can create something based on the notes bjacob left about an earlier version of the report.

I just found another formatting problem too, so this will give me a chance to fully test this version of the output.
I've regenerated the two reports to have what Benoit asked for about the scientific notation. These are crashes by AdapterDeviceID and crashes by AdapterVendorID (per the ticket). 

https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_deviceid.csv.bz2

https://github.com/selenamarie/socorro/raw/797068-c19-video-card-crash-reports/sql/crash_sql/crash_by_vendorid.csv.bz2

Regarding generating a report for Adobe, I added a sheet to the Crash By DeviceID report, that's filtered for signatures with more than 100 crashes and sorted to the devices which had a P(sig,deviceid)/(P(sig) * P(deviceid)) > 80.  I used the procedure Benoit outlined in comment 45 and spent about an hour with the data.  I looked up the deviceIDs in the PCI Vendor Database by hand, and did not look up all the Dual GPUs - apparently the first GPU ID reported is bogus, and you're supposed to actually look at the second GPU for the real DeviceID. I imagine what I'll do in a production version of this report is store a local copy of that data. To find the rest of the information, someone could just look up the Signatures and examine the app_notes field for the second GPU ID.

Github doesn't seem to want to allow the raw download for the .ods, so readers of this ticket will need to checkout the branch to get a copy. (I sent a copy directly to Benoit and Benjamin)

We're operating on the edge of my understanding of how this information is now being used, so if you wanted something different out of this, please let me know. Looking up GPU information is pretty horribly time consuming. I will rewrite this report to find the second GPU in the future.

I also looked at the VendorID data, and the top few crashes reported are the same as in the DeviceID report, and the remaining correlations are not as strong as what was found for AdapterDeviceIDs.
Added bug 807076 to track implementing this style of correlation report in PostgreSQL.

Closing this bug!
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This is a snapshot of just the top interesting results from the spreadsheet Selena created. I'm going to look through the results and file individual bugs in detail if a manual sanity-check shows nothing wrong.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: