Closed Bug 593117 Opened 12 years ago Closed 12 years ago

mine crash data to come up with rough number of Firefox users using CPUs that don't support SSE2

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: beltzner, Assigned: benjamin)

Details

Attachments

(1 file)

We're trying to get a sense of what %age of our existing users run Firefox on CPUs that don't support SSE2. The best way we know of to do this is to mine crash data, which includes information about the CPU vendor/family.

Please run a query against all crash reports, segmented by active major Firefox release (Firefox 3.5.*, Firefox 3.6.*, Firefox 4.0.*) which will return the percentage of the reports that were submitted indicating CPUs that match the following criteria:

if vendor == AuthenticAMD and family >= 15
if vendor == GenuineIntel and family >= 15 or (family == 6 and (model == 9 or model > 11))

This should yield a close-enough estimate of the %age of crash reports submitted for CPUs that support SSE2.
(Not at all sure this is the right component, please feel free to move it!)
I think this is likely best answered from HBase. Daniel?
We could write a mapreduce for it.  We would need to schedule that in amongst our 1.9 dev work.  What is the priority/turn-around time for this request?
Can someone else write a mapreduce for it, if you're busy?  Christian might well be able to, for example.
For that he'd need to be cc'd on the bug!

Turnaround time is: soon. If we could have this for Wednesday next, that'd be great. I'm sorry, I should have filed sooner. Thought it would a simpler task than it is, obviously :/
This is a very complicated task. We have family/model/stepping numbers, but there's no single map which says whether each of those combos has SSE2 or not. I spent almost a week looking up that data for 3.5 the last time we had this problem. (At the time, it was processors that had SSE2 but not CMOV, IIRC).
The query itself came from Jeff, and while he admits it's by no means exhaustive, he feels that it's a close enough approximation from which to reason. Are you saying that's not true?
bsmedberg thinks we should also track the VIA chipset without SSE2 that was used in some Netbooks. Please alter the query to be:

if vendor == AuthenticAMD and family >= 15
if vendor == GenuineIntel and family >= 15 or (family == 6 and (model == 9 or
model > 11))
if vendor == CentaurHauls and family >= 6 and model >= 10
I'll take this, I think I can do it from the nightly dump files.
Assignee: server-ops → benjamin
ok, pulling the .jsonz files to do this locally is going about 3/second, which will take more than a day. Alternately, let's try running this query against production:

SELECT cpu_info, COUNT(*) FROM reports
WHERE date_processed BETWEEN '2010-08-31 00:00:00' AND '2010-09-02 00:00:00'
GROUP BY cpu_info;

And attach it here (tab-delimited is fine). There's nothing private/sensitive about the result set.
Anurag or Xavier, could one of you take a look at whether we can quickly run an MR for this
Daniel - If i understand correctly, we are planning to query HBase @ prod for the above dates, look at processed_data:json for CPU and then build the report....
I can start writing MR job, can't guarantee results by EOD, but you will have it by tomorrow (Friday morning)....
Got results, from three days of crashes:

Unknown: 2%
Has SSE2: 90%
No SSE2: 6%

I can break down "No SSE2" more if you need it.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Why don't the percentage add up ? We're loosing a full 2% of data.

99% of unknown is : "\N" , so it might be worth getting a look at some of those crash stack to see what may be the cause.

The processors representing a significant number of crash are split between the Athlon XP (40738 crash) and Intel Pentium III (16664 crash). Everything else is non significant (some in my non-significant category could be uncommon steppings of the XP or PIII, I didn't check fully).

All in one, I think it's amazing too see that this many people are still running really old CPUs, especially those PIII.

Could it be their really old, never updated computer keeps crashing Firefox ? 
Maybe it would be useful to cross those results with the Firefox version used.
The integers truncated instead of rounding:

Yes: 90.2%
No: 6.9%
Unknown: 2.9%

I'll see if I can pull down a version breakdown for the two interesting cases.
   product   |   version    | total 
-------------+--------------+-------
 Firefox     | 3.6.8        | 14441
 Firefox     | 4.0b4        |  1430
 Firefox     | 3.5.11       |  1208
 Firefox     | 3.0.19       |   698
 Firefox     | 3.6.3        |   662
 Firefox     | 3.6.9        |   523
 Firefox     | 3.6.6        |   521
 Thunderbird | 3.1.2        |   460
 Firefox     | 3.6          |   347
 Firefox     | 4.0b5pre     |   237
 Firefox     | 3.6.4        |   185
 SeaMonkey   | 2.0.6        |   177
 Firefox     | 3.5.9        |   136
 Thunderbird | 3.0.6        |   120
 Firefox     | 4.0b3        |   109
 Firefox     | 4.0b1        |    85
 Firefox     | 4.0b2        |    82
 Firefox     | 3.6.7        |    64
 Firefox     | 3.6.2        |    60
 Firefox     | 3.0          |    51
 Firefox     | 3.1b3        |    48

So, fairly current, and even some beta testers.
(In reply to comment #15)
> 99% of unknown is : "\N" , so it might be worth getting a look at some of those
> crash stack to see what may be the cause.

This is probably just minidumps that we can't process. (Malformed or empty.) It's a known issue.
What was the overall sample size? Are the 20,000 or so reports indicated in comment 17 the 6.9% "nos" or is that the total sample size?
Total sample size is 900k, I think... three days of crashes. Comment 17 was a different sample, one or two days I think.
I've put the data on Google Doc, it simplifies things to do some stats :
https://spreadsheets.google.com/ccc?key=0AvcJfo-nsaDFdE1uTm9hX0I4SjZtZ3N6Q2hMUWlVWFE

This document is world read/write.

I used http://www.cpu-world.com/cgi-bin/CPUID.pl to identify which proc exactly are the family 6 results for AMD and Intel, and also for the non-null number of K6-2.

There's a crash result for a Pentium Pro model without MMX ! (that's probably why it crashed !!)

There's also 5 crash for a processor nobody knows about (Intel Family 6 Model 2 when every ref I've found does not reference a proc between the Model 1 Pentium Pro and the Model 3 PII).

From the data above, the sample size was around 925k identified CPUS, of which 2.05% of non-SSE2 PII/PIII Intel proc, and 4.9% of non-SSE2 Athlon/Athlon XP AMD proc, 6.1% total.

I'm afraid it's clearly too much to ignore.
Sorry 6.95% total, which is already more than the 6.9 total announced, because I compensated for the 3% of unidentified results which are actually stacks that could not be processed, so that we can just remove them from the total count of results.
Do we know if the type of crashes being seen by non-SSE2 processors are unique to those families? Is it the case that they are crashing *because* they don't support SSE2? We currently don't run tests on non-SSE2 architecture, so it's entirely possible that they're over-representing, here, too.
that might be hard to determine without diagnosing a big list of crashes and given long tail effects.  what we could do is produce a top 300 (or so) crash list that are non-SSE2, then compare the signatures to the overall (mostly SSE2) top crash list.   I don't have SSE2 data in the .csv files, but if someone produces a sample of data for a days worth of data I could have a look.
If someone dumps a days worth of data for all the non-SSE2 crashes in the tab separated format

signature \t uuid_url

that would be enough to do some interesting comparisons.

I've also added a bug to get this permanently added to the .csv files so we could continue to run checks.
(In reply to comment #25)
> I've also added a bug to get this permanently added to the .csv files so we
> could continue to run checks.

Bug 594160.
(In reply to comment #23)
> Do we know if the type of crashes being seen by non-SSE2 processors are unique
> to those families? Is it the case that they are crashing *because* they don't
> support SSE2? We currently don't run tests on non-SSE2 architecture, so it's
> entirely possible that they're over-representing, here, too.

If we're doing bad SSE detection, I'd expect these to be mostly EXCEPTION_ILLEGAL_INSTRUCTION / SIGILL. (We saw that in the past with the CMOV crash.)
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.