Closed Bug 544583 Opened 14 years ago Closed 14 years ago

Identify source of TCBS data changes between 3/5 and 3/4

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ozten, Unassigned)

References

()

Details

Attachments

(1 file)

Ken Kovash noticed the ADU numbers changed between being retrieved on today (3/5) versus yesterday.

Date	Firefox 3.5.7 Crashes (retrieved today)	Firefox 3.5.7 Crashes (retrieved yesterday)
1/22/10	15,535	15,535
1/23/10	14,911	122,160
1/24/10	118,930	14,392
1/25/10	16,389	125,948
1/26/10	16,727	16,727
1/27/10	112,084	112,084
1/28/10	16,116	126,000
1/29/10	11,347	91,573
1/30/10	6,718	6,718
1/31/10	6,873	6,873
2/1/10	38,016	4,771
2/2/10	12,626	12,626
2/3/10	10,431	16,087

ADU data comes from:
raw_adu and our top_crashes_by_signature (TCBS) tables.
TCBS is calculated on date_processed on an hourly basis.

Once created, this data does not change.

Is any of the above incorrect?
Did we rerun TCBS reports for 1/23, 1/24, 1/25, 1/28, 1/29, and 2/1 to backfill missing data after our data recover efforts?
Odd that 1/23, 1/25 and 1/28 grow by 8x
and 1/24 and 2/1 shrink by 8x

(not exactly 8, no I'm not doing drugs)
Adding link to daily ADU report, which is the report referenced in Comment 0.
This does look like a change in the contents of the top_crashes_by_signature table to me. No way to directly examine that. I suspect that if there was a change, Aravind had a hand in doing it. Added him to the CC list.
Filed a bug to refresh our development copy of production data.
Depends on: 544608
Is this fixed?
I'm also trying to figure out if this is behind some of the spikes we see for some signatures over the last few days for bugs like https://bugzilla.mozilla.org/show_bug.cgi?id=542203

day               count  signature

20100122-crashdata 146 nsXHREventTarget::GetParentObject
20100123-crashdata 148 nsXHREventTarget::GetParentObject
20100124-crashdata 145 nsXHREventTarget::GetParentObject
20100125-crashdata 1950 nsXHREventTarget::GetParentObject
20100126-crashdata 11731 nsXHREventTarget::GetParentObject
20100127-crashdata 10300 nsXHREventTarget::GetParentObject
20100129-crashdata 1143 nsXHREventTarget::GetParentObject
20100130-crashdata 467 nsXHREventTarget::GetParentObject
20100131-crashdata 444 nsXHREventTarget::GetParentObject
172:crashdata chofmann$ ./stacktrend.sh nsXHREventTarget::GetParentObject 201002*

date nsXHREventTarget::GetParentObjectcrashes
20100201-crashdata 761 nsXHREventTarget::GetParentObject
20100202-crashdata 527 nsXHREventTarget::GetParentObject
20100203-crashdata 406 nsXHREventTarget::GetParentObject
20100204-crashdata 328 nsXHREventTarget::GetParentObject
these are counts out of the .csv files.
I've traced the web service call that populates this data...

SQL for 1/24 outputs
2010-01-24 00:00:00	Linux	20
2010-01-24 00:00:00	Mac	8295
2010-01-24 00:00:00	Windows	61472
2010-01-24 00:00:00	Windows	12

Service responds with
Windows - 12 crashes 57434300 users
Linux - 20 crashes 673199 users
Mac - 8295 crashes 4025750 users

Why a fluctuation? - It's possible that the order the results come back is non-determinate and the service layer only uses the first row per OS.
Depends on: 545035
I looked at the hourly rates for total, 3.6, and 3.5.7 for recent days of the .csv files that have produced after the system down-time on 2010 01 28.

overall crash volume on total crashes per hour, and 3.5.7 is down and 3.6 is up a little.  this would be explained by people moving from 3.5.7 to 3.6, and the fact that 3.6 throttling was adjusted from 25% to 15% coming out of socorro 1.4 udpate on 1/28.

so overall these numbers seem to be reasonable.

that would leave the strange spikes in some signatures like in comment 6 unexplained.
How are we doing here?  Are the numbers consistent?  Can Ken do the things he needs to do to compare 3.5 w/ 3.6 yet?  Who owns this bug?
The numbers I get using data from the .csv files are looking great for firefox 3.6

https://wiki.mozilla.org/CrashKill/Crashr#3.6_RC1.2C_RC2.2C_Final  shows 3.4-3.8 crashes per 100 users, which would make it more stable than 3.0.x

there is one outlier in 2010 02 05  but its possible that we just didn't get all the data on that day.  there are about 12k few crashes on that day than I would have expected in the .csv file
and it looks like the .cvs file for that date only contains data up to 4pm

crash
count date_processed

7186 2010020500
7943 2010020501
8314 2010020502
8628 2010020503
9963 2010020504
11252 2010020505
12234 2010020506
12553 2010020507
13030 2010020508
12912 2010020509
12535 2010020510
12127 2010020511
12052 2010020512
11297 2010020513
10350 2010020514
9634 2010020515
8166 2010020516
was this resolved?  it'd be nice to be able to do analysis with our crash data and know that we're 100% sure of the underlying data.
The fix is on stage http://crash-stats.stage.mozilla.com/ it is schedule for production Thursday night.

Please help test the fix (Bug#545035) on stage.
Looking good in production.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: