Closed Bug 704176 Opened 13 years ago Closed 12 years ago

Telemetry sends bad buckets sometimes

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla12

People

(Reporter: xstevens, Assigned: froydnj)

References

Details

Attachments

(3 files, 3 obsolete files)

histogram example 13 years ago Xavier Stevens [:xstevens] 7.71 KB, text/plain		Details
don't reflect corrupt histograms 13 years ago Nathan Froyd [:froydnj] 3.27 KB, patch	taras.mozilla : feedback+	Details \| Diff \| Splinter Review
part 1 - add GetHistogramEnumId 13 years ago Nathan Froyd [:froydnj] 2.45 KB, patch	taras.mozilla : review+	Details \| Diff \| Splinter Review
part 2 - don't reflect corrupt histograms 13 years ago Nathan Froyd [:froydnj] 8.70 KB, patch	taras.mozilla : review+	Details \| Diff \| Splinter Review
part 2 - don't reflect corrupt histograms, v2 13 years ago Nathan Froyd [:froydnj] 8.94 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
part 2 - don't reflect corrupt histograms, v3 13 years ago Nathan Froyd [:froydnj] 8.80 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review

Xavier Stevens [:xstevens]

Reporter

Description

•

13 years ago

I've been doing some analysis for taras and Telemetry sends bad buckets for some histograms which is causing the telemetry dashboard to try to display many more buckets than are possible.

One example from 10/23/2011 is on Firefox 8.0 build ID 20111011182523. There was a single doc that came up with buckets 6,19,59 for EARLY_GLUESTARTUP_READ_TRANSFER. No other builds on that version, that day used those buckets. I'll attach my data files.

I've also seen this happen in other versions and other histograms as well. Firefox 7.0.1 and CYCLE_COLLECTOR for instance.

Xavier Stevens [:xstevens]

Reporter

Comment 1

•

13 years ago

Attached file histogram example — Details

Xavier Stevens [:xstevens]

Reporter

Comment 2

•

13 years ago

Also here are the CYCLE_COLLECTOR bucket counts by version by day. This just shows what versions are over their bucket_count (should be 50) for that histogram.

http://people.mozilla.org/~xstevens/telemetry/704176/

Xavier Stevens [:xstevens]

Reporter

Comment 3

•

13 years ago

cc'ing taras and glandium.

glandium: taras asked me to see if you could look into this since he's out this week.

(dormant account)

Updated

•

13 years ago

Assignee: nobody → tglek

(dormant account)

Comment 4

•

13 years ago

Xavier, I could give you a list of buckets in every histogram, but we could also port the bucket generation algorithm to java so we dont have to do this every time we add a histogram.

Nathan Froyd [:froydnj]

Assignee

Comment 6

•

13 years ago

We're going to piggyback on this feature to help detect corruption or machine mismatches when reading in histograms in bug 707320.

Assignee: taras.mozilla → nfroyd

Blocks: 707320

Nathan Froyd [:froydnj]

Assignee

Comment 7

•

13 years ago

Attached patch don't reflect corrupt histograms (obsolete) — Details — Splinter Review

Here's a patch to do some corruption detection on the client side.  We'll just not reflect things into JS that we know to be bad.

I'm a little unsure of what to do for JSHistogram_Snapshot.  Giving back garbage data (as we might do before) isn't desirable, but just punting and handing back nothing doesn't seem like the right idea either.  I've opted to punt and hand back nothing in this patch.  (In any event, that only really matters for cloned histograms and user-created histograms, which are not the focus here.)

Attachment #586512 - Flags: feedback?(taras.mozilla)

(dormant account)

Comment 8

•

13 years ago

Comment on attachment 586512 [details] [diff] [review]
don't reflect corrupt histograms

s/SKIP/CORRUPT/. I think this should throw an exception instead of returning an empty snapshot object which could lead to undefined values propagating.

You should also add a histogram to track detected corruption. Might make sense to clear the histogram after detecting corruption too.

f+ since this is relatively close to correct.

Attachment #586512 - Flags: feedback?(taras.mozilla) → feedback+

Nathan Froyd [:froydnj]

Assignee

Comment 9

•

13 years ago

Histogram::FindCorruption actually logs some instances of corruption (not all) in secret histograms.  So those would show up in telemetry pings, I guess?

I am a little leery of updating the histogram while you're looping over them in GetHistogramSnapshots.  One could make a pre-pass to find corrupted instances and then only reflect the ones that are OK, though...do we have proper bit vectors somewhere?

(dormant account)

Comment 10

•

13 years ago

(In reply to Nathan Froyd (:froydnj) from comment #9)
> Histogram::FindCorruption actually logs some instances of corruption (not
> all) in secret histograms.  So those would show up in telemetry pings, I
> guess?

sounds good

> 
> I am a little leery of updating the histogram while you're looping over them
> in GetHistogramSnapshots.  One could make a pre-pass to find corrupted
> instances and then only reflect the ones that are OK, though...do we have
> proper bit vectors somewhere?

not aware of any(short of vector<bool>)

Nathan Froyd [:froydnj]

Assignee

Comment 11

•

13 years ago

Attached patch part 1 - add GetHistogramEnumId — Details — Splinter Review

Part 2 is going to maintain a corruption flag for every statically defined histogram.  To do that, we need some way of mapping from a histogram (histogram name) to the ID associated with it.  This function is the way to do that.

I know that this function is logically a boolean function, but kept the nsresult return code so that we get more precise errors in GetHistogramByName.

Attachment #587056 - Flags: review?(taras.mozilla)

Nathan Froyd [:froydnj]

Assignee

Comment 12

•

13 years ago

Attached patch part 2 - don't reflect corrupt histograms (obsolete) — Details — Splinter Review

OK, so this addresses feedback comments.  We now identify all corrupt (static) histograms prior to reflection and mark them as such.  We don't reflect Histogram's secret histograms; we instead maintain our own for the different types of corruption that FindCorruption can report.

I chose not to accumulate histograms in ReflectHistogramSnapshot because a) we're already doing it in HistogramSnapshots and friends; and b) if we're not calling ReflectHistogramSnapshot from HistogramSnapshots, then we're calling it for a user-defined histogram, which we'll just report an error for.  (And with an eye to the future, when we call ReflectHistogramSnapshot when reifying snapshots that came from disk, we don't want to measure corruption in those, just the ones we're accumulating this session.)

Attachment #586512 - Attachment is obsolete: true

Attachment #587057 - Flags: review?(taras.mozilla)

(dormant account)

Updated

•

13 years ago

Attachment #587056 - Flags: review?(taras.mozilla) → review+

(dormant account)

Comment 13

•

13 years ago

Comment on attachment 587057 [details] [diff] [review]
part 2 - don't reflect corrupt histograms

don't really like that histograms are looked up by string.

You are doing 2x as many string comparisons than sensible. 
strcmp(name, "Histogram.InconsistentCountHigh") check should go into NS_FAILED, but I guess that lazily creates histograms, so up to you if you want to create an immutable version of GetHistogramEnumId.

Attachment #587057 - Flags: review?(taras.mozilla) → review+

(dormant account)

Comment 14

•

13 years ago

This could benefit from some sort of test coverage, but I don't see how you'd do that.

Nathan Froyd [:froydnj]

Assignee

Comment 15

•

13 years ago

Attached patch part 2 - don't reflect corrupt histograms, v2 (obsolete) — Details — Splinter Review

Address review comments, in particular moving strcmps out of the common path.  Tidied comments, too.

Attachment #587057 - Attachment is obsolete: true

Attachment #587322 - Flags: review+

Nathan Froyd [:froydnj]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

Nathan Froyd [:froydnj]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

Nathan Froyd [:froydnj]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

Dão Gottwald [:dao]

Comment 16

•

13 years ago

http://hg.mozilla.org/integration/mozilla-inbound/rev/465bacf25d61
http://hg.mozilla.org/integration/mozilla-inbound/rev/47c464033fa8

Keywords: checkin-needed

Target Milestone: --- → mozilla12

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Comment 17

•

13 years ago

backed out due to build bustage

../components/telemetry/Telemetry.o: In function `IdentifyCorruptHistograms':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:529: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:552: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
../components/telemetry/Telemetry.o: In function `(anonymous namespace)::TelemetryImpl::GetHistogramSnapshots(JSContext*, JS::Value*)':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:601: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
collect2: ld returned 1 exit status

Nathan Froyd [:froydnj]

Assignee

Comment 18

•

13 years ago

Attached patch part 2 - don't reflect corrupt histograms, v3 — Details — Splinter Review

Things work better if you actually declare storage for static variables.

Attachment #587322 - Attachment is obsolete: true

Attachment #588594 - Flags: review+

Nathan Froyd [:froydnj]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

Mozilla RelEng Bot

Comment 19

•

12 years ago

Try run for cb131e9edb0b is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=cb131e9edb0b
Results (out of 277 total builds):
    exception: 2
    success: 231
    warnings: 28
    failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/philringnalda@gmail.com-cb131e9edb0b

Phil Ringnalda (:philor)

Comment 20

•

12 years ago

Pfft. Only 46 failures? I'll push that.

https://hg.mozilla.org/integration/mozilla-inbound/rev/33e46ccc6d5c
https://hg.mozilla.org/integration/mozilla-inbound/rev/af6585a74bb9

Keywords: checkin-needed

Justin Wood (:Callek)

Comment 21

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/33e46ccc6d5c
https://hg.mozilla.org/mozilla-central/rev/af6585a74bb9

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Justin Lebar (not reading bugmail)

Comment 22

•

12 years ago

Are we doing any kind of consistency checking on the server side?  This is important, per the end-to-end principal.

(dormant account)

Updated

•

12 years ago

Blocks: 721150

You need to log in before you can comment on or make changes to this bug.