Closed Bug 704176 Opened 11 years ago Closed 11 years ago

Telemetry sends bad buckets sometimes

Categories

(Toolkit :: Telemetry, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: xstevens, Assigned: froydnj)

References

Details

Attachments

(3 files, 3 obsolete files)

I've been doing some analysis for taras and Telemetry sends bad buckets for some histograms which is causing the telemetry dashboard to try to display many more buckets than are possible.

One example from 10/23/2011 is on Firefox 8.0 build ID 20111011182523. There was a single doc that came up with buckets 6,19,59 for EARLY_GLUESTARTUP_READ_TRANSFER. No other builds on that version, that day used those buckets. I'll attach my data files.

I've also seen this happen in other versions and other histograms as well. Firefox 7.0.1 and CYCLE_COLLECTOR for instance.
Attached file histogram example
Also here are the CYCLE_COLLECTOR bucket counts by version by day. This just shows what versions are over their bucket_count (should be 50) for that histogram.

http://people.mozilla.org/~xstevens/telemetry/704176/
cc'ing taras and glandium.

glandium: taras asked me to see if you could look into this since he's out this week.
Assignee: nobody → tglek
Xavier, I could give you a list of buckets in every histogram, but we could also port the bucket generation algorithm to java so we dont have to do this every time we add a histogram.
We're going to piggyback on this feature to help detect corruption or machine mismatches when reading in histograms in bug 707320.
Assignee: taras.mozilla → nfroyd
Blocks: 707320
Attached patch don't reflect corrupt histograms (obsolete) — Splinter Review
Here's a patch to do some corruption detection on the client side.  We'll just not reflect things into JS that we know to be bad.

I'm a little unsure of what to do for JSHistogram_Snapshot.  Giving back garbage data (as we might do before) isn't desirable, but just punting and handing back nothing doesn't seem like the right idea either.  I've opted to punt and hand back nothing in this patch.  (In any event, that only really matters for cloned histograms and user-created histograms, which are not the focus here.)
Attachment #586512 - Flags: feedback?(taras.mozilla)
Comment on attachment 586512 [details] [diff] [review]
don't reflect corrupt histograms

s/SKIP/CORRUPT/. I think this should throw an exception instead of returning an empty snapshot object which could lead to undefined values propagating.

You should also add a histogram to track detected corruption. Might make sense to clear the histogram after detecting corruption too.

f+ since this is relatively close to correct.
Attachment #586512 - Flags: feedback?(taras.mozilla) → feedback+
Histogram::FindCorruption actually logs some instances of corruption (not all) in secret histograms.  So those would show up in telemetry pings, I guess?

I am a little leery of updating the histogram while you're looping over them in GetHistogramSnapshots.  One could make a pre-pass to find corrupted instances and then only reflect the ones that are OK, though...do we have proper bit vectors somewhere?
(In reply to Nathan Froyd (:froydnj) from comment #9)
> Histogram::FindCorruption actually logs some instances of corruption (not
> all) in secret histograms.  So those would show up in telemetry pings, I
> guess?

sounds good

> 
> I am a little leery of updating the histogram while you're looping over them
> in GetHistogramSnapshots.  One could make a pre-pass to find corrupted
> instances and then only reflect the ones that are OK, though...do we have
> proper bit vectors somewhere?

not aware of any(short of vector<bool>)
Part 2 is going to maintain a corruption flag for every statically defined histogram.  To do that, we need some way of mapping from a histogram (histogram name) to the ID associated with it.  This function is the way to do that.

I know that this function is logically a boolean function, but kept the nsresult return code so that we get more precise errors in GetHistogramByName.
Attachment #587056 - Flags: review?(taras.mozilla)
OK, so this addresses feedback comments.  We now identify all corrupt (static) histograms prior to reflection and mark them as such.  We don't reflect Histogram's secret histograms; we instead maintain our own for the different types of corruption that FindCorruption can report.

I chose not to accumulate histograms in ReflectHistogramSnapshot because a) we're already doing it in HistogramSnapshots and friends; and b) if we're not calling ReflectHistogramSnapshot from HistogramSnapshots, then we're calling it for a user-defined histogram, which we'll just report an error for.  (And with an eye to the future, when we call ReflectHistogramSnapshot when reifying snapshots that came from disk, we don't want to measure corruption in those, just the ones we're accumulating this session.)
Attachment #586512 - Attachment is obsolete: true
Attachment #587057 - Flags: review?(taras.mozilla)
Attachment #587056 - Flags: review?(taras.mozilla) → review+
Comment on attachment 587057 [details] [diff] [review]
part 2 - don't reflect corrupt histograms

don't really like that histograms are looked up by string.

You are doing 2x as many string comparisons than sensible. 
strcmp(name, "Histogram.InconsistentCountHigh") check should go into NS_FAILED, but I guess that lazily creates histograms, so up to you if you want to create an immutable version of GetHistogramEnumId.
Attachment #587057 - Flags: review?(taras.mozilla) → review+
This could benefit from some sort of test coverage, but I don't see how you'd do that.
Address review comments, in particular moving strcmps out of the common path.  Tidied comments, too.
Attachment #587057 - Attachment is obsolete: true
Attachment #587322 - Flags: review+
Keywords: checkin-needed
Keywords: checkin-needed
Keywords: checkin-needed
backed out due to build bustage

../components/telemetry/Telemetry.o: In function `IdentifyCorruptHistograms':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:529: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:552: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
../components/telemetry/Telemetry.o: In function `(anonymous namespace)::TelemetryImpl::GetHistogramSnapshots(JSContext*, JS::Value*)':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:601: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
collect2: ld returned 1 exit status
Things work better if you actually declare storage for static variables.
Attachment #587322 - Attachment is obsolete: true
Attachment #588594 - Flags: review+
Keywords: checkin-needed
Try run for cb131e9edb0b is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=cb131e9edb0b
Results (out of 277 total builds):
    exception: 2
    success: 231
    warnings: 28
    failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/philringnalda@gmail.com-cb131e9edb0b
Are we doing any kind of consistency checking on the server side?  This is important, per the end-to-end principal.
Blocks: 721150
You need to log in before you can comment on or make changes to this bug.