Closed
Bug 707320
Opened 13 years ago
Closed 13 years ago
Allow histograms to persist to disk
Categories
(Toolkit :: Telemetry, defect)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla13
Tracking | Status | |
---|---|---|
firefox12 | + | unaffected |
firefox13 | --- | fixed |
People
(Reporter: taras.mozilla, Assigned: froydnj)
References
(Blocks 1 open bug)
Details
(Whiteboard: [qa-])
Attachments
(5 files, 13 obsolete files)
1.23 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
27.59 KB,
patch
|
taras.mozilla
:
review+
akeybl
:
approval-mozilla-aurora+
mccr8
:
checkin+
|
Details | Diff | Splinter Review |
13.44 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
10.34 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
2.53 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
This way we can track stuff on shutdown.
Assignee | ||
Comment 1•13 years ago
|
||
I guess this would block bug 706647; we're not going to want to add a telemetry ping to shutdown to report on how long it took to shutdown plugins?
Blocks: 706647
Reporter | ||
Comment 2•13 years ago
|
||
I was talking to Nathan/Sid about this. Seems like the way to do this is to serialize data on shutdown, then
on startup:
a) if there is no serialized telemetry data, send a ping same as we do now
b) if there is serialized data:
b1) send serialized data
b2) reset UID, wipe all histograms
Keywords: privacy-review-needed
Summary: Allow certain histograms to persist to disk → Allow histograms to persist to disk
Assignee | ||
Comment 3•13 years ago
|
||
This is basically what I'm implementing. I'm planning on marking certain histograms as "persistent" so we're not saving several tens of histograms to disk. Then we can send those the data from those histograms as an initial ping after startup, along with the saved UUID. We don't have to worry about combining histograms--I'm planning on just reflecting the serialized data into JS--and we don't have to deal with clearing histograms, either.
Reporter | ||
Comment 4•13 years ago
|
||
Nathan, after reflecting on this for a while: what's the benefit of not storing certain histograms on disk? Disk used by histograms should be pretty minimal, we can further reduce it with zlib compression(if needed).
Assignee | ||
Comment 5•13 years ago
|
||
Posting a WIP. This patch, however trivial, is ready for review.
This patch is driven by a couple of observations:
1. We're only going to serialize SampleSets at shutdown;
2. We don't want to mix previous-session SampleSets and current-session SampleSets;
3. We want to reflect previous-session information into JS similarly to how histogramSnapshots works now.
All of these mean that we need an interface to reflect SampleSets that didn't come from Histogram::SnapshotSample. This patch provides that interface.
Attachment #581294 -
Flags: review?(mozilla)
Assignee | ||
Comment 6•13 years ago
|
||
This patch implements the actual save-and-load histogram bits. It persists all histograms to disk, rather than marking specific histograms as persistent. Back-of-the-envelope calculations suggest that the persistent file is going to be 10s of K, so not too bad.
I haven't tested it because I don't have the JavaScript bits done yet; those bits would be the next part. Should be about 50 lines or so.
Feedback on the approach welcome; if you think it's good enough for r+, please indicate as such.
Attachment #581296 -
Flags: feedback?(mozilla)
Reporter | ||
Comment 7•13 years ago
|
||
Comment on attachment 581294 [details] [diff] [review]
0. provide reflection for separate Histograms and SampleSets
should ss be const?
Attachment #581294 -
Flags: review?(taras.mozilla) → review+
Reporter | ||
Comment 8•13 years ago
|
||
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk
>+ Histogram::SampleSet ss;
>+ h->SnapshotSample(&ss);
>+
>+ // Write out explicit lengths to improve corruption detection.
>+ size_t name_length = strlen(gHistograms[i].id);
>+ if (!(pickle.WriteData(gHistograms[i].id, name_length)
>+ && ss.Serialize(&pickle))) {
>+ return false;
>+ }
>+ }
>+
>+ return true;
Don't need an if statement there. Same elsewhere
>+}
>+
>+bool
>+DeserializeHistogramData(AutoFDClose &fd, Pickle &pickle, const char *uuid,
>+ void **iter, JSContext *cx, jsval *ret)
>+{
>+ size_t n_histograms = ArrayLength(gHistograms);
Use HistogramCount here and elsewhere
Attachment #581296 -
Flags: feedback?(taras.mozilla)
Reporter | ||
Comment 9•13 years ago
|
||
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk
You should make sure that you either don't serialize empty/non-existent histograms or don't deserialize them
Reporter | ||
Comment 10•13 years ago
|
||
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk
Should also read/write off main thread on the C++ side. ie have the api be read(callback). For writes we can just copy data into a buffer, fire off a thread and forget about it.
Assignee | ||
Comment 11•13 years ago
|
||
(In reply to Taras Glek (:taras) from comment #8)
> >+ Histogram::SampleSet ss;
> >+ h->SnapshotSample(&ss);
> >+
> >+ // Write out explicit lengths to improve corruption detection.
> >+ size_t name_length = strlen(gHistograms[i].id);
> >+ if (!(pickle.WriteData(gHistograms[i].id, name_length)
> >+ && ss.Serialize(&pickle))) {
> >+ return false;
> >+ }
> >+ }
> >+
> >+ return true;
>
> Don't need an if statement there. Same elsewhere
Why not? We had an error, no sense in writing the rest of the file if there were problems.
Assignee | ||
Comment 12•13 years ago
|
||
(In reply to Taras Glek (:taras) from comment #10)
> Should also read/write off main thread on the C++ side. ie have the api be
> read(callback). For writes we can just copy data into a buffer, fire off a
> thread and forget about it.
I assume for reads, you mean something like NS_DispatchToCurrentThread? For writes, will there be problems using the JSContext we received on a different thread?
Reporter | ||
Comment 13•13 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #11)
> (In reply to Taras Glek (:taras) from comment #8)
> > >+ Histogram::SampleSet ss;
> > >+ h->SnapshotSample(&ss);
> > >+
> > >+ // Write out explicit lengths to improve corruption detection.
> > >+ size_t name_length = strlen(gHistograms[i].id);
> > >+ if (!(pickle.WriteData(gHistograms[i].id, name_length)
> > >+ && ss.Serialize(&pickle))) {
> > >+ return false;
> > >+ }
> > >+ }
> > >+
> > >+ return true;
> >
> > Don't need an if statement there. Same elsewhere
>
> Why not? We had an error, no sense in writing the rest of the file if there
> were problems.
return pickle.WriteData(gHistograms[i].id, name_length) && ss.Serialize(&pickle);
>
> I assume for reads, you mean something like NS_DispatchToCurrentThread? For
> writes, will there be problems using the JSContext we received on a
> different thread?
Yes and yes, you should extract data out of js on main thread.
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to Taras Glek (:taras) from comment #13)
> (In reply to Nathan Froyd (:froydnj) from comment #11)
> > Why not? We had an error, no sense in writing the rest of the file if there
> > were problems.
>
> return pickle.WriteData(gHistograms[i].id, name_length) &&
> ss.Serialize(&pickle);
That if statement lives in a loop over all histograms; we don't want to return after writing the first one.
Assignee | ||
Comment 15•13 years ago
|
||
Totally redid I/O to defer it to a more convenient time.
TelemetrySessionData is an XPCOM object for convenience reasons; if we really want to, I suppose it could be turned into a JS object instead.
Attachment #581296 -
Attachment is obsolete: true
Attachment #584610 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 16•13 years ago
|
||
Hooks TelemetryPing up to the new C++ functionality.
Attachment #584611 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 17•13 years ago
|
||
Basic tests that callbacks and such work.
Attachment #584613 -
Flags: review?(taras.mozilla)
Reporter | ||
Comment 18•13 years ago
|
||
Comment on attachment 584610 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk
This seems a bit too complicated, but I guess using xpcom to save function references does that.
TelemetrySessionData::DeserializeHistogramData
needs assertions/comments. Eg it's not clear why we'd have an entry serialized, but not read it in.
Don't use new/delete, use nsAutoPtr, nsAutoPtr::forget to transfer ownership.
I don't think you need two separate completion interfaces. One interfaces with "saved" "loaded" would do. I think xpcom allows leaving js methods unimplemented (ie they just return some nsresult failure code).
Attachment #584610 -
Flags: review?(taras.mozilla) → review-
Reporter | ||
Comment 19•13 years ago
|
||
Comment on attachment 584611 [details] [diff] [review]
2. JS bits to send previously-saved session bits
This is pretty simple(in a good way).
A couple of observations:
1) I think we should set a custom reason string for the persisted data since it should be somewhat different from short-termed idle-daily data. Call it "unpickled" or something.
2) What happens when a) submission for old data fails, b) what if we shutdown before resubmitting data. Might be worth making it more persistent..ie keep last x sessions?
r-ing to address 1. Can do 2 in a followup bug.
Attachment #584611 -
Flags: review?(taras.mozilla) → review-
Reporter | ||
Comment 20•13 years ago
|
||
Comment on attachment 584613 [details] [diff] [review]
3. Tests for C++ load/save
I'm not sure if tests can pollute the temp dir. Can you ask on #developers if that's ok?
Other than that looks good.
Attachment #584613 -
Flags: review?(taras.mozilla) → review+
Assignee | ||
Comment 21•13 years ago
|
||
(In reply to Taras Glek (:taras) from comment #18)
> I don't think you need two separate completion interfaces. One interfaces
> with "saved" "loaded" would do. I think xpcom allows leaving js methods
> unimplemented (ie they just return some nsresult failure code).
My impression from chatting on #developers about how to do callbacks suggested that the [function] interfaces needed to be single-method so that XPCOM knows which method the callback is meant to implement. We do have an instance or two of multi-method [function] interfaces in the tree, but I don't know how those work or if they work at all (might be dead code?).
(In reply to Taras Glek (:taras) from comment #20)
> I'm not sure if tests can pollute the temp dir. Can you ask on #developers
> if that's ok?
The tests for the StartupCache touch the temp dir, so there's at least precedent. The removeFile do_register_cleanup was intended to attempt to clean up after ourselves.
Assignee | ||
Comment 22•13 years ago
|
||
(In reply to Taras Glek (:taras) from comment #19)
> 1) I think we should set a custom reason string for the persisted data since
> it should be somewhat different from short-termed idle-daily data. Call it
> "unpickled" or something.
AFAICS, the reason parameter to send() is never actually sent; it's just used to determine whether to send to a fixed URL or a session-unique UUID URL. Is that what you were thinking of changing? If so, did you want to include that in the ping payload?
Reporter | ||
Comment 23•13 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #22)
> (In reply to Taras Glek (:taras) from comment #19)
> > 1) I think we should set a custom reason string for the persisted data since
> > it should be somewhat different from short-termed idle-daily data. Call it
> > "unpickled" or something.
>
> AFAICS, the reason parameter to send() is never actually sent; it's just
> used to determine whether to send to a fixed URL or a session-unique UUID
> URL. Is that what you were thinking of changing? If so, did you want to
> include that in the ping payload?
You are right.
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#210 reason is sent here.
Still seems like we should change it for picked packets.
re callbacks: It's a minor enough api, existing approach is fine.
Comment 24•13 years ago
|
||
removing privacy-review-needed keyword as this has been updated in the privacy review for Telemetry. froydnj: is your code still following the procedure outlined in comment 2?
Keywords: privacy-review-needed
Assignee | ||
Comment 25•13 years ago
|
||
No, it's not. The scheme now is:
On shutdown, save all histograms we've collected and the UUID of the session to disk.
When sending a telemetry ping, check for saved data. If there is any, also send a ping with the saved data associated with the old session's UUID.
Comment 26•13 years ago
|
||
Okay, so that results in two pings if there's saved data. Is the saved data and UUID erased after it is successfully transmitted?
-Sid
Assignee | ||
Comment 27•13 years ago
|
||
Mmmm, the saved data is erased from memory (insofar as references to it go away), but not from disk. Was going to rely on browser shutdown overwriting saved data with new saved data.
Assignee | ||
Comment 28•13 years ago
|
||
New C++ patch, with some comments addressed and a few major changes.
Since pickling blasts raw memory out to disk, we need to care about endianness and word size. Therefore, as StartupCache does, we need to make the saved datafile dependent on both of those characteristics. The easiest way to do this is to move the filename bits into C++ and out of JS. I haven't quite figured whether testing is feasible with this change or not (our tests were constructing the saved filename manually before).
This version also changes how things are written out to disk.
- Version number for the file format, just in case.
- Instead of saving all histograms (and possibly creating a bunch in the process), we now take the path of GetHistogramSnapshots and just iterate over the ones that have explicitly collected data. This makes reading somewhat more complicated, of course.
This second change also improves the situation we'd find ourselves when reading histograms in. As noted in TelemetrySessionData::ReflectSamples, we need to construct Histograms for every saved histogram we read in, by virtue of needing ranges and whatnot. By only saving the "interesting" histograms, we reduce the chance of constructing new histograms at load time (since histograms that were present in the previous session are most likely present in the current session). I don't know of a good way around this, short of rewriting a fair amount of code.
Attachment #584610 -
Attachment is obsolete: true
Attachment #586442 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 29•13 years ago
|
||
Updated JS bits for review comments and for moving filename bits into C++.
Attachment #584611 -
Attachment is obsolete: true
Attachment #586444 -
Flags: review?(taras.mozilla)
Comment 30•13 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Mmmm, the saved data is erased from memory (insofar as references to it go
> away), but not from disk. Was going to rely on browser shutdown overwriting
> saved data with new saved data.
Please delete it from disk after it's successfully sent. This will minimize the chance we'll send it again by accident and minimizes the time data is persisted to disk.
Also, is any of the data saved on shutdown merged with new data collected after re-start, or are the sessions (UUID and data) kept completely separate?
Assignee | ||
Comment 31•13 years ago
|
||
(In reply to Sid Stamm [:geekboy] from comment #30)
> Please delete it from disk after it's successfully sent. This will minimize
> the chance we'll send it again by accident and minimizes the time data is
> persisted to disk.
Can do.
> Also, is any of the data saved on shutdown merged with new data collected
> after re-start, or are the sessions (UUID and data) kept completely separate?
We only persist one session at the moment; there's no merging involved. bug 715299 concerns persisting multiple sessions, if you'd like to go comment prior to investigation on what to watch out for (or if the idea is undesirable from a privacy perspective)
Reporter | ||
Comment 32•13 years ago
|
||
Comment on attachment 586442 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk, v2
old patch
Attachment #586442 -
Flags: review?(taras.mozilla)
Reporter | ||
Comment 33•13 years ago
|
||
Comment on attachment 586444 [details] [diff] [review]
2. JS bits to send previously-saved session bits, v2
old patch
Attachment #586444 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 34•13 years ago
|
||
Rebasing against HEAD. Carrying over r+.
Attachment #581294 -
Attachment is obsolete: true
Attachment #589269 -
Flags: review+
Assignee | ||
Comment 35•13 years ago
|
||
Refresh against HEAD. We now drop histograms that we don't recognize on the floor during deserialization--assuming that we can read their data. Partly need to do this so that tests work, but it's a good thing to do in general.
Attachment #586442 -
Attachment is obsolete: true
Attachment #589270 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 36•13 years ago
|
||
Refresh against HEAD. I moved the wordsize, endian-naming bits into JS; I think it's worthwhile to keep them even with bug 704176 in the tree. Plus it's somewhat less complicated in JS. Doing naming in JS means that it's easier to test, too; less clutter in the nsITelemetry interface.
Attachment #586444 -
Attachment is obsolete: true
Attachment #589271 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 37•13 years ago
|
||
Refresh against HEAD and with everything else. Carrying over r+.
Attachment #584613 -
Attachment is obsolete: true
Attachment #589272 -
Flags: review+
Reporter | ||
Comment 38•13 years ago
|
||
Comment on attachment 589270 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk, v3
This looks good. idl docs should mention writing/reading occurs on a different thread.
Reporter | ||
Updated•13 years ago
|
Attachment #589270 -
Flags: review?(taras.mozilla) → review+
Reporter | ||
Updated•13 years ago
|
Attachment #589271 -
Flags: review?(taras.mozilla) → review+
Reporter | ||
Comment 39•13 years ago
|
||
Having r+ed above, I worry that this creates a race condition(ie if we quit before we get to the async save thread). I think saveHistograms should have a sync parameter which can be set when calling it on shutdown.
I think we should also do async saves periodically in case the browser crashes(ie while polling for memory usage, we can also trigger a save once an hour).
Assignee | ||
Comment 40•13 years ago
|
||
Adding isSynchronous flag to saveHistograms and tweaking IDL comments, as requested. Nothing major, so carrying over r+.
Attachment #589270 -
Attachment is obsolete: true
Attachment #589901 -
Flags: review+
Assignee | ||
Comment 41•13 years ago
|
||
Synchronously save histograms at quit-application-granted. Carrying over r+.
Attachment #589271 -
Attachment is obsolete: true
Attachment #589902 -
Flags: review+
Assignee | ||
Comment 42•13 years ago
|
||
Check for the existence of the temporary file before deleting it. Nothing big, carrying over r+.
Attachment #589272 -
Attachment is obsolete: true
Attachment #589903 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Comment 43•13 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a315a55ea7e
https://hg.mozilla.org/integration/mozilla-inbound/rev/6c2b90a11ea8
https://hg.mozilla.org/integration/mozilla-inbound/rev/5c40a413d9a9
https://hg.mozilla.org/integration/mozilla-inbound/rev/51bffa83d14f
Keywords: checkin-needed
OS: Windows 7 → All
Hardware: x86 → All
Target Milestone: --- → mozilla12
Version: unspecified → Trunk
Comment 44•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/5a315a55ea7e
https://hg.mozilla.org/mozilla-central/rev/6c2b90a11ea8
https://hg.mozilla.org/mozilla-central/rev/5c40a413d9a9
https://hg.mozilla.org/mozilla-central/rev/51bffa83d14f
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Comment 45•13 years ago
|
||
Backing this out due to startup crashes (particularly, bug 722545).
Depends on: 722625
Comment 46•13 years ago
|
||
Backed out from m-c before the uplift to FF13. But this backout may not be part of the uplift, depending on where we branch. So we'll have to keep an eye on it.
https://hg.mozilla.org/mozilla-central/rev/d74a924a149b
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 47•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #46)
> Backed out from m-c before the uplift to FF13. But this backout may not be
> part of the uplift, depending on where we branch. So we'll have to keep an
> eye on it.
>
> https://hg.mozilla.org/mozilla-central/rev/d74a924a149b
This backout has been undone on m-c since it caused M-oth orange:
https://tbpl.mozilla.org/?rev=d74a924a149b
https://hg.mozilla.org/mozilla-central/rev/974b0efab5c8
akeybl is going to revert the backout on aurora as well, since d74a924a149b (the backout) made it there, but 974b0efab5c8 didn't.
The startup crashes in comment 45 therefore will need to be resolved on both m-c and aurora.
Assignee | ||
Comment 48•13 years ago
|
||
Not quite sure how to handle this, but here's a patch that correctly backs out these patches on trunk. It's not a straight revision because TelemetryPing.getSessionPayloadAndSlug is also used in the observe method.
TBPL is here: https://tbpl.mozilla.org/?tree=Try&rev=bd4b87fee4f4
M-oth contains the bits that wind up calling getSessionPayloadAndSlug (and were turned orange on the first backout), and the xpcshell tests hit all the Telemetry tests.
Attachment #593599 -
Flags: review?(taras.mozilla)
Reporter | ||
Updated•13 years ago
|
Attachment #593599 -
Flags: review?(taras.mozilla) → review+
Comment 49•13 years ago
|
||
Comment 50•13 years ago
|
||
I just hit an exception in getSessionPayloadAndSlug using Nightly, sync'd from mozilla-central a few days ago:
Timestamp: 12-02-03 08:04:09 AM
Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsITelemetrySessionData.snapshots]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: file:///...objdir.../dist/bin/components/TelemetryPing.js :: getSessionPayloadAndSlug :: line 432" data: no]
Source File: file:///...objdir.../dist/bin/components/TelemetryPing.js
Line: 432
Here's the code at and around line 432:
getSessionPayloadAndSlug: function getSessionPayloadAndSlug(reason) {
// Use a deterministic url for testing.
let isTestPing = (reason == "test-ping");
let havePreviousSession = !!this._prevSession;
let slug = (isTestPing
? reason
: (havePreviousSession
? this._prevSession.uuid
: this._uuid));
let payloadObj = {
ver: PAYLOAD_VERSION,
// Send a different reason string for previous session data.
info: this.getMetadata(havePreviousSession ? "saved-session" : reason),
};
if (havePreviousSession) {
>>432>> payloadObj.histograms = this.getHistograms(this._prevSession.snapshots);
}
else {
payloadObj.simpleMeasurements = getSimpleMeasurements();
payloadObj.histograms = this.getHistograms(Telemetry.histogramSnapshots);
payloadObj.slowSQL = Telemetry.slowSQL;
}
return { previous: !!havePreviousSession, slug: slug, payload: JSON.stringify(payloadObj) };
},
Comment 51•13 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #49)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/67a219c9fb12
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/67a219c9fb12
Updated•13 years ago
|
Target Milestone: mozilla12 → ---
Assignee | ||
Comment 52•13 years ago
|
||
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch
[Approval Request Comment]
Regression caused by (bug #): This bug.
User impact if declined: Crashes on startup if telemetry is enabled.
Testing completed (on m-c, etc.): Patch survived testing on mozilla-inbound, recently checked into mozilla-central.
Risk to taking this patch (and alternatives if risky): Epsilon.
String changes made by this patch: None.
Attachment #593599 -
Flags: approval-mozilla-aurora?
Comment 53•13 years ago
|
||
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch
[Triage Comment]
Approved for Aurora.
Attachment #593599 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•13 years ago
|
Assignee | ||
Updated•13 years ago
|
Attachment #593599 -
Flags: checkin?
Assignee | ||
Comment 54•13 years ago
|
||
Please checkin the backout patch on aurora. Adding the a=akeybl would be great too, if you'd be so kind.
Keywords: checkin-needed
Comment 55•13 years ago
|
||
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch
backed out checked in.
https://hg.mozilla.org/releases/mozilla-aurora/rev/26c731b682ce
Attachment #593599 -
Flags: checkin? → checkin+
Assignee | ||
Comment 57•13 years ago
|
||
The reason that we had debug-only crashes was that the Histogram class is, by design, not thread-safe. Each Histogram, in addition to maintaining counts for its buckets, also maintains a "redundant count", a sum of all the counts we have accumulated thus far. This produces a situation where we have a race between two threads, A and B:
A: Inserts a count into a histogram H
B: Snapshots histogram H
A: Updates the redundant count of H
B now has data where sum(counts) != redundant_count. If we write the snapshot that B took to disk and read it later, we hit a debug-only check in the deserialization code that sum(counts) == redundant_count. This check was the cause of bugs like bug 722545. On opt builds, this degrades to a mere indication that deserialization failed. We take this indication to mean that *all* of the saved histogram data is corrupt.
This is clearly not a good situation. Note, however, that the histogram code does allow for this possibility when checking for corruption and permits the two count measures to differ by no more than 1 when declaring a histogram not corrupt. So there is a bit of inconsistency here: the code is OK with drift in-memory, but has zero tolerance when it comes to serialization.
Options for fixing this:
1) One big Telemetry lock that you have to grab before accumulating data or snapshotting data. This is suboptimal.
Note that we snapshot histograms in more places than just persisting them to disk; we snapshot them for reflection into JS, for instance.
2) Per-histogram locks, effectively making accumulation and snapshotting thread-safe. The methods on Histogram for accumulating data are virtual for precisely this sort of thing, though I haven't looked to see if you can extend them easily in practice.
3) Some sort of reader-writer locking scheme where you can accumulate into any number of histograms simultaneously, but snapshotting is atomic with respect to accumulation. I like this one, but it'd be a good deal more complicated, I think, than any of the others.
Note that this would leave us open to inconsistent sum(counts) and redundant_count, as two threads could be accumulating into the same histogram simultaneously.
4) Changing how snapshots are serialized, either by not serializing redundant_count or by inventing some checksum to make sure that counts and redundant_count didn't get corrupted with respect to each other. Then we can remove the sum(counts) == redundant_count check. This leaves the race in place, but makes the serialization process saner, IMHO. It's OK to read in slightly corrupted histograms, as we tolerate slightly corrupted histograms in-memory (see above discussion).
This option requires hacking on chromium-imported code, which may or may not make life more difficult for future merges. At the very least, I think distinguishing between the "I didn't read all of the saved data" versus "the data looks corrupt" in the deserialization code would be useful.
I think options 2 and 4 are the most reasonable; I'm leaning slightly towards option 4, but I'm curious whether anybody else has an opinion.
Assignee | ||
Comment 59•13 years ago
|
||
Rebasing.
Attachment #589902 -
Attachment is obsolete: true
Attachment #600913 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Attachment #589901 -
Attachment is obsolete: true
Assignee | ||
Updated•13 years ago
|
Whiteboard: [autoland-try:589269,600912,600913,589903]
Updated•13 years ago
|
Whiteboard: [autoland-try:589269,600912,600913,589903] → [autoland-in-queue]
Comment 60•13 years ago
|
||
Autoland Patchset:
Patches: 589269, 600912, 600913, 589903
Branch: mozilla-central => try
Error applying patch 589903 to mozilla-central.
patching file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
Hunk #1 FAILED at 99
1 out of 1 hunks FAILED -- saving rejects to file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js.rej
abort: patch failed to apply
Could not apply and push patchset:
Updated•13 years ago
|
Whiteboard: [autoland-in-queue]
Assignee | ||
Comment 61•13 years ago
|
||
Update All The Patches!
Attachment #589903 -
Attachment is obsolete: true
Attachment #600921 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Whiteboard: [autoland-try:589269,600912,600913,600921]
Updated•13 years ago
|
Whiteboard: [autoland-try:589269,600912,600913,600921] → [autoland-in-queue]
Comment 62•13 years ago
|
||
Autoland Patchset:
Patches: 589269, 600912, 600913, 600921
Branch: mozilla-central => try
Destination: http://hg.mozilla.org/try/pushloghtml?changeset=8f0fae38e776
Try run started, revision 8f0fae38e776. To cancel or monitor the job, see: https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
Assignee | ||
Comment 63•13 years ago
|
||
The try run looks OK, despite going for 24 hours. Let's try to get this in and have some soak time before the Aurora uplift.
Keywords: checkin-needed
Comment 64•13 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/a853405443e7
https://hg.mozilla.org/integration/mozilla-inbound/rev/229fe7214b40
https://hg.mozilla.org/integration/mozilla-inbound/rev/be00254af671
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9eb5f6242f7
Keywords: checkin-needed
Comment 65•13 years ago
|
||
Try run for 8f0fae38e776 is complete.
Detailed breakdown of the results available here:
https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
Results (out of 181 total builds):
exception: 1
success: 144
warnings: 20
failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/autolanduser@mozilla.com-8f0fae38e776
Timed out after 12 hours without completing.
Updated•13 years ago
|
Whiteboard: [autoland-in-queue]
Comment 66•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/a853405443e7
https://hg.mozilla.org/mozilla-central/rev/229fe7214b40
https://hg.mozilla.org/mozilla-central/rev/be00254af671
https://hg.mozilla.org/mozilla-central/rev/f9eb5f6242f7
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
Comment 67•13 years ago
|
||
I've been told that this bug prevents us from getting post-startup telemetry data. If that's the case, we should consider uplifting to Aurora 12 as soon as this has had a week or so to bake. We need telemetry data to know whether or not we have perf regressions across releases.
Assignee | ||
Comment 68•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #67)
> I've been told that this bug prevents us from getting post-startup telemetry
> data.
This bug prevents us from receiving data collected in the last ~day before a running Firefox instance is shutdown. It does *not* affect normal telemetry gathering; we can still gather telemetry data for perf regression information across releases without this bug being fixed in 12.
I'm inclined to not nominate this for Aurora 12. Given the problems that showed up last time, I'd like to let this bake for a complete release cycle.
Reporter | ||
Comment 69•13 years ago
|
||
This was backed out of 12, so it doesn't affect 12.
Comment 70•13 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #68)
> This bug prevents us from receiving data collected in the last ~day before a
> running Firefox instance is shutdown. It does *not* affect normal telemetry
> gathering; we can still gather telemetry data for perf regression
> information across releases without this bug being fixed in 12.
>
> I'm inclined to not nominate this for Aurora 12. Given the problems that
> showed up last time, I'd like to let this bake for a complete release cycle.
My concern here is that we're many times investigating performance regressions that accumulate over time and cause Firefox users to frequently restart their browser - from talking with Lawrence it sounds like telemetry data on these users would be lost.
Is there any targeted QA that we could do to land this fix a cycle early? Telemetry and other data gathering are the places that we're willing to take on a little more risk since it could be eye opening in our understanding of perf regressions.
If this carries too much risk, I don't mean to force the issue - I just want to make sure we're on the same page that this is the only external performance data we collect and can use to resolve perf regressions (seen in both FF10 and FF11 most recently).
Reporter | ||
Comment 71•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #70)
> (In reply to Nathan Froyd (:froydnj) from comment #68)
> > This bug prevents us from receiving data collected in the last ~day before a
> > running Firefox instance is shutdown. It does *not* affect normal telemetry
> > gathering; we can still gather telemetry data for perf regression
> > information across releases without this bug being fixed in 12.
> >
> > I'm inclined to not nominate this for Aurora 12. Given the problems that
> > showed up last time, I'd like to let this bake for a complete release cycle.
>
> My concern here is that we're many times investigating performance
> regressions that accumulate over time and cause Firefox users to frequently
> restart their browser - from talking with Lawrence it sounds like telemetry
> data on these users would be lost.
Telemetry only records first 5-10minutes of browsing(+ every 24hours) because of the way our idle-daily notification works.
>
> Is there any targeted QA that we could do to land this fix a cycle early?
> Telemetry and other data gathering are the places that we're willing to take
> on a little more risk since it could be eye opening in our understanding of
> perf regressions.
>
> If this carries too much risk, I don't mean to force the issue - I just want
> to make sure we're on the same page that this is the only external
> performance data we collect and can use to resolve perf regressions (seen in
> both FF10 and FF11 most recently).
The fact that we are tracking ff10 perf is news to me.
This functionality has a long teething period. Both of the 2 landings had bugs. Second landing did not regress anything, hopefully things are now working with bug 732970 fixed.
We can only consider uplifting this once this has baked on trunk for a while and we have serverside evidence that things work. It takes about 7 days to get sufficient confidence that telemetry is working as expected.
Comment 72•13 years ago
|
||
Is there anything QA can do to verify this fix?
Assignee | ||
Comment 73•13 years ago
|
||
This was all disabled in bug 756152.
Comment 74•13 years ago
|
||
Thanks Nathan. Marking this [qa-]. We will verify-disabled in bug 756152. If there is anything you can add to that bug to aid in our verification, please comment there. Thanks.
Whiteboard: [qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•