Last Comment Bug 707320 - Allow histograms to persist to disk
: Allow histograms to persist to disk
Status: RESOLVED FIXED
[qa-]
:
Product: Toolkit
Classification: Components
Component: Telemetry (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla13
Assigned To: Nathan Froyd [:froydnj]
:
Mentors:
Depends on: 704176 722545 722625 726761 733977
Blocks: 736268 708674 711217 715299 719167 719437 723802
  Show dependency treegraph
 
Reported: 2011-12-02 14:27 PST by (dormant account)
Modified: 2012-05-29 13:00 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
unaffected
fixed


Attachments
0. provide reflection for separate Histograms and SampleSets (1.21 KB, patch)
2011-12-13 09:11 PST, Nathan Froyd [:froydnj]
taras.mozilla: review+
Details | Diff | Splinter Review
1. C++ interface bits for saving histograms to disk (6.52 KB, patch)
2011-12-13 09:17 PST, Nathan Froyd [:froydnj]
no flags Details | Diff | Splinter Review
1. C++ interface bits for saving histograms to disk (10.98 KB, patch)
2011-12-28 12:10 PST, Nathan Froyd [:froydnj]
taras.mozilla: review-
Details | Diff | Splinter Review
2. JS bits to send previously-saved session bits (6.12 KB, patch)
2011-12-28 12:11 PST, Nathan Froyd [:froydnj]
taras.mozilla: review-
Details | Diff | Splinter Review
3. Tests for C++ load/save (2.04 KB, patch)
2011-12-28 12:12 PST, Nathan Froyd [:froydnj]
taras.mozilla: review+
Details | Diff | Splinter Review
1. C++ interface bits for saving histograms to disk, v2 (14.53 KB, patch)
2012-01-06 08:23 PST, Nathan Froyd [:froydnj]
no flags Details | Diff | Splinter Review
2. JS bits to send previously-saved session bits, v2 (5.70 KB, patch)
2012-01-06 08:24 PST, Nathan Froyd [:froydnj]
no flags Details | Diff | Splinter Review
part 1 - provide reflection for separate Histograms and SampleSets (1.23 KB, patch)
2012-01-17 12:45 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 2 - C++ bits for saving histograms to disk, v3 (11.98 KB, patch)
2012-01-17 12:48 PST, Nathan Froyd [:froydnj]
taras.mozilla: review+
Details | Diff | Splinter Review
part 3 - JS bits to send previously-saved session bits, v3 (7.55 KB, patch)
2012-01-17 12:51 PST, Nathan Froyd [:froydnj]
taras.mozilla: review+
Details | Diff | Splinter Review
part 4 - tests for C++ load/save (2.02 KB, patch)
2012-01-17 12:52 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 2 - C++ bits for saving histograms to disk, v3 (13.72 KB, patch)
2012-01-19 10:26 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 3 - JS bits to send previously-saved session bits, v4 (9.90 KB, patch)
2012-01-19 10:27 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 4 - tests for C++ load/save, v4 (2.57 KB, patch)
2012-01-19 10:28 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
fixed backout patch (27.59 KB, patch)
2012-02-01 13:52 PST, Nathan Froyd [:froydnj]
taras.mozilla: review+
akeybl: approval‑mozilla‑aurora+
continuation: checkin+
Details | Diff | Splinter Review
part 2 - C++ bits for saving histograms to disk (13.44 KB, patch)
2012-02-27 07:53 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 3 - JS bits to send previously-saved session bits (10.34 KB, patch)
2012-02-27 07:54 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review
part 4 - tests for C++ load/save (2.53 KB, patch)
2012-02-27 08:26 PST, Nathan Froyd [:froydnj]
nfroyd: review+
Details | Diff | Splinter Review

Description (dormant account) 2011-12-02 14:27:48 PST
This way we can track stuff on shutdown.
Comment 1 Nathan Froyd [:froydnj] 2011-12-05 08:58:23 PST
I guess this would block bug 706647; we're not going to want to add a telemetry ping to shutdown to report on how long it took to shutdown plugins?
Comment 2 (dormant account) 2011-12-12 13:50:39 PST
I was talking to Nathan/Sid about this. Seems like the way to do this is to serialize data on shutdown, then
on startup:
a) if there is no serialized telemetry data, send a ping same as we do now
b) if there is serialized data:
b1) send serialized data
b2) reset UID, wipe all histograms
Comment 3 Nathan Froyd [:froydnj] 2011-12-12 14:00:35 PST
This is basically what I'm implementing.  I'm planning on marking certain histograms as "persistent" so we're not saving several tens of histograms to disk.  Then we can send those the data from those histograms as an initial ping after startup, along with the saved UUID.  We don't have to worry about combining histograms--I'm planning on just reflecting the serialized data into JS--and we don't have to deal with clearing histograms, either.
Comment 4 (dormant account) 2011-12-12 14:31:42 PST
Nathan, after reflecting on this for a while: what's the benefit of not storing certain histograms on disk? Disk used by histograms should be pretty minimal, we can further reduce it with zlib compression(if needed).
Comment 5 Nathan Froyd [:froydnj] 2011-12-13 09:11:00 PST
Created attachment 581294 [details] [diff] [review]
0. provide reflection for separate Histograms and SampleSets

Posting a WIP.  This patch, however trivial, is ready for review.

This patch is driven by a couple of observations:

1. We're only going to serialize SampleSets at shutdown;
2. We don't want to mix previous-session SampleSets and current-session SampleSets;
3. We want to reflect previous-session information into JS similarly to how histogramSnapshots works now.

All of these mean that we need an interface to reflect SampleSets that didn't come from Histogram::SnapshotSample.  This patch provides that interface.
Comment 6 Nathan Froyd [:froydnj] 2011-12-13 09:17:21 PST
Created attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

This patch implements the actual save-and-load histogram bits.  It persists all histograms to disk, rather than marking specific histograms as persistent.  Back-of-the-envelope calculations suggest that the persistent file is going to be 10s of K, so not too bad.

I haven't tested it because I don't have the JavaScript bits done yet; those bits would be the next part.  Should be about 50 lines or so.

Feedback on the approach welcome; if you think it's good enough for r+, please indicate as such.
Comment 7 (dormant account) 2011-12-13 14:35:25 PST
Comment on attachment 581294 [details] [diff] [review]
0. provide reflection for separate Histograms and SampleSets

should ss be const?
Comment 8 (dormant account) 2011-12-13 14:59:15 PST
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

>+    Histogram::SampleSet ss;
>+    h->SnapshotSample(&ss);
>+
>+    // Write out explicit lengths to improve corruption detection.
>+    size_t name_length = strlen(gHistograms[i].id);
>+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
>+          && ss.Serialize(&pickle))) {
>+      return false;
>+    }
>+  }
>+
>+  return true;

Don't need an if statement there. Same elsewhere

>+}
>+
>+bool
>+DeserializeHistogramData(AutoFDClose &fd, Pickle &pickle, const char *uuid,
>+                         void **iter, JSContext *cx, jsval *ret)
>+{
>+  size_t n_histograms = ArrayLength(gHistograms);

Use HistogramCount here and elsewhere
Comment 9 (dormant account) 2011-12-13 15:00:21 PST
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

You should make sure that you either don't serialize empty/non-existent histograms or don't deserialize them
Comment 10 (dormant account) 2011-12-13 15:05:58 PST
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

Should also read/write off main thread on the C++ side. ie have the api be read(callback). For writes we can just copy data into a buffer, fire off a thread and forget about it.
Comment 11 Nathan Froyd [:froydnj] 2011-12-21 08:45:54 PST
(In reply to Taras Glek (:taras) from comment #8)
> >+    Histogram::SampleSet ss;
> >+    h->SnapshotSample(&ss);
> >+
> >+    // Write out explicit lengths to improve corruption detection.
> >+    size_t name_length = strlen(gHistograms[i].id);
> >+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
> >+          && ss.Serialize(&pickle))) {
> >+      return false;
> >+    }
> >+  }
> >+
> >+  return true;
> 
> Don't need an if statement there. Same elsewhere

Why not?  We had an error, no sense in writing the rest of the file if there were problems.
Comment 12 Nathan Froyd [:froydnj] 2011-12-21 09:12:52 PST
(In reply to Taras Glek (:taras) from comment #10)
> Should also read/write off main thread on the C++ side. ie have the api be
> read(callback). For writes we can just copy data into a buffer, fire off a
> thread and forget about it.

I assume for reads, you mean something like NS_DispatchToCurrentThread?  For writes, will there be problems using the JSContext we received on a different thread?
Comment 13 (dormant account) 2011-12-21 10:42:13 PST
(In reply to Nathan Froyd (:froydnj) from comment #11)
> (In reply to Taras Glek (:taras) from comment #8)
> > >+    Histogram::SampleSet ss;
> > >+    h->SnapshotSample(&ss);
> > >+
> > >+    // Write out explicit lengths to improve corruption detection.
> > >+    size_t name_length = strlen(gHistograms[i].id);
> > >+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
> > >+          && ss.Serialize(&pickle))) {
> > >+      return false;
> > >+    }
> > >+  }
> > >+
> > >+  return true;
> > 
> > Don't need an if statement there. Same elsewhere
> 
> Why not?  We had an error, no sense in writing the rest of the file if there
> were problems.

return pickle.WriteData(gHistograms[i].id, name_length) && ss.Serialize(&pickle); 

> 
> I assume for reads, you mean something like NS_DispatchToCurrentThread?  For
> writes, will there be problems using the JSContext we received on a
> different thread?

Yes and yes, you should extract data out of js on main thread.
Comment 14 Nathan Froyd [:froydnj] 2011-12-21 11:03:18 PST
(In reply to Taras Glek (:taras) from comment #13)
> (In reply to Nathan Froyd (:froydnj) from comment #11)
> > Why not?  We had an error, no sense in writing the rest of the file if there
> > were problems.
> 
> return pickle.WriteData(gHistograms[i].id, name_length) &&
> ss.Serialize(&pickle); 

That if statement lives in a loop over all histograms; we don't want to return after writing the first one.
Comment 15 Nathan Froyd [:froydnj] 2011-12-28 12:10:50 PST
Created attachment 584610 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

Totally redid I/O to defer it to a more convenient time.

TelemetrySessionData is an XPCOM object for convenience reasons; if we really want to, I suppose it could be turned into a JS object instead.
Comment 16 Nathan Froyd [:froydnj] 2011-12-28 12:11:56 PST
Created attachment 584611 [details] [diff] [review]
2. JS bits to send previously-saved session bits

Hooks TelemetryPing up to the new C++ functionality.
Comment 17 Nathan Froyd [:froydnj] 2011-12-28 12:12:53 PST
Created attachment 584613 [details] [diff] [review]
3. Tests for C++ load/save

Basic tests that callbacks and such work.
Comment 18 (dormant account) 2012-01-03 17:13:09 PST
Comment on attachment 584610 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

This seems a bit too complicated, but I guess using xpcom to save function references does that.
TelemetrySessionData::DeserializeHistogramData

needs assertions/comments. Eg it's not clear why we'd have an entry serialized, but not read it in.

Don't use new/delete, use nsAutoPtr, nsAutoPtr::forget to transfer ownership.

I don't think you need two separate completion interfaces. One interfaces with "saved" "loaded" would do. I think xpcom allows leaving js methods unimplemented (ie they just return some nsresult failure code).
Comment 19 (dormant account) 2012-01-03 17:27:04 PST
Comment on attachment 584611 [details] [diff] [review]
2. JS bits to send previously-saved session bits

This is pretty simple(in a good way).
A couple of observations:
1) I think we should set a custom reason string for the persisted data since it should be somewhat different from short-termed idle-daily data. Call it "unpickled" or something.

2) What happens when a) submission for old data fails, b) what if we shutdown before resubmitting data. Might be worth making it more persistent..ie keep last x sessions?

r-ing to address 1. Can do 2 in a followup bug.
Comment 20 (dormant account) 2012-01-03 17:33:16 PST
Comment on attachment 584613 [details] [diff] [review]
3. Tests for C++ load/save

I'm not sure if tests can pollute the temp dir. Can you ask on #developers if that's ok?

Other than that looks good.
Comment 21 Nathan Froyd [:froydnj] 2012-01-04 06:18:59 PST
(In reply to Taras Glek (:taras) from comment #18)
> I don't think you need two separate completion interfaces. One interfaces
> with "saved" "loaded" would do. I think xpcom allows leaving js methods
> unimplemented (ie they just return some nsresult failure code).

My impression from chatting on #developers about how to do callbacks suggested that the [function] interfaces needed to be single-method so that XPCOM knows which method the callback is meant to implement.  We do have an instance or two of multi-method [function] interfaces in the tree, but I don't know how those work or if they work at all (might be dead code?).

(In reply to Taras Glek (:taras) from comment #20)
> I'm not sure if tests can pollute the temp dir. Can you ask on #developers
> if that's ok?

The tests for the StartupCache touch the temp dir, so there's at least precedent.  The removeFile do_register_cleanup was intended to attempt to clean up after ourselves.
Comment 22 Nathan Froyd [:froydnj] 2012-01-04 09:10:59 PST
(In reply to Taras Glek (:taras) from comment #19)
> 1) I think we should set a custom reason string for the persisted data since
> it should be somewhat different from short-termed idle-daily data. Call it
> "unpickled" or something.

AFAICS, the reason parameter to send() is never actually sent; it's just used to determine whether to send to a fixed URL or a session-unique UUID URL.  Is that what you were thinking of changing?  If so, did you want to include that in the ping payload?
Comment 23 (dormant account) 2012-01-04 15:39:23 PST
(In reply to Nathan Froyd (:froydnj) from comment #22)
> (In reply to Taras Glek (:taras) from comment #19)
> > 1) I think we should set a custom reason string for the persisted data since
> > it should be somewhat different from short-termed idle-daily data. Call it
> > "unpickled" or something.
> 
> AFAICS, the reason parameter to send() is never actually sent; it's just
> used to determine whether to send to a fixed URL or a session-unique UUID
> URL.  Is that what you were thinking of changing?  If so, did you want to
> include that in the ping payload?

You are right.

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#210 reason is sent here. 

Still seems like we should change it for picked packets.

re callbacks: It's a minor enough api, existing approach is fine.
Comment 24 Sid Stamm [:geekboy or :sstamm] 2012-01-05 09:58:35 PST
removing privacy-review-needed keyword as this has been updated in the privacy review for Telemetry.  froydnj: is your code still following the procedure outlined in comment 2?
Comment 25 Nathan Froyd [:froydnj] 2012-01-05 10:11:06 PST
No, it's not.  The scheme now is:

On shutdown, save all histograms we've collected and the UUID of the session to disk.

When sending a telemetry ping, check for saved data.  If there is any, also send a ping with the saved data associated with the old session's UUID.
Comment 26 Sid Stamm [:geekboy or :sstamm] 2012-01-05 10:37:02 PST
Okay, so that results in two pings if there's saved data.  Is the saved data and UUID erased after it is successfully transmitted?

-Sid
Comment 27 Nathan Froyd [:froydnj] 2012-01-05 11:49:54 PST
Mmmm, the saved data is erased from memory (insofar as references to it go away), but not from disk.  Was going to rely on browser shutdown overwriting saved data with new saved data.
Comment 28 Nathan Froyd [:froydnj] 2012-01-06 08:23:44 PST
Created attachment 586442 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk, v2

New C++ patch, with some comments addressed and a few major changes.

Since pickling blasts raw memory out to disk, we need to care about endianness and word size.  Therefore, as StartupCache does, we need to make the saved datafile dependent on both of those characteristics.  The easiest way to do this is to move the filename bits into C++ and out of JS.  I haven't quite figured whether testing is feasible with this change or not (our tests were constructing the saved filename manually before).

This version also changes how things are written out to disk.

- Version number for the file format, just in case.
- Instead of saving all histograms (and possibly creating a bunch in the process), we now take the path of GetHistogramSnapshots and just iterate over the ones that have explicitly collected data.  This makes reading somewhat more complicated, of course.

This second change also improves the situation we'd find ourselves when reading histograms in.  As noted in TelemetrySessionData::ReflectSamples, we need to construct Histograms for every saved histogram we read in, by virtue of needing ranges and whatnot.  By only saving the "interesting" histograms, we reduce the chance of constructing new histograms at load time (since histograms that were present in the previous session are most likely present in the current session).  I don't know of a good way around this, short of rewriting a fair amount of code.
Comment 29 Nathan Froyd [:froydnj] 2012-01-06 08:24:58 PST
Created attachment 586444 [details] [diff] [review]
2. JS bits to send previously-saved session bits, v2

Updated JS bits for review comments and for moving filename bits into C++.
Comment 30 Sid Stamm [:geekboy or :sstamm] 2012-01-06 09:13:55 PST
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Mmmm, the saved data is erased from memory (insofar as references to it go
> away), but not from disk.  Was going to rely on browser shutdown overwriting
> saved data with new saved data.

Please delete it from disk after it's successfully sent.  This will minimize the chance we'll send it again by accident and minimizes the time data is persisted to disk.

Also, is any of the data saved on shutdown merged with new data collected after re-start, or are the sessions (UUID and data) kept completely separate?
Comment 31 Nathan Froyd [:froydnj] 2012-01-06 09:40:50 PST
(In reply to Sid Stamm [:geekboy] from comment #30)
> Please delete it from disk after it's successfully sent.  This will minimize
> the chance we'll send it again by accident and minimizes the time data is
> persisted to disk.

Can do.

> Also, is any of the data saved on shutdown merged with new data collected
> after re-start, or are the sessions (UUID and data) kept completely separate?

We only persist one session at the moment; there's no merging involved.  bug 715299 concerns persisting multiple sessions, if you'd like to go comment prior to investigation on what to watch out for (or if the idea is undesirable from a privacy perspective)
Comment 32 (dormant account) 2012-01-09 15:06:48 PST
Comment on attachment 586442 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk, v2

old patch
Comment 33 (dormant account) 2012-01-09 15:07:07 PST
Comment on attachment 586444 [details] [diff] [review]
2. JS bits to send previously-saved session bits, v2

old patch
Comment 34 Nathan Froyd [:froydnj] 2012-01-17 12:45:54 PST
Created attachment 589269 [details] [diff] [review]
part 1 - provide reflection for separate Histograms and SampleSets

Rebasing against HEAD.  Carrying over r+.
Comment 35 Nathan Froyd [:froydnj] 2012-01-17 12:48:16 PST
Created attachment 589270 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk, v3

Refresh against HEAD.  We now drop histograms that we don't recognize on the floor during deserialization--assuming that we can read their data.  Partly need to do this so that tests work, but it's a good thing to do in general.
Comment 36 Nathan Froyd [:froydnj] 2012-01-17 12:51:30 PST
Created attachment 589271 [details] [diff] [review]
part 3 - JS bits to send previously-saved session bits, v3

Refresh against HEAD.  I moved the wordsize, endian-naming bits into JS; I think it's worthwhile to keep them even with bug 704176 in the tree.  Plus it's somewhat less complicated in JS.  Doing naming in JS means that it's easier to test, too; less clutter in the nsITelemetry interface.
Comment 37 Nathan Froyd [:froydnj] 2012-01-17 12:52:33 PST
Created attachment 589272 [details] [diff] [review]
part 4 - tests for C++ load/save

Refresh against HEAD and with everything else.  Carrying over r+.
Comment 38 (dormant account) 2012-01-17 14:22:12 PST
Comment on attachment 589270 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk, v3

This looks good. idl docs should mention writing/reading occurs on a different thread.
Comment 39 (dormant account) 2012-01-17 14:29:08 PST
Having r+ed above, I worry that this creates a race condition(ie if we quit before we get to the async save thread). I think saveHistograms should have a sync parameter which can be set when calling it on shutdown. 
I think we should also do async saves periodically in case the browser crashes(ie while polling for memory usage, we can also trigger a save once an hour).
Comment 40 Nathan Froyd [:froydnj] 2012-01-19 10:26:31 PST
Created attachment 589901 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk, v3

Adding isSynchronous flag to saveHistograms and tweaking IDL comments, as requested.  Nothing major, so carrying over r+.
Comment 41 Nathan Froyd [:froydnj] 2012-01-19 10:27:18 PST
Created attachment 589902 [details] [diff] [review]
part 3 - JS bits to send previously-saved session bits, v4

Synchronously save histograms at quit-application-granted.  Carrying over r+.
Comment 42 Nathan Froyd [:froydnj] 2012-01-19 10:28:05 PST
Created attachment 589903 [details] [diff] [review]
part 4 - tests for C++ load/save, v4

Check for the existence of the temporary file before deleting it.  Nothing big, carrying over r+.
Comment 45 Justin Lebar (not reading bugmail) 2012-01-31 07:42:04 PST
Backing this out due to startup crashes (particularly, bug 722545).
Comment 46 Justin Lebar (not reading bugmail) 2012-01-31 08:13:33 PST
Backed out from m-c before the uplift to FF13.  But this backout may not be part of the uplift, depending on where we branch.  So we'll have to keep an eye on it.

https://hg.mozilla.org/mozilla-central/rev/d74a924a149b
Comment 47 Ed Morley [:emorley] 2012-01-31 11:41:36 PST
(In reply to Justin Lebar [:jlebar] from comment #46)
> Backed out from m-c before the uplift to FF13.  But this backout may not be
> part of the uplift, depending on where we branch.  So we'll have to keep an
> eye on it.
> 
> https://hg.mozilla.org/mozilla-central/rev/d74a924a149b

This backout has been undone on m-c since it caused M-oth orange:
https://tbpl.mozilla.org/?rev=d74a924a149b

https://hg.mozilla.org/mozilla-central/rev/974b0efab5c8

akeybl is going to revert the backout on aurora as well, since d74a924a149b (the backout) made it there, but 974b0efab5c8 didn't.

The startup crashes in comment 45 therefore will need to be resolved on both m-c and aurora.
Comment 48 Nathan Froyd [:froydnj] 2012-02-01 13:52:33 PST
Created attachment 593599 [details] [diff] [review]
fixed backout patch

Not quite sure how to handle this, but here's a patch that correctly backs out these patches on trunk.  It's not a straight revision because TelemetryPing.getSessionPayloadAndSlug is also used in the observe method.

TBPL is here: https://tbpl.mozilla.org/?tree=Try&rev=bd4b87fee4f4
M-oth contains the bits that wind up calling getSessionPayloadAndSlug (and were turned orange on the first backout), and the xpcshell tests hit all the Telemetry tests.
Comment 50 Mark Reid [:mreid] 2012-02-03 05:29:43 PST
I just hit an exception in getSessionPayloadAndSlug using Nightly, sync'd from mozilla-central a few days ago:

Timestamp: 12-02-03 08:04:09 AM
Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsITelemetrySessionData.snapshots]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: file:///...objdir.../dist/bin/components/TelemetryPing.js :: getSessionPayloadAndSlug :: line 432"  data: no]
Source File: file:///...objdir.../dist/bin/components/TelemetryPing.js
Line: 432

Here's the code at and around line 432:

  getSessionPayloadAndSlug: function getSessionPayloadAndSlug(reason) {
    // Use a deterministic url for testing.
    let isTestPing = (reason == "test-ping");
    let havePreviousSession = !!this._prevSession;
    let slug = (isTestPing
                ? reason
                : (havePreviousSession
                   ? this._prevSession.uuid
                   : this._uuid));
    let payloadObj = {
      ver: PAYLOAD_VERSION,
      // Send a different reason string for previous session data.
      info: this.getMetadata(havePreviousSession ? "saved-session" : reason),
    };
    if (havePreviousSession) {
>>432>>      payloadObj.histograms = this.getHistograms(this._prevSession.snapshots);
    }
    else {
      payloadObj.simpleMeasurements = getSimpleMeasurements();
      payloadObj.histograms = this.getHistograms(Telemetry.histogramSnapshots);
      payloadObj.slowSQL = Telemetry.slowSQL;
    }
    return { previous: !!havePreviousSession, slug: slug, payload: JSON.stringify(payloadObj) };
  },
Comment 51 Ed Morley [:emorley] 2012-02-03 11:09:00 PST
(In reply to Vladan Djeric (:vladan) from comment #49)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/67a219c9fb12

Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/67a219c9fb12
Comment 52 Nathan Froyd [:froydnj] 2012-02-03 11:41:46 PST
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

[Approval Request Comment]
Regression caused by (bug #): This bug.
User impact if declined: Crashes on startup if telemetry is enabled.
Testing completed (on m-c, etc.): Patch survived testing on mozilla-inbound, recently checked into mozilla-central.
Risk to taking this patch (and alternatives if risky): Epsilon.
String changes made by this patch: None.
Comment 53 Alex Keybl [:akeybl] 2012-02-05 13:57:18 PST
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

[Triage Comment]
Approved for Aurora.
Comment 54 Nathan Froyd [:froydnj] 2012-02-09 06:02:17 PST
Please checkin the backout patch on aurora.  Adding the a=akeybl would be great too, if you'd be so kind.
Comment 55 Andrew McCreight [:mccr8] 2012-02-09 06:29:50 PST
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

backed out checked in.

https://hg.mozilla.org/releases/mozilla-aurora/rev/26c731b682ce
Comment 56 Andrew McCreight [:mccr8] 2012-02-09 06:30:20 PST
back out checked in, rather...
Comment 57 Nathan Froyd [:froydnj] 2012-02-10 10:15:47 PST
The reason that we had debug-only crashes was that the Histogram class is, by design, not thread-safe.  Each Histogram, in addition to maintaining counts for its buckets, also maintains a "redundant count", a sum of all the counts we have accumulated thus far.  This produces a situation where we have a race between two threads, A and B:

A: Inserts a count into a histogram H
B: Snapshots histogram H
A: Updates the redundant count of H

B now has data where sum(counts) != redundant_count.  If we write the snapshot that B took to disk and read it later, we hit a debug-only check in the deserialization code that sum(counts) == redundant_count.  This check was the cause of bugs like bug 722545.  On opt builds, this degrades to a mere indication that deserialization failed.  We take this indication to mean that *all* of the saved histogram data is corrupt.

This is clearly not a good situation.  Note, however, that the histogram code does allow for this possibility when checking for corruption and permits the two count measures to differ by no more than 1 when declaring a histogram not corrupt.  So there is a bit of inconsistency here: the code is OK with drift in-memory, but has zero tolerance when it comes to serialization.

Options for fixing this:

1) One big Telemetry lock that you have to grab before accumulating data or snapshotting data.  This is suboptimal.

Note that we snapshot histograms in more places than just persisting them to disk; we snapshot them for reflection into JS, for instance.

2) Per-histogram locks, effectively making accumulation and snapshotting thread-safe.  The methods on Histogram for accumulating data are virtual for precisely this sort of thing, though I haven't looked to see if you can extend them easily in practice.

3) Some sort of reader-writer locking scheme where you can accumulate into any number of histograms simultaneously, but snapshotting is atomic with respect to accumulation.  I like this one, but it'd be a good deal more complicated, I think, than any of the others.

Note that this would leave us open to inconsistent sum(counts) and redundant_count, as two threads could be accumulating into the same histogram simultaneously.

4) Changing how snapshots are serialized, either by not serializing redundant_count or by inventing some checksum to make sure that counts and redundant_count didn't get corrupted with respect to each other.  Then we can remove the sum(counts) == redundant_count check.  This leaves the race in place, but makes the serialization process saner, IMHO.  It's OK to read in slightly corrupted histograms, as we tolerate slightly corrupted histograms in-memory (see above discussion).

This option requires hacking on chromium-imported code, which may or may not make life more difficult for future merges.  At the very least, I think distinguishing between the "I didn't read all of the saved data" versus "the data looks corrupt" in the deserialization code would be useful.

I think options 2 and 4 are the most reasonable; I'm leaning slightly towards option 4, but I'm curious whether anybody else has an opinion.
Comment 58 Nathan Froyd [:froydnj] 2012-02-27 07:53:36 PST
Created attachment 600912 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk

Rebasing.
Comment 59 Nathan Froyd [:froydnj] 2012-02-27 07:54:09 PST
Created attachment 600913 [details] [diff] [review]
part 3 - JS bits to send previously-saved session bits

Rebasing.
Comment 60 Mozilla RelEng Bot 2012-02-27 08:14:31 PST
Autoland Patchset:
	Patches: 589269, 600912, 600913, 589903
	Branch: mozilla-central => try
Error applying patch 589903 to mozilla-central.
patching file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
Hunk #1 FAILED at 99
1 out of 1 hunks FAILED -- saving rejects to file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js.rej
abort: patch failed to apply

Could not apply and push patchset:
Comment 61 Nathan Froyd [:froydnj] 2012-02-27 08:26:26 PST
Created attachment 600921 [details] [diff] [review]
part 4 - tests for C++ load/save

Update All The Patches!
Comment 62 Mozilla RelEng Bot 2012-02-27 08:31:14 PST
Autoland Patchset:
	Patches: 589269, 600912, 600913, 600921
	Branch: mozilla-central => try
	Destination: http://hg.mozilla.org/try/pushloghtml?changeset=8f0fae38e776
Try run started, revision 8f0fae38e776. To cancel or monitor the job, see: https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
Comment 63 Nathan Froyd [:froydnj] 2012-02-28 07:31:16 PST
The try run looks OK, despite going for 24 hours.  Let's try to get this in and have some soak time before the Aurora uplift.
Comment 65 Mozilla RelEng Bot 2012-02-28 08:46:01 PST
Try run for 8f0fae38e776 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
Results (out of 181 total builds):
    exception: 1
    success: 144
    warnings: 20
    failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/autolanduser@mozilla.com-8f0fae38e776
 Timed out after 12 hours without completing.
Comment 67 Alex Keybl [:akeybl] 2012-02-29 11:37:45 PST
I've been told that this bug prevents us from getting post-startup telemetry data. If that's the case, we should consider uplifting to Aurora 12 as soon as this has had a week or so to bake. We need telemetry data to know whether or not we have perf regressions across releases.
Comment 68 Nathan Froyd [:froydnj] 2012-03-06 07:22:14 PST
(In reply to Alex Keybl [:akeybl] from comment #67)
> I've been told that this bug prevents us from getting post-startup telemetry
> data.

This bug prevents us from receiving data collected in the last ~day before a running Firefox instance is shutdown.  It does *not* affect normal telemetry gathering; we can still gather telemetry data for perf regression information across releases without this bug being fixed in 12.

I'm inclined to not nominate this for Aurora 12.  Given the problems that showed up last time, I'd like to let this bake for a complete release cycle.
Comment 69 (dormant account) 2012-03-06 10:23:38 PST
This was backed out of 12, so it doesn't affect 12.
Comment 70 Alex Keybl [:akeybl] 2012-03-06 11:45:02 PST
(In reply to Nathan Froyd (:froydnj) from comment #68)
> This bug prevents us from receiving data collected in the last ~day before a
> running Firefox instance is shutdown.  It does *not* affect normal telemetry
> gathering; we can still gather telemetry data for perf regression
> information across releases without this bug being fixed in 12.
> 
> I'm inclined to not nominate this for Aurora 12.  Given the problems that
> showed up last time, I'd like to let this bake for a complete release cycle.

My concern here is that we're many times investigating performance regressions that accumulate over time and cause Firefox users to frequently restart their browser - from talking with Lawrence it sounds like telemetry data on these users would be lost.

Is there any targeted QA that we could do to land this fix a cycle early? Telemetry and other data gathering are the places that we're willing to take on a little more risk since it could be  eye opening in our understanding of perf regressions.

If this carries too much risk, I don't mean to force the issue - I just want to make sure we're on the same page that this is the only external performance data we collect and can use to resolve perf regressions (seen in both FF10 and FF11 most recently).
Comment 71 (dormant account) 2012-03-06 13:24:45 PST
(In reply to Alex Keybl [:akeybl] from comment #70)
> (In reply to Nathan Froyd (:froydnj) from comment #68)
> > This bug prevents us from receiving data collected in the last ~day before a
> > running Firefox instance is shutdown.  It does *not* affect normal telemetry
> > gathering; we can still gather telemetry data for perf regression
> > information across releases without this bug being fixed in 12.
> > 
> > I'm inclined to not nominate this for Aurora 12.  Given the problems that
> > showed up last time, I'd like to let this bake for a complete release cycle.
> 
> My concern here is that we're many times investigating performance
> regressions that accumulate over time and cause Firefox users to frequently
> restart their browser - from talking with Lawrence it sounds like telemetry
> data on these users would be lost.

Telemetry only records first 5-10minutes of browsing(+ every 24hours) because of the way our idle-daily notification works.

> 
> Is there any targeted QA that we could do to land this fix a cycle early?
> Telemetry and other data gathering are the places that we're willing to take
> on a little more risk since it could be  eye opening in our understanding of
> perf regressions.
> 
> If this carries too much risk, I don't mean to force the issue - I just want
> to make sure we're on the same page that this is the only external
> performance data we collect and can use to resolve perf regressions (seen in
> both FF10 and FF11 most recently).

The fact that we are tracking ff10 perf is news to me.

This functionality has a long teething period. Both of the 2 landings had bugs. Second landing did not regress anything, hopefully things are now working with bug 732970 fixed.

We can only consider uplifting this once this has baked on trunk for a while and we have serverside evidence that things work. It takes about 7 days to get sufficient confidence that telemetry is working as expected.
Comment 72 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-05-29 11:38:24 PDT
Is there anything QA can do to verify this fix?
Comment 73 Nathan Froyd [:froydnj] 2012-05-29 12:31:00 PDT
This was all disabled in bug 756152.
Comment 74 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-05-29 13:00:21 PDT
Thanks Nathan. Marking this [qa-]. We will verify-disabled in bug 756152. If there is anything you can add to that bug to aid in our verification, please comment there. Thanks.

Note You need to log in before you can comment on or make changes to this bug.