Closed Bug 707320 Opened 13 years ago Closed 12 years ago

Allow histograms to persist to disk

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox12 + unaffected
firefox13 --- fixed

People

(Reporter: taras.mozilla, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

(Whiteboard: [qa-])

Attachments

(5 files, 13 obsolete files)

1.23 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
27.59 KB, patch
taras.mozilla
: review+
mccr8
: checkin+
Details | Diff | Splinter Review
13.44 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
10.34 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
2.53 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
This way we can track stuff on shutdown.
I guess this would block bug 706647; we're not going to want to add a telemetry ping to shutdown to report on how long it took to shutdown plugins?
Blocks: 706647
Blocks: 708674
I was talking to Nathan/Sid about this. Seems like the way to do this is to serialize data on shutdown, then
on startup:
a) if there is no serialized telemetry data, send a ping same as we do now
b) if there is serialized data:
b1) send serialized data
b2) reset UID, wipe all histograms
Summary: Allow certain histograms to persist to disk → Allow histograms to persist to disk
This is basically what I'm implementing.  I'm planning on marking certain histograms as "persistent" so we're not saving several tens of histograms to disk.  Then we can send those the data from those histograms as an initial ping after startup, along with the saved UUID.  We don't have to worry about combining histograms--I'm planning on just reflecting the serialized data into JS--and we don't have to deal with clearing histograms, either.
Nathan, after reflecting on this for a while: what's the benefit of not storing certain histograms on disk? Disk used by histograms should be pretty minimal, we can further reduce it with zlib compression(if needed).
Posting a WIP.  This patch, however trivial, is ready for review.

This patch is driven by a couple of observations:

1. We're only going to serialize SampleSets at shutdown;
2. We don't want to mix previous-session SampleSets and current-session SampleSets;
3. We want to reflect previous-session information into JS similarly to how histogramSnapshots works now.

All of these mean that we need an interface to reflect SampleSets that didn't come from Histogram::SnapshotSample.  This patch provides that interface.
Attachment #581294 - Flags: review?(mozilla)
This patch implements the actual save-and-load histogram bits.  It persists all histograms to disk, rather than marking specific histograms as persistent.  Back-of-the-envelope calculations suggest that the persistent file is going to be 10s of K, so not too bad.

I haven't tested it because I don't have the JavaScript bits done yet; those bits would be the next part.  Should be about 50 lines or so.

Feedback on the approach welcome; if you think it's good enough for r+, please indicate as such.
Attachment #581296 - Flags: feedback?(mozilla)
Comment on attachment 581294 [details] [diff] [review]
0. provide reflection for separate Histograms and SampleSets

should ss be const?
Attachment #581294 - Flags: review?(taras.mozilla) → review+
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

>+    Histogram::SampleSet ss;
>+    h->SnapshotSample(&ss);
>+
>+    // Write out explicit lengths to improve corruption detection.
>+    size_t name_length = strlen(gHistograms[i].id);
>+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
>+          && ss.Serialize(&pickle))) {
>+      return false;
>+    }
>+  }
>+
>+  return true;

Don't need an if statement there. Same elsewhere

>+}
>+
>+bool
>+DeserializeHistogramData(AutoFDClose &fd, Pickle &pickle, const char *uuid,
>+                         void **iter, JSContext *cx, jsval *ret)
>+{
>+  size_t n_histograms = ArrayLength(gHistograms);

Use HistogramCount here and elsewhere
Attachment #581296 - Flags: feedback?(taras.mozilla)
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

You should make sure that you either don't serialize empty/non-existent histograms or don't deserialize them
Comment on attachment 581296 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

Should also read/write off main thread on the C++ side. ie have the api be read(callback). For writes we can just copy data into a buffer, fire off a thread and forget about it.
Blocks: 711217
(In reply to Taras Glek (:taras) from comment #8)
> >+    Histogram::SampleSet ss;
> >+    h->SnapshotSample(&ss);
> >+
> >+    // Write out explicit lengths to improve corruption detection.
> >+    size_t name_length = strlen(gHistograms[i].id);
> >+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
> >+          && ss.Serialize(&pickle))) {
> >+      return false;
> >+    }
> >+  }
> >+
> >+  return true;
> 
> Don't need an if statement there. Same elsewhere

Why not?  We had an error, no sense in writing the rest of the file if there were problems.
(In reply to Taras Glek (:taras) from comment #10)
> Should also read/write off main thread on the C++ side. ie have the api be
> read(callback). For writes we can just copy data into a buffer, fire off a
> thread and forget about it.

I assume for reads, you mean something like NS_DispatchToCurrentThread?  For writes, will there be problems using the JSContext we received on a different thread?
(In reply to Nathan Froyd (:froydnj) from comment #11)
> (In reply to Taras Glek (:taras) from comment #8)
> > >+    Histogram::SampleSet ss;
> > >+    h->SnapshotSample(&ss);
> > >+
> > >+    // Write out explicit lengths to improve corruption detection.
> > >+    size_t name_length = strlen(gHistograms[i].id);
> > >+    if (!(pickle.WriteData(gHistograms[i].id, name_length)
> > >+          && ss.Serialize(&pickle))) {
> > >+      return false;
> > >+    }
> > >+  }
> > >+
> > >+  return true;
> > 
> > Don't need an if statement there. Same elsewhere
> 
> Why not?  We had an error, no sense in writing the rest of the file if there
> were problems.

return pickle.WriteData(gHistograms[i].id, name_length) && ss.Serialize(&pickle); 

> 
> I assume for reads, you mean something like NS_DispatchToCurrentThread?  For
> writes, will there be problems using the JSContext we received on a
> different thread?

Yes and yes, you should extract data out of js on main thread.
(In reply to Taras Glek (:taras) from comment #13)
> (In reply to Nathan Froyd (:froydnj) from comment #11)
> > Why not?  We had an error, no sense in writing the rest of the file if there
> > were problems.
> 
> return pickle.WriteData(gHistograms[i].id, name_length) &&
> ss.Serialize(&pickle); 

That if statement lives in a loop over all histograms; we don't want to return after writing the first one.
Totally redid I/O to defer it to a more convenient time.

TelemetrySessionData is an XPCOM object for convenience reasons; if we really want to, I suppose it could be turned into a JS object instead.
Attachment #581296 - Attachment is obsolete: true
Attachment #584610 - Flags: review?(taras.mozilla)
Hooks TelemetryPing up to the new C++ functionality.
Attachment #584611 - Flags: review?(taras.mozilla)
Attached patch 3. Tests for C++ load/save (obsolete) — Splinter Review
Basic tests that callbacks and such work.
Attachment #584613 - Flags: review?(taras.mozilla)
Comment on attachment 584610 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk

This seems a bit too complicated, but I guess using xpcom to save function references does that.
TelemetrySessionData::DeserializeHistogramData

needs assertions/comments. Eg it's not clear why we'd have an entry serialized, but not read it in.

Don't use new/delete, use nsAutoPtr, nsAutoPtr::forget to transfer ownership.

I don't think you need two separate completion interfaces. One interfaces with "saved" "loaded" would do. I think xpcom allows leaving js methods unimplemented (ie they just return some nsresult failure code).
Attachment #584610 - Flags: review?(taras.mozilla) → review-
Comment on attachment 584611 [details] [diff] [review]
2. JS bits to send previously-saved session bits

This is pretty simple(in a good way).
A couple of observations:
1) I think we should set a custom reason string for the persisted data since it should be somewhat different from short-termed idle-daily data. Call it "unpickled" or something.

2) What happens when a) submission for old data fails, b) what if we shutdown before resubmitting data. Might be worth making it more persistent..ie keep last x sessions?

r-ing to address 1. Can do 2 in a followup bug.
Attachment #584611 - Flags: review?(taras.mozilla) → review-
Comment on attachment 584613 [details] [diff] [review]
3. Tests for C++ load/save

I'm not sure if tests can pollute the temp dir. Can you ask on #developers if that's ok?

Other than that looks good.
Attachment #584613 - Flags: review?(taras.mozilla) → review+
(In reply to Taras Glek (:taras) from comment #18)
> I don't think you need two separate completion interfaces. One interfaces
> with "saved" "loaded" would do. I think xpcom allows leaving js methods
> unimplemented (ie they just return some nsresult failure code).

My impression from chatting on #developers about how to do callbacks suggested that the [function] interfaces needed to be single-method so that XPCOM knows which method the callback is meant to implement.  We do have an instance or two of multi-method [function] interfaces in the tree, but I don't know how those work or if they work at all (might be dead code?).

(In reply to Taras Glek (:taras) from comment #20)
> I'm not sure if tests can pollute the temp dir. Can you ask on #developers
> if that's ok?

The tests for the StartupCache touch the temp dir, so there's at least precedent.  The removeFile do_register_cleanup was intended to attempt to clean up after ourselves.
(In reply to Taras Glek (:taras) from comment #19)
> 1) I think we should set a custom reason string for the persisted data since
> it should be somewhat different from short-termed idle-daily data. Call it
> "unpickled" or something.

AFAICS, the reason parameter to send() is never actually sent; it's just used to determine whether to send to a fixed URL or a session-unique UUID URL.  Is that what you were thinking of changing?  If so, did you want to include that in the ping payload?
Blocks: 715299
(In reply to Nathan Froyd (:froydnj) from comment #22)
> (In reply to Taras Glek (:taras) from comment #19)
> > 1) I think we should set a custom reason string for the persisted data since
> > it should be somewhat different from short-termed idle-daily data. Call it
> > "unpickled" or something.
> 
> AFAICS, the reason parameter to send() is never actually sent; it's just
> used to determine whether to send to a fixed URL or a session-unique UUID
> URL.  Is that what you were thinking of changing?  If so, did you want to
> include that in the ping payload?

You are right.

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#210 reason is sent here. 

Still seems like we should change it for picked packets.

re callbacks: It's a minor enough api, existing approach is fine.
removing privacy-review-needed keyword as this has been updated in the privacy review for Telemetry.  froydnj: is your code still following the procedure outlined in comment 2?
No, it's not.  The scheme now is:

On shutdown, save all histograms we've collected and the UUID of the session to disk.

When sending a telemetry ping, check for saved data.  If there is any, also send a ping with the saved data associated with the old session's UUID.
Okay, so that results in two pings if there's saved data.  Is the saved data and UUID erased after it is successfully transmitted?

-Sid
Mmmm, the saved data is erased from memory (insofar as references to it go away), but not from disk.  Was going to rely on browser shutdown overwriting saved data with new saved data.
New C++ patch, with some comments addressed and a few major changes.

Since pickling blasts raw memory out to disk, we need to care about endianness and word size.  Therefore, as StartupCache does, we need to make the saved datafile dependent on both of those characteristics.  The easiest way to do this is to move the filename bits into C++ and out of JS.  I haven't quite figured whether testing is feasible with this change or not (our tests were constructing the saved filename manually before).

This version also changes how things are written out to disk.

- Version number for the file format, just in case.
- Instead of saving all histograms (and possibly creating a bunch in the process), we now take the path of GetHistogramSnapshots and just iterate over the ones that have explicitly collected data.  This makes reading somewhat more complicated, of course.

This second change also improves the situation we'd find ourselves when reading histograms in.  As noted in TelemetrySessionData::ReflectSamples, we need to construct Histograms for every saved histogram we read in, by virtue of needing ranges and whatnot.  By only saving the "interesting" histograms, we reduce the chance of constructing new histograms at load time (since histograms that were present in the previous session are most likely present in the current session).  I don't know of a good way around this, short of rewriting a fair amount of code.
Attachment #584610 - Attachment is obsolete: true
Attachment #586442 - Flags: review?(taras.mozilla)
Updated JS bits for review comments and for moving filename bits into C++.
Attachment #584611 - Attachment is obsolete: true
Attachment #586444 - Flags: review?(taras.mozilla)
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Mmmm, the saved data is erased from memory (insofar as references to it go
> away), but not from disk.  Was going to rely on browser shutdown overwriting
> saved data with new saved data.

Please delete it from disk after it's successfully sent.  This will minimize the chance we'll send it again by accident and minimizes the time data is persisted to disk.

Also, is any of the data saved on shutdown merged with new data collected after re-start, or are the sessions (UUID and data) kept completely separate?
(In reply to Sid Stamm [:geekboy] from comment #30)
> Please delete it from disk after it's successfully sent.  This will minimize
> the chance we'll send it again by accident and minimizes the time data is
> persisted to disk.

Can do.

> Also, is any of the data saved on shutdown merged with new data collected
> after re-start, or are the sessions (UUID and data) kept completely separate?

We only persist one session at the moment; there's no merging involved.  bug 715299 concerns persisting multiple sessions, if you'd like to go comment prior to investigation on what to watch out for (or if the idea is undesirable from a privacy perspective)
Depends on: 704176
Comment on attachment 586442 [details] [diff] [review]
1. C++ interface bits for saving histograms to disk, v2

old patch
Attachment #586442 - Flags: review?(taras.mozilla)
Comment on attachment 586444 [details] [diff] [review]
2. JS bits to send previously-saved session bits, v2

old patch
Attachment #586444 - Flags: review?(taras.mozilla)
Rebasing against HEAD.  Carrying over r+.
Attachment #581294 - Attachment is obsolete: true
Attachment #589269 - Flags: review+
Refresh against HEAD.  We now drop histograms that we don't recognize on the floor during deserialization--assuming that we can read their data.  Partly need to do this so that tests work, but it's a good thing to do in general.
Attachment #586442 - Attachment is obsolete: true
Attachment #589270 - Flags: review?(taras.mozilla)
Refresh against HEAD.  I moved the wordsize, endian-naming bits into JS; I think it's worthwhile to keep them even with bug 704176 in the tree.  Plus it's somewhat less complicated in JS.  Doing naming in JS means that it's easier to test, too; less clutter in the nsITelemetry interface.
Attachment #586444 - Attachment is obsolete: true
Attachment #589271 - Flags: review?(taras.mozilla)
Attached patch part 4 - tests for C++ load/save (obsolete) — Splinter Review
Refresh against HEAD and with everything else.  Carrying over r+.
Attachment #584613 - Attachment is obsolete: true
Attachment #589272 - Flags: review+
Comment on attachment 589270 [details] [diff] [review]
part 2 - C++ bits for saving histograms to disk, v3

This looks good. idl docs should mention writing/reading occurs on a different thread.
Attachment #589270 - Flags: review?(taras.mozilla) → review+
Attachment #589271 - Flags: review?(taras.mozilla) → review+
Having r+ed above, I worry that this creates a race condition(ie if we quit before we get to the async save thread). I think saveHistograms should have a sync parameter which can be set when calling it on shutdown. 
I think we should also do async saves periodically in case the browser crashes(ie while polling for memory usage, we can also trigger a save once an hour).
Blocks: 719167
Blocks: 719437
Adding isSynchronous flag to saveHistograms and tweaking IDL comments, as requested.  Nothing major, so carrying over r+.
Attachment #589270 - Attachment is obsolete: true
Attachment #589901 - Flags: review+
Synchronously save histograms at quit-application-granted.  Carrying over r+.
Attachment #589271 - Attachment is obsolete: true
Attachment #589902 - Flags: review+
Check for the existence of the temporary file before deleting it.  Nothing big, carrying over r+.
Attachment #589272 - Attachment is obsolete: true
Attachment #589903 - Flags: review+
Keywords: checkin-needed
No longer blocks: 706647
No longer blocks: 722545
Depends on: 722545
Backing this out due to startup crashes (particularly, bug 722545).
Depends on: 722625
Backed out from m-c before the uplift to FF13.  But this backout may not be part of the uplift, depending on where we branch.  So we'll have to keep an eye on it.

https://hg.mozilla.org/mozilla-central/rev/d74a924a149b
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Justin Lebar [:jlebar] from comment #46)
> Backed out from m-c before the uplift to FF13.  But this backout may not be
> part of the uplift, depending on where we branch.  So we'll have to keep an
> eye on it.
> 
> https://hg.mozilla.org/mozilla-central/rev/d74a924a149b

This backout has been undone on m-c since it caused M-oth orange:
https://tbpl.mozilla.org/?rev=d74a924a149b

https://hg.mozilla.org/mozilla-central/rev/974b0efab5c8

akeybl is going to revert the backout on aurora as well, since d74a924a149b (the backout) made it there, but 974b0efab5c8 didn't.

The startup crashes in comment 45 therefore will need to be resolved on both m-c and aurora.
Not quite sure how to handle this, but here's a patch that correctly backs out these patches on trunk.  It's not a straight revision because TelemetryPing.getSessionPayloadAndSlug is also used in the observe method.

TBPL is here: https://tbpl.mozilla.org/?tree=Try&rev=bd4b87fee4f4
M-oth contains the bits that wind up calling getSessionPayloadAndSlug (and were turned orange on the first backout), and the xpcshell tests hit all the Telemetry tests.
Attachment #593599 - Flags: review?(taras.mozilla)
Attachment #593599 - Flags: review?(taras.mozilla) → review+
Blocks: 723802
I just hit an exception in getSessionPayloadAndSlug using Nightly, sync'd from mozilla-central a few days ago:

Timestamp: 12-02-03 08:04:09 AM
Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsITelemetrySessionData.snapshots]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: file:///...objdir.../dist/bin/components/TelemetryPing.js :: getSessionPayloadAndSlug :: line 432"  data: no]
Source File: file:///...objdir.../dist/bin/components/TelemetryPing.js
Line: 432

Here's the code at and around line 432:

  getSessionPayloadAndSlug: function getSessionPayloadAndSlug(reason) {
    // Use a deterministic url for testing.
    let isTestPing = (reason == "test-ping");
    let havePreviousSession = !!this._prevSession;
    let slug = (isTestPing
                ? reason
                : (havePreviousSession
                   ? this._prevSession.uuid
                   : this._uuid));
    let payloadObj = {
      ver: PAYLOAD_VERSION,
      // Send a different reason string for previous session data.
      info: this.getMetadata(havePreviousSession ? "saved-session" : reason),
    };
    if (havePreviousSession) {
>>432>>      payloadObj.histograms = this.getHistograms(this._prevSession.snapshots);
    }
    else {
      payloadObj.simpleMeasurements = getSimpleMeasurements();
      payloadObj.histograms = this.getHistograms(Telemetry.histogramSnapshots);
      payloadObj.slowSQL = Telemetry.slowSQL;
    }
    return { previous: !!havePreviousSession, slug: slug, payload: JSON.stringify(payloadObj) };
  },
Target Milestone: mozilla12 → ---
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

[Approval Request Comment]
Regression caused by (bug #): This bug.
User impact if declined: Crashes on startup if telemetry is enabled.
Testing completed (on m-c, etc.): Patch survived testing on mozilla-inbound, recently checked into mozilla-central.
Risk to taking this patch (and alternatives if risky): Epsilon.
String changes made by this patch: None.
Attachment #593599 - Flags: approval-mozilla-aurora?
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

[Triage Comment]
Approved for Aurora.
Attachment #593599 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #593599 - Flags: checkin?
Please checkin the backout patch on aurora.  Adding the a=akeybl would be great too, if you'd be so kind.
Keywords: checkin-needed
Comment on attachment 593599 [details] [diff] [review]
fixed backout patch

backed out checked in.

https://hg.mozilla.org/releases/mozilla-aurora/rev/26c731b682ce
Attachment #593599 - Flags: checkin? → checkin+
back out checked in, rather...
Keywords: checkin-needed
The reason that we had debug-only crashes was that the Histogram class is, by design, not thread-safe.  Each Histogram, in addition to maintaining counts for its buckets, also maintains a "redundant count", a sum of all the counts we have accumulated thus far.  This produces a situation where we have a race between two threads, A and B:

A: Inserts a count into a histogram H
B: Snapshots histogram H
A: Updates the redundant count of H

B now has data where sum(counts) != redundant_count.  If we write the snapshot that B took to disk and read it later, we hit a debug-only check in the deserialization code that sum(counts) == redundant_count.  This check was the cause of bugs like bug 722545.  On opt builds, this degrades to a mere indication that deserialization failed.  We take this indication to mean that *all* of the saved histogram data is corrupt.

This is clearly not a good situation.  Note, however, that the histogram code does allow for this possibility when checking for corruption and permits the two count measures to differ by no more than 1 when declaring a histogram not corrupt.  So there is a bit of inconsistency here: the code is OK with drift in-memory, but has zero tolerance when it comes to serialization.

Options for fixing this:

1) One big Telemetry lock that you have to grab before accumulating data or snapshotting data.  This is suboptimal.

Note that we snapshot histograms in more places than just persisting them to disk; we snapshot them for reflection into JS, for instance.

2) Per-histogram locks, effectively making accumulation and snapshotting thread-safe.  The methods on Histogram for accumulating data are virtual for precisely this sort of thing, though I haven't looked to see if you can extend them easily in practice.

3) Some sort of reader-writer locking scheme where you can accumulate into any number of histograms simultaneously, but snapshotting is atomic with respect to accumulation.  I like this one, but it'd be a good deal more complicated, I think, than any of the others.

Note that this would leave us open to inconsistent sum(counts) and redundant_count, as two threads could be accumulating into the same histogram simultaneously.

4) Changing how snapshots are serialized, either by not serializing redundant_count or by inventing some checksum to make sure that counts and redundant_count didn't get corrupted with respect to each other.  Then we can remove the sum(counts) == redundant_count check.  This leaves the race in place, but makes the serialization process saner, IMHO.  It's OK to read in slightly corrupted histograms, as we tolerate slightly corrupted histograms in-memory (see above discussion).

This option requires hacking on chromium-imported code, which may or may not make life more difficult for future merges.  At the very least, I think distinguishing between the "I didn't read all of the saved data" versus "the data looks corrupt" in the deserialization code would be useful.

I think options 2 and 4 are the most reasonable; I'm leaning slightly towards option 4, but I'm curious whether anybody else has an opinion.
Depends on: 726761
Rebasing.
Attachment #589902 - Attachment is obsolete: true
Attachment #600913 - Flags: review+
Attachment #589901 - Attachment is obsolete: true
Whiteboard: [autoland-try:589269,600912,600913,589903]
Whiteboard: [autoland-try:589269,600912,600913,589903] → [autoland-in-queue]
Autoland Patchset:
	Patches: 589269, 600912, 600913, 589903
	Branch: mozilla-central => try
Error applying patch 589903 to mozilla-central.
patching file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
Hunk #1 FAILED at 99
1 out of 1 hunks FAILED -- saving rejects to file toolkit/components/telemetry/tests/unit/test_nsITelemetry.js.rej
abort: patch failed to apply

Could not apply and push patchset:
Whiteboard: [autoland-in-queue]
Update All The Patches!
Attachment #589903 - Attachment is obsolete: true
Attachment #600921 - Flags: review+
Whiteboard: [autoland-try:589269,600912,600913,600921]
Whiteboard: [autoland-try:589269,600912,600913,600921] → [autoland-in-queue]
Autoland Patchset:
	Patches: 589269, 600912, 600913, 600921
	Branch: mozilla-central => try
	Destination: http://hg.mozilla.org/try/pushloghtml?changeset=8f0fae38e776
Try run started, revision 8f0fae38e776. To cancel or monitor the job, see: https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
The try run looks OK, despite going for 24 hours.  Let's try to get this in and have some soak time before the Aurora uplift.
Keywords: checkin-needed
Try run for 8f0fae38e776 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=8f0fae38e776
Results (out of 181 total builds):
    exception: 1
    success: 144
    warnings: 20
    failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/autolanduser@mozilla.com-8f0fae38e776
 Timed out after 12 hours without completing.
Whiteboard: [autoland-in-queue]
I've been told that this bug prevents us from getting post-startup telemetry data. If that's the case, we should consider uplifting to Aurora 12 as soon as this has had a week or so to bake. We need telemetry data to know whether or not we have perf regressions across releases.
(In reply to Alex Keybl [:akeybl] from comment #67)
> I've been told that this bug prevents us from getting post-startup telemetry
> data.

This bug prevents us from receiving data collected in the last ~day before a running Firefox instance is shutdown.  It does *not* affect normal telemetry gathering; we can still gather telemetry data for perf regression information across releases without this bug being fixed in 12.

I'm inclined to not nominate this for Aurora 12.  Given the problems that showed up last time, I'd like to let this bake for a complete release cycle.
This was backed out of 12, so it doesn't affect 12.
(In reply to Nathan Froyd (:froydnj) from comment #68)
> This bug prevents us from receiving data collected in the last ~day before a
> running Firefox instance is shutdown.  It does *not* affect normal telemetry
> gathering; we can still gather telemetry data for perf regression
> information across releases without this bug being fixed in 12.
> 
> I'm inclined to not nominate this for Aurora 12.  Given the problems that
> showed up last time, I'd like to let this bake for a complete release cycle.

My concern here is that we're many times investigating performance regressions that accumulate over time and cause Firefox users to frequently restart their browser - from talking with Lawrence it sounds like telemetry data on these users would be lost.

Is there any targeted QA that we could do to land this fix a cycle early? Telemetry and other data gathering are the places that we're willing to take on a little more risk since it could be  eye opening in our understanding of perf regressions.

If this carries too much risk, I don't mean to force the issue - I just want to make sure we're on the same page that this is the only external performance data we collect and can use to resolve perf regressions (seen in both FF10 and FF11 most recently).
(In reply to Alex Keybl [:akeybl] from comment #70)
> (In reply to Nathan Froyd (:froydnj) from comment #68)
> > This bug prevents us from receiving data collected in the last ~day before a
> > running Firefox instance is shutdown.  It does *not* affect normal telemetry
> > gathering; we can still gather telemetry data for perf regression
> > information across releases without this bug being fixed in 12.
> > 
> > I'm inclined to not nominate this for Aurora 12.  Given the problems that
> > showed up last time, I'd like to let this bake for a complete release cycle.
> 
> My concern here is that we're many times investigating performance
> regressions that accumulate over time and cause Firefox users to frequently
> restart their browser - from talking with Lawrence it sounds like telemetry
> data on these users would be lost.

Telemetry only records first 5-10minutes of browsing(+ every 24hours) because of the way our idle-daily notification works.

> 
> Is there any targeted QA that we could do to land this fix a cycle early?
> Telemetry and other data gathering are the places that we're willing to take
> on a little more risk since it could be  eye opening in our understanding of
> perf regressions.
> 
> If this carries too much risk, I don't mean to force the issue - I just want
> to make sure we're on the same page that this is the only external
> performance data we collect and can use to resolve perf regressions (seen in
> both FF10 and FF11 most recently).

The fact that we are tracking ff10 perf is news to me.

This functionality has a long teething period. Both of the 2 landings had bugs. Second landing did not regress anything, hopefully things are now working with bug 732970 fixed.

We can only consider uplifting this once this has baked on trunk for a while and we have serverside evidence that things work. It takes about 7 days to get sufficient confidence that telemetry is working as expected.
Depends on: 733977
Blocks: 736268
Is there anything QA can do to verify this fix?
This was all disabled in bug 756152.
Thanks Nathan. Marking this [qa-]. We will verify-disabled in bug 756152. If there is anything you can add to that bug to aid in our verification, please comment there. Thanks.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.