Closed Bug 1345153 Opened 3 years ago Closed 3 years ago

pingSender assumes the Data Pipeline deduplicates crash pings

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: chutten, Assigned: Dexter)

References

Details

(Whiteboard: [measurement:client])

Attachments

(4 files, 2 obsolete files)

pingSender tries once to send one "crash" ping when Firefox abnormally exits. Whether it succeeds or not, the crash manager will send a "crash" ping with an identical id when Firefox is restarted... whenever the user gets around to it.

This results in about a 39% duplication rate of main-process "crash" pings since the Feb 15 build (Firefox 54) that introduced pingSender.
Whiteboard: [measurement:client:tracking]
Blocks: 1310703
See Also: → 1342111
See Also: → 1343504
So I'm going to go out on a limb here and suggest that there might be something that can be done client-side... gsvelto? Alessio? Who wants it?
Flags: needinfo?(gsvelto)
Flags: needinfo?(alessio.placitelli)
Two things would be needed for that:

- Waiting to discover if the pingsender did successfully send the ping
- Communicating this success reliably to Firefox when it restarts

Both are problematic. Discovering if the ping was successfully sent or not means in the worst case waiting for a timeout of the HTTP POST which can take a while and we don't want the user to wait for that before we restart Firefox. Communicating this information back means writing something in the user profile - something that the crashreporter is already doing in the form of the event file, but without blocking to wait for anything. The reason we want to write out the event file rapidly is again that we don't want Firefox to restart and pick up an incomplete one. The only exception to this is when the user submits the crash report in which case we write out the submission event only after we confirmed the submission, but we're making the user wait before restarting Firefox in that case.

In a nutshell, it's possible to do it, but it would be a brittle, race-prone solution.
Flags: needinfo?(gsvelto)
What about the proposal from bug 1343504:
- immediately tag the event file as "ping handled"
- always write a crash ping to disk (similar to the crash report locations)
- pass the path to ping sender
- ping sender deletes crash ping on success
- we check back on the remaining duplicate rate

The remaining duplicates should be mostly from immediate Firefox restarts and thus catchable with bug 1342111.
Flags: needinfo?(alessio.placitelli)
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> What about the proposal from bug 1343504:
> - immediately tag the event file as "ping handled"
> - always write a crash ping to disk (similar to the crash report locations)
> - pass the path to ping sender
> - ping sender deletes crash ping on success
> - we check back on the remaining duplicate rate

so if I understand correctly your idea we'd put the path to the crash ping into the event file; then when Firefox restarts it check if the file is there or not. If it isn't then we've successfully sent the ping already and we don't send another one, if it is then we send the crash ping and delete the file, correct? In this case we'd leave the ping somewhere inside the "Crash Reports" folder instead of the user profile so that we don't have to deal with concurrent access to it.

> The remaining duplicates should be mostly from immediate Firefox restarts
> and thus catchable with bug 1342111.

Indeed.
(In reply to Gabriele Svelto [:gsvelto] from comment #4)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> > What about the proposal from bug 1343504:
> > - immediately tag the event file as "ping handled"
> > - always write a crash ping to disk (similar to the crash report locations)
> > - pass the path to ping sender
> > - ping sender deletes crash ping on success
> > - we check back on the remaining duplicate rate
> 
> so if I understand correctly your idea we'd put the path to the crash ping
> into the event file; then when Firefox restarts it check if the file is
> there or not. If it isn't then we've successfully sent the ping already and
> we don't send another one, if it is then we send the crash ping and delete
> the file, correct?

I don't think we need to put a path into the event file for that.
Just mark the event file as "processed" (or delete already? does anything else need it?).
Independently of that, on Firefox startup, send any pings left in the "Crash Pings" folder.

> In this case we'd leave the ping somewhere inside the
> "Crash Reports" folder instead of the user profile so that we don't have to
> deal with concurrent access to it.

Right, although i think we should use a separate directory.
Presumably bug bug 1342111 will handle the pipeline de-duping side.
Lets make this bug about the client-side handling.

Are there follow-up bugs for the crash data jobs that need to be filed to clean up the data?
Flags: needinfo?(chutten)
gsvelto is planning to work on the client-side.
Component: Metrics: Pipeline → Telemetry
Priority: -- → P3
Product: Cloud Services → Toolkit
Whiteboard: [measurement:client:tracking] → [measurement:client]
Well, crash_summary is just a list of crashpings, so dedupe wouldn't be part of its mandate.

crash_aggregates, on the other hand, is trying to count events, not pings, so maybe it needs a dedupe pass for the last month's worth of data. +ni? mdoglio, who owns crash_aggregates for what bugs he might like, and if he can think of other things that might want to dedupe out these main-process "crash" pings
Flags: needinfo?(chutten) → needinfo?(mdoglio)
See Also: → 1345901
I filed bug 1345901 and bug 1345903 to cover the deduping and backfill of crash_aggregates and crash_summary.
Flags: needinfo?(mdoglio)
See Also: → 1345903
Assignee: nobody → gsvelto
Blocks: 1343504
Status: NEW → ASSIGNED
Blocks: 1348250
No longer blocks: 1343504
Duplicate of this bug: 1343504
Blocks: 1336360
Points: --- → 3
Priority: P3 → P1
Not ready for review, works fine on Linux, still needs to be tested on Windows/Mac.
Comment on attachment 8851005 [details] [diff] [review]
[PATCH WIP] Save crash pings to disk so that if they could not be sent by the pingsender they're persisted for later

Review of attachment 8851005 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/pingsender/pingsender.cpp
@@ +49,2 @@
>  
> +  if (argc == 3) {

Gabriele, is this still using the pipes? Since we're saving the ping to the disk, let's just load it from a file and remove the piping.
Currently yes but I will get rid of them in the final patch. I first want to make sure that everything else works fine.
Attachment #8851005 - Attachment is obsolete: true
OK, here's the patch which I hope works fine, but if it doesn't feel free to pick it up, polish and land it because I'll be away for the next two weeks :-) Long story short, this makes the pingsender pick up the crash ping from disk instead of from the pipe and deletes the ping file if it's sent successfully. It does integrate a lot of the code from bug 1343504 including the tests so that Telemetry and the crashreporter client are on the same page as far as using the pingsender goes. I've also modified the CrashManager only when the event file is not flagged with the CrashPingUUID field. If that field is present it means that we've already generated a ping and we can avoid sending another one. I've also ripped out the code that allowed us to override the ping UUID in Telemtry since it wasn't used anymore.

The saved pings will end in the application directory (i.e. ~/.mozilla/firefox, etc...) under the 'Pending Pings' folder. This is communicated to the crashreporter via environment variables like the rest of the stuff.

I've built and tested it on Windows and Linux (both 64-bit) and I hope it works elsewhere too.

Note that this patch is only part of the full fix with the missing part being bug 1348250 which will need to pick up the leftover pings and send them.
Comment on attachment 8851633 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/123892/#review126396

I gave an high level look to the Telemetry changes, leaving out all the crashreporter/crashmanager changes.

::: toolkit/components/telemetry/TelemetryController.jsm:202
(Diff revision 1)
>     * @param {Boolean} [aOptions.addClientId=false] true if the ping should contain the client
>     *                  id, false otherwise.
>     * @param {Boolean} [aOptions.addEnvironment=false] true if the ping should contain the
>     *                  environment data.
>     * @param {Object}  [aOptions.overrideEnvironment=null] set to override the environment data.
> -   * @param {String} [aOptions.overridePingId=null] set to override the
> +   * @param {Boolean} [aOptions.usePingSender=false] if true, send the ping using the PingSender.

I think we can drop all the unrelated changes to TelemetryController.jsm, except for the removal of overridePingId.

::: toolkit/components/telemetry/TelemetrySend.jsm:211
(Diff revision 1)
>    shutdown() {
>      return TelemetrySendImpl.shutdown();
>    },
>  
>    /**
>     * Submit a ping for sending. This will:

I think we can drop all the changes to this file.

::: toolkit/components/telemetry/docs/internals/pingsender.rst:6
(Diff revision 1)
>  Ping Sender
>  ===========
>  
>  The ping sender is a minimalistic program whose sole purpose is to deliver a
> -telemetry ping. It accepts a single parameter which is the URL the ping will
> -be sent to (as an HTTP POST command) and once started it will wait to read the
> +telemetry ping. It accepts two parameters, the first one holds the URL the ping
> +will be sent to (as an HTTP POST command) and the second one the path to a file

"...the path to an uncompressed file ..."

::: toolkit/components/telemetry/pingsender/pingsender.cpp:11
(Diff revision 1)
>  #include <cstdlib>
>  #include <ctime>
> +#include <fstream>
>  #include <iomanip>
>  #include <string>
>  #include <cstdio>

Do we still need cstdio?

::: toolkit/components/telemetry/tests/unit/test_PingSender.js
(Diff revision 1)
>  
>    Assert.equal(req.getHeader("User-Agent"), "pingsender/1.0",
>                 "Should have received the correct user agent string.");
>    Assert.equal(req.getHeader("X-PingSender-Version"), "1.0",
>                 "Should have received the correct PingSender version string.");
> -  Assert.equal(req.getHeader("Content-Encoding"), "gzip",

Why are you removing this check?
Comment on attachment 8851633 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/123892/#review126396

> Why are you removing this check?

Mistake on my part :)
Comment on attachment 8851633 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/123892/#review126724

Ok, the Telemetry related changes look ok to me with the following changes. I think Gabriele is on PTO from today, so I'll take the bug from here and flag :chutten for review on the Telemetry bits.

::: toolkit/components/telemetry/Telemetry.cpp
(Diff revision 2)
>    if (NS_WARN_IF(NS_FAILED(rv))) {
>      return NS_ERROR_FAILURE;
>    }
>  
> -  // Create a pipe to send the ping contents to the ping sender
> -  ScopedPRFileDesc pipeRead;

Since we're removing these, can we also remove the *ScopedPRFileDesc* definition and the "mozilla/Scoped.h" include?

::: toolkit/components/telemetry/docs/internals/pingsender.rst:5
(Diff revision 2)
>  Ping Sender
>  ===========
>  
>  The ping sender is a minimalistic program whose sole purpose is to deliver a
> -telemetry ping. It accepts a single parameter which is the URL the ping will
> +telemetry ping. It accepts two parameters, the first one holds the URL the ping

Let's use bullets for the list of parameters, for readability.

::: toolkit/components/telemetry/nsITelemetry.idl:553
(Diff revision 2)
>     *
> -   * @param aUrl The telemetry server URL
> -   * @param aPing A string holding the ping contents
> +   * @param {String} aUrl The telemetry server URL
> +   * @param {String} aPingFilePath The path to the file holding the ping
> +   *        contents, if if sent successfully the pingsender will delete it.
>     *
>     * @throws NS_ERROR_FAILURE if we couldn't run the pingsender or if we

Let's remove "...or if we couldn't hand it the ping data" and mention that the function can fail also if the pingsender executable cannot be found.

::: toolkit/components/telemetry/pingsender/pingsender.cpp:184
(Diff revision 2)
>  
>    if (!Post(url, gzipPing)) {
>      exit(EXIT_FAILURE);
>    }
>  
> +  // If the path to a ping was provided, delete the file.

Since the path is always provided now, let's update the comment: "If the ping was successfully sent, delete the file."
Comment on attachment 8851633 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

Taking this over as Gabriele is on PTO.
Attachment #8851633 - Attachment is obsolete: true
Attachment #8851633 - Flags: review?(ted)
Attachment #8851633 - Flags: review?(alessio.placitelli)
Assignee: gsvelto → alessio.placitelli
Comment on attachment 8851953 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/124200/#review126774

It is possible that I've missed something as I am very new to this patch. However, I was able to follow it and understand what was going on with only a little bothering of :Dexter on IRC, so I guess it can't be all bad :)

Certainly the logic on the Telemetry side seems in order. Exception handlers and crash reporters I'm a little more fuzzy on.

::: toolkit/crashreporter/client/crashreporter.h:161
(Diff revision 1)
>  std::ofstream* UIOpenWrite(const std::string& filename,
>                             bool append=false,
>                             bool binary=false);
>  void UIPruneSavedDumps(const std::string& directory);
>  
> -// Run the program specified by exename, passing it the parameter in arg and
> +// Run the program specified by exename, passing it the parameters in arg.

nit: I believe it's called "args" now.
Attachment #8851953 - Flags: review?(chutten) → review+
Comment on attachment 8851953 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/124200/#review127200

I only really looked at the C++ code, hopefully that's all you needed! If you don't feel comfortable making the changes I suggested (given that you didn't write the original patch) none of them are really critical, they're just suggestions that I think would make things nicer.

::: toolkit/components/telemetry/Telemetry.cpp:2896
(Diff revision 1)
>    if (!attr) {
>      return NS_ERROR_FAILURE;
>    }
>  
> -  // Connect the pingsender standard input to the pipe and launch it
> -  PR_ProcessAttrSetStdioRedirect(attr, PR_StandardInput, pipeRead);
> +  // We pretend we're redirecting stdout to force NSPR not to show a
> +  // command prompt when launching the program.

We could also build the program as `-SUBSYSTEM:WINDOWS` on Windows to avoid this, but it's probably not worthwhile to get rid of this one line.

::: toolkit/crashreporter/client/crashreporter_unix_common.cpp:84
(Diff revision 1)
>    if (pid == -1) {
>      return false;
>    } else if (pid == 0) {
>      // Child
> -    if (usePipes) {
> -      if (dup2(fd[0], STDIN_FILENO) == -1) {
> +    size_t argvLen = args.size() + 2;
> +    char** argv = new char*[argvLen];

I would probably just use a `std::vector<char*>` here instead.

::: toolkit/crashreporter/client/ping.cpp:287
(Diff revision 1)
> +  ofstream* f = UIOpenWrite(aPath.c_str());
> +  bool success = false;
> +
> +  if (f->is_open()) {
> +    *f << aPing;
> +    f->flush();

Is the flush necessary for the good below to work? Otherwise I assume it will get flushed on close.

::: toolkit/crashreporter/nsExceptionHandler.cpp:1992
(Diff revision 1)
>  
> -  _wputenv(dataDirEnv.get());
> +  dirEnv.append(*directoryPath);
> +  delete directoryPath;
> +
> +#if defined(XP_WIN32)
> +  _wputenv(dirEnv.c_str());

I think `SetEnvironmentVariableW` would be preferable here:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686206(v=vs.85).aspx

::: toolkit/crashreporter/nsExceptionHandler.cpp:1996
(Diff revision 1)
> +#if defined(XP_WIN32)
> +  _wputenv(dirEnv.c_str());
>  #else
> -  // Save this path in the environment for the crash reporter application.
> -  nsAutoCString dataDirEnv("MOZ_CRASHREPORTER_DATA_DIRECTORY=");
> +  XP_CHAR* str = new XP_CHAR[dirEnv.size() + 1];
> +  strncpy(str, dirEnv.c_str(), dirEnv.size() + 1);
> +  PR_SetEnv(str);

Since we're handling Windows separately, we might as well just use `setenv` directly here, which doesn't require allocating a new buffer. It would require slight tweaking since `setenv` takes the variable and value separately.
Attachment #8851953 - Flags: review?(ted) → review+
Comment on attachment 8851953 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

https://reviewboard.mozilla.org/r/124200/#review127200

Thanks :ted! Your feedback makes sense. Unfortunately, I wasn't able to test the changes on Windows and, since I don't know too much about that section of Firefox, I'd rather not change that code :-) I'll leave a ni? for when Gabriele comes back so that he can follow-up!

I also think that I'll have to "drop" the issues otherwise MozReview won't let me land this :-\
Gabriele, would you mind taking a look at Ted's feedback from comment 23 when you're back? If that's reasonable, could you please file a bug to address it?
Flags: needinfo?(gsvelto)
Adding duplicate monitoring by docType Bug 1351773
Flags: needinfo?(ted)
Flags: needinfo?(gsvelto)
Flags: needinfo?(alessio.placitelli)
Comment on attachment 8852786 [details]
Bug 1345153 - Suppress valgrind warnings for PR_SetEnv in SetupCrashReporterDirectory.

https://reviewboard.mozilla.org/r/124958/#review127500
Attachment #8852786 - Flags: review?(n.nethercote) → review+
This is the try-push for both the patches: the valgrind failure seems to be fixed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e7a8e1b2f3583fd17f9b5443c70a6d1720ff610
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/23cf03d6e45a
When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry; r=chutten,ted
https://hg.mozilla.org/integration/autoland/rev/a4791cc6fc74
Suppress valgrind warnings for PR_SetEnv in SetupCrashReporterDirectory. r=njn
Whiteboard: [measurement:client] → [measurement:client][measurement:client:uplift]
https://hg.mozilla.org/mozilla-central/rev/23cf03d6e45a
https://hg.mozilla.org/mozilla-central/rev/a4791cc6fc74
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Abe, this will be in today's Nightly. Would you please perform QA coverage on it as we need to uplift ASAP?

I think we should at least cover the following edge cases (if they're not already in the crash ping test plan!):

1) Firefox crashes and the crashreporter sends the crash ping through the pingsender, but the server is not available. We expect the crash ping to be saved, uncompressed, in %AppData%/Mozilla/Firefox/Pending Pings on windows (check if it's a valid crash ping!) and in /home/user/.mozilla/Firefox/Pending Pings on Linux.

2) Like case 1, with a change: the server is available. The ping will still be saved to disk in the aforementioned directories but it will be deleted from the disk right after being received from the server. Please check that the received ping has the correct structure (and conforms to the crash ping schema).

The other test cases from the crash ping test plan apply as well.
Flags: needinfo?(amasresha)
Ah, I almost forgot the most important thing in comment 34 :-)

After we restart Firefox in both cases (1) and (2) from the previous comment, no new crash ping must be generated and sent (which is the whole point of this bug :-D).
We have tested this on latest Nightly. 
The only concern we have is: when the crash ping is validated against the schema [1] using an online tool [2], validation fails on creation date.  
The error is " String '2017-04-03T15:44:02Z' does not match regex pattern '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}Z$'"

Screen capture: https://testing-1.tinytake.com/sf/MTQ2ODE5NV81MTg5NTc5
Test cases are here: https://public.etherpad-mozilla.org/p/1345153

It is also reproducible on nightly builds of 2017-03-01. 

[1] https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/crash.schema.json
[2] http://www.jsonschemavalidator.net/
Flags: needinfo?(amasresha) → needinfo?(alessio.placitelli)
Depends on: 1353106
Depends on: 1353141
Depends on: 1353451
(In reply to Abe - QA (:Abe_LV) from comment #36)
> We have tested this on latest Nightly. 
> The only concern we have is: ...
> 
> It is also reproducible on nightly builds of 2017-03-01. 

Since this is also happening on older nightlies, it is not related to this bug. I've filed bug 1353451 to look into it.
Flags: needinfo?(alessio.placitelli)
The ping validation issue will be addressed by Bug 1353451. 
The rest is all green, see test cases in comment 36. 
Closing as verified-fixed.
Status: RESOLVED → VERIFIED
Comment on attachment 8851953 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

Approval Request Comment
[Feature/Bug causing the regression]: 1310703
[User impact if declined]: We will double the amount of "crash" pings sent by the client, skewing server side analyses and increasing the storage costs once the regressing bug hits channels with larger populations.
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes, see comment 38
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: Both patches in this bug, bug 1348250 and bug 1345108.
[Is the change risky?]: Low risk
[Why is the change risky/not risky?]: This is making "crash" ping sending more reliable by saving them locally before sending them through the pingsender. If anything goes wrong, we loose the crash pings sent by the crashreporter. This is unlikely, as these patches have both QA and test coverage.
[String changes made/needed]: None
Attachment #8851953 - Flags: approval-mozilla-aurora?
Comment on attachment 8851953 [details]
Bug 1345153 - When the pingsender fails to send a ping, persist it to disk so that it can be sent later via regular telemetry;

Fix a duplicated crash ping issue. Aurora54+.
Attachment #8851953 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Needs rebasing for Aurora uplift.
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
Attachment #8854751 - Attachment is patch: true
Attachment #8854750 - Attachment is patch: true
Depends on: 1354468
Whiteboard: [measurement:client][measurement:client:uplift] → [measurement:client]
(In reply to Alessio Placitelli [:Dexter] from comment #25)
> Gabriele, would you mind taking a look at Ted's feedback from comment 23
> when you're back? If that's reasonable, could you please file a bug to
> address it?

Filed bug 1357688 to address it.
Flags: needinfo?(gsvelto)
You need to log in before you can comment on or make changes to this bug.