Closed Bug 1310703 Opened 8 years ago Closed 7 years ago

Create pingsender executable to handle crash ping transmission

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
Firefox 54
Tracking Status
firefox54 --- fixed

People

(Reporter: ddurst, Assigned: gsvelto)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [fce-active-legacy])

Attachments

(5 files, 6 obsolete files)

59 bytes, text/x-review-board-request
ted
: review+
Details
59 bytes, text/x-review-board-request
ted
: review+
Dexter
: review+
Details
59 bytes, text/x-review-board-request
Dexter
: review+
Details
59 bytes, text/x-review-board-request
Dexter
: review+
Details
163.30 KB, application/json
Details
Once we are submitting crash pings for browser and content crashes alike, we want to implement a separate process to submit those pings that exists apart from Gecko.

This should allow crash pings to be send as soon as possible, rather than requiring a restart. The existing process for sending these will still exist and be implemented, in case something goes awry, and we will rely on its deduplication.
Depends on: 1293253
This is a clunky prototype that uses libcurl under Linux and macOS to implement the ping sender plus some logic in gecko and the crashreporter client to populate the ping more or less correctly. It's still largely incomplete and there's no Windows support yet.
Comment on attachment 8817389 [details] [diff] [review]
[PATCH WIP] Prototype implementation

Review of attachment 8817389 [details] [diff] [review]:
-----------------------------------------------------------------

Nice to see a prototype that quickly!

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +992,5 @@
> +      if (currentClientId) {
> +        WriteAnnotation(eventFile, "TelemetryClientId", currentClientId);
> +      }
> +      if (currentTelemetryServer) {
> +        WriteAnnotation(eventFile, "TelemetryServerURL", currentTelemetryServer);

I'm not sure that we should store the server URL in the event file.
If we update the Telemetry server URL in Firefox, any "old" event files would still point to the old servers.
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> I'm not sure that we should store the server URL in the event file.
> If we update the Telemetry server URL in Firefox, any "old" event files
> would still point to the old servers.

We have the same problem for the crash submission URL (ServerURL annotation) which we store in the extra file. This will only be used for sending the crash ping when the crashreporter is invoked so it's unlikely to pick up an older file.

As for the ping sender I'll just make the URL one of the parameters passed by the caller so that when calling it from Firefox one can always use the up-to-date URL. This is still very much a prototype and it's hardcoded right now :)
This patch contains the ping assembly part in the crashreporter as well as a couple of fixes here and there. The sender is still not functional, the crash id is still not read back by Firefox, the client is still not calling the server, there's no Windows support and it's pretty ugly. But the ping contents are complete :)
Attachment #8817389 - Attachment is obsolete: true
Status: NEW → ASSIGNED
This is the first version of this patch that works end-to-end (on Linux and Mac): after being invoked, the crashreporter client will extract the stack trace from the minidump, then assemble a full crash ping and use the ping sender to send it to the specified address.

The code is still very much disorganized so don't pay attention too much attention to it (it lacks Windows support too), so the feedback request is about the overall architecture.

Some notes:

- The ping sender is just a dump program that takes a URL and a JSON-formatted ping and sends the latter to the former using libcurl (on Linux and Mac) and soon using WinINet on Windows. Being stand-along it can be called by Gecko too during shutdown, etc...

- The ping assembled in the crashreporter is 100% identical to a regular crash ping. I'm using jsoncpp to assemble it since we already have it in the tree, it makes things a hell of a lot simpler.

- There's no handling of proxies for now; this is a deliberate choice since I want to fix the proxy issue in the crashreporter client too by using a common framework. This will happen at a later time.

- There's practically no error reporting and error handling policy is silently ignoring them. The reason for this is that we really don't care. If the ping gets out - and in most cases there's no reason why it shouldn't - then it gets out and we're all happy. If it hits a snag somewhere in-between we're not going to do heroics to get it out, re-try the transmission, etc... Since it will provide statistical data about crashes we can afford losing some, especially if it greatly simplifies complexity.
Attachment #8820686 - Attachment is obsolete: true
Attachment #8821182 - Flags: feedback?(ted)
Attachment #8821182 - Flags: feedback?(gfritzsche)
Comment on attachment 8821182 [details] [diff] [review]
[PATCH WIP v3] Prototype implementation

Review of attachment 8821182 [details] [diff] [review]:
-----------------------------------------------------------------

About to go on PTO, moving the feedback to Alessio.
Attachment #8821182 - Flags: feedback?(gfritzsche) → feedback?(alessio.placitelli)
Comment on attachment 8821182 [details] [diff] [review]
[PATCH WIP v3] Prototype implementation

Review of attachment 8821182 [details] [diff] [review]:
-----------------------------------------------------------------

The overall architecture makes sense to me, good job with the prototype. Can't wait to see the version with Windows support.

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +198,5 @@
>  /**
>   * Annotate the current session ID with the crash reporter to map potential
>   * crash pings with the related main ping.
> + *
> + * @param sessionId {String} The telemetry session ID

nit: these should be inverted, e.g. "@param {String} sessionId"
Attachment #8821182 - Flags: feedback?(alessio.placitelli) → feedback+
Updated patch with Windows support (though only partially working), some cleanups and some refactoring. What's missing now is a little bit more testing on Windows and Mac to ensure it's reliable and a way to communicate the crash ping UUID back to gecko for ping deduplication then it should be ready for review.
Attachment #8821182 - Attachment is obsolete: true
Attachment #8821182 - Flags: feedback?(ted)
I've done some further Windows testing and figured out why it worked only intermittently: CreateProcess() accepts commands up to 32767 characters long, but no more. When sending regular pings I didn't hit this limitation but when sending crash pings with stack traces I blew past it big time. This is extremely annoying since I need a Windows-specific, ad-hoc solution to passing the ping payload to the sender.
Any reason not to pipe the crash ping to the sender's stdin?

Also: I seriously wonder if this shouldn't be written in Rust. It's a standalone program that's not using any Gecko internals but is doing nontrivial work.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #12)
> Any reason not to pipe the crash ping to the sender's stdin?

Yeah, I think I'll do that.

> Also: I seriously wonder if this shouldn't be written in Rust. It's a
> standalone program that's not using any Gecko internals but is doing
> nontrivial work.

I started writing it in Rust but I didn't have a good option to access HTTP functionality from there (which meant accessing libcurl on Linux/Mac and WININET functions on Windows). There's crates with binding for both available but I didn't want to pull yet more code into gecko. I'm not sufficiently familiar with Rust to write a minimal binding for both and there's no native Rust functionality for dlopen()ing (though again there's a couple of crates for that).

So my idea was to start writing this in C++ so that we'd get the functionality right away (and the data too!) and then switch to Rust after all the other bits have fallen into place.
hyper has worked pretty well for Rust HTTP for me in the past: https://github.com/hyperium/hyper

The only fiddly bit is that by default it wants to use OpenSSL for SSL support, which is a pain on non-Linux. There is a rust-native-tls crate that supports using platform-native Crypto APIs for SSL, which supposedly works with hyper, but I haven't tried it: https://github.com/sfackler/rust-native-tls

rustup has code that can (optionally) use hyper+native-tls:
https://github.com/rust-lang-nursery/rustup.rs/blob/27575af269673c956b0bb32cb7b9a9300a861950/src/download/src/lib.rs#L221
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> hyper has worked pretty well for Rust HTTP for me in the past:
> https://github.com/hyperium/hyper
> 
> The only fiddly bit is that by default it wants to use OpenSSL for SSL
> support, which is a pain on non-Linux. There is a rust-native-tls crate that
> supports using platform-native Crypto APIs for SSL, which supposedly works
> with hyper, but I haven't tried it:
> https://github.com/sfackler/rust-native-tls
> 
> rustup has code that can (optionally) use hyper+native-tls:
> https://github.com/rust-lang-nursery/rustup.rs/blob/
> 27575af269673c956b0bb32cb7b9a9300a861950/src/download/src/lib.rs#L221

I've evaluated Hyper when writing a Rust-based prototype and it does everything I need. However it's a very large project and I didn't feel it would be right to pull it into our dependencies just to be able to do an HTTP POST. Since the ping sender needs minimal, best effort functionality I'd rather stick with using libcurl where it's available (which covers pretty much all Unixes with a handful of lines of code) and WININET on Windows.

I'll definitely end up rewriting it in Rust but I'd really like to have it working ASAP because it will allow us to get telemetry pings (including crashes w/ stacks) both during startup and shutdown which is an area where we currently have very little visibility.
I hear you I just have built up a huge bias against adding new C++ code. :)

I don't know what the full dependency stack looks like but there is also tokio-curl + tokio-tls nowadays (built on tokio-core which uses Rust futures):
https://github.com/tokio-rs/tokio-curl
https://github.com/tokio-rs/tokio-tls

I imagine we'll wind up pulling in a dependency on tokio for Quantum stuff in the near future anyway.
Depends on: 1328657
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #16)
> I hear you I just have built up a huge bias against adding new C++ code. :)

I understand; this was supposed to be Rust but then the schedules didn't allow it. Anyway, it will become Rust soon enough.

> I don't know what the full dependency stack looks like but there is also
> tokio-curl + tokio-tls nowadays (built on tokio-core which uses Rust
> futures):
> https://github.com/tokio-rs/tokio-curl
> https://github.com/tokio-rs/tokio-tls
> 
> I imagine we'll wind up pulling in a dependency on tokio for Quantum stuff
> in the near future anyway.

That'd be great. If I already had something like that in mozilla-central writing this in Rust would have been a no-brainer.
This is the complete patch, it still needs some testing though so I won't be asking for review straight away.
Attachment #8823118 - Attachment is obsolete: true
Comment on attachment 8824466 [details] [diff] [review]
[PATCH] Add the crash ping sender and the machinery to use it

Consider this almost a review request in the sense that the patch is pretty much finalized but I've hit a snag on Windows with the logic I've added to ensure ping deduplication works. Everything works fine on Linux and Mac though. Here's a detailed description of everything the patch does:

- Introduce a pingsender executable that is executes an HTTP POST with the data it's provided on stdin to a URL specified as a parameter. This uses curl on Linux and Mac, and WININET on Windows. It's used to send crash pings from the crashreporter client and at a later by the telemetry code. Note that I haven't yet added the machinery to use this from the telemetry code; I'll add it in a follow-up.

- I've modified the telemetry code to annotate the crash report with all the information needed to assemble the crash ping. The telemetry client ID and server URL are added to the .extra file for each crash.

- I've modified the crashreporter client to assemble a full crash ping and send it to the appropriate URL via the pingsender. Once the ping is sent the crashreporter client will add its UUID to the crash event file. This allows the CrashManager to pick up this UUID and use it when sending its own copy of the crash ping. This is required to ensure that the telemetry server will properly deduplicate the ping.

- The telemetry code now has an option to override the ping's generated UUID with one provided externally. This is used as described above.
Attachment #8824466 - Flags: feedback?(ted)
Attachment #8824466 - Flags: feedback?(alessio.placitelli)
Comment on attachment 8824466 [details] [diff] [review]
[PATCH] Add the crash ping sender and the machinery to use it

Georg is back, moving the feedback back to him.
Attachment #8824466 - Flags: feedback?(alessio.placitelli) → feedback?(gfritzsche)
OK, this is it. I've fixed the last remaining issues and rewrote some of the code to be more readable, now everything seems functional:

- Upon a crash this reliably sends a fully-formed crash ping directly from the crashreporter client

- The UUID of the ping is reported back to Gecko and used for the crash ping sent from Gecko itself, which should allow de-dup'ing the pings on the server-side

- I've tested it using a local server on Linux, Windows and Mac and it seems to be working fine everywhere. I haven't tested it yet against our infrastructure because I didn't want to pollute it with my tries
Attachment #8824466 - Attachment is obsolete: true
Attachment #8824466 - Flags: feedback?(ted)
Attachment #8824466 - Flags: feedback?(gfritzsche)
Attachment #8825414 - Flags: review?(ted)
Attachment #8825414 - Flags: review?(gfritzsche)
Gabriele, did you already think about how to add test coverage for this in the automation (if that's possible at all)?
Flags: needinfo?(gsvelto)
(In reply to Alessio Placitelli [:Dexter] from comment #22)
> Gabriele, did you already think about how to add test coverage for this in
> the automation (if that's possible at all)?

I didn't because it's not currently possible for the crashreporter client path. The issue stems from the fact that we don't have a way of testing the crashreporter client and in my patch the ping sender is only called from there. There's a few bugs open already for this and we'll need to add coverage but it's a fairly complex issue.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #23)
> I didn't because it's not currently possible for the crashreporter client
> path. The issue stems from the fact that we don't have a way of testing the
> crashreporter client and in my patch the ping sender is only called from
> there.

Maybe an easy start would be to test the ping sender specifically from our Telemetry xpcshell test setup?
I.e. write a simple xpcshell test that triggers the ping sender with a test ping, pointing it to our test server:
https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/toolkit/components/telemetry/tests/unit/head.js#38
... then wait for a ping to be received and check the contents etc.
(In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> Maybe an easy start would be to test the ping sender specifically from our
> Telemetry xpcshell test setup?
> I.e. write a simple xpcshell test that triggers the ping sender with a test
> ping, pointing it to our test server:
> https://dxr.mozilla.org/mozilla-central/rev/
> 2963cf6be7f830c0d2155e2968cfc53585868a76/toolkit/components/telemetry/tests/
> unit/head.js#38
> ... then wait for a ping to be received and check the contents etc.

Yeah that should be possible, though I'm not exposing the pingsender to the Telemetry code just yet. I was planning on doing that in a separate bug so it might be appropriate to do that right away so that we have some testing available.
Comment on attachment 8825414 [details] [diff] [review]
[PATCH v2] Add the crash ping sender and the machinery to use it

Review of attachment 8825414 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, this looks pretty good overall!

While i didn't review everything in detail, i have some comments on this.
After clearing those up i think Alessio should take a detailed pass on the Telemetry changes, ping generation and submission logic (as i'm out on PTO to Jan 24).

There is a question on whether we need to rebuild the crash ping in Firefox JS code. We could potentially just submit the same ping if you store it the right way, see the comment in crashreporter.cpp.

We should document the details of the data flow and ping handling.
I think this probably goes into:
- crashreporter/docs for the crash ping generation details
- crash-ping.rst for describing where pings are triggered from (this is to enable analysists etc. to understand the semantics)
Ideally this lands together with the code here, maybe in a separate patch?

How do we assure that the ping sender and crash reporter are working properly, is there a plan?
QA & automation coverage? (We have a QA engineer now that is the contact for Telemetry QA, Justin Williams.)
Could we tell if the crash reporter submissions from our populations suddenly drop?
Can we validate the crash ping numbers from crash reports?

::: toolkit/components/crashes/CrashManager.jsm
@@ +54,5 @@
>    let value = null;
>  
>    if (field in obj) {
>      if (!parseAsJson) {
> +      value = obj[field].trim();

What is the trim() needed for?
Does the new code push data that is not trimmed already?

@@ +638,5 @@
>      let sessionId = parseAndRemoveField(reportMeta, "TelemetrySessionId",
>                                          /* parseAsJson */ false);
>      let stackTraces = parseAndRemoveField(reportMeta, "StackTraces");
> +    let pingId = parseAndRemoveField(reportMeta, "CrashPingUUID",
> +                                     /* parseAsJson */ false);

What about old pings that don't have this value yet?
Is there test coverage for both cases?

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +197,5 @@
>     * @param {Boolean} [aOptions.addEnvironment=false] true if the ping should contain the
>     *                  environment data.
>     * @param {Object}  [aOptions.overrideEnvironment=null] set to override the environment data.
> +   * @param {String} [aOptions.overridePingId] set to override the generate
> +   *                 ping id.

[aOptions.overridePingId=null]
Nit: "generated"
Ditto for the changes below.

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1479,5 @@
>      // the very same value for |_sessionStartDate|.
>      this._sessionStartDate = this._subsessionStartDate;
>  
> +    // Annotate crash reports using the cached client ID which can be accessed
> +    // synchronously, we'll update it later

Nit: "... synchronously. If it is not available yet, we'll update it later."

@@ +1536,5 @@
>          }
>  
>          Telemetry.asyncFetchTelemetryData(function() {});
>  
> +        // Update the crash annotation with the proper client ID

Nit: Add trailing ".".

::: toolkit/crashreporter/client/crashreporter.cpp
@@ +643,5 @@
> +    // Assemble and send the crash ping
> +    string pingUuid;
> +
> +    if (SendCrashPing(queryParameters, pingUuid)) {
> +      AppendToEventFile("CrashPingUUID", pingUuid);

Do we need to let the JS code regenerate the ping?
We could:
- assemble crash ping
- save ping to pending pings directory in profile
- on success, append crash ping uuid to event file
- try to submit ping
- if submit successful, remove ping from pending directory

That way, if anything fails, Telemetry would just pick the ping up in the next Firefox run & submit it.

::: toolkit/crashreporter/client/ping.cpp
@@ +30,5 @@
> +    uint16_t m2;
> +    uint8_t  m3[8];
> +};
> +
> +// Generates an UUID; the code here is mostly copied from nsUUIDGenerator.cpp

Is it possible to share this code?

@@ +93,5 @@
> +  return str;
> +}
> +
> +// ISO 8601 date format: YYYY-MM-DD
> +#define ISO8601_DATE "%F"

Do these need to be defines?
Can't they just be `const std::string`, `const char[]` or so?
Can we list the constants in one section at the topic instead of strewn in between the functions?

@@ +110,5 @@
> +}
> +
> +#define TELEMETRY_CLIENT_ID    "TelemetryClientId"
> +#define TELEMETRY_SERVER_URL   "TelemetryServerURL"
> +#define TELEMETRY_SESSION_ID   "TelemetrySessionId"

Ditto, do these need to be defines?

@@ +111,5 @@
> +
> +#define TELEMETRY_CLIENT_ID    "TelemetryClientId"
> +#define TELEMETRY_SERVER_URL   "TelemetryServerURL"
> +#define TELEMETRY_SESSION_ID   "TelemetrySessionId"
> +#define TELEMETRY_VERSION      (4)

const unsigned kTelemetryVersion = ...

@@ +118,5 @@
> +// from the .extra file
> +static Json::Value
> +CreateMetadataNode(StringTable& strings)
> +{
> +  // The following list should be kept in sync with the one in CrashManager.jsm

How can we ensure that this is not overlooked?
Is there test coverage?

@@ +161,5 @@
> +}
> +
> +// Create the payload node of the crash ping
> +static Json::Value
> +CreatePayloadNoad(StringTable& strings, const string& aSessionId)

Typo? CreatePayloadNode()?

@@ +256,5 @@
> +
> +  return root;
> +}
> +
> +// Generates the URL used to submit the crash ping, see TelemetrySend.jsm

Also see: https://wiki.mozilla.org/CloudServices/DataPipeline/HTTPEdgeServerSpecification

@@ +302,5 @@
> +    return false;
> +  }
> +
> +  Json::Value root = CreateRootNode(strings, uuid, clientId, sessionId,
> +                                    name, version, channel, buildId);

We definitely will want test coverage for this generating the correct format and sanely formed value.

::: toolkit/crashreporter/pingsender/pingsender.cpp
@@ +29,5 @@
> +namespace CrashReporter {
> +
> +const char* kUserAgent = "pingsender/1.0";
> +
> +#if defined(XP_UNIX) || defined(XP_MACOSX)

Could we break this into pingsender_win.cpp and pingsender_unix.cpp?
ifdefs that long are hard to discover, especially when reviewing future changes.

@@ +31,5 @@
> +const char* kUserAgent = "pingsender/1.0";
> +
> +#if defined(XP_UNIX) || defined(XP_MACOSX)
> +
> +// Machinery used to send the crash ping to

... to?

@@ +80,5 @@
> +bool
> +CurlWrapper::Init()
> +{
> +  // libcurl might show up under different names, try them all until we find it
> +  const char* libcurlNames[] = {

Is this the best way to do it?
Maybe flag someone more knowledgeable on this to check?

@@ +301,5 @@
> +    url = argv[1];
> +  }
> +
> +  if (url.empty()) {
> +    exit(EXIT_FAILURE);

Printing something descriptive would be helpful for anyone testing the behavior of the binary separately.
Ditto for the other exit(failure) paths.

@@ +305,5 @@
> +    exit(EXIT_FAILURE);
> +  }
> +
> +  string ping(ReadPingFromStdin());
> +  Post(url, ping);

With these two command line arguments, we can already add test coverage for the ping sender.
E.g. write a simple xpcshell test (or more suitable test harness?) that:
- builds a ping
- invoke the pingsender, pointing URL to a local PingServer [0]
- wait for ping submission
- validate submission url, ping structure & contents

0: https://dxr.mozilla.org/mozilla-central/rev/b1c31c4a0a678194931779e0f13fba7b508eb109/toolkit/components/telemetry/tests/unit/head.js#40
Attachment #8825414 - Flags: review?(gfritzsche) → feedback+
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #26)
> Comment on attachment 8825414 [details] [diff] [review]
> [PATCH v2] Add the crash ping sender and the machinery to use it
> 
> Review of attachment 8825414 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice, this looks pretty good overall!

Thanks for the feedback Georg, here's a few notes:

> While i didn't review everything in detail, i have some comments on this.
> After clearing those up i think Alessio should take a detailed pass on the
> Telemetry changes, ping generation and submission logic (as i'm out on PTO
> to Jan 24).

Alright, I'll redirect to him.

> There is a question on whether we need to rebuild the crash ping in Firefox
> JS code. We could potentially just submit the same ping if you store it the
> right way, see the comment in crashreporter.cpp.

Good point. I hadn't thought about that. I guess it depends on the complexity of storing the ping from the crashreporter code. The crashreporter code isn't pretty so adding more complexity to it isn't great, and we've been discussing rewriting it at some point (in Rust no less) together with the rest of this machinery.

> We should document the details of the data flow and ping handling.
> I think this probably goes into:
> - crashreporter/docs for the crash ping generation details
> - crash-ping.rst for describing where pings are triggered from (this is to
> enable analysists etc. to understand the semantics)
> Ideally this lands together with the code here, maybe in a separate patch?

Good point. Since I'm at it I'll split up the patch into smaller bits and push it via mozreview. It's high time I'd do that.

> How do we assure that the ping sender and crash reporter are working
> properly, is there a plan?

Yes, we need actual integration tests for it. The issue we're facing right now is that there's not tests covering the crashreporter client AFAIK so we have to write those firsts.

> QA & automation coverage? (We have a QA engineer now that is the contact for
> Telemetry QA, Justin Williams.)

That's a start, Abe Masresha has been testing the inclusion of stack traces in the pings and I suppose he'll be testing this too.

> Could we tell if the crash reporter submissions from our populations
> suddenly drop?
> Can we validate the crash ping numbers from crash reports?

We should be able to do that though we still might be missing some pieces, bug 1322611 for example.

> What is the trim() needed for?
> Does the new code push data that is not trimmed already?

I hadn't noticed before but when reading back the .extra file on Windows hosts we're left with an invisible CR character at the end of the strings (since we're splitting at LF). It was irrelevant before but now it caused issues when the pingId was used in an URL. Note that if we'd store the crashreporter's ping like you suggested then this code wouldn't be needed.

> What about old pings that don't have this value yet?

They use the generated UUID.

> Is there test coverage for both cases?

Nope, and I should add it.

> ::: toolkit/crashreporter/client/crashreporter.cpp
> @@ +643,5 @@
> > +    // Assemble and send the crash ping
> > +    string pingUuid;
> > +
> > +    if (SendCrashPing(queryParameters, pingUuid)) {
> > +      AppendToEventFile("CrashPingUUID", pingUuid);
> 
> Do we need to let the JS code regenerate the ping?
> We could:
> - assemble crash ping
> - save ping to pending pings directory in profile
> - on success, append crash ping uuid to event file
> - try to submit ping
> - if submit successful, remove ping from pending directory
> 
> That way, if anything fails, Telemetry would just pick the ping up in the
> next Firefox run & submit it.

Yeah, we could do it that way and it would be useful until we have proper proxy support for the ping sender. I don't know how complex it'd be but I'll evaluate it (we might do it in a follow-up).

> ::: toolkit/crashreporter/client/ping.cpp
> @@ +30,5 @@
> > +    uint16_t m2;
> > +    uint8_t  m3[8];
> > +};
> > +
> > +// Generates an UUID; the code here is mostly copied from nsUUIDGenerator.cpp
> 
> Is it possible to share this code?

Yeah, I hope to add this to MFBT so that we can use it both from gecko and outside of it but I'd do it in a follow.

> @@ +93,5 @@
> > +  return str;
> > +}
> > +
> > +// ISO 8601 date format: YYYY-MM-DD
> > +#define ISO8601_DATE "%F"
> 
> Do these need to be defines?
> Can't they just be `const std::string`, `const char[]` or so?
> Can we list the constants in one section at the topic instead of strewn in
> between the functions?

Yes, good idea.

> @@ +118,5 @@
> > +// from the .extra file
> > +static Json::Value
> > +CreateMetadataNode(StringTable& strings)
> > +{
> > +  // The following list should be kept in sync with the one in CrashManager.jsm
> 
> How can we ensure that this is not overlooked?
> Is there test coverage?

We have test coverage for the crash manager but not for the crashreporter client. Since this list needs to be used in multiple places (and both from C++ and JS) I was planning on making it generated from a file so that we can also automatically generate documentation for all the fields (now one needs to manually update crash-ping.rst every time we do a change).

> Could we break this into pingsender_win.cpp and pingsender_unix.cpp?
> ifdefs that long are hard to discover, especially when reviewing future
> changes.

Yes, I'll do that.

> @@ +80,5 @@
> > +bool
> > +CurlWrapper::Init()
> > +{
> > +  // libcurl might show up under different names, try them all until we find it
> > +  const char* libcurlNames[] = {
> 
> Is this the best way to do it?
> Maybe flag someone more knowledgeable on this to check?

AFAIK it is. We don't want to bundle libcurl with the pingsender just like we don't bundle it with the crashreporter client so the best approach is to try and find a version we can use. I've discussed it directly with Daniel Stenberg which is one of our network guys (and the author of libcurl) before starting work on the bug.

> Printing something descriptive would be helpful for anyone testing the
> behavior of the binary separately.
> Ditto for the other exit(failure) paths.

Yeah, just like with the minidump-analyzer I've left error handling to a minimum but it might be better to make it a little clearer.

> With these two command line arguments, we can already add test coverage for
> the ping sender.
> E.g. write a simple xpcshell test (or more suitable test harness?) that:
> - builds a ping
> - invoke the pingsender, pointing URL to a local PingServer [0]
> - wait for ping submission
> - validate submission url, ping structure & contents
> 
> 0:
> https://dxr.mozilla.org/mozilla-central/rev/
> b1c31c4a0a678194931779e0f13fba7b508eb109/toolkit/components/telemetry/tests/
> unit/head.js#40

I'll try to do that. What my plan was - and I think I'll do it in this bug in the end - is to add functionality so that the telemetry code can easily call the pingsender from JS. I'd use this for testing too.
> > With these two command line arguments, we can already add test coverage for
> > the ping sender.
> > E.g. write a simple xpcshell test (or more suitable test harness?) that:
> > - builds a ping
> > - invoke the pingsender, pointing URL to a local PingServer [0]
> > - wait for ping submission
> > - validate submission url, ping structure & contents
> > 
> > 0:
> > https://dxr.mozilla.org/mozilla-central/rev/
> > b1c31c4a0a678194931779e0f13fba7b508eb109/toolkit/components/telemetry/tests/
> > unit/head.js#40
> 
> I'll try to do that. What my plan was - and I think I'll do it in this bug
> in the end - is to add functionality so that the telemetry code can easily
> call the pingsender from JS. I'd use this for testing too.

I can write the test Georg suggested if that would help!
(In reply to Alessio Placitelli [:Dexter] from comment #28)
> I can write the test Georg suggested if that would help!

Thanks, I'll ping you if I need help with that. I'm already adding support for calling the sender from Telemetry so I'll be adding the test too since otherwise I can't tell if it works :)
Quick update: I've prepared all the necessary functionality to invoke the pingsender from JS code by adding it to the main Telemetry object and I've written an xpcshell test using PingServer to verify everything's working fine... however the test isn't working and I still haven't figured out why because everything else seems to be working fine. I'll upload a WIP ASAP.
Updated the bug name to reflect the fact that we've generalized this for all kind of pings. I've got the tests working but I still need to ensure that the Telemetry part works correctly on Windows and Mac.
Summary: Create CrashSender executable to handle crash ping transmission → Create pingsender executable to handle crash ping transmission
Blocks: 1333128
After a lot of wrangling with the code I've got a patch-set ready (now split in four). It's been manually tested on Linux, Windows and Mac and it has some new tests included but more are needed for the crashreporter client. I'll add them in a separate bug since there's no tests at all covering it (for now).
Attachment #8825414 - Attachment is obsolete: true
Attachment #8825414 - Flags: review?(ted)
Comment on attachment 8830722 [details]
Bug 1310703 - Use the ID generated by the crashreporter client when sending a crash ping so that server-side deduplication works correctly;

https://reviewboard.mozilla.org/r/107460/#review108874

This looks good with the comments below addressed.

::: toolkit/components/crashes/CrashManager.jsm:58
(Diff revision 1)
>  function parseAndRemoveField(obj, field, parseAsJson = true) {
>    let value = null;
>  
>    if (field in obj) {
>      if (!parseAsJson) {
> -      value = obj[field];
> +      value = obj[field].trim();

nit: can you add a comment about why this is needed? Something along the lines of "Needed to remove the invisible CR character at the end of the strings (since we're splitting at LF), on Windows"

::: toolkit/components/crashes/CrashManager.jsm:641
(Diff revision 1)
>      let crashEnvironment = parseAndRemoveField(reportMeta,
>                                                 "TelemetryEnvironment");
>      let sessionId = parseAndRemoveField(reportMeta, "TelemetrySessionId",
>                                          /* parseAsJson */ false);
>      let stackTraces = parseAndRemoveField(reportMeta, "StackTraces");
> +    let pingId = parseAndRemoveField(reportMeta, "CrashPingUUID",

nit: can you add something like "If CrashPingUUID is not available, Telemetry will generate a new UUID".

::: toolkit/components/telemetry/tests/unit/test_TelemetryController.js:94
(Diff revision 1)
>    // Check the clientId and environment fields, as needed.
>    Assert.equal("clientId" in aPing, aHasClientId);
>    Assert.equal("environment" in aPing, aHasEnvironment);
>  }
>  
> +function generateUUID() {

We already have this function [here](http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/toolkit/components/telemetry/tests/unit/test_TelemetrySession.js#77). You could move it to head.js and use it from both files to avoid code duplication.
Attachment #8830722 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review108880

Thanks for this patch! I'm wondering, would it be helpful to split this into two parts: one adding the ability to spawn the pingsender and the rest with the additions to TelemetryController/TelemetrySend.

We could then move the test that you added to a separate file, manually spawn the sender (passing some ping data) and verifying that it reaches the PingServer.

We could leave dealing with the rest of the Telemetry integration as a follow-up bug (since this is now only being used by the test).

::: toolkit/components/telemetry/Telemetry.cpp:2677
(Diff revision 1)
> +  xreAppDistDir->GetPath(aPath);
> +  return NS_OK;
> +}
> +
> +NS_IMETHODIMP
> +TelemetryImpl::RunPingSender(const nsACString& aUrl, const nsACString& aPing)

Just asking... :-) can this be done in a JSM? Or we're lacking XPCOM APIs for some functionality?

::: toolkit/components/telemetry/Telemetry.cpp:2696
(Diff revision 1)
> +  if (PR_CreatePipe(&pipeRead, &pipeWrite) != PR_SUCCESS) {
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  if (PR_SetFDInheritable(pipeRead, PR_TRUE) != PR_SUCCESS) {
> +    PR_Close(pipeRead);

You could probably get rid of all this PR\_Close calls (also below) by using a custom deletion policy with UniquePtr (see [here](http://searchfox.org/mozilla-central/rev/8fa84ca6444e2e01fb405e078f6d2c8da0e55723/mfbt/UniquePtr.h#115))

::: toolkit/components/telemetry/Telemetry.cpp:2718
(Diff revision 1)
> +
> +  // Connect the pingsender standard input to the pipe and launch it
> +  PR_ProcessAttrSetStdioRedirect(attr, PR_StandardInput, pipeRead);
> +
> +  char* args[] = {
> +    ToNewCString(path),

If you do soemthing like |UniquePtr<char[]> arg0(ToNewCString(path));| and then use it in args through |.get()|, you won't need to worry about the free

::: toolkit/components/telemetry/TelemetryController.jsm:488
(Diff revision 1)
>      // following operations to keep this function synchronous for the majority of the calls.
>      let archivePromise = TelemetryArchive.promiseArchivePing(pingData)
>        .catch(e => this._log.error("submitExternalPing - Failed to archive ping " + pingData.id, e));
>      let p = [ archivePromise ];
>  
> -    p.push(TelemetrySend.submitPing(pingData));
> +    let options = aOptions.usePingSender ? { usePingSender: true } : {};

What about |const options = { usePingSender: !!aOptions.usePingSender };|?

::: toolkit/components/telemetry/TelemetrySend.jsm:210
(Diff revision 1)
> +   * @param {Boolean} [options.usePingSender=false] set to use the pingsender
> +   *                  program for sending the ping.
>     * @return {Promise} Test-only - a promise that is resolved when the ping is sent or saved.
>     */
> -  submitPing(ping) {
> -    return TelemetrySendImpl.submitPing(ping);
> +  submitPing(ping, options = {}) {
> +    return TelemetrySendImpl.submitPing(ping, options);

nit: let's do |options.usePingSender = options.usePingSender || false;| to sanitize the input parameter as we do in TelemetryController.jsm.

::: toolkit/components/telemetry/TelemetrySend.jsm:712
(Diff revision 1)
>        this._log.trace("submitPing - can't send ping now, persisting to disk - " +
>                        "canSendNow: " + this.canSendNow);
>        return savePing(ping);
>      }
>  
> +    if (options.usePingSender) {

Unfortunately, as implemented, you wouldn't be able to send pings on shutdown.

canSendNow is false when shutting down, so pings get persisted to the disk rather then sent.

::: toolkit/components/telemetry/TelemetrySend.jsm:719
(Diff revision 1)
> +      this._log.trace("submitPing - can send pings, using the pingsender program");
> +
> +      const url = this._server + this._getSubmissionPath(ping) + "?v=" +
> +                  (isV4PingFormat(ping) ? PING_FORMAT_VERSION : 1);
> +
> +      Telemetry.runPingSender(url, JSON.stringify(ping));

The implementation is returning NS_ERROR\_\*, which AFAIK means the JS API call can throw.

This would only happen in case of errors when transferring the ping data to the pingsender. I think we should handle the exception here and make sure we don't loose the ping data.

::: toolkit/components/telemetry/nsITelemetry.idl:529
(Diff revision 1)
> +   * as soon as the contents of the ping have been handed over to the ping
> +   * sender.
> +   *
> +   * @param aUrl The telemetry server URL
> +   * @param aPing A string holding the ping contents
> +   */

Can you document that the function throws NS_ERROR_FAILURE and why?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js:207
(Diff revision 1)
>  });
>  
>  add_task(function* test_sendDateHeader() {
>    fakeNow(new Date(Date.UTC(2011, 1, 1, 11, 0, 0)));
>    yield TelemetrySend.reset();
> +  fakePingId("g", 0);

Why is this needed here?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js:243
(Diff revision 1)
> +  const payload = { foo: "bar" };
> +  let pingId = yield TelemetryController.submitExternalPing(
> +    "test-ping-sender", payload, { usePingSender: true }
> +  );
> +  let req = yield PingServer.promiseNextRequest();
> +  let ping = decodeRequestPayload(req);

You could merge this and the previous line to: const ping = yield PingServer.promiseNextPing();

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js:245
(Diff revision 1)
> +    "test-ping-sender", payload, { usePingSender: true }
> +  );
> +  let req = yield PingServer.promiseNextRequest();
> +  let ping = decodeRequestPayload(req);
> +
> +  Assert.equal(ping.id, pingId, "Should have received the correct ping id.");

Can you also assert that ping.type == "test-ping-sender"?
Thanks for the review, I'm addressing your comments and will upload the new patches soon, here's a few notes:

(In reply to Alessio Placitelli [:Dexter] from comment #38)
> Thanks for this patch! I'm wondering, would it be helpful to split this into
> two parts: one adding the ability to spawn the pingsender and the rest with
> the additions to TelemetryController/TelemetrySend.

As you prefer; I wanted to implement them together to ensure that they'd work correctly.

> We could then move the test that you added to a separate file, manually
> spawn the sender (passing some ping data) and verifying that it reaches the
> PingServer.

I've tried it but the code in head.js requires some setup from the Telemetry code (e.g. it calls TelemetrySend.shutdown unconditionally) so putting together a new test would require duplicating most of what is already in test_TelemetrySend.js.

> Just asking... :-) can this be done in a JSM? Or we're lacking XPCOM APIs
> for some functionality?

The closes we have is nsIFile.launch() which doesn't have the right semantics (e.g. you can't redirect output and it's blocking IIRC).

> You could probably get rid of all this PR\_Close calls (also below) by using
> a custom deletion policy with UniquePtr (see
> [here](http://searchfox.org/mozilla-central/rev/
> 8fa84ca6444e2e01fb405e078f6d2c8da0e55723/mfbt/UniquePtr.h#115))

Actually we had a facility in MFBT to easily declare RAII auto-close PRFiles; I couldn't find it before but now I've updated the patch to use it so I got rid of all the explicit closes.

> What about |const options = { usePingSender: !!aOptions.usePingSender };|?

Sure, I have no particular preference.

> Unfortunately, as implemented, you wouldn't be able to send pings on
> shutdown.
> 
> canSendNow is false when shutting down, so pings get persisted to the disk
> rather then sent.

Good point, I'll fix it and cover this with a test. That's one of the reasons why I wanted to implement both things together.

> The implementation is returning NS_ERROR\_\*, which AFAIK means the JS API
> call can throw.
> 
> This would only happen in case of errors when transferring the ping data to
> the pingsender. I think we should handle the exception here and make sure we
> don't loose the ping data.

Right, I will.

> Why is this needed here?

Without this the test would print out a bunch of warnings which I found annoying because the storage logic would find a duplicate ping. I think this makes the test output cleaner (and the test saner since it's not hitting an unrelated corner-case anymore).
As per discussion on IRC I'll split up the last patch removing the telemetry-specific JS code and leaving only C++ changes plus tests for a standalone pingsender execution.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review108880

I've just uploded a new patch with the TelemetryController/TelemetrySend integration removed and only a stand-alone test for running the pingsender using Telemetry.runPingSender().
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

::: toolkit/components/telemetry/Telemetry.cpp:89
(Diff revision 2)
>  #include "nsPrintfCString.h"
>  #endif // MOZ_GECKO_PROFILER
>  
> +namespace mozilla {
> +  // Scoped auto-close for PRFileDesc file descriptors
> +MOZ_TYPE_SPECIFIC_SCOPED_POINTER_TEMPLATE(ScopedPRFileDesc,

nit: indentation

::: toolkit/components/telemetry/Telemetry.cpp:2724
(Diff revision 2)
> +  char* args[] = {
> +    arg0.get(),
> +    arg1.get(),
> +    nullptr,
> +  };
> +  Unused << NS_WARN_IF(PR_CreateProcessDetached(args[0], args, nullptr, attr));

Is there a reason why we're ignoring the return value here? AFAIK NS_WARN_IF is also evaluated on Release: does it make sense to continue on the lines  below instead of simply bailing out if this failed?

::: toolkit/components/telemetry/Telemetry.cpp:2735
(Diff revision 2)
> +
> +  while (length > 0) {
> +    int result = PR_Write(pipeWrite, s, length);
> +
> +    if (result <= 0) {
> +      break;

Instead of breaking, should we return a failure code here?

::: toolkit/components/telemetry/TelemetrySend.jsm:540
(Diff revision 2)
>  var TelemetrySendImpl = {
>    _sendingEnabled: false,
>    // Tracks the shutdown state.
>    _shutdown: false,
> +  // True if setup has run, cleared upon shutdown
> +  _setup: false,

Why is this needed? Ditto the other uses below.

::: toolkit/components/telemetry/tests/unit/test_PingSender.js:34
(Diff revision 2)
> +
> +  const url = "http://localhost:" + PingServer.port + "/submit/telemetry/";
> +  const data = {
> +    type: "test-pingsender-type",
> +    id: generateUUID(),
> +    creationDate: (new Date(1485810000)).toISOString(),

nit: why are you specifying 1485810000?
Comment on attachment 8830722 [details]
Bug 1310703 - Use the ID generated by the crashreporter client when sending a crash ping so that server-side deduplication works correctly;

https://reviewboard.mozilla.org/r/107460/#review109590

::: toolkit/components/telemetry/tests/unit/head.js:304
(Diff revision 2)
> +
>    });
>  }
>  
> +// Generate a UUID, used for the ping ID
> +function generateUUID() {

We have `TelemetryUtils.generateUUID()`, lets just use that?
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109620

I almost forgot (thanks Georg!) about the docs: would you mind also adding some in-tree documentation about this? toolkit/components/telemetry/docs/internals might be a good location

::: toolkit/components/telemetry/Telemetry.cpp:2699
(Diff revision 2)
> +
> +  // Create a pipe to send the ping contents to the ping sender
> +  ScopedPRFileDesc pipeRead;
> +  ScopedPRFileDesc pipeWrite;
> +
> +  if (PR_CreatePipe(&pipeRead.rwget(), &pipeWrite.rwget()) != PR_SUCCESS) {

Gabriele, I'm never used the PR\_\* functions. Do you know anyone who could review their usage?
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109620

I just noticed Georg already talked about the docs. Quoting him:

"We should document the details of the data flow and ping handling.
I think this probably goes into:
- crashreporter/docs for the crash ping generation details
- crash-ping.rst for describing where pings are triggered from (this is to enable analysists etc. to understand the semantics)
Ideally this lands together with the code here, maybe in a separate patch?"
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> Is there a reason why we're ignoring the return value here? AFAIK NS_WARN_IF is also evaluated on Release: does it make sense to continue on the lines  below instead of simply bailing out if this failed?

PR_CreateProcessDetached() is an asynchronous function, it does not create the process itself so it can return a successful value even though the process failed to start. IMHO it's better to just ignore the return value, everything below that will have no effect anyway if the process didn't start or if the function failed internally for any other reason.

> Instead of breaking, should we return a failure code here?

Sure, that's a good indiciation that the ping hasn't been sent correctly.

> Why is this needed? Ditto the other uses below.

Because head.js calls TelemetrySend.shutdown() unconditionally:

https://dxr.mozilla.org/mozilla-central/rev/1fe66bd0efba89df59d2046e8c91418eb5ae10b8/toolkit/components/telemetry/tests/unit/head.js#319

so if TelemetrySend hadn't been setup before (such as in the test I added) that causes a failure.

> nit: why are you specifying 1485810000?

It's just a fixed date, nothing special. It could have been Date.now() or anything else.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> Because head.js calls TelemetrySend.shutdown() unconditionally:
> 
> https://dxr.mozilla.org/mozilla-central/rev/1fe66bd0efba89df59d2046e8c91418eb5ae10b8/toolkit/components/telemetry/tests/unit/head.js#319
> 
> so if TelemetrySend hadn't been setup before (such as in the test I added) that causes a failure.

Do you have the failure output? I wonder why this is failing, since we have a [similar](http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/tests/unit/test_TelemetryHistograms.js) test that doesn't init Telemetry but still works.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> Do you have the failure output? I wonder why this is failing, since we have a [similar](http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/tests/unit/test_TelemetryHistograms.js) test that doesn't init Telemetry but still works.

This is from the test you pointed out, I think it's not causing the test to fail because all the tests are explicitly catching exceptions while my test did not.

0:01.10 LOG: Thread-1 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "1485879598785	Toolkit.Telemetry	ERROR	TelemetrySend::shutdown - failed to remove observer for idle-daily: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://gre/modules/TelemetrySend.jsm :: TelemetrySendImpl.shutdown< :: line 644"  data: no] Stack trace: TelemetrySendImpl.shutdown<()@resource://gre/modules/TelemetrySend.jsm:644 < TaskImpl_run()@resource://gre/modules/Task.jsm:319 < TaskImpl()@resource://gre/modules/Task.jsm:277 < asyncFunction()@resource://gre/modules/Task.jsm:252 < shutdown()@resource://gre/modules/TelemetrySend.jsm:197 < /home/gsvelto/projects/build/firefox/_tests/xpcshell/toolkit/components/telemetry/tests/unit/head.js:328 < _execute_test()@/home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:604 < -e:1" {file: "resource://gre/modules/Log.jsm" line: 748}]
App_append@resource://gre/modules/Log.jsm:748:9
log@resource://gre/modules/Log.jsm:386:7
getLoggerWithMessagePrefix/proxy.log@resource://gre/modules/Log.jsm:501:44
error@resource://gre/modules/Log.jsm:394:5
TelemetrySendImpl.shutdown<@resource://gre/modules/TelemetrySend.jsm:646:9
TaskImpl_run@resource://gre/modules/Task.jsm:319:42
TaskImpl@resource://gre/modules/Task.jsm:277:3
asyncFunction@resource://gre/modules/Task.jsm:252:14
shutdown@resource://gre/modules/TelemetrySend.jsm:197:12
@/home/gsvelto/projects/build/firefox/_tests/xpcshell/toolkit/components/telemetry/tests/unit/head.js:328:29
_execute_test@/home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:604:16
@-e:1:1
"
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109620

Good point, I thought of doing it and then forgot.

> Gabriele, I'm never used the PR\_\* functions. Do you know anyone who could review their usage?

Maybe :chutten can you help you with that? They're just portable wrappers for common I/O and process functionality. We've got some decent documentation available: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference
BTW I'm closing the issues on mozreview as I fix them locally, then I'll upload the updated patch. Let me know if you prefer the reverse (I upload the patch first then mark the issues fixed).
(In reply to Gabriele Svelto [:gsvelto] from comment #55)
> Comment on attachment 8830720 [details]
> Bug 1310703 - Introduce the pingsender executable;
> 
> Review request updated; see interdiff:
> https://reviewboard.mozilla.org/r/107456/diff/2-3/

Ugh, for some reason mozreview picked up the contents of another patch that's already in central; I'll see what I can do to fix it.
(In reply to Gabriele Svelto [:gsvelto] from comment #59)
> Ugh, for some reason mozreview picked up the contents of another patch
> that's already in central; I'll see what I can do to fix it.

Actually not, they appear only in the interdiff, if I look at the full diff only my changes are present. This is weird, maybe a bug in mozreview? Or is it because I've rebased my patches before pushing? Either way the new patches have most of Alessio's comments addressed plus the extra documentation; I've added it in the second patch as they're related to the changes in that patch.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> This is from the test you pointed out, I think it's not causing the test to fail because all the tests are explicitly catching exceptions while my test did not.
> 
> 0:01.10 LOG: Thread-1 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "1485879598785	Toolkit.Telemetry	ERROR	TelemetrySend::shutdown - failed to remove observer for idle-daily: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://gre/modules/TelemetrySend.jsm :: TelemetrySendImpl.shutdown< :: line 644"  data: no] Stack trace: TelemetrySendImpl.shutdown<()@resource://gre/modules/TelemetrySend.jsm:644 < TaskImpl_run()@resource://gre/modules/Task.jsm:319 < TaskImpl()@resource://gre/modules/Task.jsm:277 < asyncFunction()@resource://gre/modules/Task.jsm:252 < shutdown()@resource://gre/modules/TelemetrySend.jsm:197 < /home/gsvelto/projects/build/firefox/_tests/xpcshell/toolkit/components/telemetry/tests/unit/head.js:328 < _execute_test()@/home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:604 < -e:1" {file: "resource://gre/modules/Log.jsm" line: 748}]
> App_append@resource://gre/modules/Log.jsm:748:9
> log@resource://gre/modules/Log.jsm:386:7
> getLoggerWithMessagePrefix/proxy.log@resource://gre/modules/Log.jsm:501:44
> error@resource://gre/modules/Log.jsm:394:5
> TelemetrySendImpl.shutdown<@resource://gre/modules/TelemetrySend.jsm:646:9
> TaskImpl_run@resource://gre/modules/Task.jsm:319:42
> TaskImpl@resource://gre/modules/Task.jsm:277:3
> asyncFunction@resource://gre/modules/Task.jsm:252:14
> shutdown@resource://gre/modules/TelemetrySend.jsm:197:12
> @/home/gsvelto/projects/build/firefox/_tests/xpcshell/toolkit/components/telemetry/tests/unit/head.js:328:29
> _execute_test@/home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:604:16
> @-e:1:1
> "

That failure is caught [here](http://searchfox.org/mozilla-central/rev/4e0c5c460318fb9ef7d92b129ac095ce04bc4795/toolkit/components/telemetry/TelemetrySend.jsm#643), as we don't expect observers to be registered all the times for all the tests.

This should be true for your test as well! If it doesn't, we should probably figure out why it isn't :(
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109620

> Maybe :chutten can you help you with that? They're just portable wrappers for common I/O and process functionality. We've got some decent documentation available: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference

:chutten summoned me here to take a look at this function, and it looks good to me.
I do want to note that my only qualification is that I have used the NSPR process library before (we use it in the update driver code).
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> That failure is caught [here](http://searchfox.org/mozilla-central/rev/4e0c5c460318fb9ef7d92b129ac095ce04bc4795/toolkit/components/telemetry/TelemetrySend.jsm#643), as we don't expect observers to be registered all the times for all the tests.
> 
> This should be true for your test as well! If it doesn't, we should probably figure out why it isn't :(

You're right, I just checked and my test doesn't fail, but the error annoyed me. The only reason why removing the observer throws is that TelemetrySend.setup() has not been called so I went ahead and fixed that. If you don't deem that necessary I'll take it out of the patch.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> You're right, I just checked and my test doesn't fail, but the error annoyed me. The only reason why removing the observer throws is that TelemetrySend.setup() has not been called so I went ahead and fixed that. If you don't deem that necessary I'll take it out of the patch.

Yes, please remove that. If you want, kindly file a bug in Toolkit::Telemetry about it: a patch just landed to reduce the log noise, this might be a good follow up.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109558

> Yes, please remove that. If you want, kindly file a bug in Toolkit::Telemetry about it: a patch just landed to reduce the log noise, this might be a good follow up.

OK, updated patch coming. I'll rebase to make sure I pick up the recent changes.
Yet another set of patches, I've done some small changes to the pingsender so that it correctly picks up a 32-bit libcurl library when it's running as a 32-bit executable on a 64-bit multi-lib host. The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f34f4d994973784fdd4aabdfc5ce55aac836198
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review109890

::: toolkit/components/telemetry/tests/unit/xpcshell.ini:66
(Diff revision 4)
>  [test_TelemetryScalars.js]
>  [test_TelemetryTimestamps.js]
>  skip-if = toolkit == 'android'
>  [test_TelemetryCaptureStack.js]
>  [test_TelemetryEvents.js]
> +[test_PingSender.js]

Let's explicitly disable android, add |skip-if = toolkit == 'android'| just under this test.
As we don't need to support the pingsender on Android, can we explicitly disable building it there etc.?
(In reply to Georg Fritzsche [:gfritzsche] from comment #72)
> As we don't need to support the pingsender on Android, can we explicitly
> disable building it there etc.?

OK, I'll file a bug to add support at a later stage.
Blocks: 1335917
I've disabled the telemetry test on both Fennec - because we don't support it (yet?) - and on Linux 32-bit because the VMs we run tests on do not include a 32-bit version of libcurl which is required for the pingsender to function.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review110284

::: toolkit/components/telemetry/tests/unit/xpcshell.ini:67
(Diff revision 5)
>  [test_TelemetryTimestamps.js]
>  skip-if = toolkit == 'android'
>  [test_TelemetryCaptureStack.js]
>  [test_TelemetryEvents.js]
> +[test_PingSender.js]
> +skip-if = (os == "android") || (os == "linux" && bits == 32)

Could you explain/document why we're skipping on linux 32 bits?
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review110284

> Could you explain/document why we're skipping on linux 32 bits?

libcurl is not available on our 32-bit test runners, see my comment on Bugzilla.
Comment on attachment 8830723 [details]
Bug 1310703 - Add support for the pingsender to the Telemetry code;

https://reviewboard.mozilla.org/r/107462/#review110292

This looks good now, thanks! Please let's not land this without the documentation mentioned in the previous comments.
Attachment #8830723 - Flags: review?(alessio.placitelli) → review+
(In reply to Gabriele Svelto [:gsvelto] from comment #80)
> Comment on attachment 8830723 [details]
> Bug 1310703 - Add support for the pingsender to the Telemetry code;
> 
> https://reviewboard.mozilla.org/r/107462/#review110284
> 
> > Could you explain/document why we're skipping on linux 32 bits?
> 
> libcurl is not available on our 32-bit test runners, see my comment on
> Bugzilla.

From a conversation in #releng, it looks like we could file a bug in Taskcluster::Task Configuration if we want to add the missing library. The VM configuration is stored in-tree as a docker file [1]. However, it seems we do 32 bit builds on 64 bit images, so this might complicate things a bit.

Here's some docs [2] about how we generate the images.

[1] - https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/desktop-build
[2] - http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/docker-images.html
(In reply to Alessio Placitelli [:Dexter] from comment #81)
> This looks good now, thanks! Please let's not land this without the
> documentation mentioned in the previous comments.

The docs are here:

https://reviewboard.mozilla.org/r/107458/diff/2-3/

(In reply to Alessio Placitelli [:Dexter] from comment #82)
> From a conversation in #releng, it looks like we could file a bug in
> Taskcluster::Task Configuration if we want to add the missing library. The
> VM configuration is stored in-tree as a docker file [1]. However, it seems
> we do 32 bit builds on 64 bit images, so this might complicate things a bit.
> 
> Here's some docs [2] about how we generate the images.
> 
> [1] -
> https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/desktop-
> build
> [2] -
> http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/docker-images.
> html

Good, a 64-bit host is fine as long as we have 32-bit libcurl available.
While doing try runs I've hit an annoying snag: the asan build includes the --disable-crashreporter option which also disables the pingsender. Now, the pingsender lives "accidentally" within the crashreporter code so it's disabled in that build. I'm wondering if I should just move it somewhere else since it has no real dependency with the crashreporter code, only the reverse is true since the crashreporter client uses it.
Georg, because of the issue I described in comment 85 I think it's better if I let the pingsender sources live under toolkit/components/telemetry. It makes more sense too, would you be OK with it?
Flags: needinfo?(gfritzsche)
(In reply to Gabriele Svelto [:gsvelto] from comment #86)
> Georg, because of the issue I described in comment 85 I think it's better if
> I let the pingsender sources live under toolkit/components/telemetry. It
> makes more sense too, would you be OK with it?

Yes, that makes more sense. I'm good with that as long as it lives in a subdirectory (.../pingsender/ ?).

Let's add docs describing it in "toolkit/components/telemetry/docs/internals/". I'd be fine with seeing that as a follow-up.
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #87)
> Yes, that makes more sense. I'm good with that as long as it lives in a
> subdirectory (.../pingsender/ ?).

Yes, I've got a patch almost ready with the sources under toolkit/components/telemetry/pingsender

> Let's add docs describing it in
> "toolkit/components/telemetry/docs/internals/". I'd be fine with seeing that
> as a follow-up.

I think I'll do it right away, I'm still waiting for the try runs and my local builds to finish so that I can test it anyway :)
Updated patch-set with the pingsender moved under toolkit/components/telemetry/pingsender and some additional documentation. I've redirected the review of the first patch from Ted to Alessio because of the path change, I hope not to have wasted anyone's time because of this.
The try run is here and it's looking green (fingers crossed):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7105583ce2fb8d9ee99507d12f711c79977ea44d
Comment on attachment 8830720 [details]
Bug 1310703 - Introduce the pingsender executable;

https://reviewboard.mozilla.org/r/107456/#review111590

Ted agreed to take on this review as well, as he has substantially more experience with the build system & tools.

::: toolkit/components/telemetry/docs/internals/pingsender.rst:4
(Diff revision 6)
> +Ping Sender
> +===========
> +
> +The ping sender is a minimalistic program whose sole purpose is to deliver a

Please mention that it's not enabled on Android (and link to the bug number as well).
Attachment #8830720 - Flags: review?(ted)
Comment on attachment 8830721 [details]
Bug 1310703 - Make the crashreporter client send a crash ping via the ping sender executable;

https://reviewboard.mozilla.org/r/107458/#review111238

I'd feel better if someone who knew our telemetry setup reviewed the `SendCrashPing` code for schema correctness etc. Otherwise this patch looks fine. Can we file a bug on getting this all rewritten in Rust, though? I understand there are some issues that made it not straightforward at the moment, but I'd feel a lot better if we could spell them out, get them sorted, and not have to keep adding more C++ code.

::: toolkit/components/telemetry/TelemetrySession.jsm:219
(Diff revision 6)
>      const cr = Cc["@mozilla.org/toolkit/crash-reporter;1"];
>      if (cr) {
> -      cr.getService(Ci.nsICrashReporter).setTelemetrySessionId(sessionId);
> +      const crs = cr.getService(Ci.nsICrashReporter);
> +
> +      crs.setTelemetrySessionId(sessionId);
> +      crs.annotateCrashReport("TelemetrySessionId", sessionId);

Any reason this couldn't just happen inside `setTelemetrySessionId`?

::: toolkit/crashreporter/client/crashreporter_unix_common.cpp:131
(Diff revision 6)
> +      close(fd[0]);
> +      close(fd[1]);
> +    }
> +
> +    if (wait) {
> -    waitpid(pid, nullptr, 0);
> +      waitpid(pid, nullptr, 0);

Should we be doing anything with the exit status here?

::: toolkit/crashreporter/client/crashreporter_win.cpp:1581
(Diff revision 6)
>    STARTUPINFO si = {};
> +  si.cb = sizeof(si);
> +
> +  if (usePipes) {
> +    si.dwFlags |= STARTF_USESTDHANDLES;
> +    si.hStdInput = childStdinPipeRead;

I think you need to set hStdOutput and hStdError here when you set `STARTF_USESTDHANDLES`, don't you? You can probably just call `GetStdHandle` for the values.
Attachment #8830721 - Flags: review?(ted) → review+
Comment on attachment 8830720 [details]
Bug 1310703 - Introduce the pingsender executable;

https://reviewboard.mozilla.org/r/107456/#review112524

Sadness about C++ aside this looks fine.

::: browser/installer/package-manifest.in:786
(Diff revision 6)
>  #ifdef MOZ_CRASHREPORTER_INJECTOR
>  @BINPATH@/breakpadinjector.dll
>  #endif
>  #endif
>  
> +; [ Ping Sender ]

You don't really need the comment bits here, they're historical and not super useful.

::: toolkit/components/telemetry/pingsender/pingsender.h:20
(Diff revision 6)
> +
> +// System-specific function to make an HTTP POST operation
> +bool Post(const std::string& url, const std::string& payload);
> +
> +// System-specific function to read the ping payload from stdin
> +std::string ReadPingFromStdin();

You could just use libc for this instead of having separate implementations, couldn't you? `fread(STDIN...)` ought to work everywhere. It's not much code, but minimizing the platform-specific code would be great.

::: toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp:84
(Diff revision 6)
> +    "/usr/lib32/libcurl.so",
> +    "/usr/lib32/libcurl.so.4",
> +    "/usr/lib32/libcurl-gnutls.so.4",
> +    "/usr/lib32/libcurl.so.3",
> +#endif
> +    // macOS

We have code in the crashreporter to POST using native OS X frameworks. Is that more of a hassle to do here?
Attachment #8830720 - Flags: review?(ted) → review+
Comment on attachment 8830721 [details]
Bug 1310703 - Make the crashreporter client send a crash ping via the ping sender executable;

https://reviewboard.mozilla.org/r/107458/#review111238

> Any reason this couldn't just happen inside `setTelemetrySessionId`?

None if I pay attention that it gets written to the .extra file. I'll fix it.

> I think you need to set hStdOutput and hStdError here when you set `STARTF_USESTDHANDLES`, don't you? You can probably just call `GetStdHandle` for the values.

The MSDN documentation isn't very clear on that so I'll go ahead and add them; better be safe than sorry.
Blocks: 1339035
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #96)
> I'd feel better if someone who knew our telemetry setup reviewed the
> `SendCrashPing` code for schema correctness etc.

I think Alessio already had a look but better err on the side of caution. BTW I've checked both the telemetry URLs and ping contents side-by-side between those sent by the crashreporter and those sent by Firefox to ensure they're identical and they seem to be.

> Otherwise this patch looks
> fine. Can we file a bug on getting this all rewritten in Rust, though? I
> understand there are some issues that made it not straightforward at the
> moment, but I'd feel a lot better if we could spell them out, get them
> sorted, and not have to keep adding more C++ code.

Here you go: bug 1339035
Flags: needinfo?(alessio.placitelli)
Comment on attachment 8830720 [details]
Bug 1310703 - Introduce the pingsender executable;

https://reviewboard.mozilla.org/r/107456/#review112524

> You could just use libc for this instead of having separate implementations, couldn't you? `fread(STDIN...)` ought to work everywhere. It's not much code, but minimizing the platform-specific code would be great.

You're right, there's no need for platform-specific code here, I just went along with the flow since I was forced to do some stuff in a platform-specific way without the help of our infrastructure.

> We have code in the crashreporter to POST using native OS X frameworks. Is that more of a hassle to do here?

Yes, it's significantly more convoluted and plus it's Objective-C. libcurl ships with macOS by default now so it's easier to use that on both Linux and Mac.
I've updated the patches with comments addressed and I threw in a couple more tests to ensure that if telemetry is disabled the TelemetryServerURL is not added to the .extra file (and it is when telemetry is enabled). A little extra coverage can't hurt.

The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=39143516587e790917c3d11545a6ddd32398276b
(In reply to Gabriele Svelto [:gsvelto] from comment #99)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #96)
> > I'd feel better if someone who knew our telemetry setup reviewed the
> > `SendCrashPing` code for schema correctness etc.
> 
> I think Alessio already had a look but better err on the side of caution.
> BTW I've checked both the telemetry URLs and ping contents side-by-side
> between those sent by the crashreporter and those sent by Firefox to ensure
> they're identical and they seem to be.

I didn't :(

Gabriele, would you mind attaching a sample ping generated by that chunk of code? You can redact the clientID/PingID, etc.
Flags: needinfo?(alessio.placitelli) → needinfo?(gsvelto)
Attached file Sample ping
Here you go: when doing tests I've compared it side-by-side with pings generated by Firefox and it ends up identical save for the ping creation time.
Flags: needinfo?(gsvelto)
Comment on attachment 8830721 [details]
Bug 1310703 - Make the crashreporter client send a crash ping via the ping sender executable;

https://reviewboard.mozilla.org/r/107458/#review113630

::: toolkit/components/telemetry/docs/data/crash-ping.rst:27
(Diff revision 7)
>  Structure:
>  
>  .. code-block:: js
>  
>      {
>        version: 1,

This is an error, as the version field for the crash payload is within the "payload" section, [see here]( http://searchfox.org/mozilla-central/rev/d3307f19d5dac31d7d36fc206b00b686de82eee4/toolkit/components/crashes/CrashManager.jsm#647). 

I know that you didn't add this error, but would you mind moving this version inside payload: ?
Attachment #8830721 - Flags: review+
(In reply to Gabriele Svelto [:gsvelto] from comment #107)
> Created attachment 8837061 [details]
> Sample ping
> 
> Here you go: when doing tests I've compared it side-by-side with pings
> generated by Firefox and it ends up identical save for the ping creation
> time.

It looks good to me, thanks! See comment 108 for a little nit. Let's land this! \o/
(In reply to Alessio Placitelli [:Dexter] from comment #108)
> Comment on attachment 8830721 [details]
> Bug 1310703 - Make the crashreporter client send a crash ping via the ping
> sender executable;
> 
> https://reviewboard.mozilla.org/r/107458/#review113630
> 
> ::: toolkit/components/telemetry/docs/data/crash-ping.rst:27
> (Diff revision 7)
> >  Structure:
> >  
> >  .. code-block:: js
> >  
> >      {
> >        version: 1,
> 
> This is an error, as the version field for the crash payload is within the
> "payload" section, [see here](
> http://searchfox.org/mozilla-central/rev/
> d3307f19d5dac31d7d36fc206b00b686de82eee4/toolkit/components/crashes/
> CrashManager.jsm#647). 
> 
> I know that you didn't add this error, but would you mind moving this
> version inside payload: ?

To be more precise, let's move [1] where the second highlighted line is at [1].

[1] - http://searchfox.org/mozilla-central/rev/d3307f19d5dac31d7d36fc206b00b686de82eee4/toolkit/components/telemetry/docs/data/crash-ping.rst#16,23
(In reply to Alessio Placitelli [:Dexter] from comment #110)
> To be more precise, let's move [1] where the second highlighted line is at
> [1].
> 
> [1] -
> http://searchfox.org/mozilla-central/rev/
> d3307f19d5dac31d7d36fc206b00b686de82eee4/toolkit/components/telemetry/docs/
> data/crash-ping.rst#16,23

Thanks, I'll file a bug to fix the crash manager too.
Scratch my previous comment, the ping is fine with the |version| field stored in the payload, it's just the documentation that was written improperly, I'll fix that and land.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca283ec01ae1
Introduce the pingsender executable; r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad13217f5e81
Make the crashreporter client send a crash ping via the ping sender executable; r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e93c3fe76f8
Use the ID generated by the crashreporter client when sending a crash ping so that server-side deduplication works correctly; r=Dexter
https://hg.mozilla.org/integration/mozilla-inbound/rev/30262fea813b
Add support for the pingsender to the Telemetry code; r=Dexter
Blocks: 1339428
Depends on: 1339635
Depends on: 1339712
Blocks: 1341282
Blocks: 1341349
Blocks: 1340631
Depends on: 1343504
QA verified as fixed.
Status: RESOLVED → VERIFIED
Depends on: 1345153
Depends on: 1333125
Depends on: 1368455
I have a developer friend who let me know that pingsender scared him and few of his co-workers since it triggers firewall alerts of programs doing funky stuff.
I don't know if normal users encounter this, however, what are the option to mitigate this?
(In reply to Armen [:armenzg] from comment #116)
> I have a developer friend who let me know that pingsender scared him and few
> of his co-workers since it triggers firewall alerts of programs doing funky
> stuff.
> I don't know if normal users encounter this, however, what are the option to
> mitigate this?

Normal users haven't encountered this problem, without knowing more about the type of firewall involved and how it was configured it's difficult to say how we could help. pingsender can be turned off though, it's useful but not required to gather telemetry. Also it would be best if we opened a separate bug for this specific issue.
I've filed bug 1410249.
Depends on: 1410249
Whiteboard: [fce-active] → [fce-active-legacy]
Regressions: 1609649
You need to log in before you can comment on or make changes to this bug.