Closed Bug 1330833 Opened 7 years ago Closed 7 years ago

Add the new "modules" (DLLs) ping type

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: marco, Assigned: marco)

References

Details

(Whiteboard: [measurement:client])

Attachments

(4 files, 4 obsolete files)

We are lacking data on how many users have a given module. It can be useful when we have to make an informed decision on whether to block or not a module with our blocklist.

If it is possible to add this information, I think we should at least have the module name and version.
I'm not sure that we need this in the environment data to make decisions (the environment data is included with different ping types too).
This could probably just be in the main ping if you need data on it, e.g. in keyed scalars [0]?
Maybe data collection peers [1] have a more informed opinion on this though.

Are you planning to work on this or have an engineer that can look into it?

0: https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/scalars.html
1: https://wiki.mozilla.org/Firefox/Data_Collection
Flags: needinfo?(mcastelluccio)
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #1)
> I'm not sure that we need this in the environment data to make decisions
> (the environment data is included with different ping types too).
> This could probably just be in the main ping if you need data on it, e.g. in
> keyed scalars [0]?
> Maybe data collection peers [1] have a more informed opinion on this though.

I thought adding it to the environment data was cheaper, as the list of modules might be long (e.g. take a look at https://crash-stats.mozilla.com/report/index/76545abd-fb4a-4bb4-9450-741342170113#tab-modules) but it doesn't really change often.

> Are you planning to work on this or have an engineer that can look into it?

I can look into it, but I'll probably need some guidance.
Flags: needinfo?(mcastelluccio)
Benjamin, do you have some feedback on this from your initiatives?
From that and the data collection perspective, would we want this in the environment or is the main data payload sufficient?

(In reply to Marco Castelluccio [:marco] from comment #2)
> > Are you planning to work on this or have an engineer that can look into it?
> 
> I can look into it, but I'll probably need some guidance.

Great, Alessio and me can get you started with this.
Flags: needinfo?(benjamin)
Priority: -- → P3
Whiteboard: [measurement:client]
I don't think this should be in the environment for size/cost reasons. I rather think I already filed this bug, but I can't find it right now; in any case I would strongly recommend doing a completely separate ping for this which is sent much less often: e.g. once per week.
Priority: P3 → --
Whiteboard: [measurement:client]
Argh, midair. Although I don't know that you setting P3 on this if Marco's working on it makes much sense.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> I rather think I already filed this bug, but I can't find it right now; in any
> case I would strongly recommend doing a completely separate ping for this
> which is sent much less often: e.g. once per week.

I think there are different trade-offs to this.
If we expect to usually analyze this data on its own, against a few dimensions that are submitted in a separate ping, then thats fine.
If we expect to correlate with multiple datapoints in the main ping regularly and the size of this is limited, maybe it is better to submit it in the main ping (maybe only on shutdown or so).

For scheduling a new ping e.g. once per week, i think having the scheduling logic work fine across Firefox sessions has some complexities.
For experiments i think we solved that by using the update timer:
https://dxr.mozilla.org/mozilla-central/rev/b1c31c4a0a678194931779e0f13fba7b508eb109/browser/experiments/Experiments.manifest#3
I'm not sure if that's the best solution though.
Attached patch WIP Patch (obsolete) — Splinter Review
This is a WIP patch, which introduces the "modules" ping.

The format of the ping is quite simple. It doesn't contain the environment, it contains the client ID, and its payload is:

    {
      version: 1,
      modules: [{
        name: <string>, // Name of the module file (e.g. xul.dll)
        version: <string>, // Version of the module (only available on Windows, it's an empty string on other platforms)
        debug_id: <string>, // ID of the debug information file (compatible with the ID on Socorro)
      },
      ...
      ],
    } 

The ping is sent once a week.

I still have to write some documentation. Is a file under toolkit/components/telemetry/docs/data/ enough?
Assignee: nobody → mcastelluccio
Status: NEW → ASSIGNED
Attachment #8828767 - Flags: feedback?(alessio.placitelli)
Yes, that's a fine place to put the docs. I would recommend adding the environment, though, since it will let us correlate graphics config, detailed OS versions etc.
Also would recommend adding debug_name (PDB name)
Sorry for the fragmented comments: if there is no version information (on non-Windows), we probably shouldn't include that in the JSON at all. On Windows when there is no DLL version, it would be good to make that null instead of the empty string. I can't recall whether there are DLLs with no debug ID, but if so that should be null if missing also, instead of empty. That will make it easier to analyze on the backend in presto.
Comment on attachment 8828767 [details] [diff] [review]
WIP Patch

Review of attachment 8828767 [details] [diff] [review]:
-----------------------------------------------------------------

I had a first look at the Telemetry bits and they look ok!

Yes, as Benjamin said, that's the right location for the ping documentation. Make sure to link it where appropriate (see http://searchfox.org/mozilla-central/search?q=crash-ping&case=false&regexp=false&path= for an example).

::: toolkit/components/telemetry/Telemetry.cpp
@@ +73,5 @@
>  #include "mozilla/IOInterposer.h"
>  #include "mozilla/PoisonIOInterposer.h"
>  #include "mozilla/StartupTimeline.h"
>  #include "mozilla/HangMonitor.h"
> +#include "shared-libraries.h"

Is this safe to be moved out of the MOZ_ENABLE_PROFILER_SPS? Does this break compilation if we have an empty .mozconfig?

::: toolkit/components/telemetry/TelemetryModules.jsm
@@ +21,5 @@
> +];
> +
> +this.TelemetryModules = function() {
> +  // The default is 1 week.
> +  let interval = Preferences.get("telemetry.modules_ping.interval", 604800);

nit:s/let/const ?

nit#2: Also, could you move 604800 in a separate "const MODULES_PING_INTERVAL_SECONDS = 604800;" ? Or, if you want to keep it here, please at least mention that it's in seconds.

@@ +22,5 @@
> +
> +this.TelemetryModules = function() {
> +  // The default is 1 week.
> +  let interval = Preferences.get("telemetry.modules_ping.interval", 604800);
> +  gUpdateTimerManager.registerTimer("telemetry_modules_ping", this, interval);

Can this fail/throw?

@@ +24,5 @@
> +  // The default is 1 week.
> +  let interval = Preferences.get("telemetry.modules_ping.interval", 604800);
> +  gUpdateTimerManager.registerTimer("telemetry_modules_ping", this, interval);
> +
> +  this._pingPromise = null;

Is this used only for testing? If so, please mark it as such and add a comment to explain why.

@@ +32,5 @@
> +  _sendPing() {
> +    return TelemetryController.submitExternalPing("modules",
> +      {
> +        version: 1,
> +        modules: Telemetry.getLoadedModules(),

Can Telemetry.getLoadedModules() throw?

@@ +35,5 @@
> +        version: 1,
> +        modules: Telemetry.getLoadedModules(),
> +      },
> +      {
> +        retentionDays: 180,

We don't have any "retentionDays" option (see https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/custom-pings.html). You can drop it

@@ +42,5 @@
> +      }
> +    );
> +  },
> +
> +  notify() {

nit: could you please add a comment to the top of this function mentioning that it gets called by nsIUpdateTimerManager when the timer expires?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +180,5 @@
>    [implicit_jscontext]
>    jsval snapshotCapturedStacks([optional] in boolean clear);
>  
> +  [implicit_jscontext]
> +  jsval getLoadedModules();

Could you please add a comment here explaining what does it do?
Also under which circumstances it can fail (and the reported error codes).

::: toolkit/components/telemetry/tests/unit/test_TelemetryModules.js
@@ +23,5 @@
> +
> +  let m = new TelemetryModules();
> +  Assert.ok(!!m, "TelemetryModules can be created.");
> +
> +  while (!m._pingPromise) {

You can get rid of the test promise and directly wait for the ping to arrive using the PingServer. It's used here: http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/toolkit/components/telemetry/tests/unit/test_TelemetryController.js#94

Just init Telemetry:

  yield TelemetryController.testSetup();

And shut down the ping server after the last test: http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/toolkit/components/telemetry/tests/unit/test_TelemetryController.js#518

@@ +64,5 @@
> +    libxul_version = "";
> +  } else {
> +    return;
> +  }
> + 

nit: whitespace
Attachment #8828767 - Flags: feedback?(alessio.placitelli) → feedback+
(In reply to Benjamin Smedberg [:bsmedberg] from comment #8)
> Yes, that's a fine place to put the docs. I would recommend adding the
> environment, though, since it will let us correlate graphics config,
> detailed OS versions etc.

Yeah, I agree, I was thinking about it while writing the docs.

(In reply to Benjamin Smedberg [:bsmedberg] from comment #9)
> Also would recommend adding debug_name (PDB name)

What should we do on Linux and Mac, where the debug name would be the same as the module name?
Should we add it anyway for simplicity? Or not, to avoid adding useless data to the ping?

(In reply to Benjamin Smedberg [:bsmedberg] from comment #10)
> Sorry for the fragmented comments: if there is no version information (on
> non-Windows), we probably shouldn't include that in the JSON at all. On
> Windows when there is no DLL version, it would be good to make that null
> instead of the empty string.

OK.

> I can't recall whether there are DLLs with no debug ID, but if so that should
> be null if missing also, instead of empty.
> That will make it easier to analyze on the backend in presto.

I've seen some DLLs without a PDB and debug ID on crash-stats, e.g. "l3codecx.ax" (https://crash-stats.mozilla.com/report/index/8acaab7b-a158-4141-ae94-0ed792170120#tab-modules).

(In reply to Alessio Placitelli [:Dexter] from comment #11)
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ +73,5 @@
> >  #include "mozilla/IOInterposer.h"
> >  #include "mozilla/PoisonIOInterposer.h"
> >  #include "mozilla/StartupTimeline.h"
> >  #include "mozilla/HangMonitor.h"
> > +#include "shared-libraries.h"
> 
> Is this safe to be moved out of the MOZ_ENABLE_PROFILER_SPS? Does this break
> compilation if we have an empty .mozconfig?

I think I've moved everything that was needed out of MOZ_ENABLE_PROFILER_SPS.
I haven't tested compiling without MOZ_ENABLE_PROFILER_SPS yet though (it is defined by default and there isn't an option to disable it, so I'll have to change https://dxr.mozilla.org/mozilla-central/rev/aa3e49299a3aa5cb0db570532e3df9e75d30c2d1/toolkit/moz.configure#38 to force it to disabled).

> @@ +22,5 @@
> > +
> > +this.TelemetryModules = function() {
> > +  // The default is 1 week.
> > +  let interval = Preferences.get("telemetry.modules_ping.interval", 604800);
> > +  gUpdateTimerManager.registerTimer("telemetry_modules_ping", this, interval);
> 
> Can this fail/throw?

AFAICT, it can't.

> @@ +24,5 @@
> > +  // The default is 1 week.
> > +  let interval = Preferences.get("telemetry.modules_ping.interval", 604800);
> > +  gUpdateTimerManager.registerTimer("telemetry_modules_ping", this, interval);
> > +
> > +  this._pingPromise = null;
> 
> Is this used only for testing? If so, please mark it as such and add a
> comment to explain why.

Yes, I'll add more comments and explanations in the next iteration.

> @@ +32,5 @@
> > +  _sendPing() {
> > +    return TelemetryController.submitExternalPing("modules",
> > +      {
> > +        version: 1,
> > +        modules: Telemetry.getLoadedModules(),
> 
> Can Telemetry.getLoadedModules() throw?

Yes, I'll add a try-catch here. I'll not send the ping at all in case of exceptions, as it would be useless.

> 
> @@ +35,5 @@
> > +        version: 1,
> > +        modules: Telemetry.getLoadedModules(),
> > +      },
> > +      {
> > +        retentionDays: 180,
> 
> We don't have any "retentionDays" option (see
> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/
> telemetry/collection/custom-pings.html). You can drop it

Oh, I got it from one example usage in CrashManager.jsm. I guess we can remove it from there too.
I filed bug 1332644 to do it (and marked it as a good-first-bug for new contributors).
Attached patch Patch (obsolete) — Splinter Review
I've decided to only enable the TelemetryModules code when MOZ_ENABLE_PROFILER_SPS is enabled as it is safer [1].
It is always enabled on our supported platforms anyway [2], so it shouldn't matter.

[1]: https://hg.mozilla.org/mozilla-central/file/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/tools/profiler/public/shared-libraries.h#l10
[2]: https://hg.mozilla.org/mozilla-central/file/aa3e49299a3aa5cb0db570532e3df9e75d30c2d1/toolkit/moz.configure#l38
Attachment #8828767 - Attachment is obsolete: true
Attachment #8829398 - Flags: review?(benjamin)
Attachment #8829398 - Flags: review?(alessio.placitelli)
Comment on attachment 8829398 [details] [diff] [review]
Patch

Review of attachment 8829398 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me overall, thanks for refactoring the test so that it doesn't rely on the promise anymore. I'd love to take one last look after the changes.

I only reviewed the stuff that lives in toolkit/components/telemetry, I'm not sure who should be reviewing the other bits of this patch.

A couple of additional questions:
- Could you please attach a sample ping (you can redact the clientId/environment for privacy)?
- Did you check with :mreid the storage/impact of this new ping? I don't think there's a problem given the frequency, but he should at the very least be aware that there's a new ping coming.
- Could you please create a schema for the new ping type? You file a follow-up bug for that or do it in that bug. But please do not forget about doing that :-) The new schema should probably live here: https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/telemetry

::: browser/base/content/browser.js
@@ +1446,5 @@
>        }
>      });
>  
> +    // Report via Telemetry the modules loaded in the Firefox process.
> +    setTimeout(() => new TelemetryModules(), 1000 * 60);

Since TelemetryModules.jsm lives in toolkit/components/telemetry and we're doing delayed Telemetry init after 60 seconds anyway, wouldn't it be better to do that here: http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/toolkit/components/telemetry/TelemetryController.jsm#736

::: toolkit/components/telemetry/Telemetry.cpp
@@ +1656,5 @@
> +  SharedLibraryInfo rawModules = SharedLibraryInfo::GetInfoForSelf();
> +  for (unsigned int i = 0, n = rawModules.GetSize(); i != n; i++) {
> +    const SharedLibrary &info = rawModules.GetEntry(i);
> +
> +    // XXX: Convert to lowercase?

Do we need to do that?

@@ +1680,5 @@
> +    }
> +
> +    // Module name.
> +    JS::RootedString moduleName(cx, JS_NewStringCopyZ(cx, basename.c_str()));
> +    if (!moduleName) {

nit: I guess this could be |(!moduleName || !JS_DefineProperty(cx, moduleObj, "name", moduleName, JSPROP_ENUMERATE)) {|, since both checks are returning the same error value. Ditto below.

@@ +1689,5 @@
> +    }
> +
> +    // Module debug name.
> +    JS::RootedString moduleDebugName(cx, JS_NewStringCopyZ(cx, debug_basename.c_str()));
> +    if (!moduleName) {

Should this be |moduleDebugName|?

@@ +1708,5 @@
> +      id.setString(str_id);
> +    } else {
> +      id.setNull();
> +    }
> +    

nit: whitespaces

@@ +1725,5 @@
> +      version.setString(v);
> +    } else {
> +      version.setNull();
> +    }
> +    

nit:whitespaces

@@ +1737,5 @@
> +  }
> +
> +  ret.setObject(*moduleArray);
> +#endif // MOZ_ENABLE_PROFILER_SPS
> +  return NS_OK;

I wonder if we should return an empty array here when MOZ_ENABLE_PROFILE_SPS is not enabled. You can achieve this simply by moving this #endif before |ret.setObject()| and the |#if defined| right before the |SharedLibraryInfo ...|, so that the array is only filled if SPS is enabled.

That would reduce the possibility of breaking analyses due to not having the right type in this field.

@@ +3027,5 @@
>  
>  #ifdef MOZ_ENABLE_PROFILER_SPS
>    for (unsigned i = 0, n = rawModules.GetSize(); i != n; ++i) {
>      const SharedLibrary &info = rawModules.GetEntry(i);
> +    const std::string &name = info.GetDebugName();

Does this break any assumption server side (I'm not sure how and if this data is used at all)?

::: toolkit/components/telemetry/TelemetryModules.jsm
@@ +37,5 @@
> +  if (!AppConstants.MOZ_ENABLE_PROFILER_SPS) {
> +    return;
> +  }
> +
> +  gUpdateTimerManager.registerTimer(

nit: please add a comment mentioning that this timer survives across sessions

::: toolkit/components/telemetry/tests/unit/test_TelemetryModules.js
@@ +28,5 @@
> +  Assert.ok(!!m, "TelemetryModules can be created.");
> +
> +  let found = yield PingServer.promiseNextPing();
> +  Assert.ok(!!found, "Telemetry ping submitted.");
> +  Assert.ok(found.environment, "'modules' ping has an environment.");

Please also assert that found.type contains the correct ping type, i.e. "modules"
Attachment #8829398 - Flags: review?(alessio.placitelli) → feedback+
Summary: Add modules (DLLs) to the main ping environment data → Add the new "modules" (DLLs) ping type
Is this an opt-in or opt-out ping? What is the expected size range for these pings?
Attachment #8829842 - Attachment description: modules_ping.txt → modules ping (Linux machine)
Blocks: 1333405
(In reply to Mark Reid [:mreid] from comment #15)
> Is this an opt-in or opt-out ping? What is the expected size range for these
> pings?

It is an opt-out ping.
I've collected two example pings on my machine (Linux and Windows). The size of the uncompressed JSON is ~30 kB, gzip-compressed it's ~6 kB.
I wouldn't expect the number of modules to be wildly different between users, but I don't know for sure.
Comment on attachment 8829398 [details] [diff] [review]
Patch

># HG changeset patch
># User Marco Castelluccio <mcastelluccio@mozilla.com>
># Parent  7a4f03169c77e4618312be942f7c5fa17f7bcb68
>
>diff --git a/browser/base/content/browser.js b/browser/base/content/browser.js

>+    // Report via Telemetry the modules loaded in the Firefox process.
>+    setTimeout(() => new TelemetryModules(), 1000 * 60);

It seems strange/unidiomatic to launch this via side effects of a constructor. Either (new TelemetryModules()).start() or just startTelemetryModules().

Why is this on a settimeout? This seems like it could cause weird racy problems if the user shuts down immediately and this timeout hits during shutdown. This method by itself doesn't do anything other than call into the update timer manager, right?

>diff --git a/toolkit/components/telemetry/Telemetry.cpp b/toolkit/components/telemetry/Telemetry.cpp

>+NS_IMETHODIMP
>+TelemetryImpl::GetLoadedModules(JSContext *cx, JS::MutableHandle<JS::Value> ret)
>+{
>+#if defined(MOZ_ENABLE_PROFILER_SPS)
>+  JS::RootedObject moduleArray(cx, JS_NewArrayObject(cx, 0));
>+  if (!moduleArray) {
>+    return NS_ERROR_FAILURE;
>+  }
>+
>+  SharedLibraryInfo rawModules = SharedLibraryInfo::GetInfoForSelf();

Does this incur I/O? Doing blocking I/O on the main thread especially near startup seems like a bad idea. If there is I/O, can we do this async and resolve via a promise?

>+  for (unsigned int i = 0, n = rawModules.GetSize(); i != n; i++) {
>+    const SharedLibrary &info = rawModules.GetEntry(i);
>+
>+    // XXX: Convert to lowercase?
>+    std::string basename = info.GetName();

>+    // Module name.
>+    JS::RootedString moduleName(cx, JS_NewStringCopyZ(cx, basename.c_str()));

What is the charset here? This at least deserves a comment about charset handling. Although I'm worried about the charset particularly on windows where native path handling is lossy and you really need to use UCS2/char16_t.

>+    if (!moduleName) {

Is JSAPI really still fallible like this?

I don't think I'm the right person to review the JSAPI usage here, because that's so particular. Please ask a JSAPI reviewer (jorendorff or a delegate). I feel like there's a better way to handle all the rooted-things here rather than a bunch of separate Rooted<> locals.


> static bool
> IsValidBreakpadId(const std::string &breakpadId) {

Style nit, opening brace of a function goes in column 0 of the next line.

>diff --git a/toolkit/components/telemetry/TelemetryModules.jsm b/toolkit/components/telemetry/TelemetryModules.jsm

Is this new file being jslinted automatically?


>diff --git a/toolkit/components/telemetry/docs/data/modules-ping.rst b/toolkit/components/telemetry/docs/data/modules-ping.rst

>@@ -0,0 +1,41 @@
>+
>+"modules" ping
>+============
>+
>+This ping is captured once a week and includes the modules loaded in the Firefox process.

s/captured/sent/


>+Sometimes the debug name and debug ID are missing for Windows modules (often with malware).

In this case will they be null or completely undefined?
Flags: needinfo?(mcastelluccio)
Attachment #8829398 - Flags: review?(benjamin) → review-
In the example Linux ping, what's up with the empty modules:

      {
        "name": "",
        "debugName": "",
        "debugID": null,
        "version": null
      },
      {
        "name": "",
        "debugName": "",
        "debugID": null,
        "version": null
      },
(In reply to Benjamin Smedberg [:bsmedberg] from comment #19)
> >+    // Report via Telemetry the modules loaded in the Firefox process.
> >+    setTimeout(() => new TelemetryModules(), 1000 * 60);
> 
> It seems strange/unidiomatic to launch this via side effects of a
> constructor. Either (new TelemetryModules()).start() or just
> startTelemetryModules().

I'll change it.

> Why is this on a settimeout? This seems like it could cause weird racy
> problems if the user shuts down immediately and this timeout hits during
> shutdown. This method by itself doesn't do anything other than call into the
> update timer manager, right?

I've moved it to TelemetryController now. It's still deferred to 60 seconds after startup (without a setTimeout though).
As it is now, it shouldn't be expensive (as you're saying, it's just setting up a timer), but I chose to run it after 60 seconds just in case in the future we somehow change it making it more expensive.

> 
> >diff --git a/toolkit/components/telemetry/Telemetry.cpp b/toolkit/components/telemetry/Telemetry.cpp
> 
> >+NS_IMETHODIMP
> >+TelemetryImpl::GetLoadedModules(JSContext *cx, JS::MutableHandle<JS::Value> ret)
> >+{
> >+#if defined(MOZ_ENABLE_PROFILER_SPS)
> >+  JS::RootedObject moduleArray(cx, JS_NewArrayObject(cx, 0));
> >+  if (!moduleArray) {
> >+    return NS_ERROR_FAILURE;
> >+  }
> >+
> >+  SharedLibraryInfo rawModules = SharedLibraryInfo::GetInfoForSelf();
> 
> Does this incur I/O? Doing blocking I/O on the main thread especially near
> startup seems like a bad idea. If there is I/O, can we do this async and
> resolve via a promise?

It does incur I/O. It's rarely going to happen during startup though, as it happens when the weekly update timer expires.
If it's worth it anyway, I can move it to a separate thread.

> 
> >+  for (unsigned int i = 0, n = rawModules.GetSize(); i != n; i++) {
> >+    const SharedLibrary &info = rawModules.GetEntry(i);
> >+
> >+    // XXX: Convert to lowercase?
> >+    std::string basename = info.GetName();
> 
> >+    // Module name.
> >+    JS::RootedString moduleName(cx, JS_NewStringCopyZ(cx, basename.c_str()));
> 
> What is the charset here? This at least deserves a comment about charset
> handling. Although I'm worried about the charset particularly on windows
> where native path handling is lossy and you really need to use UCS2/char16_t.

On Windows, I've converted the UTF16 string to UTF8 in shared-libraries-win32.cpp (using NS_ConvertUTF16toUTF8).
Is that correct? Or should I use the UTF16 string directly and call JS_NewUCStringCopyZ instead?

> 
> >+    if (!moduleName) {
> 
> Is JSAPI really still fallible like this?

I don't know, I've followed the examples from https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/JSAPI_Cookbook#Creating_an_Array, but they might be outdated.

> I don't think I'm the right person to review the JSAPI usage here, because
> that's so particular. Please ask a JSAPI reviewer (jorendorff or a
> delegate). I feel like there's a better way to handle all the rooted-things
> here rather than a bunch of separate Rooted<> locals.

Will do.

> >diff --git a/toolkit/components/telemetry/TelemetryModules.jsm b/toolkit/components/telemetry/TelemetryModules.jsm
> 
> Is this new file being jslinted automatically?

Yes.

> >+Sometimes the debug name and debug ID are missing for Windows modules (often with malware).
> 
> In this case will they be null or completely undefined?

They will be null, I'm adding the info to the rst file.
> It does incur I/O. It's rarely going to happen during startup though, as it
> happens when the weekly update timer expires.
> If it's worth it anyway, I can move it to a separate thread.

That timer will fire somewhere near the startup path most of the time, since most users don't leave their browser open for long periods. I'm at least concerned about this. What is the typical time for this operation? If it's >50ms then it shouldn't be on the main thread, or if common worst-case is >100ms.


> 
> > 
> > >+  for (unsigned int i = 0, n = rawModules.GetSize(); i != n; i++) {
> > >+    const SharedLibrary &info = rawModules.GetEntry(i);
> > >+
> > >+    // XXX: Convert to lowercase?
> > >+    std::string basename = info.GetName();
> > 
> > >+    // Module name.
> > >+    JS::RootedString moduleName(cx, JS_NewStringCopyZ(cx, basename.c_str()));
> > 
> > What is the charset here? This at least deserves a comment about charset
> > handling. Although I'm worried about the charset particularly on windows
> > where native path handling is lossy and you really need to use UCS2/char16_t.
> 
> On Windows, I've converted the UTF16 string to UTF8 in
> shared-libraries-win32.cpp (using NS_ConvertUTF16toUTF8).
> Is that correct? Or should I use the UTF16 string directly and call
> JS_NewUCStringCopyZ instead?

If this is always UTF8, then you should be using the UTF8 variants e.g. JS_NewStringCopyUTF8Z; the char* variants just byte-expand to a JS string. I think we generally assume that the native charset on linux is always utf8-compatible nowadays, but that's not a completely safe assumption; if incorrect utf8 would trigger release asserts, we should probably do native conversion using NS_CopyNativeToUnicode.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #20)
> In the example Linux ping, what's up with the empty modules:
> 
>       {
>         "name": "",
>         "debugName": "",
>         "debugID": null,
>         "version": null
>       },
>       {
>         "name": "",
>         "debugName": "",
>         "debugID": null,
>         "version": null
>       },

shared-libraries-linux.cc is using dl_iterate_phdr. The man page for that function says: "The first object visited by callback is the main program.  For the main program, the dlpi_name field will be an empty string.".
I'm not sure about the second one.
Flags: needinfo?(mcastelluccio)
Comment on attachment 8829398 [details] [diff] [review]
Patch

Review of attachment 8829398 [details] [diff] [review]:
-----------------------------------------------------------------

I'll leave some quick high-level drive-by comments.

::: toolkit/components/telemetry/TelemetryModules.jsm
@@ +27,5 @@
> +const LOGGER_NAME = "Toolkit.Telemetry";
> +const LOGGER_PREFIX = "TelemetryModules::";
> +
> +// The default is 1 week.
> +const MODULES_PING_INTERVAL_SECONDS = 604800;

This would be much more readable as ` ... = 7 * 24 * 60 * 60;`

@@ +28,5 @@
> +const LOGGER_PREFIX = "TelemetryModules::";
> +
> +// The default is 1 week.
> +const MODULES_PING_INTERVAL_SECONDS = 604800;
> +const MODULES_PING_INTERVAL_PREFERENCE = "telemetry.modules_ping.interval";

Where is that pref documented?
It could be documented in modules-ping.rst or preferences.rst.
Note that most of our prefs are of the pattern `toolkit.telemetry.*`.

::: toolkit/components/telemetry/docs/data/modules-ping.rst
@@ +16,5 @@
> +      clientId: <UUID>,
> +      environment: { ... },
> +      payload: {
> +        version: 1,
> +        modules: [

Is there a hard-limit on the count of modules?
Size estimations for the ping should best be pessimistic (unless the behavior is well known) - so you would need to work with the upper bound of how big the data can be.

@@ +18,5 @@
> +      payload: {
> +        version: 1,
> +        modules: [
> +          {
> +            name: <string>, // Name of the module file (e.g. xul.dll)

Is there a hard-limit on the length of this and other strings?
(In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> ::: toolkit/components/telemetry/docs/data/modules-ping.rst
> @@ +16,5 @@
> > +      clientId: <UUID>,
> > +      environment: { ... },
> > +      payload: {
> > +        version: 1,
> > +        modules: [
> 
> Is there a hard-limit on the count of modules?
> Size estimations for the ping should best be pessimistic (unless the
> behavior is well known) - so you would need to work with the upper bound of
> how big the data can be.

In theory, there's no limit.

> 
> @@ +18,5 @@
> > +      payload: {
> > +        version: 1,
> > +        modules: [
> > +          {
> > +            name: <string>, // Name of the module file (e.g. xul.dll)
> 
> Is there a hard-limit on the length of this and other strings?

On Windows, the limit is MAX_MODULE_NAME32, which should be 255 characters.


I've a new patch almost ready which should fix all of the review comments.
I'm still in the process of fixing one last bug with DLLs with Unicode characters on Windows.
In particular, it looks like module.szExePath is not correctly encoded. Even the LoadLibraryEx call fails here: https://dxr.mozilla.org/mozilla-central/source/tools/profiler/core/shared-libraries-win32.cc#109.
One possible solution is using EnumProcessModules instead of CreateToolhelp32Snapshot. In that case, I can use GetModuleFileNameEx which gives me a correctly encoded UTF-16 string.
I think I will just go for this solution instead of trying to figure out why module.szExePath has problems.
Attached patch Patch (obsolete) — Splinter Review
r? Dexter for the Telemetry changes
data-review? bsmedberg
r? mstange for the tools/profiler/ changes
r? jorendorff for the JSAPI usage in Telemetry.cpp

I've moved the code to get the loaded modules and related info in a runnable dispatched to a new thread, and made getLoadedModules return a Promise.
The promise can't be resolved off the main thread, so I had to create another runnable in the other thread and dispatch it to the main thread, where the promise is resolved. The JS object is created on the main thread.

As I said in comment 25, I've replaced CreateToolhelp32Snapshot with EnumProcessModules, as CreateToolhelp32Snapshot was returning module paths with incorrect encoding (even LoadLibraryEx couldn't load them).
The PDB file name is encoded as UTF-8, according to https://github.com/Microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/locator.cpp#L785, I've converted it to UTF-16.

I'll file a follow-up bug to fix the usage of the SharedLibrary class in the |ProcessedStack GetStackAndModules(const std::vector<uintptr_t>& aPCs)| function (it's assuming the encoding of the string as ASCII and I haven't fixed it in my patch, as it is out of scope).

I've added a couple of DLLs for testing, one with a PDB containing a unicode character, another without a PDB. I've built them from the source file in tests/unit, but they are binary files.
Attachment #8829398 - Attachment is obsolete: true
Attachment #8831129 - Flags: review?(mstange)
Attachment #8831129 - Flags: review?(jorendorff)
Attachment #8831129 - Flags: review?(benjamin)
Attachment #8831129 - Flags: review?(alessio.placitelli)
(In reply to Marco Castelluccio [:marco] from comment #25)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> > ::: toolkit/components/telemetry/docs/data/modules-ping.rst
...
> > > +        modules: [
> > 
> > Is there a hard-limit on the count of modules?
> > Size estimations for the ping should best be pessimistic (unless the
> > behavior is well known) - so you would need to work with the upper bound of
> > how big the data can be.
> 
> In theory, there's no limit.
> 
> > 
> > @@ +18,5 @@
...
> > > +            name: <string>, // Name of the module file (e.g. xul.dll)
> > 
> > Is there a hard-limit on the length of this and other strings?
> 
> On Windows, the limit is MAX_MODULE_NAME32, which should be 255 characters.

I request to have hard-limits in place for both, probably in TelemetryModules.jsm.
Otherwise we risk running into Telemetry issues with odd populations (and we always have some very odd clients).
(In reply to Georg Fritzsche [:gfritzsche] from comment #27)
> I request to have hard-limits in place for both, probably in
> TelemetryModules.jsm.
> Otherwise we risk running into Telemetry issues with odd populations (and we
> always have some very odd clients).

It works for me. Any suggestion on what the limits should be?
Comment on attachment 8831129 [details] [diff] [review]
Patch

Review of attachment 8831129 [details] [diff] [review]:
-----------------------------------------------------------------

Just a quick pass before (I'll take a final look once the limits are in place).

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +736,5 @@
>          // in the future.
>          TelemetryStorage.removeFHRDatabase();
>  
> +        // Report the modules loaded in the Firefox process.
> +        (new TelemetryModules()).start();

If you change TelemetryModules, this could be as simple as TelemetryModules.start();

::: toolkit/components/telemetry/TelemetryModules.jsm
@@ +30,5 @@
> +// The default is 1 week.
> +const MODULES_PING_INTERVAL_SECONDS = 7 * 24 * 60 * 60;
> +const MODULES_PING_INTERVAL_PREFERENCE = "toolkit.telemetry.modulesPing.interval";
> +
> +this.TelemetryModules = function() {

Since this is being only used once, would you mind refactoring it as the other Telemetry modules, getting rid of the prototype? See http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/toolkit/components/telemetry/TelemetrySend.jsm#165

@@ +42,5 @@
> +    if (!AppConstants.MOZ_ENABLE_PROFILER_SPS) {
> +      return;
> +    }
> +
> +    // Use nsIUpdateTimerManager for a long-duration timer that survives across sections.

nit: s/sections/sessions

@@ +56,5 @@
> +   */
> +  notify() {
> +    try {
> +      Telemetry.getLoadedModules()
> +      .then(modules =>

nit: could you move |.then(modules =>| to the previous line or indent it properly?
Comment on attachment 8831129 [details] [diff] [review]
Patch

data-review=me (I didn't review the code)
Attachment #8831129 - Flags: review?(benjamin) → review+
Comment on attachment 8831129 [details] [diff] [review]
Patch

Review of attachment 8831129 [details] [diff] [review]:
-----------------------------------------------------------------

The profiler changes look good. The only thing I'm not sure about is the race condition that's mentioned in the long comment; I'm not quite sure what it means and I don't know whether it still applies after your changes.

::: tools/profiler/core/shared-libraries-win32.cc
@@ +21,5 @@
>    uint32_t signature;
>    GUID pdbSignature;
>    uint32_t pdbAge;
> +  // A UTF-8 string, according to https://github.com/Microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/locator.cpp#L785
> +  char pdbFileName[1]; 

end-of-line whitespace

@@ +115,5 @@
>    SharedLibraryInfo sharedLibraryInfo;
>  
> +  HANDLE hProcess = GetCurrentProcess();
> +  mozilla::UniquePtr<HMODULE[]> hMods;
> +  unsigned int modulesNum = 0;

size_t would probably be the better type for this

@@ +121,5 @@
> +    DWORD modulesSize;
> +    EnumProcessModules(hProcess, nullptr, 0, &modulesSize);
> +    modulesNum = modulesSize / sizeof(HMODULE);
> +    hMods = mozilla::MakeUnique<HMODULE[]>(modulesNum);
> +    EnumProcessModules(hProcess, hMods.get(), modulesNum * sizeof(HMODULE), &modulesSize);

Should maybe error-check the return value of EnumProcessModules here and above, and return early.

@@ +144,5 @@
> +
> +    // Load the module again to make sure that its handle will remain remain
> +    // valid as we attempt to read the PDB information from it.  We load the
> +    // DLL as a datafile so that if the module actually gets unloaded between
> +    // the call to Module32Next and the following LoadLibraryEx, we don't end

You're no longer calling Module32Next, so this comment needs to be adjusted.

@@ +186,5 @@
> +      0, // DLLs are always mapped at offset 0 on Windows
> +      breakpadId,
> +      moduleName,
> +      pdbNameStr,
> +      GetVersion(modulePath));

It's great that we now have the full path to both the DLL and the PDB file here. This will make fixing the "lack of full path on Windows" part of bug 1329111 really easy to fix.
Attachment #8831129 - Flags: review?(mstange) → review+
Priority: -- → P3
Whiteboard: [measurement:client]
Attachment #8831129 - Flags: review?(jorendorff) → review+
Comment on attachment 8831129 [details] [diff] [review]
Patch

Clearing the review request. Please flag me again once the patch we discussed with the field limits is in place!
Attachment #8831129 - Flags: review?(alessio.placitelli)
I've looked at the modules in Socorro crash reports over the past 7 days to see which limits would make sense: https://gist.github.com/marco-c/f6aba367d4ef2838cfb5e96cc2e552dc.

98,64% of the modules had a filename length shorter than 40 characters; 99,99% shorter than 48; 100% shorter than 80.

93,5% of the reports have less than ~200 modules; 99,4% less than ~300; 99,99% less than ~400; 100% shorter than 1021.
Attached patch Patch (obsolete) — Splinter Review
I've added a limit of 1028 modules with 128 characters. We can reduce it easily if there are concerns that it is too much.
Attachment #8831129 - Attachment is obsolete: true
Attachment #8837586 - Flags: review?(alessio.placitelli)
Comment on attachment 8837586 [details] [diff] [review]
Patch

Review of attachment 8837586 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good with the following addressed.

In addition to these, could you please also add the documentation for which fields are length limited, what's the limit and the fact that you're appending a character to mark the truncation?

::: toolkit/components/telemetry/TelemetryModules.jsm
@@ +30,5 @@
> +// The default is 1 week.
> +const MODULES_PING_INTERVAL_SECONDS = 7 * 24 * 60 * 60;
> +const MODULES_PING_INTERVAL_PREFERENCE = "toolkit.telemetry.modulesPing.interval";
> +
> +const MAX_MODULES_NUM = 1024;

99,99% of the reported modules have less than 400 entries, as you stated. Is there any good reason to basically double the upper limit for the module count?

@@ +31,5 @@
> +const MODULES_PING_INTERVAL_SECONDS = 7 * 24 * 60 * 60;
> +const MODULES_PING_INTERVAL_PREFERENCE = "toolkit.telemetry.modulesPing.interval";
> +
> +const MAX_MODULES_NUM = 1024;
> +const MAX_MODULE_FILENAME_LENGTH = 128;

nit: Since this is used for both the filename and the debug name, can rename it to "MAX_STRING_LENGTH"?

@@ +64,5 @@
> +            modules = modules.slice(0, MAX_MODULES_NUM);
> +          }
> +
> +          // Cut the file names of the modules to MAX_MODULE_FILENAME_LENGTH characters.
> +          for (let module of modules) {

Please also limit the version field, as it is a string too (unless it is bounded somewhere else). You could use the same constant you're using for the other fields.

@@ +66,5 @@
> +
> +          // Cut the file names of the modules to MAX_MODULE_FILENAME_LENGTH characters.
> +          for (let module of modules) {
> +            if (module.name.length > MAX_MODULE_FILENAME_LENGTH) {
> +              module.name = module.name.substr(0, MAX_MODULE_FILENAME_LENGTH - 1) + "\u2026";

Can you move "\u2026" here and below to a constant? e.g. TRUNCATION_DELIMITER?
Attachment #8837586 - Flags: review?(alessio.placitelli) → review+
Attached patch PatchSplinter Review
Carrying r+.
Attachment #8837586 - Attachment is obsolete: true
Attachment #8838119 - Flags: review+
Pushed by mcastelluccio@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/927194e7769d
Add the new "modules" (DLLs) ping type. r=Dexter,mstange,jorendorff, data-review=bsmedberg
Attached file modules ping (Mac)
Attachment #8829842 - Attachment mime type: text/plain → application/json
Attachment #8829876 - Attachment mime type: text/plain → application/json
Pushed by mcastelluccio@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/466565fa382a
Fix comment in TelemetryModules.jsm. r=Dexter
I had to back these out because 64-bit Windows xpcshell tests have been failing like https://treeherder.mozilla.org/logviewer.html#?job_id=78036074&repo=mozilla-inbound ever since this landed.

https://hg.mozilla.org/integration/mozilla-inbound/rev/c03c6bb15620
Flags: needinfo?(mcastelluccio)
The problem is quite straightforward to fix (just need to provide a 64 bit alternative of the test DLL), I'll fix it today. I'm not sure why I haven't seen the failure on Try.
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #42)
> The problem is quite straightforward to fix (just need to provide a 64 bit
> alternative of the test DLL), I'll fix it today. I'm not sure why I haven't
> seen the failure on Try.

Instead of landing .dll(s), can't you simple check that the format of the data in the modules ping is correct? I mean, whatever the modules are, they still have to comply to a particular format (and be limited).
Flags: needinfo?(mcastelluccio)
(In reply to Alessio Placitelli [:Dexter] from comment #43)
> (In reply to Marco Castelluccio [:marco] from comment #42)
> > The problem is quite straightforward to fix (just need to provide a 64 bit
> > alternative of the test DLL), I'll fix it today. I'm not sure why I haven't
> > seen the failure on Try.
> 
> Instead of landing .dll(s), can't you simple check that the format of the
> data in the modules ping is correct? I mean, whatever the modules are, they
> still have to comply to a particular format (and be limited).

The DLLs are needed to test the cases:
1) PDB missing;
2) PDB containing unicode characters;
3) DLL name containing unicode characters.

We need to make sure these work. Merely testing the format of the data in the ping wouldn't catch problems and regressions in these cases. Indeed, the other, preexisting, function in Telemetry.cpp that is using the shared library info is not using the correct encoding when creating the JS string (I will file a bug to fix that).
Flags: needinfo?(mcastelluccio)
Pushed by mcastelluccio@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8fc28f2d52d9
Add the new "modules" (DLLs) ping type. r=Dexter,mstange,jorendorff, data-review=bsmedberg
Depends on: 1341015
https://hg.mozilla.org/mozilla-central/rev/8fc28f2d52d9
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
(In reply to Marco Castelluccio [:marco] from comment #33)
> 98,64% of the modules had a filename length shorter than 40 characters;
> 99,99% shorter than 48; 100% shorter than 80.

FYI, the Socorro stackwalker only outputs the filename portion of the module path from minidumps (for privacy reasons):
https://github.com/mozilla/socorro/blob/1385f102cebcd1c3091c8de0da7d918de819dbd7/minidump-stackwalk/stackwalker.cc#L575
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #47)
> (In reply to Marco Castelluccio [:marco] from comment #33)
> > 98,64% of the modules had a filename length shorter than 40 characters;
> > 99,99% shorter than 48; 100% shorter than 80.
> 
> FYI, the Socorro stackwalker only outputs the filename portion of the module
> path from minidumps (for privacy reasons):
> https://github.com/mozilla/socorro/blob/
> 1385f102cebcd1c3091c8de0da7d918de819dbd7/minidump-stackwalk/stackwalker.
> cc#L575

Yep, we're doing the same with the ping.
Blocks: 1341245
Depends on: 1341571
Comment on attachment 8838119 [details] [diff] [review]
Patch

Review of attachment 8838119 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +192,5 @@
> +   *   },
> +   *   ...
> +   * ]
> +   *
> +   * @return An array of modules.

I see this actually returns a promise.
Please update this in a follow-up.
Flags: needinfo?(mcastelluccio)
(In reply to Georg Fritzsche [:gfritzsche] from comment #49)
> Comment on attachment 8838119 [details] [diff] [review]
> Patch
> 
> Review of attachment 8838119 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/nsITelemetry.idl
> @@ +192,5 @@
> > +   *   },
> > +   *   ...
> > +   * ]
> > +   *
> > +   * @return An array of modules.
> 
> I see this actually returns a promise.
> Please update this in a follow-up.

Good find, I forgot to update the comment when I modified the function to return a Promise.
Flags: needinfo?(mcastelluccio)
Depends on: 1342034
Depends on: 1343752
You need to log in before you can comment on or make changes to this bug.