Closed Bug 1364068 Opened 3 years ago Closed 2 years ago

Enable the 'new-profile' ping & estimate storage impact

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
1

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: Dexter, Assigned: Dexter)

References

Details

(Whiteboard: [measurement:client])

Attachments

(2 files, 1 obsolete file)

The 'new-profile' ping will land as disabled in bug 1120370. Once client-side validation is done and looking good, this bug will take care of flipping the pref to turn it on.

This bug will also estimate the storage impact before turning it on.
Blocks: 1351394
Points: --- → 1
Priority: -- → P2
Whiteboard: [measurement:client]
Note to self: we can estimate the storage impact by considering the distinct number of client ids having the profile creation date of day X.
I've started to put together a doc to analyse the storage impact, here: https://docs.google.com/document/d/16-OS5AbNh7Yy0zoNpHWtCq8U-xbZl0LJ5XWb5apDoSw/edit# (LDAP required)
Assignee: nobody → alessio.placitelli
Priority: P2 → P1
Comment on attachment 8870020 [details]
Bug 1364068 - Enable the 'new-profile' ping.

https://reviewboard.mozilla.org/r/141538/#review145140

::: toolkit/components/telemetry/tests/unit/head.js:310
(Diff revision 1)
>    Services.prefs.setBoolPref("toolkit.telemetry.archive.enabled", true);
>    // Telemetry xpcshell tests cannot show the infobar.
>    Services.prefs.setBoolPref("datareporting.policy.dataSubmissionPolicyBypassNotification", true);
>    // FHR uploads should be enabled.
>    Services.prefs.setBoolPref("datareporting.healthreport.uploadEnabled", true);
> -  // Many tests expect the shutdown ping to not be sent on shutdown and will fail
> +  // Many tests expect the shutdown and the 'new-profile' to not be sent on shutdown

"... and the new-profile ping"

::: toolkit/components/telemetry/tests/unit/head.js:311
(Diff revision 1)
> -  // if receive an unexpected ping. Let's globally disable the shutdown ping sender:
> -  // the relevant tests will enable this pref when needed.
> +  // and will fail if receive an unexpected ping. Let's globally disable the shutdown
> +  // ping sender: the relevant tests will enable this pref when needed.

"... disable these features"?
"these prefs"?
Attachment #8870020 - Flags: review?(gfritzsche) → review+
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/31a809d5ca1a
Enable the 'new-profile' ping. r=gfritzsche
Backed out for valgrind bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=101165774&repo=autoland&lineNumber=24770
Flags: needinfo?(alessio.placitelli)
Backout by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/73035b57365c
Backed out changeset 31a809d5ca1a for valgrind bustage
Mh, that's interesting:

> [task 2017-05-23T09:35:27.756303Z] 09:35:27     INFO -   7:37.40 ==38542== 32 (16 direct, 16 indirect) bytes in 1 blocks are definitely lost in loss record 75 of 539
> [task 2017-05-23T09:35:27.756456Z] 09:35:27     INFO -   7:37.40 ==38542==    at 0x4C293D5: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> [task 2017-05-23T09:35:27.756578Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7DA6131: PORT_Alloc_Util (in /usr/lib64/libnssutil3.so)
> [task 2017-05-23T09:35:27.756681Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7AA8ECD: ???
> [task 2017-05-23T09:35:27.756830Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7AA9E4E: ???
> [task 2017-05-23T09:35:27.756945Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C86B: ???
> [task 2017-05-23T09:35:27.757103Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C8DF: ???
> [task 2017-05-23T09:35:27.757233Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C8DF: ???
> [task 2017-05-23T09:35:27.757360Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A6B2E3: ???
> [task 2017-05-23T09:35:27.757514Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A6BAB0: ???
> [task 2017-05-23T09:35:27.757647Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61D54FA: ???
> [task 2017-05-23T09:35:27.757785Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61D56E7: ???
> [task 2017-05-23T09:35:27.757931Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61CD294: ???
> [task 2017-05-23T09:35:27.758035Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61ABAAA: ???
> [task 2017-05-23T09:35:27.758174Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61B2251: ???
> [task 2017-05-23T09:35:27.758298Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61B88FB: ???
> [task 2017-05-23T09:35:27.758428Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x61C098F: ???
> [task 2017-05-23T09:35:27.758568Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x401939: PingSender::CurlWrapper::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:171)
> [task 2017-05-23T09:35:27.758712Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x401996: PingSender::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:193)
> [task 2017-05-23T09:35:27.758830Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x4011FE: main (pingsender.cpp:180)

Daniel, this seems to be coming from [1]. Do you have any idea about what should be causing this?

[1] - http://searchfox.org/mozilla-central/rev/6c2dbacbba1d58b8679cee700fd0a54189e0cf1b/toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp#171
Flags: needinfo?(alessio.placitelli) → needinfo?(daniel)
(In reply to Alessio Placitelli [:Dexter] from comment #10)

> Daniel, this seems to be coming from [1]. Do you have any idea about what
> should be causing this?

It certainly looks like the libcurl call yes, and the leak itself is within nss and its PORT_Alloc_Util function. No symbols for libcurl makes it a bit hard to follow the flow.

My primary guess: no call to curl_global_cleanup() when you're done so there's some remaining data NSS hasn't had the chance to free.
Flags: needinfo?(daniel)
(In reply to Daniel Stenberg [:bagder] from comment #11)
> My primary guess: no call to curl_global_cleanup() when you're done so
> there's some remaining data NSS hasn't had the chance to free.

Thanks Daniel! Looks like this slightly changes the leaks to something else [1]:

> [task 2017-05-24T07:12:13.696906Z] 07:12:13     INFO -   7:31.73 ==38541== 232 (168 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 69 of 84
> [task 2017-05-24T07:12:13.697077Z] 07:12:13     INFO -   7:31.73 ==38541==    at 0x4C27449: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> [task 2017-05-24T07:12:13.697234Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x83E7DBC: ???
> [task 2017-05-24T07:12:13.697400Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x83DED54: ???
> [task 2017-05-24T07:12:13.697565Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61D5068: ???
> [task 2017-05-24T07:12:13.697730Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61C1838: ???
> [task 2017-05-24T07:12:13.697894Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61C1877: ???
> [task 2017-05-24T07:12:13.698074Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4018B1: PingSender::CurlWrapper::Init() (pingsender_unix_common.cpp:144)
> [task 2017-05-24T07:12:13.698259Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4019DC: PingSender::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:205)
> [task 2017-05-24T07:12:13.698419Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4011FE: main (pingsender.cpp:180)
> [task 2017-05-24T07:12:13.698574Z] 07:12:13     INFO -   7:31.73 ==38541==

The total amount of leaked memory doesn't seem to change though.

[1] - https://treeherder.mozilla.org/logviewer.html#?job_id=101456936&repo=try&lineNumber=25579-25605
Curious.

I trust it you don't see this leak when you run it stand-alone or is it reproducible any other way? Which libcurl version is this? I'm not aware of any memory leak in any recent libcurl version for such a normal operation that this seems to be.
(In reply to Daniel Stenberg [:bagder] from comment #13)
> Curious.
> 
> I trust it you don't see this leak when you run it stand-alone or is it
> reproducible any other way? Which libcurl version is this? I'm not aware of
> any memory leak in any recent libcurl version for such a normal operation
> that this seems to be.

The funny thing is that this leak started showing with this patch that doesn't change a thing in this binary, just calls it once more.
I just noticed that I'm able to see the same leak that I get after adding |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not able to see the leak that I was seeing on try w\o curl_global_cleanup:

> valgrind  --vex-iropt-register-updates=allregs-at-mem-access --smc-check=all-non-file --leak-check=full --default-suppressions=yes  --suppressions=build/valgrind/x86_64-redhat-linux-gnu.sup --suppressions=build/valgrind/cross-architecture.sup --suppressions=libcurl.sup --show-leak-kinds=definite --fair-sched=yes -v --track-origins=yes --trace-children=yes --num-callers=36 --gen-suppressions=all objdir-ff-debug/toolkit/components/telemetry/pingsender/pingsender http://localhost:3785  example.txt

That's the same command used on try, AFAICT. I'm using libcurl 7.47.0 on ubuntu
(In reply to Alessio Placitelli [:Dexter] from comment #14)

> I just noticed that I'm able to see the same leak that I get after adding
> |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not
> able to see the leak that I was seeing on try w\o curl_global_cleanup:

The try leak involved NSS which implies another libcurl build than the typical ubuntu libcurl (as libcurl on ubuntu is normally built with gnutls or openssl).

If you can reproduce it, can you reproduce it with a libcurl build with some debug symbols so that we might get some additional clues as to what exactly is leaking?
(In reply to Daniel Stenberg [:bagder] from comment #15)
> (In reply to Alessio Placitelli [:Dexter] from comment #14)
> 
> > I just noticed that I'm able to see the same leak that I get after adding
> > |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not
> > able to see the leak that I was seeing on try w\o curl_global_cleanup:
> 
> The try leak involved NSS which implies another libcurl build than the
> typical ubuntu libcurl (as libcurl on ubuntu is normally built with gnutls
> or openssl).
> 
> If you can reproduce it, can you reproduce it with a libcurl build with some
> debug symbols so that we might get some additional clues as to what exactly
> is leaking?

Thanks Daniel! After fiddling with it the whole day, and thanks to :sewardj, I was able to get closer to some result.
To get meaningful stacks locally I had to use the most recent version of valgrind, as the one that I was using (3.11) had a bug and was not able to read the debug symbols :(

With that settled, and some hel from :sewardj, we found that:

- yes, global_cleanup needs to be called (as you rightly said :D );
- calling |exit()| instead of |return| doesn't make valgrind happy as exit() can prevent destructors from being called when getting out of scope, and thus we do not free the memory;

After doing this changes I'm left, locally, with this leak:

> ==5218== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
> ==5218==    at 0x4C2EB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5218==    by 0x505864E: _dlerror_run (/build/glibc-jxM2Ev/glibc-2.24/dlfcn/dlerror.c:141)
> ==5218==    by 0x5057F81: dlopen@@GLIBC_2.2.5 (/build/glibc-jxM2Ev/glibc-2.24/dlfcn/dlopen.c:87)
> ==5218==    by 0x109C3F: PingSender::CurlWrapper::Init() (toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp:105)
> ==5218==    by 0x109F6B: PingSender::Post(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp:212)
> ==5218==    by 0x109580: main (toolkit/components/telemetry/pingsender/pingsender.cpp:180)

This is exactly what we're already suppressing at [1] and bug 793535 comment 2 explains why we're doing this.
I suspect Valgrind is not suppressing that automatically because, for some reason, even though treeherder contains debug symbols for libcurl, it's not reporting as useful stack (see comment 12). Adding a new suppression locally gave me a perfectly fine valgrind run, in all these cases:

- pingsender sending to unavailable server;
- pingsender sending to HTTP server;
- pingsender sending to HTTPS server;
- pingsender failing;

However, when with all my hopes I pushed this to try it failed miserably once again [2], with malloc/calloc leaks generated from [3].
Unfortunately, I'm not able to reproduce these leaks locally. I'm tempted to simply suppress them, as they seem to be coming other libs anyway.

[1] - http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/build/valgrind/cross-architecture.sup#61
[2] - https://treeherder.mozilla.org/#/jobs?repo=try&revision=7aaf6eb4ea1ac771f97cf602d90d128040eae063&selectedJob=101606743
[3] - http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp#171
Ah, I forgot to mention one funny thing. It could be that we're not getting useful stacks by valgrind because of a nice bug that :sewardj mentioned: if we happen to call |dlclose| on some library, then valgrind will forget about its symbols and print not super useful stacks.

Unfortunately, the naive solution of simply commenting out |dlclose| has its drawback: no leak is reported anymore and valgrind works just fine, with no suppression at all, even on Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=415a4c6f353385553f254fe7e8361b0db87e9dfb

As :sewardj explained, this is because the memory is still in use and its classified as "reachable" rather than a "leak". So the problem is still there, just not failing anymore :)
Julian, even adding the suppression from the attached review request Valgrind will still complain:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc16715a7040c1991bd36ced88ba36c8379fcbf9&selectedJob=101674306

This is the cryptic leak from try:

> ==38537==    by 0x83D8CD6: ???
> ==38537==    by 0x9F550B9: ???
> ==38537==    by 0x9F55120: ???
> ==38537==    by 0x9F55229: ???
> ==38537==    by 0x83DEF59: ???
> ==38537==    by 0x9F54FE1: ???
> ==38537==    by 0x9F2AAF4: ???
> ==38537==    by 0x9F55385: ???
> ==38537==    by 0x9F2862A: ???
> ==38537==    by 0xFFEFFFEF3: ???
> ==38537==    by 0x726F772F72656B71: ???
> ==38537==    by 0x622F65636170736A: ???
> ==38537==    by 0x6372732F646C6974: ???
> ==38537==    by 0x7269662D6A626F2E: ???
> ==38537==    by 0x7369642F786F6664: ???
> ==38537==    by 0x69702F6E69622F73: ???
> ==38537==    by 0x7265646E6573676D: ???
> ==38537==    by 0x2F3A7370747467FF: ???
> ==38537==    by 0x302E302E3732312E: ???
> ==38537==    by 0x35323430343A312D: ???
> ==38537==    by 0x74656D656C65742E: ???
> ==38537==    by 0x796D6D75642D7971: ???
> ==38537==    by 0x74696D6275732F2E: ???
> ==38537==    by 0x74656D656C65742E: ???
> ==38537==    by 0x66336235642F7971: ???
> ==38537==    by 0x373833362D643931: ???
> ==38537==    by 0x34612D663236342C: ???
> ==38537==    by 0x61623734372D3060: ???
> ==38537==    by 0x2F61303037663763: ???
> ==38537==    by 0x666F72702D77656D: ???
> ==38537==    by 0x657269462F656C68: ???
> ==38537==    by 0x302E35352F786F65: ???
> ==38537==    by 0x75616665642F3160: ???
> ==38537==    by 0x30373130322F746B: ???
> ==38537==    by 0x3334353731343234: ???
> ==38537==
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Leak
>    match-leak-kinds: definite
>    fun:calloc
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
> }

Any suggestion?
Flags: needinfo?(jseward)
Unfortunately I can't offer you any good options here:

* suppressing the leak is not feasible, because the only name in 
  the stack is "calloc"

* changing the way valgrind deals with debuginfo after dlclose
  is a big bit of work

I think your least-worst option is to rewrite the program so as not
to use dlopen/close, which tbh seems a bit strange to me, but even then:

* it's not guaranteed to fix the leak

* (per irc chat) it's not feasible in the immediate timeframe
Flags: needinfo?(jseward)
Attachment #8870939 - Attachment is obsolete: true
See Also: → 1367713
Thanks Julian for all your time and efforts helping me on this.

(In reply to Julian Seward [:jseward] from comment #23)
> Unfortunately I can't offer you any good options here:
> 
> * suppressing the leak is not feasible, because the only name in 
>   the stack is "calloc"
> 
> * changing the way valgrind deals with debuginfo after dlclose
>   is a big bit of work
> 
> I think your least-worst option is to rewrite the program so as not
> to use dlopen/close, which tbh seems a bit strange to me, but even then:
> 
> * it's not guaranteed to fix the leak
> 
> * (per irc chat) it's not feasible in the immediate timeframe

As mentioned over IRC, the most reasonable solution seems to be disabling the feature on 'valgrind jobs' until either Valgrind starts to deal with debuginfo/dlclose or the pingsender is re-written in Rust (bug 1339035).

I've filed 1367713 to enable it back.

I wonder if updating Valgrind on try would make things smoother/change this. But I see there's no bug on updating it from 3.10 to 3.12, so this seems the fastest way to land this important feature.
Comment on attachment 8870928 [details]
Bug 1364068 - Properly cleanup libcurl in the pingsender.

https://reviewboard.mozilla.org/r/142496/#review146408

Looks good to me!
Attachment #8870928 - Flags: review?(gsvelto) → review+
Attachment #8870928 - Flags: review?(n.nethercote) → review?(jseward)
Comment on attachment 8870928 [details]
Bug 1364068 - Properly cleanup libcurl in the pingsender.

https://reviewboard.mozilla.org/r/142496/#review146766

Otherwise looks OK.

::: build/valgrind/mach_commands.py:78
(Diff revision 2)
> +            # Don't spawn the 'pingsender' process on Valgrind tests due to the
> +            # 'new-profile' ping. Valgrind marks the process as leaky (bug 1364068)
> +            # but does not provide enough information to suppress the leak. Running
> +            # locally does not reproduce the issue, so disable this until we rewrite
> +            # the pingsender in Rust (bug 1339035).
> +            # We can't do this in 'prefs_genera.js' as we want to retain test coverage

Typo (prefs_genera.hs)
Attachment #8870928 - Flags: review?(jseward) → review+
Thanks! Here's the valgrind-specific try push that covers this changes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4f0378212a18c4a7c02bbff2e8a9d02bfad9cd9 (all green!)
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ad54e4734e9b
Enable the 'new-profile' ping. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/026b0b69ca23
Properly cleanup libcurl in the pingsender. r=gsvelto,jseward
After
Flags: needinfo?(alessio.placitelli)
See Also: → 1368455, 1368459
After investigating the new backout, we found that the problem is the pingsender timing out due to trying to connect to a misconfigured SSL server in the test harness:

- the pingsender waits 5 minutes before timing out;
- the test framework waits for all the processes to complete before shutting down (including the pingsender);
- if the test framework gets stuck, it kills the processes and triggers an error (the ones we're seeing in comment 32).

After discussing this with Georg, since this problems are not related to the 'new-profile' ping, we're not turning it on in tests (except for the xpcshell tests).

I've filed bug 1368455 to decrease the pingsender's timeout from 5 minutes to 60 seconds and bug 1368459 to fix the SSL-server problem and turn the coverage back on in mochitests.
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/cc274092e378
Enable the 'new-profile' ping. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/e61ba4616b81
Properly cleanup libcurl in the pingsender. r=gsvelto,jseward
https://hg.mozilla.org/mozilla-central/rev/cc274092e378
https://hg.mozilla.org/mozilla-central/rev/e61ba4616b81
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.