Enable the 'new-profile' ping & estimate storage impact

RESOLVED FIXED in Firefox 55

Status

()

Toolkit
Telemetry
P1
normal
RESOLVED FIXED
9 months ago
8 months ago

People

(Reporter: Dexter, Assigned: Dexter)

Tracking

Trunk
mozilla55
Points:
1

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [measurement:client])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

9 months ago
The 'new-profile' ping will land as disabled in bug 1120370. Once client-side validation is done and looking good, this bug will take care of flipping the pref to turn it on.

This bug will also estimate the storage impact before turning it on.
(Assignee)

Updated

9 months ago
Blocks: 1351394
Points: --- → 1
Priority: -- → P2
Whiteboard: [measurement:client]
(Assignee)

Comment 1

9 months ago
Note to self: we can estimate the storage impact by considering the distinct number of client ids having the profile creation date of day X.
(Assignee)

Comment 2

8 months ago
I've started to put together a doc to analyse the storage impact, here: https://docs.google.com/document/d/16-OS5AbNh7Yy0zoNpHWtCq8U-xbZl0LJ5XWb5apDoSw/edit# (LDAP required)
(Assignee)

Updated

8 months ago
Assignee: nobody → alessio.placitelli
(Assignee)

Updated

8 months ago
Priority: P2 → P1
Comment hidden (mozreview-request)

Comment 4

8 months ago
mozreview-review
Comment on attachment 8870020 [details]
Bug 1364068 - Enable the 'new-profile' ping.

https://reviewboard.mozilla.org/r/141538/#review145140

::: toolkit/components/telemetry/tests/unit/head.js:310
(Diff revision 1)
>    Services.prefs.setBoolPref("toolkit.telemetry.archive.enabled", true);
>    // Telemetry xpcshell tests cannot show the infobar.
>    Services.prefs.setBoolPref("datareporting.policy.dataSubmissionPolicyBypassNotification", true);
>    // FHR uploads should be enabled.
>    Services.prefs.setBoolPref("datareporting.healthreport.uploadEnabled", true);
> -  // Many tests expect the shutdown ping to not be sent on shutdown and will fail
> +  // Many tests expect the shutdown and the 'new-profile' to not be sent on shutdown

"... and the new-profile ping"

::: toolkit/components/telemetry/tests/unit/head.js:311
(Diff revision 1)
> -  // if receive an unexpected ping. Let's globally disable the shutdown ping sender:
> -  // the relevant tests will enable this pref when needed.
> +  // and will fail if receive an unexpected ping. Let's globally disable the shutdown
> +  // ping sender: the relevant tests will enable this pref when needed.

"... disable these features"?
"these prefs"?
Attachment #8870020 - Flags: review?(gfritzsche) → review+
Comment hidden (mozreview-request)

Comment 7

8 months ago
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/31a809d5ca1a
Enable the 'new-profile' ping. r=gfritzsche

Comment 8

8 months ago
Backed out for valgrind bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=101165774&repo=autoland&lineNumber=24770
Flags: needinfo?(alessio.placitelli)

Comment 9

8 months ago
Backout by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/73035b57365c
Backed out changeset 31a809d5ca1a for valgrind bustage
(Assignee)

Comment 10

8 months ago
Mh, that's interesting:

> [task 2017-05-23T09:35:27.756303Z] 09:35:27     INFO -   7:37.40 ==38542== 32 (16 direct, 16 indirect) bytes in 1 blocks are definitely lost in loss record 75 of 539
> [task 2017-05-23T09:35:27.756456Z] 09:35:27     INFO -   7:37.40 ==38542==    at 0x4C293D5: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> [task 2017-05-23T09:35:27.756578Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7DA6131: PORT_Alloc_Util (in /usr/lib64/libnssutil3.so)
> [task 2017-05-23T09:35:27.756681Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7AA8ECD: ???
> [task 2017-05-23T09:35:27.756830Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7AA9E4E: ???
> [task 2017-05-23T09:35:27.756945Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C86B: ???
> [task 2017-05-23T09:35:27.757103Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C8DF: ???
> [task 2017-05-23T09:35:27.757233Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A9C8DF: ???
> [task 2017-05-23T09:35:27.757360Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A6B2E3: ???
> [task 2017-05-23T09:35:27.757514Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x7A6BAB0: ???
> [task 2017-05-23T09:35:27.757647Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61D54FA: ???
> [task 2017-05-23T09:35:27.757785Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61D56E7: ???
> [task 2017-05-23T09:35:27.757931Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61CD294: ???
> [task 2017-05-23T09:35:27.758035Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61ABAAA: ???
> [task 2017-05-23T09:35:27.758174Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61B2251: ???
> [task 2017-05-23T09:35:27.758298Z] 09:35:27     INFO -   7:37.40 ==38542==    by 0x61B88FB: ???
> [task 2017-05-23T09:35:27.758428Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x61C098F: ???
> [task 2017-05-23T09:35:27.758568Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x401939: PingSender::CurlWrapper::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:171)
> [task 2017-05-23T09:35:27.758712Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x401996: PingSender::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:193)
> [task 2017-05-23T09:35:27.758830Z] 09:35:27     INFO -   7:37.41 ==38542==    by 0x4011FE: main (pingsender.cpp:180)

Daniel, this seems to be coming from [1]. Do you have any idea about what should be causing this?

[1] - http://searchfox.org/mozilla-central/rev/6c2dbacbba1d58b8679cee700fd0a54189e0cf1b/toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp#171
Flags: needinfo?(alessio.placitelli) → needinfo?(daniel)
(In reply to Alessio Placitelli [:Dexter] from comment #10)

> Daniel, this seems to be coming from [1]. Do you have any idea about what
> should be causing this?

It certainly looks like the libcurl call yes, and the leak itself is within nss and its PORT_Alloc_Util function. No symbols for libcurl makes it a bit hard to follow the flow.

My primary guess: no call to curl_global_cleanup() when you're done so there's some remaining data NSS hasn't had the chance to free.
Flags: needinfo?(daniel)
(Assignee)

Comment 12

8 months ago
(In reply to Daniel Stenberg [:bagder] from comment #11)
> My primary guess: no call to curl_global_cleanup() when you're done so
> there's some remaining data NSS hasn't had the chance to free.

Thanks Daniel! Looks like this slightly changes the leaks to something else [1]:

> [task 2017-05-24T07:12:13.696906Z] 07:12:13     INFO -   7:31.73 ==38541== 232 (168 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 69 of 84
> [task 2017-05-24T07:12:13.697077Z] 07:12:13     INFO -   7:31.73 ==38541==    at 0x4C27449: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> [task 2017-05-24T07:12:13.697234Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x83E7DBC: ???
> [task 2017-05-24T07:12:13.697400Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x83DED54: ???
> [task 2017-05-24T07:12:13.697565Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61D5068: ???
> [task 2017-05-24T07:12:13.697730Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61C1838: ???
> [task 2017-05-24T07:12:13.697894Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x61C1877: ???
> [task 2017-05-24T07:12:13.698074Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4018B1: PingSender::CurlWrapper::Init() (pingsender_unix_common.cpp:144)
> [task 2017-05-24T07:12:13.698259Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4019DC: PingSender::Post(std::string const&, std::string const&) (pingsender_unix_common.cpp:205)
> [task 2017-05-24T07:12:13.698419Z] 07:12:13     INFO -   7:31.73 ==38541==    by 0x4011FE: main (pingsender.cpp:180)
> [task 2017-05-24T07:12:13.698574Z] 07:12:13     INFO -   7:31.73 ==38541==

The total amount of leaked memory doesn't seem to change though.

[1] - https://treeherder.mozilla.org/logviewer.html#?job_id=101456936&repo=try&lineNumber=25579-25605
Curious.

I trust it you don't see this leak when you run it stand-alone or is it reproducible any other way? Which libcurl version is this? I'm not aware of any memory leak in any recent libcurl version for such a normal operation that this seems to be.
(Assignee)

Comment 14

8 months ago
(In reply to Daniel Stenberg [:bagder] from comment #13)
> Curious.
> 
> I trust it you don't see this leak when you run it stand-alone or is it
> reproducible any other way? Which libcurl version is this? I'm not aware of
> any memory leak in any recent libcurl version for such a normal operation
> that this seems to be.

The funny thing is that this leak started showing with this patch that doesn't change a thing in this binary, just calls it once more.
I just noticed that I'm able to see the same leak that I get after adding |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not able to see the leak that I was seeing on try w\o curl_global_cleanup:

> valgrind  --vex-iropt-register-updates=allregs-at-mem-access --smc-check=all-non-file --leak-check=full --default-suppressions=yes  --suppressions=build/valgrind/x86_64-redhat-linux-gnu.sup --suppressions=build/valgrind/cross-architecture.sup --suppressions=libcurl.sup --show-leak-kinds=definite --fair-sched=yes -v --track-origins=yes --trace-children=yes --num-callers=36 --gen-suppressions=all objdir-ff-debug/toolkit/components/telemetry/pingsender/pingsender http://localhost:3785  example.txt

That's the same command used on try, AFAICT. I'm using libcurl 7.47.0 on ubuntu
(In reply to Alessio Placitelli [:Dexter] from comment #14)

> I just noticed that I'm able to see the same leak that I get after adding
> |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not
> able to see the leak that I was seeing on try w\o curl_global_cleanup:

The try leak involved NSS which implies another libcurl build than the typical ubuntu libcurl (as libcurl on ubuntu is normally built with gnutls or openssl).

If you can reproduce it, can you reproduce it with a libcurl build with some debug symbols so that we might get some additional clues as to what exactly is leaking?
(Assignee)

Comment 16

8 months ago
(In reply to Daniel Stenberg [:bagder] from comment #15)
> (In reply to Alessio Placitelli [:Dexter] from comment #14)
> 
> > I just noticed that I'm able to see the same leak that I get after adding
> > |curl_global_cleanup| locally running valgrind in a VM box. However, I'm not
> > able to see the leak that I was seeing on try w\o curl_global_cleanup:
> 
> The try leak involved NSS which implies another libcurl build than the
> typical ubuntu libcurl (as libcurl on ubuntu is normally built with gnutls
> or openssl).
> 
> If you can reproduce it, can you reproduce it with a libcurl build with some
> debug symbols so that we might get some additional clues as to what exactly
> is leaking?

Thanks Daniel! After fiddling with it the whole day, and thanks to :sewardj, I was able to get closer to some result.
To get meaningful stacks locally I had to use the most recent version of valgrind, as the one that I was using (3.11) had a bug and was not able to read the debug symbols :(

With that settled, and some hel from :sewardj, we found that:

- yes, global_cleanup needs to be called (as you rightly said :D );
- calling |exit()| instead of |return| doesn't make valgrind happy as exit() can prevent destructors from being called when getting out of scope, and thus we do not free the memory;

After doing this changes I'm left, locally, with this leak:

> ==5218== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
> ==5218==    at 0x4C2EB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5218==    by 0x505864E: _dlerror_run (/build/glibc-jxM2Ev/glibc-2.24/dlfcn/dlerror.c:141)
> ==5218==    by 0x5057F81: dlopen@@GLIBC_2.2.5 (/build/glibc-jxM2Ev/glibc-2.24/dlfcn/dlopen.c:87)
> ==5218==    by 0x109C3F: PingSender::CurlWrapper::Init() (toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp:105)
> ==5218==    by 0x109F6B: PingSender::Post(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp:212)
> ==5218==    by 0x109580: main (toolkit/components/telemetry/pingsender/pingsender.cpp:180)

This is exactly what we're already suppressing at [1] and bug 793535 comment 2 explains why we're doing this.
I suspect Valgrind is not suppressing that automatically because, for some reason, even though treeherder contains debug symbols for libcurl, it's not reporting as useful stack (see comment 12). Adding a new suppression locally gave me a perfectly fine valgrind run, in all these cases:

- pingsender sending to unavailable server;
- pingsender sending to HTTP server;
- pingsender sending to HTTPS server;
- pingsender failing;

However, when with all my hopes I pushed this to try it failed miserably once again [2], with malloc/calloc leaks generated from [3].
Unfortunately, I'm not able to reproduce these leaks locally. I'm tempted to simply suppress them, as they seem to be coming other libs anyway.

[1] - http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/build/valgrind/cross-architecture.sup#61
[2] - https://treeherder.mozilla.org/#/jobs?repo=try&revision=7aaf6eb4ea1ac771f97cf602d90d128040eae063&selectedJob=101606743
[3] - http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/toolkit/components/telemetry/pingsender/pingsender_unix_common.cpp#171
(Assignee)

Comment 17

8 months ago
Ah, I forgot to mention one funny thing. It could be that we're not getting useful stacks by valgrind because of a nice bug that :sewardj mentioned: if we happen to call |dlclose| on some library, then valgrind will forget about its symbols and print not super useful stacks.

Unfortunately, the naive solution of simply commenting out |dlclose| has its drawback: no leak is reported anymore and valgrind works just fine, with no suppression at all, even on Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=415a4c6f353385553f254fe7e8361b0db87e9dfb

As :sewardj explained, this is because the memory is still in use and its classified as "reachable" rather than a "leak". So the problem is still there, just not failing anymore :)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 22

8 months ago
Julian, even adding the suppression from the attached review request Valgrind will still complain:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc16715a7040c1991bd36ced88ba36c8379fcbf9&selectedJob=101674306

This is the cryptic leak from try:

> ==38537==    by 0x83D8CD6: ???
> ==38537==    by 0x9F550B9: ???
> ==38537==    by 0x9F55120: ???
> ==38537==    by 0x9F55229: ???
> ==38537==    by 0x83DEF59: ???
> ==38537==    by 0x9F54FE1: ???
> ==38537==    by 0x9F2AAF4: ???
> ==38537==    by 0x9F55385: ???
> ==38537==    by 0x9F2862A: ???
> ==38537==    by 0xFFEFFFEF3: ???
> ==38537==    by 0x726F772F72656B71: ???
> ==38537==    by 0x622F65636170736A: ???
> ==38537==    by 0x6372732F646C6974: ???
> ==38537==    by 0x7269662D6A626F2E: ???
> ==38537==    by 0x7369642F786F6664: ???
> ==38537==    by 0x69702F6E69622F73: ???
> ==38537==    by 0x7265646E6573676D: ???
> ==38537==    by 0x2F3A7370747467FF: ???
> ==38537==    by 0x302E302E3732312E: ???
> ==38537==    by 0x35323430343A312D: ???
> ==38537==    by 0x74656D656C65742E: ???
> ==38537==    by 0x796D6D75642D7971: ???
> ==38537==    by 0x74696D6275732F2E: ???
> ==38537==    by 0x74656D656C65742E: ???
> ==38537==    by 0x66336235642F7971: ???
> ==38537==    by 0x373833362D643931: ???
> ==38537==    by 0x34612D663236342C: ???
> ==38537==    by 0x61623734372D3060: ???
> ==38537==    by 0x2F61303037663763: ???
> ==38537==    by 0x666F72702D77656D: ???
> ==38537==    by 0x657269462F656C68: ???
> ==38537==    by 0x302E35352F786F65: ???
> ==38537==    by 0x75616665642F3160: ???
> ==38537==    by 0x30373130322F746B: ???
> ==38537==    by 0x3334353731343234: ???
> ==38537==
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Leak
>    match-leak-kinds: definite
>    fun:calloc
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:*
> }

Any suggestion?
Flags: needinfo?(jseward)
Unfortunately I can't offer you any good options here:

* suppressing the leak is not feasible, because the only name in 
  the stack is "calloc"

* changing the way valgrind deals with debuginfo after dlclose
  is a big bit of work

I think your least-worst option is to rewrite the program so as not
to use dlopen/close, which tbh seems a bit strange to me, but even then:

* it's not guaranteed to fix the leak

* (per irc chat) it's not feasible in the immediate timeframe
Flags: needinfo?(jseward)
Comment hidden (mozreview-request)
(Assignee)

Updated

8 months ago
Attachment #8870939 - Attachment is obsolete: true
(Assignee)

Updated

8 months ago
See Also: → bug 1367713
(Assignee)

Comment 25

8 months ago
Thanks Julian for all your time and efforts helping me on this.

(In reply to Julian Seward [:jseward] from comment #23)
> Unfortunately I can't offer you any good options here:
> 
> * suppressing the leak is not feasible, because the only name in 
>   the stack is "calloc"
> 
> * changing the way valgrind deals with debuginfo after dlclose
>   is a big bit of work
> 
> I think your least-worst option is to rewrite the program so as not
> to use dlopen/close, which tbh seems a bit strange to me, but even then:
> 
> * it's not guaranteed to fix the leak
> 
> * (per irc chat) it's not feasible in the immediate timeframe

As mentioned over IRC, the most reasonable solution seems to be disabling the feature on 'valgrind jobs' until either Valgrind starts to deal with debuginfo/dlclose or the pingsender is re-written in Rust (bug 1339035).

I've filed 1367713 to enable it back.

I wonder if updating Valgrind on try would make things smoother/change this. But I see there's no bug on updating it from 3.10 to 3.12, so this seems the fastest way to land this important feature.

Comment 27

8 months ago
mozreview-review
Comment on attachment 8870928 [details]
Bug 1364068 - Properly cleanup libcurl in the pingsender.

https://reviewboard.mozilla.org/r/142496/#review146408

Looks good to me!
Attachment #8870928 - Flags: review?(gsvelto) → review+
(Assignee)

Updated

8 months ago
Attachment #8870928 - Flags: review?(n.nethercote) → review?(jseward)

Comment 28

8 months ago
mozreview-review
Comment on attachment 8870928 [details]
Bug 1364068 - Properly cleanup libcurl in the pingsender.

https://reviewboard.mozilla.org/r/142496/#review146766

Otherwise looks OK.

::: build/valgrind/mach_commands.py:78
(Diff revision 2)
> +            # Don't spawn the 'pingsender' process on Valgrind tests due to the
> +            # 'new-profile' ping. Valgrind marks the process as leaky (bug 1364068)
> +            # but does not provide enough information to suppress the leak. Running
> +            # locally does not reproduce the issue, so disable this until we rewrite
> +            # the pingsender in Rust (bug 1339035).
> +            # We can't do this in 'prefs_genera.js' as we want to retain test coverage

Typo (prefs_genera.hs)
Attachment #8870928 - Flags: review?(jseward) → review+
(Assignee)

Comment 29

8 months ago
Thanks! Here's the valgrind-specific try push that covers this changes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4f0378212a18c4a7c02bbff2e8a9d02bfad9cd9 (all green!)
Comment hidden (mozreview-request)

Comment 31

8 months ago
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ad54e4734e9b
Enable the 'new-profile' ping. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/026b0b69ca23
Properly cleanup libcurl in the pingsender. r=gsvelto,jseward
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 35

8 months ago
After
Flags: needinfo?(alessio.placitelli)
See Also: → bug 1368455, bug 1368459
(Assignee)

Comment 36

8 months ago
After investigating the new backout, we found that the problem is the pingsender timing out due to trying to connect to a misconfigured SSL server in the test harness:

- the pingsender waits 5 minutes before timing out;
- the test framework waits for all the processes to complete before shutting down (including the pingsender);
- if the test framework gets stuck, it kills the processes and triggers an error (the ones we're seeing in comment 32).

After discussing this with Georg, since this problems are not related to the 'new-profile' ping, we're not turning it on in tests (except for the xpcshell tests).

I've filed bug 1368455 to decrease the pingsender's timeout from 5 minutes to 60 seconds and bug 1368459 to fix the SSL-server problem and turn the coverage back on in mochitests.

Comment 37

8 months ago
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/cc274092e378
Enable the 'new-profile' ping. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/e61ba4616b81
Properly cleanup libcurl in the pingsender. r=gsvelto,jseward

Comment 38

8 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/cc274092e378
https://hg.mozilla.org/mozilla-central/rev/e61ba4616b81
Status: NEW → RESOLVED
Last Resolved: 8 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.