Closed Bug 1650798 Opened 4 years ago Closed 3 years ago

Determine what to do with Glean telemetry in CI

Categories

(Testing :: Performance, task, P3)

task

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: sparky, Unassigned)

References

Details

Attachments

(1 file)

This bug is for discussing what to do with glean telemetry in CI tests.

Following some discussion with :dexter about the glean telemetry, it sounds like :mcomella, and :esmyth should be involved in this discussion as well. From bug 1648183:

(In reply to Alessio Placitelli [:Dexter] from comment #27)

(In reply to Greg Mierzwinski [:sparky] from comment #26)

With that said, of course, while a believe that disabling telemetry would not be the ideal solution for this, I don't own performance telemetry :-)

I fully agree that without telemetry, we aren't testing the same thing we are releasing but we also catch less regressions when it's on. The biggest issues in the variance with telemetry stems from the extra external network requests. Would it be possible to disable these at the least in the tests that are used for regression detection?

Do we need to block all network requests? Even to localhost? In the future we might consider adding some way to tag uploads and make them go to localhost in automation.

Moreover, are you sure glean ping uploads add variance? All the collection is async and off the main thread, and the ping upload as well.

Regarding being able to compare telemetry between CI and in the wild, that sounds like a good idea to help us nail down our lab-testing gaps. But I wonder if we could have a telemetry-enabled variant running weekly/bi-weekly for this? We are quite limited in terms of the amount of mobile testing we can do so we need to balance realism with our ability to catch regressions.

EDIT: Let me know if you think I should direct these questions at someone else.

Having a separate variant for this was an explicit anti-goal. I'm not the best person to talk about this and to make decisions about it, I can only provide perspectives about telemetry and Glean. This is really a Fenix product question that Fenix people/fenix perf teams should answer. Michael Comella/Eric Smyth might be better people to talk to. I'd love to be kept in the loop, though :)

I think we should at least have the ability to reroute the network requests to localhost and avoid all external requests. External requests are the real issue since they are flaky and can (and very likely will) cause intermittent failures like they do on Firefox Desktop which will cause us to lose perf coverage. These flaky requests add to the variance we see in perf numbers as well.

This is why I propose we have a variant that runs with telemetry enabled at tier 3 so we don't have to deal with the intermittents, have more precise data for regression detection (thereby reducing our chances of slow regressions from forming), while also gathering data that we can use to compare with/without telemetry enabled to give us an idea if telemetry is causing performance issues.

I also see another issue here: if we don't have a way to disable glean telemetry in CI, how will users disable it? I have a strong feeling that Firefox users will complain about this when they eventually find out there are external mozilla-specific network requests that they can't disable.

The vast majority of our CI performance testing is done against mitmproxy http archives, right?

In these scenarios, is there any harm in leaving the telemetry system unchanged?

The new user pings will not be sent out (since they are directed via https proxy to mitmproxy).
And they should introduce not more noise than the dozens (sometimes hundreds) of requests that are made in pageload.

Or are we only talking about live site testing here?

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #1)

The vast majority of our CI performance testing is done against mitmproxy http archives, right?

In these scenarios, is there any harm in leaving the telemetry system unchanged?

The new user pings will not be sent out (since they are directed via https proxy to mitmproxy).
And they should introduce not more noise than the dozens (sometimes hundreds) of requests that are made in pageload.

If they don't get sent out when mitmproxy is on, then that's perfect and we don't need to do anything else. I think these entries in the netlocs json are the telemetry pings, looks like we had 46 pings in this run and it's consistent from one run to the next: https://firefoxci.taskcluster-artifacts.net/ep7OnP2OS3uvgGEGjSfZ6A/0/public/test_info/mitm_netlocs_mitm4-pixel2-fennec-imdb.json

    {
      "response_status": 404,
      "time": "1593973861.5751662",
      "url": "https://incoming.telemetry.mozilla.org/submit/org-mozilla-fenix-nightly/baseline/1/efbf2d87-9a35-40fe-a12c-5997910dbe73"
    },

Or are we only talking about live site testing here?

For live site testing, I don't care so much about the noise since they are already quite noisy and there's not much we can do about that. But disabling telemetry here would help with that noise a bit and the intermittent failures (which there are a lot of, most of the time, from what I've seen, it's worse than chrome).

See Also: → 1634064
Depends on: 1634064

Hi folks,

I'm flagging you all since I understood that each of you owns a different testing system that currently runs Fenix-related tests.
These tests are currently sending telemetry that's adding nasty noise to Fenix dashboards, including the ones used for Fenix tracking by Execs.

We need tests running Fenix to either tag outgoing telemetry as per bug 1634064 comment 30 or find other solutions.

Thank you for your help!

Flags: needinfo?(gmierz2)
Flags: needinfo?(dave.hunt)
Flags: needinfo?(acreskey)

:dexter, I think we should find another solution.

The additional activities is worrisome from a realistic performance testing point of view. It also seems like testing from outside our organization is not taken into consideration in this implementation. When web devs start testing on Fenix, they will contaminate your data as well so the problem is not quite resolved. I don't think it's realistic to expect that users will set these flags either - putting ourselves in their shoes, they might question why it matters to them, and more importantly, why they can't simply disable it? We should be making it as easy as possible for users to test their websites on our product, not complicating it like the performancetest flag does (which also doesn't work in our CI environment). Furthermore, it feels like not being able to disable this telemetry somewhat goes against the work we do regarding tracking protection within our products.

We really should have prefs to disable glean and set these values rather than modifying how we start the application. With this solution, we now have different solutions for our mobile/desktop products for modifying telemetry so there's no consistency and the reasons for this lack of consistency are unclear.

That said, if we continue with this solution, we can try to get to this work into the August sprint.

Flags: needinfo?(gmierz2)

I've been trying to evaluate the performance impact on the startup tests but the G5 device pool is maybe problematic at the moment?
https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=Ma8u7H5sTvOFN2dLTUkeKw.7&tier=1%2C2%2C3&revision=27372add3bb645f1787ded76a32b129ae260ec54

I share sparky's concerns with making it easy for users to run tests against our mobile browsers, which is something we have always tried to maintain on desktop. I'm also concerned that if this is not something we can change centrally and globally (such as with a preference), then it will be missed for new tests that we stand up even if they're using the same framework.

Flags: needinfo?(dave.hunt)

Thank you for highlighting how urgent the need is :dexter. We have been blocked on productionalizing our best measure of active users because of this issue for 2 months. These measures power all of our KPI reporting and were explicilty requested by our executive team.

A 'small' number of tests across many testing centers and developers is biasing our understanding of the business. Specifically, testing makes us look like we are less successful at achieving our goals. This is particularly impactful for new products that are just getting off the ground and have small user bases. Every false data point filtered out of our understanding helps us tell a better story that is more representative of the truth.

This document outlines the impact of testing on Fenix and the extreme bias seen on Nightly as an example. When we use biased metrics to make decisions we are hurting our ability to grow the business.

https://docs.google.com/document/d/1bUotTjis45XoS2eyMExTo2VkcK5HwajHuJGtdDSz1wY/edit#heading=h.a4bjl98yshwx

(In reply to Marissa Gorlick from comment #7)

Thank you for highlighting how urgent the need is :dexter. We have been blocked on productionalizing our best measure of active users because of this issue for 2 months. These measures power all of our KPI reporting and were explicilty requested by our executive team.

If this is such a pressing issue and so problematic, why was a complex system implemented and not something short-term and simple such as disabling it while :dexter implemented the more complex solution?

By the way, I looked at the data coming from the pings and it seems like we can tell exactly when the pings are made, so we didn't need to temporarily disable live site tests to figure out if it's the source. All you needed was a try push with a set of live-site tests to determine this but no one provided us with the information needed (which I've asked for in our discussions in Slack) for us to suggest this solution.

I've made a patch for Raptor-Browsertime to change the pings. Mozperftest doesn't run live site tests so it shouldn't need it, it also needs more work than Raptor to modify them.

Assignee: nobody → gmierz2
Status: NEW → ASSIGNED

(In reply to Greg Mierzwinski [:sparky] from comment #4)

:dexter, I think we should find another solution.

The additional activities is worrisome from a realistic performance testing point of view.

I can imagine this providing a constant overhead to the startup time.
From a "realistic" performance testing POV, a similar reasoning could be applied for the other activities or tweaks (e.g. this) which makes it "less realistic". However, it's not really up to me to say what's acceptable here and what isn't, so it's fine to push back if startup time overhead is non negligible, as long as we have a concrete solution and plan to fix the problem of telemetry from CI.

It also seems like testing from outside our organization is not taken into consideration in this implementation. When web devs start testing on Fenix, they will contaminate your data as well so the problem is not quite resolved.

This part is outside of the scope. This problem is specifically about data from our own CI.

Furthermore, it feels like not being able to disable this telemetry somewhat goes against the work we do regarding tracking protection within our products.

It is possible to disable telemetry from any product using the Glean SDK.

We really should have prefs to disable glean and set these values rather than modifying how we start the application.

What is "prefs" on an Android such as Fenix? If you're talking about Gecko prefs, then they are Gecko specific. Anything built on the top of Gecko doesn't know much about that, unless GeckoView provides specific APIs for it.

Gecko is a component of Fenix. My personal take is that Fenix uses Gecko, Fenix uses Glean, so Fenix gets to tell the state of things, not Gecko :)

Executing Fenix with a different set of CLI parameters to enable/disable features seems like a reasonable way to accomplish control over which features get enabled/disabled.

With this solution, we now have different solutions for our mobile/desktop products for modifying telemetry so there's no consistency and the reasons for this lack of consistency are unclear.

That's just the way it is: we have multiple products that evolved differently. We're migrating Firefox Desktop to Glean (project FOG), so things we'll be consistent in the future, at least from the telemetry POV. I'm afraid, however, that the way to enable/disable telemetry or, in general, product features will still be somehow different due to the radical differences in the platforms. This is out of scope for this bug, and seems like something that should be dealt with at an higher level, possibly by the perf/CI teams.

That said, if we continue with this solution, we can try to get to this work into the August sprint.

Let me be clear about this: the Glean team is not enforcing the use of this solution in any way. I personally don't mind/care whatever solution we end up using to prevent telemetry from CI to get out. Tagging pings is one solution. Other solutions:

  • Intercept ping uploads to incoming.telemetry.mozilla.org and always return 200;
  • Provide a special APK for Fenix (build variant) that better suits your need while also preventing telemetry to pollute the datasets;
  • Disable the tests.

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #6)

I share sparky's concerns with making it easy for users to run tests against our mobile browsers, which is something we have always tried to maintain on desktop. I'm also concerned that if this is not something we can change centrally and globally (such as with a preference), then it will be missed for new tests that we stand up even if they're using the same framework.

I'm not sure how this proposal makes the story for end users more difficult. Again, this is the Glean team speaking, not the Fenix team. If Fenix wants to provide a special mode to detect browser puppeting (e.g. Selenium?), the team can do that and deal with telemetry accordingly. This is a different problem than the one we're trying to solve.

(In reply to Greg Mierzwinski [:sparky] from comment #8)

If this is such a pressing issue and so problematic, why was a complex system implemented and not something short-term and simple such as disabling it while :dexter implemented the more complex solution?

This issue was brought up more than 2 months ago, with the more active discussions happening on Slack. Marissa made it clear in that conversation that this needed to be addressed, from an analytics point of view. And was up to us to figure out a solution.

As stated above, ping tagging is one solution to this problem. The Glean SDK needs pings tagging for other reasons as well. Tagging could be used to solve this problem as well.

If the tests are not sheriffed and provide no insight, then disabling them is definitely a better choice, since it would also allow Mozilla to save some money. I was working under the assumption that wasn't the case ;-)

By the way, I looked at the data coming from the pings and it seems like we can tell exactly when the pings are made, so we didn't need to temporarily disable live site tests to figure out if it's the source. All you needed was a try push with a set of live-site tests to determine this but no one provided us with the information needed (which I've asked for in our discussions in Slack) for us to suggest this solution.

Is this the 100+ messages thread you're talking about? - I'm not sure I understand which information we did not follow up upon :) Anyway, there was a bug about disabling the telemetry to make sure this was the source (bug 1648183), that happened, and CI was indeed the source. In hindsight things could have gone differently, probably, but everybody on said bug had the chance to chime in.

We now need to act on our findings: Mozilla products should either not send telemetry from CI or properly tag that.

I've made a patch for Raptor-Browsertime to change the pings. Mozperftest doesn't run live site tests so it shouldn't need it, it also needs more work than Raptor to modify them.

Nice! That seems very self-contained!

:dexter, thank you very much for this context and addressing our comments/concerns! It's really appreciated.

(In reply to Alessio Placitelli [:Dexter] from comment #10)

(In reply to Greg Mierzwinski [:sparky] from comment #4)

:dexter, I think we should find another solution.

The additional activities is worrisome from a realistic performance testing point of view.

I can imagine this providing a constant overhead to the startup time.
From a "realistic" performance testing POV, a similar reasoning could be applied for the other activities or tweaks (e.g. this) which makes it "less realistic". However, it's not really up to me to say what's acceptable here and what isn't, so it's fine to push back if startup time overhead is non negligible, as long as we have a concrete solution and plan to fix the problem of telemetry from CI.

I'm working with the Fenix people at the moment to try to get rid of that performancetest intent since it does way too much at once and in its current state we can't use it in CI because of some restrictions.

This part is outside of the scope. This problem is specifically about data from our own CI.

Hmm ok. So there will still be CI-sourced contamination from other organizations (such as Wikimedia) which also do this kind of testing.

We really should have prefs to disable glean and set these values rather than modifying how we start the application.

What is "prefs" on an Android such as Fenix? If you're talking about Gecko prefs, then they are Gecko specific. Anything built on the top of Gecko doesn't know much about that, unless GeckoView provides specific APIs for it.

Gecko is a component of Fenix. My personal take is that Fenix uses Gecko, Fenix uses Glean, so Fenix gets to tell the state of things, not Gecko :)

Oh, that's unfortunate. It sounds like we're going to get into a command-line option nightmare in the future because of this.

Executing Fenix with a different set of CLI parameters to enable/disable features seems like a reasonable way to accomplish control over which features get enabled/disabled.

With this solution, we now have different solutions for our mobile/desktop products for modifying telemetry so there's no consistency and the reasons for this lack of consistency are unclear.

That's just the way it is: we have multiple products that evolved differently. We're migrating Firefox Desktop to Glean (project FOG), so things we'll be consistent in the future, at least from the telemetry POV. I'm afraid, however, that the way to enable/disable telemetry or, in general, product features will still be somehow different due to the radical differences in the platforms. This is out of scope for this bug, and seems like something that should be dealt with at an higher level, possibly by the perf/CI teams.

Let me be clear about this: the Glean team is not enforcing the use of this solution in any way. I personally don't mind/care whatever solution we end up using to prevent telemetry from CI to get out. Tagging pings is one solution. Other solutions:

  • Intercept ping uploads to incoming.telemetry.mozilla.org and always return 200;

This first option is the solution we use in Firefox Desktop at the moment. But I'm glad to hear that telemetry will be standardized to Glean in the future.

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #6)

I share sparky's concerns with making it easy for users to run tests against our mobile browsers, which is something we have always tried to maintain on desktop. I'm also concerned that if this is not something we can change centrally and globally (such as with a preference), then it will be missed for new tests that we stand up even if they're using the same framework.

I'm not sure how this proposal makes the story for end users more difficult. Again, this is the Glean team speaking, not the Fenix team. If Fenix wants to provide a special mode to detect browser puppeting (e.g. Selenium?), the team can do that and deal with telemetry accordingly. This is a different problem than the one we're trying to solve.

Good point, I agree, I don't think this is a Glean/telemetry issue at this stage but something that has more to do with Fenix itself. We need to be able to setup all of this through a geckodriver connection.

(In reply to Greg Mierzwinski [:sparky] from comment #8)

If this is such a pressing issue and so problematic, why was a complex system implemented and not something short-term and simple such as disabling it while :dexter implemented the more complex solution?

This issue was brought up more than 2 months ago, with the more active discussions happening on Slack. Marissa made it clear in that conversation that this needed to be addressed, from an analytics point of view. And was up to us to figure out a solution.

As stated above, ping tagging is one solution to this problem. The Glean SDK needs pings tagging for other reasons as well. Tagging could be used to solve this problem as well.

If the tests are not sheriffed and provide no insight, then disabling them is definitely a better choice, since it would also allow Mozilla to save some money. I was working under the assumption that wasn't the case ;-)

Heh I can see where you were coming from there but these tests are a drop in the bucket in comparison to all the other tests we run on autoland so we didn't save much by disabling them temporarily. They are providing insight by helping us build a new metric for live site testing so that we can do more of this type of testing in the future to have more realistic and up to date tests. (The metric allows us to determine whether a change in a metric was caused by a website change or a true product regression/improvement, and also lets us measure the variance of a website page load for testing suitability).

By the way, I looked at the data coming from the pings and it seems like we can tell exactly when the pings are made, so we didn't need to temporarily disable live site tests to figure out if it's the source. All you needed was a try push with a set of live-site tests to determine this but no one provided us with the information needed (which I've asked for in our discussions in Slack) for us to suggest this solution.

Is this the 100+ messages thread you're talking about? - I'm not sure I understand which information we did not follow up upon :) Anyway, there was a bug about disabling the telemetry to make sure this was the source (bug 1648183), that happened, and CI was indeed the source. In hindsight things could have gone differently, probably, but everybody on said bug had the chance to chime in.

Sorry, my mistake, the messages were in this bug (regarding the temporal granularity of the data):
https://bugzilla.mozilla.org/show_bug.cgi?id=1648183#c4
https://bugzilla.mozilla.org/show_bug.cgi?id=1648183#c11

Pushed by gmierz2@outlook.com:
https://hg.mozilla.org/integration/autoland/rev/fac1d246d855
Change Fenix glean tags coming from Raptor-Browsertime. r=perftest-reviewers,Bebe,Dexter
Keywords: leave-open

(In reply to Greg Mierzwinski [:sparky] from comment #11)

:dexter, thank you very much for this context and addressing our comments/concerns! It's really appreciated.

Don't mention it!

This part is outside of the scope. This problem is specifically about data from our own CI.

Hmm ok. So there will still be CI-sourced contamination from other organizations (such as Wikimedia) which also do this kind of testing.

Yes, probably. But we will address these separately, when the time comes.

Gecko is a component of Fenix. My personal take is that Fenix uses Gecko, Fenix uses Glean, so Fenix gets to tell the state of things, not Gecko :)

Oh, that's unfortunate. It sounds like we're going to get into a command-line option nightmare in the future because of this.

The Glean SDK itself also supports environment variables on all platforms but Android. Unfortunately, it is not possible to set environment variables on Android devices :(

This first option is the solution we use in Firefox Desktop at the moment. But I'm glad to hear that telemetry will be standardized to Glean in the future.

One thing worth mentioning is that, if you'll ever want to get into the area of comparing perf telemetry collected in the wild vs telemetry collected in the lab, you can now do this through the tags.

Sorry, my mistake, the messages were in this bug (regarding the temporal granularity of the data):
https://bugzilla.mozilla.org/show_bug.cgi?id=1648183#c4
https://bugzilla.mozilla.org/show_bug.cgi?id=1648183#c11

That seem to have been addressed in comment 17 and comment 21. If it wasn't, and you're still interested in the answers, let me know and we'll address them here!

(In reply to Alessio Placitelli [:Dexter] from comment #13)

That seem to have been addressed in comment 17 and comment 21. If it wasn't, and you're still interested in the answers, let me know and we'll address them here!

They weren't addressed by those comments. Specifically, Can you simply correlate the time of the ping spike onset with the time of when the tests were run to determine if it is CI? and I'm also wondering why our data can't already answer this - do we only have a daily level of granularity or can we look at hourly/minute data? were not addressed. I figured out the answer myself with the debug view you recently gave a link to - the granularity is seconds, and we can get this information from metadata.header.date or ping_info.start_time/end_time.

(In reply to Greg Mierzwinski [:sparky] from comment #14)

(In reply to Alessio Placitelli [:Dexter] from comment #13)

That seem to have been addressed in comment 17 and comment 21. If it wasn't, and you're still interested in the answers, let me know and we'll address them here!

They weren't addressed by those comments. Specifically, Can you simply correlate the time of the ping spike onset with the time of when the tests were run to determine if it is CI?

I'm not a data scientist myself, so I'll leave this to Marissa for a better answer.

I do have opinions, though :-) Even though we might be able to perfectly filter out this in post-analysis (which I'm not sure we could, even if we had all the info), this would work on the assumption that all the people working on the data would know all the caveats. We know, with our Firefox experience, that this is a broken assumption: even if things are documented, the knowledge won't necessarily be shared, accessible . Even if I would know about this, somebody else in the org could just make the wrong decision looking at noisy data. And we really don't want this.

I strongly believe we should do our best to remove any known source of noise.

Analysts time is IMHO better spent on figuring out how to make a product win, versus on how to get clean data.

Marissa, any chance you could address these questions more extensively?

I'm also wondering why our data can't already answer this - do we only have a daily level of granularity or can we look at hourly/minute data? were not addressed. I figured out the answer myself with the debug view you recently gave a link to - the granularity is seconds, and we can get this information from metadata.header.date or ping_info.start_time/end_time.

The resolution of start_time/end_time is minutes, not seconds. And metadata.header.date is seconds, yes, but it's the submission time, which is different than the collection time.

Flags: needinfo?(mgorlick)

(In reply to Alessio Placitelli [:Dexter] from comment #15)

(In reply to Greg Mierzwinski [:sparky] from comment #14)

(In reply to Alessio Placitelli [:Dexter] from comment #13)

That seem to have been addressed in comment 17 and comment 21. If it wasn't, and you're still interested in the answers, let me know and we'll address them here!

They weren't addressed by those comments. Specifically, Can you simply correlate the time of the ping spike onset with the time of when the tests were run to determine if it is CI?

I'm not a data scientist myself, so I'll leave this to Marissa for a better answer.

I do have opinions, though :-) Even though we might be able to perfectly filter out this in post-analysis (which I'm not sure we could, even if we had all the info), this would work on the assumption that all the people working on the data would know all the caveats. We know, with our Firefox experience, that this is a broken assumption: even if things are documented, the knowledge won't necessarily be shared, accessible . Even if I would know about this, somebody else in the org could just make the wrong decision looking at noisy data. And we really don't want this.

I strongly believe we should do our best to remove any known source of noise.

Analysts time is IMHO better spent on figuring out how to make a product win, versus on how to get clean data.

Oh sorry, I was not trying to suggest that we filter our data based on the ping metadata (I fully agree that's not a good solution). I was trying to see if there's another way to determine the source - if we have graphs for daily number of pings, the hourly/minute number of pings shouldn't be difficult to get. We could have used that information to tell if it was CI-sourced or not, furthermore, I could have started a try run for those live sites to initiate a spike at a known time that we would have been able to see through those graphs.

Priority: P2 → P1

Thank you for your feedback on tracing back the volatility in our metrics, :sparky. At this time we are focused on filtering out known sources of profiles that do not represent meaningful users. Given our analysis over the last two months, our strong hypothesis is that the activity of the CI team is the primary driver of this volatility today. Once your team's changes have landed we will follow up with an additional analysis to identify whether we are ready to integrate this data source into our insights or whether there continues to be noise that makes it impossible to integrate these changes in a meaningful way. The shape of the data at this time will inform further avenues of investigation, but I am satisfied with our current state of understanding.

Flags: needinfo?(mgorlick)

Hey Greg,

is the cset in this bug just for Browsertime-Raptor or is it also covering the other tests discussed in bug 1634064 comment 15?

Flags: needinfo?(gmierz2)

It only covers Raptor-Browsertime.

Flags: needinfo?(gmierz2)

(In reply to Greg Mierzwinski [:sparky] from comment #20)

It only covers Raptor-Browsertime.

I see. Is your team responsible for the other test suites as well? If not, do you know who would be a good person to talk to?

Flags: needinfo?(gmierz2)

I'm back from PTO so I'll be able to re-assess the impact of launching via the Glean activity.

:acreskey will look into it for mozperftest - depending on what he finds we might not be able to use this solution.

Flags: needinfo?(gmierz2)

:sparky and :acreskey, please let us know about the details of any blockers so we can find a good solution. Is there anyone else we need to reach out to?

I tested an implementation of launching into the Glean pass-through activity in the Fenix startup tests.

The performance impact is significant, around 200ms delayed on Moto G5 applink tests.

Baseline (unmodified) push

3219.14
3204.36
3241.29
3205.50

Mean = 3217.57ms

Launching to the glean activity first push

3399.00
3400.36
3442.14
3417.93

Mean = 3414.85ms

Note that each result is the mean of 14 independent tests.

Because the overhead is so high, this isn't a good solution for these tests.
(We are in the process of standing up tests to compare ourselves to Chrome).

:Dexter, what about providing Java/Kotlin API so that we can tag the telemetry pings when Fenix is launched in performance mode?

Note that the mozperftest impact on telemetry is believed to be quite small, by my count 14 runs * 2 tests * 2 device types = 56 new users / day.

And using conditioned profiles, I'm wondering if this is contributing to any new users at all?

Flags: needinfo?(acreskey)

:acreskey, I was wondering the same thing about the conditioned-profiles and asked about this in the slack mega-thread: https://mozilla.slack.com/archives/CG2FGG1ST/p1593548257380200?thread_ts=1591216689.369800&cid=CG2FGG1ST

We figured out that because we re-install Fenix, it gets a new client_id each time (there are many other things that trigger this too), and so conditioned-profiles don't make a difference here. There will still be 56 * ~2 = ~112 pings per day coming from mozperftest if these tests are producing them.

(In reply to Greg Mierzwinski [:sparky] from comment #27)

:acreskey, I was wondering the same thing about the conditioned-profiles and asked about this in the slack mega-thread: https://mozilla.slack.com/archives/CG2FGG1ST/p1593548257380200?thread_ts=1591216689.369800&cid=CG2FGG1ST

We figured out that because we re-install Fenix, it gets a new client_id each time (there are many other things that trigger this too), and so conditioned-profiles don't make a difference here. There will still be 56 * ~2 = ~112 pings per day coming from mozperftest if these tests are producing them.

Unfortunately, the data tells a different story. Of particular concern is not the volume of data, so much as the client churn and how that impacts KPIs like MAU and DAU. From this, I think we should take it as given that we have to tag CI data somehow.

Stepping back to other engineering solutions: have you considered setting up networking in the test environment so that incoming.telemetry.mozilla.org goes to a dummy webserver that sends the data to /dev/null? It does mean the data won't be available for your own analysis (if you need that), but it also would meet all the requirements and constraints that Android imposes...

(In reply to Michael Droettboom [:mdroettboom] from comment #28)

(In reply to Greg Mierzwinski [:sparky] from comment #27)

:acreskey, I was wondering the same thing about the conditioned-profiles and asked about this in the slack mega-thread: https://mozilla.slack.com/archives/CG2FGG1ST/p1593548257380200?thread_ts=1591216689.369800&cid=CG2FGG1ST

We figured out that because we re-install Fenix, it gets a new client_id each time (there are many other things that trigger this too), and so conditioned-profiles don't make a difference here. There will still be 56 * ~2 = ~112 pings per day coming from mozperftest if these tests are producing them.

Unfortunately, the data tells a different story. Of particular concern is not the volume of data, so much as the client churn and how that impacts KPIs like MAU and DAU. From this, I think we should take it as given that we have to tag CI data somehow.

The majority of that churn was coming from Raptor-Browsertime live site tests which have been resolved through this patch (the graphs you point us to are outdated as well): https://bugzilla.mozilla.org/attachment.cgi?id=9167480

Mozperftest does not introduce that much churn by itself - its load is much smaller than Raptor.

:mcomella, is FNPRMS still running? I recall that you mentioned you do 100 trials/day on a couple devices which could also be contributing to these pings.

Stepping back to other engineering solutions: have you considered setting up networking in the test environment so that incoming.telemetry.mozilla.org goes to a dummy webserver that sends the data to /dev/null? It does mean the data won't be available for your own analysis (if you need that), but it also would meet all the requirements and constraints that Android imposes...

AFAIK, we have no need for these pings other than some future perf research which I think should probably be excluded from this problem at this point. We have no resources left to implement that dummy server and priorities have shifted a lot recently - if someone wants to tackle that then we can provide some guidance there. Note that we have existing prefs set for disabling telemetry set in Fenix and it follows that same sort of solution: https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#84

Flags: needinfo?(michael.l.comella)

(In reply to Greg Mierzwinski [:sparky] from comment #29)

The majority of that churn was coming from Raptor-Browsertime live site tests which have been resolved through this patch (the graphs you point us to are outdated as well): https://bugzilla.mozilla.org/attachment.cgi?id=9167480

Mozperftest does not introduce that much churn by itself - its load is much smaller than Raptor.

:mcomella, is FNPRMS still running? I recall that you mentioned you do 100 trials/day on a couple devices which could also be contributing to these pings.

Thanks. I was conflating the two different sets of tests (Raptor-Browsertime and mozperftest) as a single entity. That indeed does reduce the scale of the problem.

AFAIK, we have no need for these pings other than some future perf research which I think should probably be excluded from this problem at this point. We have no resources left to implement that dummy server and priorities have shifted a lot recently - if someone wants to tackle that then we can provide some guidance there. Note that we have existing prefs set for disabling telemetry set in Fenix and it follows that same sort of solution: https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#84

Let us look at the data now that the Raptor-Browsertime change has landed and see how critical this remains.

Did you mean disabling telemetry in Firefox Desktop? I don't think Fenix has prefs in that way.

:Dexter, what about providing Java/Kotlin API so that we can tag the telemetry pings when Fenix is launched in performance mode?

Can you point me to the docs or code for how performance mode is enabled so I could test the feasibility of this? Would it be sufficient to just disable telemetry for that path? We already have an API for that. (Understanding that would have performance implications). Otherwise, as you suggest, perhaps it's possible to expose tagging as a public API.

(In reply to Michael Droettboom [:mdroettboom] from comment #30)

Did you mean disabling telemetry in Firefox Desktop? I don't think Fenix has prefs in that way.

That's right they don't unfortunately but there's precedence in having a dummy server for telemetry - maybe if we have a adb start flag/intent for that it would help with this issue.

:Dexter, what about providing Java/Kotlin API so that we can tag the telemetry pings when Fenix is launched in performance mode?

Just a warning that we can't use the performancetest intent in CI in its current state given that it has certain restrictions which we can't work with (adb debugging + USB connection) - it's really only the adb debugging that we are missing, but enabling that may affect performance numbers. The USB connection might become an issue in power usage tests though.

I just realized that I miscalculated the mozperftest new user count.
It's 14 runs per test but that's just one APK install (it uses browsertime iterations to repeat the tests).

So new users are expected to be:
2 installs (one for each test) * 2 device types * ~2 runs per day = ~8 new users / day.

:mcomella, is FNPRMS still running? I recall that you mentioned you do 100 trials/day on a couple devices which could also be contributing to these pings.

Yes. I would guess it's ~11 installs on Nightly builds a night (1 install each for 3 test types in TOR = 3; 1 install each for 2 test types each for 4 devices in SF = 8) but significantly more more app start/stops (100+ for each install).

We do also build release builds locally for manual performance testing (and fenix/a-c developers probably build release builds as well but perhaps less frequently than us) which I'd expect would contribute to the numbers.

There is a lot of content in this thread and I haven't been following closely – please let me know if there's some action me or my team should do, thanks!

Flags: needinfo?(michael.l.comella)

Thanks all. Following up, removing 'automation' related testing telemetry has already removed thousands of single session new profiles from our mobile data and will give us a much more robust understanding of what tactics are working for new user acquisition. As a specific example, on Sunday we filtered out 1834 unique new profiles from our Nightly Fenix data using this new tag. https://sql.telemetry.mozilla.org/queries/73816/#184572 This volume is in line with the unexpected increases seen in prior analysis. https://docs.google.com/document/d/1bUotTjis45XoS2eyMExTo2VkcK5HwajHuJGtdDSz1wY/edit#heading=h.77lwp2w0wyw5

If there are additional testing centers or developers creating profiles we should continue to include this tag to keep the data clean. Thank you!

For FE performance to follow-up and address our telemetry issues, I filed https://jira.mozilla.com/browse/FXP-1251. To be explicit, I think we'll assume it's not that impactful at the current usage (see comment 33 for my guess at our usage) and keep it a low priority in the short term unless someone tells us it is a high priority.

If there are additional testing centers or developers creating profiles we should continue to include this tag to keep the data clean. Thank you!

All fenix developers may occasionally build release builds for local testing: I wonder if we should default to having local builds tagged as such and specify a build configuration for production release builds that will undo that tagging at build time. I wonder, what does desktop do in this situation? I expect the problem to be more pronounced there.

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #35)

All fenix developers may occasionally build release builds for local testing: I wonder if we should default to having local builds tagged as such and specify a build configuration for production release builds that will undo that tagging at build time.

This is a great idea IMO.

I wonder, what does desktop do in this situation? I expect the problem to be more pronounced there.

For telemetry in testing on desktop, we disable it by pointing the server pref to a dummy server: https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#84

(In reply to Greg Mierzwinski [:sparky] from comment #36)

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #35)

All fenix developers may occasionally build release builds for local testing: I wonder if we should default to having local builds tagged as such and specify a build configuration for production release builds that will undo that tagging at build time.

This is a great idea IMO.

Yes, this is indeed a good path forward. I believe all but the "production release" variants should not send telemetry (have Glean.Initialize be called with uploadEnabled=false).

I wonder, what does desktop do in this situation? I expect the problem to be more pronounced there.

For telemetry in testing on desktop, we disable it by pointing the server pref to a dummy server: https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#84

This is the way it works for automated testing. In self-built Firefox for manual QA, telemetry is "enabled but not sending".

Assignee: gmierz2 → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2

Check up in startup meeting.

Flags: needinfo?(gmierz2)

No updates, this isn't a priority at the moment.

Flags: needinfo?(gmierz2)
Priority: P2 → P3

The leave-open keyword is there and there is no activity for 6 months.
:davehunt, maybe it's time to close this bug?

Flags: needinfo?(dave.hunt)

Closing as inactive, there's nothing actionable at this time.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(dave.hunt)
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: