Closed Bug 933406 Opened 11 years ago Closed 11 years ago

App install counts doesn't match GA numbers

Categories

(Marketplace Graveyard :: Statistics, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: krupa.mozbugs, Assigned: andy+bugzilla)

References

Details

steps to reproduce:
1. Note that in GA, the total app installs for Oct 28th is 7467
2. From https://marketplace.firefox.com/stats/apps-installed/?start=2013-10-01&end=2013-10-31&region=us, add the total count across all countries is 8332
3. From robhudson, total installs in prod for that day is 9502

expected behavior:
App installs number match to GA numbers

observed behavior:
App install counts don't match.

Also, this bug made me realize how important it is to get bug 932984 fixed.
This is a big deal.  Definitely need accuracy
Priority: -- → P1
If I go to the stats API and remove the region= query string arg I get: {"count":9502.0,"date":"2013-10-28"}

So at least Monolith is returning a match for what we've recorded in our mysql table in zamboni.

The total across all countries I'm not sure about if there's not a flag represented in the UI that we've recorded for. I'll ignore the 2 and focus on 1 and 3 not matching.

Basta: Any ideas why install counts would be off? IIUC, it's the same code path where we ping GA and then also ping our installed API.

Other considerations:
* Timezones?
* Collapsing users installing the same app more than once -- we're not, but maybe GA is?
* Since our number is higher, could GA be losing pings?
* Since our number is higher, could we be getting hit with bogus pings?
Flags: needinfo?(mattbasta)
Some thoughts:

- UA is reporting 7467 for 10/28. When you say "GA" do you mean "UA"? If GA and UA match, I'd believe Google on this one.
- If you look at the "click to install" instead of "successful installs" event, you'll notice that there were 10,081 clicks, which is understandable. If a user clicks install and then one of the following happens, we would get different results:
  - The API request ping craps out. For free apps, GA would record a successful install but the API wouldn't since we don't require the API to respond.
  - The GA request to record the success fails. GA wouldn't record that ping.
  - The app is not free and the user purchases the app but then chooses not to install it. GA will not report an installation.
  - GA in the Marketplace respects the DNT setting. We don't track users in GA that tell us not to track them, though the API obviously still gets those pings.

I added a way to track failed installs just now:

https://github.com/mozilla/fireplace/commit/d65baab32c712157f2886bd05f9b233949c68d2c
Flags: needinfo?(mattbasta)
(In reply to Matt Basta [:basta] from comment #3)

Since our numbers are higher, it sounds like these are the possibilities:

>   - The GA request to record the success fails. GA wouldn't record that ping.
>   - GA in the Marketplace respects the DNT setting. We don't track users in
> GA that tell us not to track them, though the API obviously still gets those
> pings.

Of the two the DNT seems more reasonable. Is it possible to track this in some way to prove this is a partial reason for discrepancies?
Tracking do not track? :D

We could, as a simple heuristic, look at requests for the categories API endpoint (all users will fetch it exactly once per session) and see if it has a `DNT: 1` header. If it does, that'll give us a ratio of users that are using DNT. If we scale the GA numbers by that ratio, we should see a good pattern emerge.

As it stands now, if DNT is set we don't even initialize the GA/UA scripts and bail out early (loading the scripts would set Google tracking cookies). When this happens, all of the tracking APIs are replaced with stubs that simply noop. The only way around this is to not respect DNT, which I think isn't the best move. There was a lot of very harsh backlash when it was found that the Mozilla homepage doesn't respect DNT; I don't even want to start that conversation in the context of the Marketplace.
Could we instead respect DNT and not hit our install API if set? I think that (a) respects the user's right to privacy and their setting of DNT and (b) brings our numbers more inline with what GA is doing (hopefully).
Are you saying counting a download is tracking a user?  I don't think that is the case.  If anyone, regardless of DNT, clicks an install button we should increment the installed count.  Or am I misunderstanding what you're saying?

(Also, I can see not wanting to load google's libraries here, and my first reaction is to agree with that since we can't stop them from tracking everything.  I'm definitely curious how many people this would affect)
(In reply to Wil Clouser [:clouserw] from comment #7)
> Are you saying counting a download is tracking a user?  I don't think that
> is the case.  If anyone, regardless of DNT, clicks an install button we
> should increment the installed count.  Or am I misunderstanding what you're
> saying?

We also record the user's region, locale, and user agent string along with the app ID being clicked. If none of that is considered information that DNT is supposed to prevent then I'm mistaken.

> (Also, I can see not wanting to load google's libraries here, and my first
> reaction is to agree with that since we can't stop them from tracking
> everything.  I'm definitely curious how many people this would affect)

... a half thought about statsd calls when the DNT headers is set vs those where they aren't to get a rough idea of ratio.
(In reply to Rob Hudson [:robhudson] from comment #6)
> Could we instead respect DNT and not hit our install API if set? I think
> that (a) respects the user's right to privacy and their setting of DNT and
> (b) brings our numbers more inline with what GA is doing (hopefully).

That's completely reasonable, but we'd still need to hit the install API to get a receipt for non-free apps. What do you think?

(In reply to Wil Clouser [:clouserw] from comment #7)
> Are you saying counting a download is tracking a user?

There's a lot more that happens. Google strings together the entire visit into kind of a timeline of what the user does. You could, if you wanted, visit GA and step through everything a single individual user did and searched for throughout their visit. It also does things like geoip, and exposes information like network and browser data. There's no way to granularize the results down to a simple numeric bump.
(In reply to Matt Basta [:basta] from comment #9)
> (In reply to Rob Hudson [:robhudson] from comment #6)
> > Could we instead respect DNT and not hit our install API if set? I think
> > that (a) respects the user's right to privacy and their setting of DNT and
> > (b) brings our numbers more inline with what GA is doing (hopefully).
> 
> That's completely reasonable, but we'd still need to hit the install API to
> get a receipt for non-free apps. What do you think?
> 
> (In reply to Wil Clouser [:clouserw] from comment #7)
> > Are you saying counting a download is tracking a user?
> There's a lot more that happens. Google strings together the entire visit
> into kind of a timeline of what the user does. You could, if you wanted,
> visit GA and step through everything a single individual user did and
> searched for throughout their visit. It also does things like geoip, and
> exposes information like network and browser data. There's no way to
> granularize the results down to a simple numeric bump.

I specifically separated GA out in my comment because of that.  I'm comfortable saying GA is not DNT friendly.  However, and I suppose this is in reply to comment 8 as well, I don't think us recording an install is a bad thing here.  DNT was designed to prevent 3rd party tracking - we are first party and we're using the statistics to provide value to users, both developers (eg. "more people use FxOS than android, I should test more on that") and consumers (eg. "this app has 5MM users and that one has 14.  Maybe there is a reason for that.").  I don't think recording install clicks on our end violates the spirit of DNT.
(In reply to Wil Clouser [:clouserw] from comment #10)
> I specifically separated GA out in my comment because of that.  I'm
> comfortable saying GA is not DNT friendly.  However, and I suppose this is
> in reply to comment 8 as well, I don't think us recording an install is a
> bad thing here.  DNT was designed to prevent 3rd party tracking - we are
> first party and we're using the statistics to provide value to users, both
> developers (eg. "more people use FxOS than android, I should test more on
> that") and consumers (eg. "this app has 5MM users and that one has 14. 
> Maybe there is a reason for that.").  I don't think recording install clicks
> on our end violates the spirit of DNT.

That does put it in perspective. Thanks Wil. I also don't like the idea that we'd need to track installs for paid apps but not free.

Where are we now? We need to somehow confirm that DNT is the reason for the discrepancy between our stats and Google's. If it is, is it ok that our numbers are different?
> I don't think recording install clicks on our end violates the spirit of DNT.

I wish there was a way to push information to GA in piecemeal, but there isn't.

Our stats get distributed to more than just us, though, and I wouldn't be in any way surprised if we start giving limited access to the GA dashboard to operators and other partners. The data contains identifying information, and once it's in there it doesn't come back out.

And really, we don't know for sure what Google does with that data. They say they don't use it to track the user, but there's plenty of other ways that they could be taking advantage of that information for purposes not listed in the agreement(s) that we've entered into.

> we're using the statistics to provide value to users, both developers ... and consumers

Ad networks say the exact same thing.

Really, if our numbers don't include DNT users I'm not terribly upset by that so long as we're consistent. If the number is the difference between 7500 and 8000, as long as we're showing one and not both it doesn't matter at all what the actual value vs the DNT value is.
I feel like I'm repeating what I already said, but to clarify:  I'm not talking about using GA for DNT traffic at all.  I'm talking about recording our statistics in monolith.
This sounds like a dupe of bug 920313
-> Andym to investigate.  Thanks Andy.  Also note bug 920313 which is pretty much the same thing.
Assignee: nobody → amckay
If we are going to be pulling them from GA now, I don't think there's any point in worrying about this.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.