Closed Bug 468570 Opened 16 years ago Closed 15 years ago

active daily users statistics broken for all add-ons since 2008-11-19

Categories

(addons.mozilla.org Graveyard :: Statistics, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnjbarton, Assigned: clouserw)

References

Details

Attachments

(2 files)

Just look at the graphs, something is broken, eg:
https://addons.mozilla.org/en-US/firefox/statistics/addon/1843
It's like that for me and every other extension too. It has a bad habit of registering incoming stats on the graphs before they're fully collected. It's done this for quite a while but I think it's gotten worse and longer lasting recently.

(bug 443620?)
OS: Windows XP → All
Summary: Statistics are broken on Firebug addon → Download statistics down over 80% for add-ons
Summary: Download statistics down over 80% for add-ons → active daily users statistics down over 80% for add-ons
This was a specific incident last week when versioncheck.amo was moved behind a new load balancer and the parsing script did not know about the new log directory to parse. It's being addressed in bug 467502
Depends on: 467502
To update this report, the -80% point stuck and the new point for this week is -70% of of that.
All of my addons now show 0 active daily users since midnight on 10 Dec.
Depends on: 469376
For FEBE, I have noticed that the "Active Daily Users" has dropped from 276,440
on November 25, 2008 to zero on December 10.  FEBE is currently
downloaded/installed over 56,000 times a week.  Something is certainly broke
somewhere.
Summary: active daily users statistics down over 80% for add-ons → active daily users statistics broken for all add-ons since 2008-12-3
Actually it looks like active daily users it's not the only broken thing: also the download counts look wrong starting from dec 13th.
s/dec/nov/
server bug 469376 is now supposed to be fixed, but the numbers still don't look right.
Justin - what's left for this?  Should this go back to IT?
ADU numbers now show 790k for Adblock Plus - that's 85% drop compared to 12/10 (not to mention the bogus data point at 12/11). Something is still messed up.
Looking at the Firebug statistics, they are similarly messed up - unusually low ADU numbers for 12/03, bogus data point at 12/11 and 75% drop at 12/17.
DownThemAll! first hit 2.3M then was down to 200 then up again to 1.7M and now hit the bottom with 75 again.
Currently reported is a -100% drop since 24. Dez.
Fire.fm is experiencing the same problems (https://addons.mozilla.org/en-US/statistics/addon/7684).
I think bug 469376 comment #21 is the most likely reason for this problem: that the script is counting some pings into the following day, causing the constant drops the day after.
I'm more interested in the odd download peaks near mid-December. Is there a reason behind those?
Attached image an example —
This affects everyone similarly.
Justin - is this something you can fix?  Is there something I can do to help?
Another sawtooth has been added to the graphs. Has the cause been discerned yet?
Severity: normal → major
Sympathy with Wall Street's ones, I guess. XD
(In reply to comment #19)
> Sympathy with Wall Street's ones, I guess. XD

Once upon a time it said I had 850,000 users. Now it says I have 64. Wall Street is fairing better...  ;)
Target Milestone: --- → 5.0.1
Assignee: nobody → clouserw
What can we do to get this fixed? The download statistics are a vital tool for addon developers. We need them to decide when we can make changes and what versions we need to support. We also use them to justify continued support for addon development efforts.
Equally vital for Firefox product planning (e.g. are enough of the addons actually in use compatible with our next planned release? If not which ones should we prioritize?)
After much unzipping and tailing of giant files I'm pretty sure I've figured this out.  The timeplot graph will graph any data in the database.  The data for Firebug (and many other add-ons) looks like this since December 1:

+--------+----------+---------+------------+
| id     | addon_id | count   | date       |
+--------+----------+---------+------------+
| 737524 |     1843 |  727509 | 2008-12-03 | 
| 744639 |     1843 | 1805542 | 2008-12-10 | 
| 751222 |     1843 |     184 | 2008-12-11 | 
| 752959 |     1843 |  618197 | 2008-12-17 | 
| 759962 |     1843 | 1093030 | 2008-12-24 | 
| 766445 |     1843 |      67 | 2008-12-25 | 
| 767983 |     1843 |  662874 | 2008-12-31 | 
| 774749 |     1843 |     109 | 2009-01-01 | 
| 776236 |     1843 | 1353524 | 2009-01-07 | 
| 782880 |     1843 |      63 | 2009-01-08 | 
+--------+----------+---------+------------+

The tiny numbers show up directly after update stats counting so I looked at the logs for the day before and they've got several hits for the next day before the logs rotated.  So, in the example above there were 67 updates counted for Firebug within the first couple seconds of Dec 25th and those got tacked on to the end of the Dec 24th log.

Since it was a different date it got a new row in the database and when that data got graphed it looked like it went from 1093030 hits to 67 in 1 day.
Shouldn't be too hard to remove the extra DB entries and restrict the log parser to insert data from the correct day only. Who has access to these tools?
Nice job - thanks Wil.  After you fix the cause can you file a bug to run the sql to repair it?
Hmmm. Looking at the Firebug numbers I do wonder if that's the only bug in play here.... I guess it makes sense that Dec 24th and Dec 31st have much lower numbers than Dec 10, but should Dec 12th and Dec 3 really be so much smaller than Dec 10?
Attached patch only count correct dates — — Splinter Review
Here's a patch so it only records data for the requested date.

(In reply to comment #25)
> Nice job - thanks Wil.  After you fix the cause can you file a bug to run the
> sql to repair it?

Yep

(In reply to comment #26)
> Hmmm. Looking at the Firebug numbers I do wonder if that's the only bug in play
> here.... I guess it makes sense that Dec 24th and Dec 31st have much lower
> numbers than Dec 10, but should Dec 12th and Dec 3 really be so much smaller
> than Dec 10?

Other major add-ons aren't as open with their stats but looking at their numbers I see the same patterns.  Dec 03, Dec 17 and Dec 31 are substantially less than Dec 10, Dec 24, and Jan 07, respectively.  (Where substantially is something like 50%).  It looks like most add-ons took a hit to their stats on Nov 19th and have been doing this dance ever since.  Perhaps related to bug 472538.  The numbers are far closer but still in such a recognizable pattern it's worth investigating.
Attachment #356648 - Flags: review?
Attachment #356648 - Flags: review? → review?(fwenzel)
Well, if there's some bug at play, then it will likely be reflected in all add-on stats. In order to try to see if this is an expected pattern, I think it would be necessary to investigate previous year's statistics. Don't now if those are available, though.
Adblock Plus also has unrealistically low numbers listed for 2008-12-03 and 2008-12-17 (https://addons.mozilla.org/en-US/statistics/csv/1865/updatepings). Looking at December 2007, there was nothing comparable there - quite the opposite, there was a solid increase compared to November. FireBug numbers aren't much different (https://addons.mozilla.org/en-US/statistics/csv/1843/updatepings), there is a definitive upwards trend.
still seeing the saw wave. I would say this is a pretty major if not critical bug as it's affecting our ability to track update of our latest release (firebug 1.3.0). We're showing 0 ADU since Jan 7th?

Looks like you're on the right track, Wil. Sounds like there's some kind of race condition with log rotation in play and your database parser Keep us posted.
Comment on attachment 356648 [details] [diff] [review]
only count correct dates

That looks good. However, do we know the cause of this problem? Why would wrong dates be part of the logs?
Attachment #356648 - Flags: review?(fwenzel) → review+
(In reply to comment #31)
> (From update of attachment 356648 [details] [diff] [review])
> That looks good. However, do we know the cause of this problem? Why would wrong
> dates be part of the logs?

It's only about 2 seconds worth of logs.  I assume it starts rotating them at midnight but it takes that long to synchronize everything (maybe to get the file handles on the netapp?) so it continues to write to the old log in the mean time.  oremj would have a better idea.
(In reply to comment #31)
> However, do we know the cause of this problem? Why would wrong
> dates be part of the logs?

I had the same issue back when I was using Apache - it takes a few seconds before Apache hangs up on the old files and starts writing to the new log files. nginx on the other hand hangs up immediately.
Alright, I think I know what's going on with the second wave of hills and valleys.

I tried re-crunching the log files for 12-03 in verbose mode with some extra logging because I wanted to verify why there were so few updates counted.  All went well until I got to the log file for 11am from z4 which was 256M compressed.  This inflated to 2.7G uncompressed and fopen() died with:

PHP Warning:  fopen(/tmp/addon_log_file_0.41532300_1231890120): failed to open stream: File too large

At that point the script exited.

On my dev copy it didn't flag the file as counted in the database, however, the file is counted in production right now, which makes me wonder if production is setup differently.  A crash like this on a large file would certainly account for the dips in counts we've been seeing.

As far as fixing it, it looks like compiling PHP with FILE_OFFSET_BITS=64 sort of fixes it but stuff like filesize() doesn't work according to http://blog.thinkphp.de/archives/131-Handling-large-files-without-PHP.html .

Jeremy:  Any insight on if production is setup differently?  Also, is the output of the log parser for 12-03 available?
(In reply to comment #34)
> PHP Warning:  fopen(/tmp/addon_log_file_0.41532300_1231890120): failed to open
> stream: File too large

I don't know the architecture of course - but why is the uncompressed file written on disk at all? Doesn't sound like a good idea with log sizes of several GB. All my log processing reads "gzip -cd" output directly (popen() in PHP can do that as well from what I can tell). That's the big advantage of using gzip in the first place - it is built for decompression on the fly.
(In reply to comment #35)
> (In reply to comment #34)
> > PHP Warning:  fopen(/tmp/addon_log_file_0.41532300_1231890120): failed to open
> > stream: File too large
> 
> I don't know the architecture of course - but why is the uncompressed file
> written on disk at all? Doesn't sound like a good idea with log sizes of
> several GB. All my log processing reads "gzip -cd" output directly (popen() in
> PHP can do that as well from what I can tell). That's the big advantage of
> using gzip in the first place - it is built for decompression on the fly.

I don't know why it's written out but I agree handling it all in memory is a better idea.  It looks like the machine doing this in production is 64 bit so depending on how PHP is compiled comment #34 might not be the answer.  I'll start a new run using the 64 bit machine today and see what I can come up with.

Meanwhile, bug 473595 fixes comment 23 so when the cron runs tonight we shouldn't see any more of that problem.
Thanks Wil. The graphs are beginning to look a little less sickly now, but the lower than expected tallies are still there.

My graph and Adblock+'s no longer have the dead points, but Firebug still does.
https://addons.mozilla.org/en-US/firefox/statistics/addon/5791
https://addons.mozilla.org/en-US/firefox/statistics/addon/1865
https://addons.mozilla.org/en-US/firefox/statistics/addon/1843

Are the corrections just not applied evenly yet or did something fail here?
Probably still cached in memcache since we tweaked the db directly.  Give it a bit longer to sort out. Actually, looks like it just now did.
Yeah, I'm seeing it fine now too. Cache strikes again.
I have to wonder if this problem is older than we suspect.  Looking at the Active Daily Users stats for the addons listed in Comment #37 and my own (https://addons.mozilla.org/en-US/statistics/addon/2109), it seems we all took a dive on around March 25, 2008 and again around July 7, 2008.

Does this mean that the *Average* daily users will always be understated?
(In reply to comment #40)
> I have to wonder if this problem is older than we suspect.  Looking at the
> Active Daily Users stats for the addons listed in Comment #37 and my own
> (https://addons.mozilla.org/en-US/statistics/addon/2109), it seems we all took
> a dive on around March 25, 2008 and again around July 7, 2008.

The March drop is the residual damage that wasn't fixable from bug 426038. I think the July issues are due to counting issues from the Firefox 3 release.
Alright, I just ran the Dec 03 updatecount stats again from scratch on the production box into my development database.  I got 11G worth of output and no errors.  I also got completely different numbers...

Sample numbers for Dec 03:
ID     Old total      My total
1843   727509         1432563
2109   127092         241988
5791   272456         828163
1865   2276968        4834371

As you can see my totals are a lot closer to what you would expect which makes me think the script got interrupted for some reason.  (It's weird it happens every two weeks instead of every week...).

Jeremy: is the output of the update pings scripts from 2008-12-03, 2008-12-17, or 2008-12-31 available?  Without it I'm not really sure what else I can do... the script worked fine for me.
Since I think stats already ran for tonight we missed our window there.  If we can't come up with the output from the jobs already run we can run next weeks in verbose mode to see what we can see.  I have a feeling it'll be obvious whether the verbosity is turned up or not though.
In my case (addon 5817), I am still getting data for two days which are not wednesdays:
2008-02-06
2007-11-07
(In reply to comment #44)
> In my case (addon 5817), I am still getting data for two days which are not
> wednesdays:
> 2008-02-06
> 2007-11-07

In my calendar they are Wednesdays... And looking at the data for your add-on, all data points seems to have 7 days distance to each other, nothing unusual there.
(In reply to comment #45)
> (In reply to comment #44)
> > In my case (addon 5817), I am still getting data for two days which are not
> > wednesdays:
> > 2008-02-06
> > 2007-11-07
> 
> In my calendar they are Wednesdays... And looking at the data for your add-on,
> all data points seems to have 7 days distance to each other, nothing unusual
> there.

OOPS! I have a little script that fetches the csv data and presents it application+majorversion wise etc. Those dates + 2 others (which are no more there in the csv) used to be shown for which getDay() != 3. Surely, some error in creating dates in my script. Apologies to all.
(In reply to comment #42)
> Alright, I just ran the Dec 03 updatecount stats again from scratch on the
> production box into my development database.  I got 11G worth of output and no
> errors.  I also got completely different numbers...
> 
> Sample numbers for Dec 03:
> ID     Old total      My total
> 1843   727509         1432563
> 2109   127092         241988
> 5791   272456         828163
> 1865   2276968        4834371
> 
> As you can see my totals are a lot closer to what you would expect which makes
> me think the script got interrupted for some reason.  (It's weird it happens
> every two weeks instead of every week...).

...and My total column is curious 2x the Old total column.
> (In reply to comment #47)
> ...and My total column is curious 2x the Old total column.

they're not exact, and that makes sense given that the results for any day with good logs + 1 were vanishingly close to zero.

Sounds like we're close to getting this fixed. Good work, guys!

One question that may be obvious, and was mentioned here, are we only measuring active daily users every seven days? (Wladimir's comment in #45 suggests this is normal, but I wasn't sure)
(In reply to comment #48)
> One question that may be obvious, and was mentioned here, are we only measuring
> active daily users every seven days? (Wladimir's comment in #45 suggests this
> is normal, but I wasn't sure)

Yep, every Thursday we parse Wednesday's stats.  (Looking up this cron job makes me realize that the stats run tonight, not yesterday.)
ok, good to know. I eagerly await tonight's stats-gathering cron job! :)
What about ones like 2008-11-19? It's not massively off like some of the others, but it is down a couple hundred thousand. If you look at the graphs it the first point that isn't going in the same trend as the others is the 19th, and after that none are where one would expect.

Flagfox (5791)
2008-10-22,773887
2008-10-29,788225
2008-11-05,819840
2008-11-12,854817
2008-11-19,636946
2008-11-26,819878
2008-12-03,272456
2008-12-10,862177
2008-12-17,204594
2008-12-24,649970
2008-12-31,356537
2009-01-07,603525

(In reply to comment #42)
> Jeremy: is the output of the update pings scripts from 2008-12-03, 2008-12-17,
> or 2008-12-31 available?  Without it I'm not really sure what else I can do...

Do you have the data from the other points? I think all since 2008-11-19 will need reprocessing.
Yeah, I mentioned in comment 27 that the 19th is the first day where this pattern starts occuring and it's across all add-ons that I can see.

Unfortunately, due to bug 469737, the output from this script between Nov 2nd and Jan 12th is lost so I don't know what's going on with it.  Jeremy added the verbose flag to tonight's run and is redirecting the output to a file for me.  If history is any indication this week is a low week too so hopefully there will be a grand revelation.

I've got no objection to recalculating the numbers since Nov 19th once we figure out what's going on.
-98.23%

I guess they really don't like the new version. :(
Last night I noticed this not running and was curious.  I looked at the server and it appears to be in GMT time which means these cron jobs should start at noon PST on Thursday.  

I talked to Jeremy this morning about why it didn't happen and he said there was a syntax error in the cron job which he fixed and has since started manually.

I'm guessing the script will take 8-10 hours.  In the mean time you'll see the stats slowly rise as they are counted, which means the -98.23% will get closer to reality throughout the day.
(In reply to comment #54)
> I'm guessing the script will take 8-10 hours.  In the mean time you'll see the
> stats slowly rise as they are counted, which means the -98.23% will get closer
> to reality throughout the day.

Bug 443620, I believe. (what I initially thought this might be in comment 1) I don't suppose someone could look into that when this is all said and done? Kinda hard to know when the numbers are "correct" when it doesn't actually tell you when it's done counting. It really should only be showing us data once the data is fully calculated.

Is there anything in that script that could be optimized somehow? 8-10 hours sounds awfully slow.
I'm sure there are improvements to make (not writing out to disk is a big one).  Right now I'd like to get it working first.  To be fair, it is parsing something like 75G+ of logs.
The script finished a while ago.  It's messed up.  There are 149 log files that should have been parsed but the database only recorded 74.  There is a similar trend for all the other update_ping days (there are far fewer log files recorded than exist).

I looked at the output of the script and it moves right along, on the 74th file it parses the whole thing, starts inserting the results into the database and about 150 UPDATEs in it just ends.  No errors, no warnings, no nothing.

I don't have root on these boxes so new questions for IT:

1) Any errors on the database side?  It was getting hit hard and fast when the script died.

2) Anything in the server logs?  Maybe a process limit kicked in and the kernel killed it?  Using too much memory?
What time did it stop running?
No timestamps in the output :-/
I wanted a wider audience than just this bug so I posted this to my blog but this is our current plan and timeline:  http://micropipes.com/blog/2009/01/22/add-on-statistics-status/

Short answer:  This should be fixed at the end of next week and then we'll start back filling stats.
Target Milestone: 5.0.1 → 5.0.2
Depends on: 477156
bug 477156 is step 3 from comment #60.  We're switching over to new stats tonight, expect to see some skew[1] but there shouldn't be a crazy amount and it should be reliable from here on out.  Additionally, we'll get daily numbers instead of weekly now so that's a bonus.

Thanks for your patience and let me know what you see.

[1] we're switching scripts, FF3.0.6 went out yesterday, and we're doing daily counts now
Not sure if anyone else noticed this but download stats were not calculated for yesterday (Feb 05).
The first 3 days of the new stats look deflated:

Flagfox (5791)
2009-02-01,682431
2009-02-02,772594
2009-02-03,776113
2009-02-04,926222

I suspect the last one is closest to correct, seeing as back before all this started on 2008-11-12 I was in the 850k range. Other add-ons have similar low stats for the first 3 of these new days.
Summary: active daily users statistics broken for all add-ons since 2008-12-3 → active daily users statistics broken for all add-ons since 2008-11-19
(In reply to comment #62)
> Not sure if anyone else noticed this but download stats were not calculated for
> yesterday (Feb 05).

I'll talk about that in a minute in bug 472538

(In reply to comment #63)
> The first 3 days of the new stats look deflated:
> 
> Flagfox (5791)
> 2009-02-01,682431
> 2009-02-02,772594
> 2009-02-03,776113
> 2009-02-04,926222
> 
> I suspect the last one is closest to correct, seeing as back before all this
> started on 2008-11-12 I was in the 850k range. Other add-ons have similar low
> stats for the first 3 of these new days.

Most add-ons show around a 15% spike from the 3rd -> 4th.  We thought that could be attributed to the FF3.0.6 release.  As far as deflated numbers, when we compared the output of the scripts (verifying the current one actually ran) on some logs from November there was a .01% or less difference between the counts.
Bug 477923 is tracking back filling statistics.  What do you think of the past week of stats?
Depends on: 477923
Looks good so far.

I'm pretty sure my observation in comment 63 is just the natural variation in users across different days of the week. We're going to get much more detailed info now.
Yes, the stats make sense. ADU numbers vary strongly by weekday with Wednesday being the "strongest" day, I guess that's really how it is. Download numbers are back to reasonable values (interestingly, Firebug has a strong weekday pattern here as well whereas Adblock Plus downloads seem independent of the day of week).
(In reply to comment #65)
> Bug 477923 is tracking back filling statistics.  What do you think of the past
> week of stats?

For addon 5817 (https://addons.mozilla.org/en-US/statistics/csv/5817/application) I find that no of active users using Firefox 2 is shown as 0 from Feb 5 to Feb 9.
in continuation of Comment #68:
Same holds for Firefox 1, Thunderbird 1 & 2 and Sunbird 0.*
An observation: these are the versions that the current release of my addon does not support.
These numbers do look better, I believe. Also, as Wladimir mentions in comment #67, the Firebug numbers match what we believe to be the case with more users during the week and less on weekends.

good job, Wil!
Perhaps this has already been done, but to really be confident of the numbers, create a test extension and having a fixed and known number of users (1? 5?) install it for a couple of days then remove it. Then check the numbers. I'd really like to be confident that the number we see here are not off by a factor of 2, which seems like a possibility based on what I read here.
Based on our landing page stats, which I am not 100% confident of, I feel like they could definitely still be twice what they ought to be.
Blocks: 478319
(In reply to comment #68)
> For addon 5817
> (https://addons.mozilla.org/en-US/statistics/csv/5817/application) I find that
> no of active users using Firefox 2 is shown as 0 from Feb 5 to Feb 9.

I filed bug 478441 about that.

Re: comment 71 and 72; we have some add-ons reporting 1 ADU so I don't think we're doubling.  John: what makes you think we might be off by a factor of 2?
Depends on: 478441
Target Milestone: 5.0.2 → 5.0.3
Filed bug 479818: All the stats seem to be slightly down week-by-week. Not sure what it is yet.
I take back what I said in comment 72 - we're currently seeing rough parity with AMO and our own stats, at least on daily downloads and daily active users (is certainly nice to have the daily numbers!) How long till the older stats are backfilled?
Backfilling is bug 477923.  Daniel will have to give an ETA
I'm pulling this out of our milestones for now.  It's essentially fixed except for the backfilling which it's dependent on.  Once the backfilling is done this will be resolved.
Target Milestone: 5.0.3 → ---
This should be able to be verified and marked fixed now.
I opened https://addons.mozilla.org/en-US/statistics/addon/1843
and selected 
Add-on version 
from below the "Firebug Statistics".  
 
plotSelection.summary.updatepings is undefined
https://addons.mozilla.org/js/stats/plots.js
Line 265
(In reply to comment #79)
> I opened https://addons.mozilla.org/en-US/statistics/addon/1843
> and selected 
> Add-on version 
> from below the "Firebug Statistics".  
> 
> plotSelection.summary.updatepings is undefined
> https://addons.mozilla.org/js/stats/plots.js
> Line 265

It seems to me that would be a separate bug regarding a problem with the website code rather than a problem with the underlying data. Does this sound right?
(In reply to comment #80)
> (In reply to comment #79)
> > I opened https://addons.mozilla.org/en-US/statistics/addon/1843
> > and selected 
> > Add-on version 
> > from below the "Firebug Statistics".  
> > 
> > plotSelection.summary.updatepings is undefined
> > https://addons.mozilla.org/js/stats/plots.js
> > Line 265
> 
> It seems to me that would be a separate bug regarding a problem with the
> website code rather than a problem with the underlying data. Does this sound
> right?

Yep. Should be bug 484272 which still says push-needed.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: