Closed Bug 477923 Opened 15 years ago Closed 15 years ago

Back fill logs from Nov 15 -> Feb 23 for AMO

Categories

(Mozilla Metrics :: Data/Backend Reports, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED
Unreviewed

People

(Reporter: clouserw, Assigned: dre)

References

Details

(Whiteboard: pending validation)

Our stats have been flaky since Nov 15th.  Now that we're using metric's boxes+scripts we need to back fill our stats from the logs.
Assignee: deinspanjer → nobody
Group: mozilla-stats
Component: Statistics → Data/Backend Reports
Product: addons.mozilla.org → Mozilla Stats
QA Contact: add-ons → justin
Group: mozilla-stats
Blocks: 468570, 472538
Can we get an ETA on this?  Add-on developers are asking.
Still looking for an ETA.  Add-on authors use our statistics in reports to their shareholders, etc. so it's important to get this fixed or at least get them a time line.  Bumping to critical for a response.
Severity: normal → critical
Mass update to QA Contact field.  Sorry for the bugspam
QA Contact: justin → data-reports
Still don't have an ETA.  Marking this a blocker.
Severity: critical → blocker
As per bug 479818 comment 4, back fill will be needed for February too. Based on bug 478168 it should go up to Feb 23.
Summary: Back fill logs from Nov 15 -> Feb 1 for AMO → Back fill logs from Nov 15 -> Feb 23 for AMO
Will begin backfill work next week (2009-03-23).  Hopefully we should be able to get more than a month of back-processing done each week.  Status updates will be posted to this bug.
Assignee: nobody → deinspanjer
Status: NEW → ASSIGNED
hooray!
Can we get a status update or ETA?  Thanks
Blocks: 478441
Sorry for the long delay on this.  We have been doing backfill processing (November is complete), but unfortunately, we aren't able to bring the results into our data warehouse (and subsequently export them to the AMO DB) at the same time due to scheduling and resource conflicts with the current processing.  When backfill processing is complete, we will schedule a window to import and then export the data.  With Q1 behind us, this task has the highest priority outside of production maintenance right now.  I will keep you informed of further progress.
Any updates? It's been 11 days since previous, and we'd really like to have this data.
I have a set of optimizations scheduled to be pushed to production on Wednesday morning 4/22.  With this push, I'll begin exporting data from November and inserting it into the master AMO DB.
Any update on this?
I've had a couple of major setbacks on the past few days that have really messed up my processing of this data.  I'm just finishing up pushing the fix for 492910.  Let me gather my wits and take stock and I'll update this bug tomorrow. :/
The files that I've been queuing up to push to AMO had a gzip corruption issue with them.  Even though a gzip -l shows file size information, and you can zcat the head of the files, the files generate an "unexpected end of file" error.  I spent most of the day attempting to repair them using gzrecover, but the repaired files truncate around 95% of the way through, usually in the middle of a record.
I spent the evening writing a patch to prevent the corruption, and I've also gone through the ETL process trimming it down to output only the information needed by AMO instead of the full set of data that we would normally store in our data warehouse.  This will make the reprocessing run much faster and I am kicking it off tonight to run through the weekend.

I'll be in the office on Monday with a further status update.
Any update on this?
Last update on this was that we'd have results by the middle of last week. Any news?
The additional hardware ear-marked for reprocessing has not been handed over by IT yet.  They had a few technical problems and then building move related work took priority.  I received a status update from mrz today that I should hopefully receive access by tomorrow.  I'll post again then.
The new hardware is online and is being set up to process this weekend.  Will make a new status update when processing starts.
Whiteboard: amo; backprocessing
Whiteboard: amo; backprocessing → amo; backprocessing; done by 6/26
By the way, in case it wasn't clear, we should delete the existing download and update ping data from this date range before inserting/updating with the new data.

If you need us to do that, just let me know, but it should be straightforward.
Just checking in on this bug... most recent ETA was last Friday.
There was supposed to be a processing run from Thursday to Sunday on this task, unfortunately, it had to be aborted due to the log processing issues mentioned in bug 501999.  As soon as that data is caught up, I will post more status to this bug.
Whiteboard: amo; backprocessing; done by 6/26 → amo; backprocessing; on-hold waiting for log server backup to complete
Delivered an addons downloads export file for verification to fligtar.  Once he signs off on it, the rest of the downloads metrics can be inserted directly to amo_master.
Whiteboard: amo; backprocessing; on-hold waiting for log server backup to complete → amo; backprocessing; in process
Hi Daniel, I looked at the CSV and the numbers are all extremely low and a bit suspicious. For example, Adblock Plus (#1865) got 24 downloads every day from 11/15-11/30 according to the CSV. It normally gets over 100,000 downloads per day.
Alright.  That is what I needed.  I'll go through it and see what happened.
Going back up a step in the pipeline, I see about 55k to 64k requests per day in November for download URLs that have the string adblock_plus in them.  This is after the deduping filter that discards any requests from the same IP for the same URL within the same minute.

I'll look at the last sage summarizer to see why it counted wrong, but even then, I'm concerned that I'm only seeing 55k to 64k per day and you are talking about 100k+.  Could the traffic have changed that much in the last 7 months?
(In reply to comment #29)
> I'll look at the last sage summarizer to see why it counted wrong, but even
> then, I'm concerned that I'm only seeing 55k to 64k per day and you are talking
> about 100k+.  Could the traffic have changed that much in the last 7 months?

If we're assuming that before and after this date range the stats were and are being recorded correctly, then yes, it looks like the traffic has increased quite a lot since then.
I got the 100k+ number from this week's download stats of Adblock Plus. So unless there's a problem with the stats that are currently being processed, it should be right. Of course, downloads after a major Firefox release always shoot up.

Looking at Adblock Plus downloads from October of last year, the numbers you are seeing seem to be expected.
Found the problem (I was unintentionally counting the number of hours that had pings for each addon rather than correctly summing) and re-ran the summation.

It is attached, could I have another look over please?
(In reply to comment #32)
> Created an attachment (id=388354) [details]
> daily amo addon downloads from 2008-11-15 to 2008-11-30

That contains confidential information. :-/
Group: mozilla-stats
Group: mozilla-stats
Sorry, I was only thinking about the lack of any user identifiable information.  Is there anything we can do to kill off the attachment short of making the bug secured?
Group: mozilla-stats
Can you email it to me please?
(In reply to comment #34)
> Sorry, I was only thinking about the lack of any user identifiable information.
>  Is there anything we can do to kill off the attachment short of making the bug
> secured?

Reed already made it stats-confidential a couple minutes after I commented.  Please make this bug public again so we can continue to point people with questions to it.
I saw that, but then I saw the change reverted a couple of seconds later so I wasn't sure if we were happy with the resulting visibility.
Group: mozilla-stats
The new CSV file looks good.
CSV files for all of AMO downloads and one third of AMO updates are ready to be loaded pending a test run against the AMO dev database.
Could you please verify the download_counts data that I uploaded to c_junk on khan?  I'd like to push them to production tomorrow morning.
The data in c_junk in the import range (11/15 - 2/23) seems fine to me.
Okay.  I'm going to push that to production then.
Could you also check the data uploaded to update_counts as well?
For update_counts, I really only see daily consistent data since February 1. Before then seems to be the Wednesday-only bad data from before.
Sorry I didn't clarify, the only day that I pushed for update_counts was on 2008-11-16.  How does that day look?
It's hard for me to tell - 11/16 is on the weekend so is considerably lower than the other numbers that we have in that time period which were only on Wednesdays. So I'm not sure if the difference is because of the weekend or because of a problem.
Pushed download data for 2008-11-15 to 2009-02-23 to production.
Pushed update data for 2008-11-15 to 2008-12-18 to c_junk for verification.
Is anyone else seeing a dramatic decrease in total download count today because of this fix?  My extension total count decreased over 20% from 3.3m to 2.6m.

(See Bug 505133)
(In reply to comment #47)
> Is anyone else seeing a dramatic decrease in total download count today because
> of this fix?  My extension total count decreased over 20% from 3.3m to 2.6m.
> 
> (See Bug 505133)

See bug 472538 that caused some old bad download stats. Some of the backfilling here is to repair that.
This backprocessing is on a completely separate development branch from the production nightly update system.
That isn't to say there isn't some funky way that the two systems could have a negative interaction, but it is rather unlikely.
Oh!  Sorry, Dave's comment #48 is much more useful than mine.  I wasn't thinking about a grand total number of downloads over time.  Yes.  bug 472538 involves the fact that some duplicate download requests were not filtered by the old system.  These new download counts are filtered properly so that means that the counts will drop, but the new totals will be closer to the truth.
The update_counts in c_junk seems extremely high for 11/15 through 12/17 for all add-ons.
Hrm.  Looking at a histogram of the table, it seems like the data somehow got double imported.  Let me try reimporting again.
My processing machine was taken down by IT for diagnostics, but I was able to re-import from 2008-11-15 to 2008-12-11.  Could you see if these numbers look more reasonable?
The same dates are still too high.
Okay. I found and fixed the double counting problem.  A histogram of the data now looks much more reasonable.  How about on your end?  Data from 2008-11-15 to 2008-12-17 has been updated.
Looks good to me.
Okay.  Please give me a time-window when it would be okay for me to push this batch of data to production.  I don't know for sure, but it seems like this insert job might potentially be stressing the amo dev server, so I don't want to potentially impact production at a bad time.
Up to Wil. If you think it will possibly impact production, we can have IT do a maintenance window tomorrow night and do it then.
We might be able to determine better whether it would impact production if we can get someone to take a look at any server monitoring we might have for the amo dev machine and see if it was dogged out for any periods earlier this afternoon.  If it wasn't, then there is likely to be no load impact on production either.
How long will the update take?

You can do it at night or during a maintenance window.  I doubt it'll be a problem but I'm not the guy to make the call.  CCing oremj, but I'm happy to have input from anyone on IT.
It was taking about 5 minutes per day of data to insert to the dev server.  We have just over 100 days of data to import.

I suspect that since AMO doesn't have a tremendous amount of write traffic going in to the master server that I could run the whole thing straight with no noticeable strain on the server.  But since I can't log in to the box and look at it via top or iostat or anything while doing the import, I just want to make sure that you guys are ready.

I can break the import up into small bits, I can delay for a while inbetween each day, just let me know what you want.
Wil?
Whiteboard: amo; backprocessing; in process → Completed, waiting for schedule to push to AMO production
(In reply to comment #62)
> Wil?

I've already said I don't think it will be a problem but it's not my decision.  If we can break it up and avoid the issue, do it. Let's just get this running.
I believe all the backfilling should be done.  Please let me know if you have questions or concerns
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Looks like the case of some of the fields was lost and is causing problems in the graphs. (i.e. userenabled instead of userEnabled; winnt instead of WINNT) That means for this time range the values are considered to be different than the correctly cased versions before and after, and thus the graphs add a separate plot for it. Check the graphs and this entire range is a different color in the status and OS views.
Yes, this would be a side effect of the difference in the backprocessing script from the normal one.
Rather than reprocessing the entire data set, I'd like to just correct the case of the relevant items so they can be re-imported.  Is this feasible? Can we get a complete list of the necessary keywords to search and replace?
Whiteboard: Completed, waiting for schedule to push to AMO production → pending validation
I applied the following replacements in the ETL and am re-importing, please let me know if I missed anything:

.replace("userenabled","userEnabled")
.replace("userdisabled","userDisabled")
.replace("needsdependencies","needsDependencies")


.replace("aix","AIX")
.replace("beos","BeOS")
.replace("darwin","Darwin")
.replace("darwin,","Darwin")
.replace("dragonfly","DragonFly")
.replace("freebsd","FreeBSD")
.replace("hp-ux","HP-UX")
.replace("irix","IRIX")
.replace("irix64","IRIX64")
.replace("linux","Linux")
.replace("netbsd","NetBSD")
.replace("nto","NTO")
.replace("openbsd","OpenBSD")
.replace("os2","OS2")
.replace("osf1","OSF1")
.replace("sco_sv","SCO_SV")
.replace("sunos","SunOS")
.replace("winnt","WINNT")
.replace("Linux-gnu","linux-gnu")
.replace("Linuxappabi","linuxappABI")
.replace("WINNTappabi","winntappABI")
.replace("Darwinappabi","darwinappABI")
(In reply to comment #67)
> .replace("darwin","Darwin")
> .replace("darwin,","Darwin")

Duplicate Darwin with a comma. (probably not consequential, though)

There's also "userEnabled,incompatible" and "userDisabled,incompatible", though I would guess the replaces above already cover them.
Thanks Daniel. Looks like the cases have all been corrected over this range. Filed bug 507831 for a recent issue that's related.
You need to log in before you can comment on or make changes to this bug.