Closed Bug 1189071 Opened 9 years ago Closed 2 years ago

[Redshift] Reduce number of rows in impression_stats_daily

Categories

(Content Services Graveyard :: Tiles: Data Processing, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: oyiptong, Unassigned)

Details

We now have billions of rows in impression_stats_daily. Query performance is going down.

One way to tackle it (the way we've done before) is to throw more hardware at it.

Another way is to periodically make the data more compact.

Every time infernyx dumps user event data in Redshift, it creates at least one row in impression_stats_daily.

Much of this data can be summed up according to cardinality keys. Ideally, we only have one row per combination causing one cardinality instance.

Hypothetically, this would involve:

1. creating a temporary table
2. dumping the 'compacted' data in the temp table
3. renaming impression_stats_daily to a throwaway name
4. renaming the temp table to impression_stats_daily
5. dropping the throwaway table

How does that work with Redshift's distributed nature? (Hopefully well)
Do we need to stop infernyx jobs?
Can we do this in an automated fashion?
Maybe we could introduce some data expiry policy, cause currently we keep 13 months of data. Depends on the actual need for analytics and business purposes, we can expire data regularly.
Good idea.

Our data principles allows us to store our aggregate data for up to 13 months. We don't need to store data for that long.

We could also delete data for a smaller period.

Hermina, can you please find out if we can store less data?
How far back do we go for reports? (:mruttley too)
Flags: needinfo?(mruttley)
Flags: needinfo?(hcondei)
The compacting method you suggest might be unnecessarily broad in scope, due to processing data that has already been compacted. It might be worthwhile to determine a way to only compact new data, perhaps after a day's data is already complete. Regarding temp tables, I think they already exist, because tspurway was looking into this a few months ago.
By storing less data you mean us gathering less data on our users or store that data less time? 
If it's the second, I suppose the 13 months have been agreed with legal and policy - when making that decision, what did we base it upon? I would need more context there. Will follow up with product on this.
Flags: needinfo?(hcondei)
(In reply to Hermina from comment #4)
> By storing less data you mean us gathering less data on our users or store
> that data less time? 
> If it's the second, I suppose the 13 months have been agreed with legal and
> policy - when making that decision, what did we base it upon? I would need
> more context there. Will follow up with product on this.

I think what :nanj means is that we store the aggregate data for a shorter time period. He suggested that the analytical needs may require data for a shorter time period.

This may or may not be the solution, but I wanted that bug to be a record of our conversations.
When choosing 13 months, we thought it was a safe number that would fulfill our needs. I'm just trying to gather some facts in order to come up with a solution.
Perhaps we could create smaller historical aggregates just for campaign level data (i.e. not for local, country, tile id). 
This would allow us brief overviews for YoY comparison analyses. 
At the moment, historical data is only of passing interest, most analyses being sent out to clients are for the past month or quarter at most so it could be compressed and archived somewhere else.
Flags: needinfo?(mruttley)
The easiest and canonical way of doing this is to run a daily cron that compresses the previous day's data.  This would look like (in pseudoSQL):

  start transaction;
  select k1, k2, k3, ..., kn, sum(v1), sum(v2), ..., sum(vm) 
    from <table>
    where date = <yesterday> 
    insert into <temp table>;
  delete 
    from <table> 
    where date = <yesterday>;
  select * 
    from <temp table> 
    insert into <table>;
  commit transaction;

Note this operation would be transactionally isolated, and idempotent.  Note also it could be be easily run on arbitrary day ranges and executed either from cron or the command line.

This bug lies at rest in the graveyard.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.