1154400 - Fix Slow Bootup Query in Zenko

Reporter

Description

•

9 years ago

>SELECT DISTINCT
>    tiles.id,
>    countries.country_name
>FROM
>    tiles
>INNER JOIN impression_stats_daily ON
>    tiles.id = impression_stats_daily.tile_id
>INNER JOIN countries ON
>    countries.country_code = impression_stats_daily.country_code
>ORDER BY
>    countries.country_name ASC;

Copying from bug 1152996

:oyiptong says

This query makes a join on impression_stats_daily, which is a table with billions of rows

A join on a table of that size is something one should never do, as it will make the database do a lot of work and will probably timeout

:mostlygeek says

:oyiptong could you open up a bug to fix the zenko query and make it a dependency for this bug. I don't feel comfortable turning zenko back on if it's just going to degrade redshift again.

Tim Spurway [:tspurway]

Comment 1

•

9 years ago

If the problem is one of wanting to get 'valid' tile_ids, then we could simply delete those tiles that have no foreign key references from impression_stats_daily as a one-time data maintenance task.  This query look like:

delete from tiles
where id not in (select distinct tile_id from impression_stats_daily);

You could then just change your original query to:

>SELECT DISTINCT
>    tiles.id,
>    countries.country_name
>FROM
>    tiles
>INNER JOIN countries ON
>    countries.country_code = impression_stats_daily.country_code
>ORDER BY
>    countries.country_name ASC;

which would be more reasonable

Daniel Thorn [:relud]

Comment 2

•

9 years ago

for reference, this query does not appear to work:

>SELECT DISTINCT
>    tiles.id,
>    countries.country_name
>FROM
>    tiles
>INNER JOIN countries ON
>    countries.country_code = impression_stats_daily.country_code
>ORDER BY
>    countries.country_name ASC;

Matthew Ruttley [:mruttley]

Assignee

Comment 3

•

9 years ago

Basically I need a way of finding out which countries each tile runs in and we have data for (whether it was intended or not) and this query seems to be the only way. Any ideas?

Daniel Thorn [:relud]

Comment 4

•

9 years ago

after further investigation, i think this was what tim meant:

>SELECT DISTINCT
>    impression_stats_daily.tile_id,
>    countries.country_name
>FROM
>    impression_stats_daily
>INNER JOIN countries ON
>    countries.country_code = impression_stats_daily.country_code
>ORDER BY
>    countries.country_name ASC;

Also, while I agree that this is a nicer query, due to fewer joins: testing shows that both of these queries appear to take about 35-50 seconds in prod and 3-10 seconds in stage. also, the original query has the benefit of ignoring invalid tile ids. I suspect that redshift is actually being really efficient about this, because that query time doesn't sound DOS-level intense.

Benson Wong [:mostlygeek]

Comment 5

•

9 years ago

The join in that query only appears to provide country name. 
How long does that query run if we skip the join entirely and omit the country name?

Daniel Thorn [:relud]

Comment 6

•

9 years ago

19-27 seconds for this query:

> SELECT DISTINCT
>     tile_id, country_code
> FROM
>     impression_stats_daily
> ORDER BY
>     country_code ASC;

Note that this does not ignore invalid country codes or invalid tile ids.

Benson Wong [:mostlygeek]

Comment 7

•

9 years ago

Hmm... 30% to 50% speed increase without the joins. 

Seems to me there are two issues here: 

- bad data (invalid country codes, invalid tile ids) in the impression_stats_daily table
- joins in queries as a result of bad data

What is (if any) the policy on invalid data? 
Should we be actively deleting it as :tspurway suggested in Comment #1?
Should we be filtering it out in inferno before it hits redshift?

Flags: needinfo?(tspurway)

Matthew Ruttley [:mruttley]

Assignee

Comment 8

•

9 years ago

Two current strategies that I'm considering to deal with this are 

1) [Short term fix] Possibly speed things up by finding when the user's cache was last updated and limiting this query to data only after that date. Might be worth a shot. 
2) [Long Term, Big Project] Create a decent hosted version of Zenko rather than on each individual's PC. This would hold and cache useful data aggregates in a separate database. It would perform this long query every 24 hours, early in the morning so that data would be available much faster for people. This has several other benefits in accessibility, usage, fraud monitoring possibilities etc.

Tim Spurway [:tspurway]

Comment 9

•

9 years ago

The real problem here is that country_code is not part of any 'dimension' table, and is only tracked in 'fact' tables like impression_stats_daily, which is arbitrarily large, and therefore extremely expensive to query.  The solution is to start tracking the country_code in a fact table like 'tiles' or 'adgroups' (see below).

http://en.wikipedia.org/wiki/Dimension_table
http://en.wikipedia.org/wiki/Fact_table

Our Redshift database is going through some major changes right now, which will affect Zenko:

1. In the next release, we are introducing Adgroups, which have the following features:
- deprecates tiles.locale - this field should not be accessed going forward, and will be removed
- new tables: adgroups and adgroup_sites which contain targeting meta data for tiles (such as locale and frecent_sites)

2. In the following release we will be separating fact tables and dimension tables into separate databases.  This means that it will no longer be possible to perform joins across these tables without synchronizing the data beforehand.  This will make for much more efficient OLTP processing of campaign data (campaign management), but will somewhat complicate reporting.

Flags: needinfo?(tspurway)

Matthew Ruttley [:mruttley]

Assignee

Comment 10

•

9 years ago

I've fixed this issue with: https://github.com/matthewruttley/zenko/commit/346cf9ce95ddcacfbdccbb84d31dfd8a8150a0c2

The only reason the query existed was to tell which Countries we had data for, for each tile. If the user specifically needs that data they can use a country-by-country view: http://localhost:5000/country_impressions?client=Booking.com . I've decided there is no particular point in having that cached. 

Initially it was for views such as the day-by-day view e.g. http://localhost:5000/daily_impressions?tile_id=628 . Here, the Country drop down box would have only been populated by countries for which we *actually* had data for. Instead, now it will just show all countries. If there happens to be no data for that country, there will just be a blank table and graph. This seems OK functionality-wise. If they want to check which countries exist for a tile they can use the country-by-country view mentioned above. 

Zenko now loads in 2-3 seconds :)

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ed Lee :Mardak

Updated

•

9 years ago

Iteration: --- → 40.3 - 11 May

Points: --- → 5

Bugzilla

Quick Search

Fix Slow Bootup Query in Zenko

Categories

(Content Services Graveyard :: Tiles: Administration Front-End, defect)

Tracking

(Not tracked)

People

(Reporter: oyiptong, Assigned: mruttley)

References

(
URL
)

Details

(Whiteboard: .?)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated