>SELECT DISTINCT > tiles.id, > countries.country_name >FROM > tiles >INNER JOIN impression_stats_daily ON > tiles.id = impression_stats_daily.tile_id >INNER JOIN countries ON > countries.country_code = impression_stats_daily.country_code >ORDER BY > countries.country_name ASC; Copying from bug 1152996 :oyiptong says This query makes a join on impression_stats_daily, which is a table with billions of rows A join on a table of that size is something one should never do, as it will make the database do a lot of work and will probably timeout :mostlygeek says :oyiptong could you open up a bug to fix the zenko query and make it a dependency for this bug. I don't feel comfortable turning zenko back on if it's just going to degrade redshift again.
If the problem is one of wanting to get 'valid' tile_ids, then we could simply delete those tiles that have no foreign key references from impression_stats_daily as a one-time data maintenance task. This query look like: delete from tiles where id not in (select distinct tile_id from impression_stats_daily); You could then just change your original query to: >SELECT DISTINCT > tiles.id, > countries.country_name >FROM > tiles >INNER JOIN countries ON > countries.country_code = impression_stats_daily.country_code >ORDER BY > countries.country_name ASC; which would be more reasonable
for reference, this query does not appear to work: >SELECT DISTINCT > tiles.id, > countries.country_name >FROM > tiles >INNER JOIN countries ON > countries.country_code = impression_stats_daily.country_code >ORDER BY > countries.country_name ASC;
Basically I need a way of finding out which countries each tile runs in and we have data for (whether it was intended or not) and this query seems to be the only way. Any ideas?
after further investigation, i think this was what tim meant: >SELECT DISTINCT > impression_stats_daily.tile_id, > countries.country_name >FROM > impression_stats_daily >INNER JOIN countries ON > countries.country_code = impression_stats_daily.country_code >ORDER BY > countries.country_name ASC; Also, while I agree that this is a nicer query, due to fewer joins: testing shows that both of these queries appear to take about 35-50 seconds in prod and 3-10 seconds in stage. also, the original query has the benefit of ignoring invalid tile ids. I suspect that redshift is actually being really efficient about this, because that query time doesn't sound DOS-level intense.
The join in that query only appears to provide country name. How long does that query run if we skip the join entirely and omit the country name?
19-27 seconds for this query: > SELECT DISTINCT > tile_id, country_code > FROM > impression_stats_daily > ORDER BY > country_code ASC; Note that this does not ignore invalid country codes or invalid tile ids.
Hmm... 30% to 50% speed increase without the joins. Seems to me there are two issues here: - bad data (invalid country codes, invalid tile ids) in the impression_stats_daily table - joins in queries as a result of bad data What is (if any) the policy on invalid data? Should we be actively deleting it as :tspurway suggested in Comment #1? Should we be filtering it out in inferno before it hits redshift?
Two current strategies that I'm considering to deal with this are 1) [Short term fix] Possibly speed things up by finding when the user's cache was last updated and limiting this query to data only after that date. Might be worth a shot. 2) [Long Term, Big Project] Create a decent hosted version of Zenko rather than on each individual's PC. This would hold and cache useful data aggregates in a separate database. It would perform this long query every 24 hours, early in the morning so that data would be available much faster for people. This has several other benefits in accessibility, usage, fraud monitoring possibilities etc.
The real problem here is that country_code is not part of any 'dimension' table, and is only tracked in 'fact' tables like impression_stats_daily, which is arbitrarily large, and therefore extremely expensive to query. The solution is to start tracking the country_code in a fact table like 'tiles' or 'adgroups' (see below). http://en.wikipedia.org/wiki/Dimension_table http://en.wikipedia.org/wiki/Fact_table Our Redshift database is going through some major changes right now, which will affect Zenko: 1. In the next release, we are introducing Adgroups, which have the following features: - deprecates tiles.locale - this field should not be accessed going forward, and will be removed - new tables: adgroups and adgroup_sites which contain targeting meta data for tiles (such as locale and frecent_sites) 2. In the following release we will be separating fact tables and dimension tables into separate databases. This means that it will no longer be possible to perform joins across these tables without synchronizing the data beforehand. This will make for much more efficient OLTP processing of campaign data (campaign management), but will somewhat complicate reporting.
I've fixed this issue with: https://github.com/matthewruttley/zenko/commit/346cf9ce95ddcacfbdccbb84d31dfd8a8150a0c2 The only reason the query existed was to tell which Countries we had data for, for each tile. If the user specifically needs that data they can use a country-by-country view: http://localhost:5000/country_impressions?client=Booking.com . I've decided there is no particular point in having that cached. Initially it was for views such as the day-by-day view e.g. http://localhost:5000/daily_impressions?tile_id=628 . Here, the Country drop down box would have only been populated by countries for which we *actually* had data for. Instead, now it will just show all countries. If there happens to be no data for that country, there will just be a blank table and graph. This seems OK functionality-wise. If they want to check which countries exist for a tile they can use the country-by-country view mentioned above. Zenko now loads in 2-3 seconds :)