Investigate Materialized Views V2
Categories
(Data Platform and Tools :: General, task, P3)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Assigned: ascholtz)
References
Details
(Whiteboard: [dataplatform])
Materialized Views without aggregations and joins are now available in preview. This could be useful for creating user-facing datasets with live data, see also https://bugzilla.mozilla.org/show_bug.cgi?id=1697285. bigquery-etl already has some support for materialized views, so generally it should be possible to simply add them for monitoring use cases.
Materialized views based on event_v4
Creating an event_1pct_live
table by running the same query but querying data from telemetry_live.event_v4
worked without any issues:
CREATE MATERIALIZED VIEW analysis.event_1pct_live
OPTIONS (enable_refresh = true, refresh_interval_minutes = 10)
PARTITION BY date(submission_timestamp)
CLUSTER BY
normalized_channel,
sample_id,
subsample_id
AS
SELECT
date(submission_timestamp) AS submission_date,
CAST(NULL AS INT64) AS subsample_id,
*
FROM
telemetry_live.event_v4
WHERE
sample_id = 0
AND date(submission_timestamp) > "2021-07-08" # limit the data processed for this test
Selecting all data from the materialized view for a specific date ("2021-07-09") took about 40.8 seconds (12.5 GB processed) while selecting 1 percent of the data directly from telemetry_live.event_v4
takes about 43.5 seconds (888 GB processed). This was tested only on a small amount of data (~1 day). More data needs to be processed when querying directly from event_v4
since this table is not clustered or partitioned on sample_id while the materialized view only contains data for a specific sample_id.
When the materialized view gets updated only new data that has not been written to the materialized view before gets added. Overall, the daily cost to keep the materialized view up-to-date depends on the amount of data added to events_v4
.
Materialized views based on main_v4
I tried to create a materialized view for main_1pct
with live data, however this resulted in the following error: Too many columns in schema.
Materialized view for monitoring WebRender
Generally, for cases where we want to provide user-facing live datasets, e.g. for monitoring, it could make sense to setup materialized views that provide a subset of data users are interested in (as opposed to all data of main_v4). For example, for the Webrender monitoring use case we could set up a materialized view like:
CREATE MATERIALIZED VIEW analysis.webrender_live
PARTITION BY date(submission_timestamp)
CLUSTER BY
normalized_channel
OPTIONS (enable_refresh = true, refresh_interval_minutes = 10)
AS
SELECT
client_id,
submission_timestamp,
application.version AS version,
environment.system.gfx.features.compositor,
normalized_channel
FROM `moz-fx-data-shared-prod.telemetry_live.main_v4`
WHERE
date(submission_timestamp) > '2021-06-01'
The dashboard queries could then directly query the materialized views, which improves performance and potentially reduces the amount of data to be processed. For example for:
SELECT
APPROX_COUNT_DISTINCT(client_id) as count,
TIMESTAMP_TRUNC(submission_timestamp, hour) as time,
SPLIT(version,".")[OFFSET(0)] as version
FROM analysis.webrender_live as telem
WHERE submission_timestamp > TIMESTAMP(DATE("2021-06-01"))
AND normalized_channel = 'release'
AND (version LIKE "89%"
OR version LIKE "88%")
GROUP BY version, time
Fetching data from the materialized view takes 5.5 seconds (367 GB processed) while running the original query on telemetry_live.main_v4
takes 5 minutes 18 seconds (399 GB processed) seconds to finish.
select * from (SELECT
APPROX_COUNT_DISTINCT(client_id) as count,
TIMESTAMP_TRUNC(submission_timestamp, hour) as time,
SUM(cast((compositor LIKE 'webrender%') as INT64)) / count(*) as pct,
"88" as version
-- compositor = 'webrender' as wr
FROM `analysis.webrender_live`
WHERE submission_timestamp > TIMESTAMP(DATE("2021-06-01"))
AND version LIKE "88.0%"
AND normalized_channel = 'release'
GROUP BY time, version
order by time)
UNION all
select * from (SELECT
APPROX_COUNT_DISTINCT(client_id) as count,
TIMESTAMP_TRUNC(submission_timestamp, hour) as time,
SUM(cast((compositor LIKE 'webrender%') as INT64)) / count(*) as pct,
"89"
-- compositor = 'webrender' as wr
FROM `analysis.webrender_live`
WHERE submission_timestamp > TIMESTAMP(DATE("2021-06-01"))
AND version LIKE "89.%"
AND normalized_channel = 'release'
GROUP BY time, version
order by time)
Using the materialized view reduced the query time to 5.3 seconds (466 GB processed) vs 11 minutes (517 GB processed).
Comment 1•3 years ago
|
||
Selecting all data from the materialized view for a specific date ("2021-07-09") took about 40.8 seconds (12.5 GB processed) while selecting 1 percent of the data directly from telemetry_live.event_v4 takes about 43.5 seconds (888 GB processed)
I'm surprised that there wasn't a bigger speed boost here. Was this doing a literal SELECT *
? If so, perhaps the bottleneck here is writing out the result set and an aggregate query would show a bigger gain in terms of query execution time.
I tried to create a materialized view for main_1pct with live data
I've made JJ at Google aware of this issue.
Assignee | ||
Comment 2•3 years ago
|
||
Was this doing a literal SELECT *?
Yes. I did run a couple of additional tests, when running aggregate queries querying the materialized view was indeed showing some gains. The data that I used for testing is quite limited, but running something like the following query was about 3x faster:
SELECT
count(client_id),
e.f1_
from analysis.event_1pct_live
CROSS JOIN
UNNEST(
[
STRUCT("content" AS event_process, payload.events.content AS events),
("dynamic", payload.events.dynamic),
("extension", payload.events.extension),
("gpu", payload.events.gpu),
("parent", payload.events.parent)
]
)
CROSS JOIN
UNNEST(events) AS e
group by 2
Comment 3•3 years ago
|
||
I did a little testing of MV2 to try to understand behavior of query rewrites and MV refresh patterns.
I created the following in a test project:
create or replace table mvtest1.base as
select n, cast(n as string) as s from unnest(generate_array(1, 1000000)) as n;
drop materialized view if exists mvtest1.mv1;
create materialized view mvtest1.mv1 as
select * from mvtest1.base where n < 1000;
create materialized view mvtest1.mv2 as
select n, s from mvtest1.base;
create or replace view mvtest2.v1 as
select * from mvtest1.mv1;
create or replace view mvtest2.v2 as
select * from mvtest1.base;
Query rewrites are possible even through a (virtual) view
I was pleased to see that querying SELECT n,s FROM mvtest2.v2 WHERE n < 100
successfully took advantage of the mvtest1.mv1
materialized view, even though it had to hop through a virtual view definition from a separate dataset. It was able to evaluate the WHERE clause and determine that mv1
was appropriate. It's still necessary, though, for mv1
to be defined in the same dataset as the base
table.
Schema updates invalidate the entire view
I tried adding a new empty field to base
. This caused all queries to fall back to querying base
, even if you try to explicitly query the materialized view. Once I called a refresh on mv1
, the new field showed up in the schema browser and queries started taking advantage of it again.
This could be problematic if we try to implement a large-scale MV for main_1pct
. The main_v4
table gets new fields added every few days, and BQ would have to refresh all partitions of the MV every time a schema update happens. This could end up being expensive, and it may cause the MV to be unavailable for a significant period of time, meaning that queries on main_1pct
would be hitting main_v4
and losing the performance advantage.
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Description
•