I'm trying to run a query to graph the distribution of Flash versions over time: https://sql.telemetry.mozilla.org/queries/4012/source This query will not run. Here are some of the error messages I have received: "Error running query: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes." "Error running query: Should not have nextUri if failed" I'm not sure whether this is a transient operational issue, or a problem with the query itself. I don't *think* that this query ought to be unusually expensive, but I've been wrong before. Text of the query, for reference: WITH clients AS ( SELECT SUBSTR(s.subsession_start_date, 1, 10) AS activity_date, s.subsession_length, p.active_plugins.version AS flash_version FROM longitudinal l CROSS JOIN UNNEST( l.subsession_start_date, l.subsession_length, l.active_plugins ) AS s( subsession_start_date, subsession_length, active_plugins ) CROSS JOIN UNNEST(s.active_plugins) AS p(active_plugins) WHERE normalized_channel = 'release' AND os = 'Windows_NT' AND p.active_plugins.name = 'Shockwave Flash' AND s.subsession_start_date > '2016-09-01' ) SELECT activity_date, CASE WHEN COALESCE(TRY_CAST(SPLIT_PART(flash_version, '.', 1) AS TINYINT), 0) BETWEEN 20 AND 63 THEN flash_version ELSE '0:Older' END AS flash_version, SUM(subsession_length) AS usage FROM clients GROUP BY activity_date, flash_version ORDER BY activity_date DESC, flash_version DESC
An update here: robotblake said he kicked the cluster on Friday, but this query continues to time out/not run properly. Could I get somebody to either look at the query and tell me if I'm doing something really dumb from a query-performance perspective, or check server logs to see what's failing?
Created attachment 8856514 [details] Query plan according to EXPLAIN Attached is the query plan for this according to EXPLAIN. I'm not experienced at reading presto query plans, but that there are two RemoteExchange nodes is a bit troubling because I'd really expect you'd only need to do a slave aggregate and a single gather/collect.
I believe this query was causing the presto cluster to fail due to resource constraints. The same query runs OK against the longitudinal table accessible via Athena per: https://sql.telemetry.mozilla.org/queries/4416/source I've filed Bug 1361803 to ensure that the longitudinal table is kept up to date via the Athena data source. Benjamin, is this OK as a workaround?
Can you describe what Athena is? As long as the data is accessible I just need guidance about the right way to access it.
Athena is an Amazon hosted version of Presto + Hive, it's got some limitations but generally is faster and more stable than the Presto data source.
Athena uses the exact same underlying data as our Presto cluster. You can use it by switching the Data Source drop-down from "Presto" to "Athena". You also need to refer to the table as "telemetry.longitudinal", but otherwise it should be the same.