Status

Data Platform and Tools
Redash (STMO)
--
critical
RESOLVED FIXED
8 months ago
7 months ago

People

(Reporter: Benjamin Smedberg, Unassigned)

Tracking

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

8 months ago
I'm trying to run a query to graph the distribution of Flash versions over time:

https://sql.telemetry.mozilla.org/queries/4012/source

This query will not run. Here are some of the error messages I have received:
"Error running query: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes."
"Error running query: Should not have nextUri if failed"

I'm not sure whether this is a transient operational issue, or a problem with the query itself. I don't *think* that this query ought to be unusually expensive, but I've been wrong before.

Text of the query, for reference:

WITH clients AS (
    SELECT
        SUBSTR(s.subsession_start_date, 1, 10) AS activity_date,
        s.subsession_length,
        p.active_plugins.version AS flash_version
    FROM
        longitudinal l
    CROSS JOIN
        UNNEST(
            l.subsession_start_date,
            l.subsession_length,
            l.active_plugins
        ) AS s(
            subsession_start_date,
            subsession_length,
            active_plugins
        )
    CROSS JOIN
        UNNEST(s.active_plugins) AS p(active_plugins)
    WHERE
        normalized_channel = 'release'
        AND os = 'Windows_NT'
        AND p.active_plugins.name = 'Shockwave Flash'
        AND s.subsession_start_date > '2016-09-01'
)
SELECT
    activity_date,
    CASE
      WHEN COALESCE(TRY_CAST(SPLIT_PART(flash_version, '.', 1) AS TINYINT), 0) BETWEEN 20 AND 63 THEN flash_version
      ELSE '0:Older'
    END AS flash_version,
    SUM(subsession_length) AS usage
FROM clients
GROUP BY activity_date, flash_version
ORDER BY activity_date DESC, flash_version DESC
(Reporter)

Comment 1

8 months ago
An update here: robotblake said he kicked the cluster on Friday, but this query continues to time out/not run properly.

Could I get somebody to either look at the query and tell me if I'm doing something really dumb from a query-performance perspective, or check server logs to see what's failing?
Flags: needinfo?(bimsland)
(Reporter)

Comment 2

8 months ago
Created attachment 8856514 [details]
Query plan according to EXPLAIN

Attached is the query plan for this according to EXPLAIN.

I'm not experienced at reading presto query plans, but that there are two RemoteExchange nodes is a bit troubling because I'd really expect you'd only need to do a slave aggregate and a single gather/collect.

Comment 3

8 months ago
I believe this query was causing the presto cluster to fail due to resource constraints.

The same query runs OK against the longitudinal table accessible via Athena per:
https://sql.telemetry.mozilla.org/queries/4416/source

I've filed Bug 1361803 to ensure that the longitudinal table is kept up to date via the Athena data source.

Benjamin, is this OK as a workaround?
Flags: needinfo?(benjamin)
(Reporter)

Comment 4

8 months ago
Can you describe what Athena is? As long as the data is accessible I just need guidance about the right way to access it.
Flags: needinfo?(benjamin)
Athena is an Amazon hosted version of Presto + Hive, it's got some limitations but generally is faster and more stable than the Presto data source.
Flags: needinfo?(bimsland)

Comment 6

7 months ago
Athena uses the exact same underlying data as our Presto cluster. You can use it by switching the Data Source drop-down from "Presto" to "Athena".

You also need to refer to the table as "telemetry.longitudinal", but otherwise it should be the same.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.