Closed Bug 1354640 Opened 7 years ago Closed 7 years ago

Unable to run query

Categories

(Data Platform and Tools :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Unassigned)

References

()

Details

Attachments

(1 file)

I'm trying to run a query to graph the distribution of Flash versions over time:

https://sql.telemetry.mozilla.org/queries/4012/source

This query will not run. Here are some of the error messages I have received:
"Error running query: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes."
"Error running query: Should not have nextUri if failed"

I'm not sure whether this is a transient operational issue, or a problem with the query itself. I don't *think* that this query ought to be unusually expensive, but I've been wrong before.

Text of the query, for reference:

WITH clients AS (
    SELECT
        SUBSTR(s.subsession_start_date, 1, 10) AS activity_date,
        s.subsession_length,
        p.active_plugins.version AS flash_version
    FROM
        longitudinal l
    CROSS JOIN
        UNNEST(
            l.subsession_start_date,
            l.subsession_length,
            l.active_plugins
        ) AS s(
            subsession_start_date,
            subsession_length,
            active_plugins
        )
    CROSS JOIN
        UNNEST(s.active_plugins) AS p(active_plugins)
    WHERE
        normalized_channel = 'release'
        AND os = 'Windows_NT'
        AND p.active_plugins.name = 'Shockwave Flash'
        AND s.subsession_start_date > '2016-09-01'
)
SELECT
    activity_date,
    CASE
      WHEN COALESCE(TRY_CAST(SPLIT_PART(flash_version, '.', 1) AS TINYINT), 0) BETWEEN 20 AND 63 THEN flash_version
      ELSE '0:Older'
    END AS flash_version,
    SUM(subsession_length) AS usage
FROM clients
GROUP BY activity_date, flash_version
ORDER BY activity_date DESC, flash_version DESC
An update here: robotblake said he kicked the cluster on Friday, but this query continues to time out/not run properly.

Could I get somebody to either look at the query and tell me if I'm doing something really dumb from a query-performance perspective, or check server logs to see what's failing?
Flags: needinfo?(bimsland)
Attached is the query plan for this according to EXPLAIN.

I'm not experienced at reading presto query plans, but that there are two RemoteExchange nodes is a bit troubling because I'd really expect you'd only need to do a slave aggregate and a single gather/collect.
I believe this query was causing the presto cluster to fail due to resource constraints.

The same query runs OK against the longitudinal table accessible via Athena per:
https://sql.telemetry.mozilla.org/queries/4416/source

I've filed Bug 1361803 to ensure that the longitudinal table is kept up to date via the Athena data source.

Benjamin, is this OK as a workaround?
Flags: needinfo?(benjamin)
Can you describe what Athena is? As long as the data is accessible I just need guidance about the right way to access it.
Flags: needinfo?(benjamin)
Athena is an Amazon hosted version of Presto + Hive, it's got some limitations but generally is faster and more stable than the Presto data source.
Flags: needinfo?(bimsland)
Athena uses the exact same underlying data as our Presto cluster. You can use it by switching the Data Source drop-down from "Presto" to "Athena".

You also need to refer to the table as "telemetry.longitudinal", but otherwise it should be the same.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Redash (STMO) → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: