Closed Bug 1344817 Opened 7 years ago Closed 7 years ago

Evaluate Superset for multidimensional retention plots

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: amiyaguchi, Unassigned)

References

Details

Anthony Miyaguchi [:amiyaguchi]

Reporter

Description

•

7 years ago

Superset [1] is an open-source data exploration tool by Airbnb. It is the default data exploration and visualization too packed with Apache Druid.

This tool may be useful for implementing multi-dimensional retention plots, as requested in bug 1337037. Superset can connect to various data backends via SQLAlchemy[2]. A presto/hive connector is supported, allowing this tool to be setup in a similar fashion to redash.

[1] https://github.com/airbnb/superset
[2] http://airbnb.io/superset/installation.html#

Anthony Miyaguchi [:amiyaguchi]

Reporter

Updated

•

7 years ago

Depends on: 1344884

Katie Parlante

Updated

•

7 years ago

Points: --- → 3

Priority: -- → P3

Anthony Miyaguchi [:amiyaguchi]

Reporter

Updated

•

7 years ago

Assignee: nobody → amiyaguchi

Anthony Miyaguchi [:amiyaguchi]

Reporter

Comment 1

•

7 years ago

The default presto instance that comes packaged with EMR 5.4.0 does not like the churn dataset.[3] 

SETUP ENVIRONMENT:
Instantiate a Presto cluster on EMR 5.4.0 using 3 x m3.xlarge. Create an instance of redis in elasticache, allowing all incoming traffic between the security groups of the redis and master nodes. 

Install superset on the master node using the instructions in [1]. Add the configuration[4] to the PYTHONPATH, which can also be added to the virtual environment.

> echo 'export PYTHONPATH=/home/hadoop' >> /home/hadoop/venv/bin/activate

assuming that your working path is /home/hadoop. This configuration file changes the Superset port to 9999 to avoid a port conflict, and adds the route to the redis instance for asynchronous queries with presto.

Next, configure Presto to mirror the deployment configuration backing redash. `hive.parquet.use-column-names=true` is applied to account for schema evolution.[4]

> echo 'hive.parquet.use-column-names=true' | sudo tee -a /etc/presto/conf.dist/catalog/hive.properties
> sudo restart presto-server

The churn data is added through the following call to p2h.

> parquet2hive -ulv s3://telemetry-parquet/churn | bash

Run both the server and worker in two sessions (screen is useful).

> superset runserver
> superset worker

In the sources tab, add a new source using the local presto connection string.

> presto://localhost:8889/hive/default

Enable SQL Lab and Async.


The next step to fully evaluate Superset is to be able to query on the data. However, Presto keeps falling over on simple queries. I think this could be related to the configuration of the parquet data, but it also might be a flag that I'm missing for Presto. 

[3] https://gist.github.com/acmiyaguchi/6ec7708ec0df920744baceb0015247d1
[4] https://gist.github.com/acmiyaguchi/9c8086939051b4bec8e5d413066af934
[5] https://github.com/fbertsch/schema_evolution_exploration

Anthony Miyaguchi [:amiyaguchi]

Reporter

Comment 2

•

7 years ago

After discussion with :robotblake, the one difference that might be causing issues is the version of Presto packaged with EMR 5.4.0. Redash is currently powered with Presto hosted on EMR 5.1.0.

It may not be worth spending more effort to make my Superset instance compatible with Presto because of the eventual migration to Athena, a hosted Amazon flavored Presto. Athena only officially exposes a JDBC, whereas Superset requires a ODBC for its SQLAlchemy backend.

I will try out another supported backend like postgresql or redshift, since the dataset in question is small (~50mb a week for ~40 weeks).

Anthony Miyaguchi [:amiyaguchi]

Reporter

Updated

•

7 years ago

Assignee: amiyaguchi → nobody

Firefox Bug Husbandry Bot

Comment 3

•

7 years ago

Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Evaluate Superset for multidimensional retention plots

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: amiyaguchi, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Updated