Closed Bug 1344817 Opened 7 years ago Closed 7 years ago

Evaluate Superset for multidimensional retention plots

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: amiyaguchi, Unassigned)

References

Details

Superset [1] is an open-source data exploration tool by Airbnb. It is the default data exploration and visualization too packed with Apache Druid.

This tool may be useful for implementing multi-dimensional retention plots, as requested in bug 1337037. Superset can connect to various data backends via SQLAlchemy[2]. A presto/hive connector is supported, allowing this tool to be setup in a similar fashion to redash.

[1] https://github.com/airbnb/superset
[2] http://airbnb.io/superset/installation.html#
Depends on: 1344884
Points: --- → 3
Priority: -- → P3
Assignee: nobody → amiyaguchi
The default presto instance that comes packaged with EMR 5.4.0 does not like the churn dataset.[3] 

SETUP ENVIRONMENT:
Instantiate a Presto cluster on EMR 5.4.0 using 3 x m3.xlarge. Create an instance of redis in elasticache, allowing all incoming traffic between the security groups of the redis and master nodes. 

Install superset on the master node using the instructions in [1]. Add the configuration[4] to the PYTHONPATH, which can also be added to the virtual environment.

> echo 'export PYTHONPATH=/home/hadoop' >> /home/hadoop/venv/bin/activate

assuming that your working path is /home/hadoop. This configuration file changes the Superset port to 9999 to avoid a port conflict, and adds the route to the redis instance for asynchronous queries with presto.

Next, configure Presto to mirror the deployment configuration backing redash. `hive.parquet.use-column-names=true` is applied to account for schema evolution.[4]

> echo 'hive.parquet.use-column-names=true' | sudo tee -a /etc/presto/conf.dist/catalog/hive.properties
> sudo restart presto-server

The churn data is added through the following call to p2h.

> parquet2hive -ulv s3://telemetry-parquet/churn | bash

Run both the server and worker in two sessions (screen is useful).

> superset runserver
> superset worker

In the sources tab, add a new source using the local presto connection string.

> presto://localhost:8889/hive/default

Enable SQL Lab and Async.


The next step to fully evaluate Superset is to be able to query on the data. However, Presto keeps falling over on simple queries. I think this could be related to the configuration of the parquet data, but it also might be a flag that I'm missing for Presto. 

[3] https://gist.github.com/acmiyaguchi/6ec7708ec0df920744baceb0015247d1
[4] https://gist.github.com/acmiyaguchi/9c8086939051b4bec8e5d413066af934
[5] https://github.com/fbertsch/schema_evolution_exploration
After discussion with :robotblake, the one difference that might be causing issues is the version of Presto packaged with EMR 5.4.0. Redash is currently powered with Presto hosted on EMR 5.1.0.

It may not be worth spending more effort to make my Superset instance compatible with Presto because of the eventual migration to Athena, a hosted Amazon flavored Presto. Athena only officially exposes a JDBC, whereas Superset requires a ODBC for its SQLAlchemy backend.

I will try out another supported backend like postgresql or redshift, since the dataset in question is small (~50mb a week for ~40 weeks).
Assignee: amiyaguchi → nobody
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.