Closed
Bug 1344817
Opened 7 years ago
Closed 7 years ago
Evaluate Superset for multidimensional retention plots
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P3)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: amiyaguchi, Unassigned)
References
Details
Superset [1] is an open-source data exploration tool by Airbnb. It is the default data exploration and visualization too packed with Apache Druid. This tool may be useful for implementing multi-dimensional retention plots, as requested in bug 1337037. Superset can connect to various data backends via SQLAlchemy[2]. A presto/hive connector is supported, allowing this tool to be setup in a similar fashion to redash. [1] https://github.com/airbnb/superset [2] http://airbnb.io/superset/installation.html#
Updated•7 years ago
|
Points: --- → 3
Priority: -- → P3
Reporter | ||
Updated•7 years ago
|
Assignee: nobody → amiyaguchi
Reporter | ||
Comment 1•7 years ago
|
||
The default presto instance that comes packaged with EMR 5.4.0 does not like the churn dataset.[3] SETUP ENVIRONMENT: Instantiate a Presto cluster on EMR 5.4.0 using 3 x m3.xlarge. Create an instance of redis in elasticache, allowing all incoming traffic between the security groups of the redis and master nodes. Install superset on the master node using the instructions in [1]. Add the configuration[4] to the PYTHONPATH, which can also be added to the virtual environment. > echo 'export PYTHONPATH=/home/hadoop' >> /home/hadoop/venv/bin/activate assuming that your working path is /home/hadoop. This configuration file changes the Superset port to 9999 to avoid a port conflict, and adds the route to the redis instance for asynchronous queries with presto. Next, configure Presto to mirror the deployment configuration backing redash. `hive.parquet.use-column-names=true` is applied to account for schema evolution.[4] > echo 'hive.parquet.use-column-names=true' | sudo tee -a /etc/presto/conf.dist/catalog/hive.properties > sudo restart presto-server The churn data is added through the following call to p2h. > parquet2hive -ulv s3://telemetry-parquet/churn | bash Run both the server and worker in two sessions (screen is useful). > superset runserver > superset worker In the sources tab, add a new source using the local presto connection string. > presto://localhost:8889/hive/default Enable SQL Lab and Async. The next step to fully evaluate Superset is to be able to query on the data. However, Presto keeps falling over on simple queries. I think this could be related to the configuration of the parquet data, but it also might be a flag that I'm missing for Presto. [3] https://gist.github.com/acmiyaguchi/6ec7708ec0df920744baceb0015247d1 [4] https://gist.github.com/acmiyaguchi/9c8086939051b4bec8e5d413066af934 [5] https://github.com/fbertsch/schema_evolution_exploration
Reporter | ||
Comment 2•7 years ago
|
||
After discussion with :robotblake, the one difference that might be causing issues is the version of Presto packaged with EMR 5.4.0. Redash is currently powered with Presto hosted on EMR 5.1.0. It may not be worth spending more effort to make my Superset instance compatible with Presto because of the eventual migration to Athena, a hosted Amazon flavored Presto. Athena only officially exposes a JDBC, whereas Superset requires a ODBC for its SQLAlchemy backend. I will try out another supported backend like postgresql or redshift, since the dataset in question is small (~50mb a week for ~40 weeks).
Reporter | ||
Updated•7 years ago
|
Assignee: amiyaguchi → nobody
Comment 3•7 years ago
|
||
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•