Closed
Bug 1304421
Opened 8 years ago
Closed 8 years ago
Parquet2hive fails on non-parquet files
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Assigned: frank)
References
Details
parquet2hive s3://net-mozaws-prod-us-west-2-pipeline-analysis/mobile/android_events -dv v1 fails with: Analyzing dataset android_events, v1 Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt1/ -Xmx15000M -Xms1000M Could not read footer: java.lang.RuntimeException: file:/tmp/tmp9XOS_v is not a Parquet file (too small) Failure to parse dataset, 'NoneType' object has no attribute 'group' Ideally, p2h should just ignore these files and continue processing.
Updated•8 years ago
|
Assignee: nobody → fbertsch
Points: --- → 1
Priority: -- → P1
Assignee | ||
Comment 1•8 years ago
|
||
Implemented a fix which igores all directories that begin with '_'. While this solves our current problem, it then fails on a later file: mobile/android_events/v1/channel=aurora/submission=20160919_$folder$ While I'm keeping that original change, I'm also going to implement a fix that will skip over non-parquet files. They will still need to be downloaded in order to be verified, but at least it won't break the import.
Assignee | ||
Comment 2•8 years ago
|
||
We've decided that it's better to have a blacklist of files to exclude. We're going to include all files and directories that end with $folder$, which should solve this problem. Parquet2hive WILL break on non-parquet files, in which case we'll need to edit the blacklist again in the future.
Assignee | ||
Comment 3•8 years ago
|
||
PR here: https://github.com/mozilla/parquet2hive/pull/16
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•