Closed Bug 1304421 Opened 8 years ago Closed 8 years ago

Parquet2hive fails on non-parquet files

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: frank)

References

Details

parquet2hive s3://net-mozaws-prod-us-west-2-pipeline-analysis/mobile/android_events -dv v1
fails with:

Analyzing dataset android_events, v1
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt1/ -Xmx15000M -Xms1000M
Could not read footer: java.lang.RuntimeException: file:/tmp/tmp9XOS_v is not a Parquet file (too small)
Failure to parse dataset, 'NoneType' object has no attribute 'group'

Ideally, p2h should just ignore these files and continue processing.
Assignee: nobody → fbertsch
Points: --- → 1
Priority: -- → P1
Implemented a fix which igores all directories that begin with '_'. While this solves our current problem, it then fails on a later file: mobile/android_events/v1/channel=aurora/submission=20160919_$folder$

While I'm keeping that original change, I'm also going to implement a fix that will skip over non-parquet files. They will still need to be downloaded in order to be verified, but at least it won't break the import.
We've decided that it's better to have a blacklist of files to exclude. We're going to include all files and directories that end with $folder$, which should solve this problem. 

Parquet2hive WILL break on non-parquet files, in which case we'll need to edit the blacklist again in the future.
Blocks: 1305429
PR here: https://github.com/mozilla/parquet2hive/pull/16
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.