Closed Bug 1377548 Opened 7 years ago Closed 7 years ago

Main_summary queries keep failing (when called from PySpark) with no such file /directory error

Categories

(Data Platform and Tools Graveyard :: Presto, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joy, Unassigned)

Details

This is my query


    select
    client_id ,
    sum(case
    when active_ticks is null then 0
    else active_ticks*5
    end)
    as usg,
    last(case
    when profile_creation_date <= '16968' then 1
    else 30 / ( 16997 - profile_creation_date + 1)
    end) as opportunity
    from main_summary
    where app_name='Firefox'
    and submission_date >= '20160616'
    and submission_date <= '20160715'
    and profile_creation_date <= 16997
    and sample_id < '5'
    group by 1


and i get this error

Py4JJavaError: An error occurred while calling o52.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1709 in stage 0.0 failed 4 times, most recent failure: Lost task 1709.3 in stage 0.0 (TID 1766, ip-172-31-5-86.us-west-2.compute.inter
nal): java.io.FileNotFoundException: No such file or directory 's3://telemetry-parquet/main_summary/v4/submission_date_s3=20160601/sample_id=0'
        at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:812)

(  No such file or directory 's3://telemetry-parquet/main_summary/v4/submission_date_s3=20160601/sample_id=0' )

(if i rerun this, i get the same error but the missing file has a different sample_id)
This appears to work fine now, though note that using `submission_date_s3` will work much much better than using `submission_date` since it is a paritioning field.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.