Closed Bug 1329023 Opened 7 years ago Closed 7 years ago

ReDash errors when parsing parquet files output by SyncView

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tcsc, Assigned: robotblake)

References

Details

(Whiteboard: [fxsync])

Attachments

(2 files)

Sometimes the parquet data files output by SyncView.scala can't be read by ReDash.  Example error: https://irccloud.mozilla.com/pastebin/2A4o9YPP

I've been unable to replicate this outside of ReDash (e.g. using a Spark cluster), but it only happens on some queries inside ReDash (although -- it happens on most non-trivial ones), and I'm not 100% certain that what I tried could have reproduced the issue regardless.

This takes over from bug 1314864, which I had been keeping open to indicate that "this doesn't quite work yet".
Component: Telemetry → Metrics: Pipeline
Product: Toolkit → Cloud Services
Blocks: 1289536
Also note that this has broken our existing Sync dashboard: 
https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2016-10-01%2000:00:00.000&p_end_date=2016-10-16%2000:00:00.000&p_version=5&p_min_err=1

It also block an ongoing cross-functional effort to repair bookmark data since we can't create our dashboards to measure our current corruption types and the impact of our upcoming repair efforts.
https://docs.google.com/document/d/1WDg71fGVTLyQ3jSqB2ZEdUepoZ36nasxvNYppxesnV0/edit
https://docs.google.com/spreadsheets/d/1iIvkEURziNFVZwzXCkRdt8oWQrDRl4sEr0YYikquMEI/edit#gid=0
Assignee: nobody → bimsland
Points: --- → 3
Priority: -- → P1
Sorry for the slow reply, I'm still looking into this issue and have been hitting a bit of a wall, initially we thought it might be due to a schema mismatch but that looks fine. It appears to still be happening in newer versions of Presto so reaching out to the devs to see if they have any ideas might also be a possibility.

:amiyaguchi is going to jump in and help with this as well as he has more experience with Spark than myself.
Flags: needinfo?(amiyaguchi)
I've replicated the bug using spark [1], so we can cross off Presto as suspect.

[1] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae
Flags: needinfo?(amiyaguchi)
It looks like there's some corrupt (or invalid UTF8) engine.name fields in a row which is causing the failure, I'm attempting to write some code to get a dump of what is actually in those fields.
Worth noting that we only allow a small set of (entirely ASCII) names [0] through as of bug 1311557. The fix for that bug should be everywhere by now, but I'm willing to add a similar test to the SyncView scala code, if necessary.

That said, the old code also was able to handle the same pings without issue, so i *suspect* it's not a problem with the input (Shouldn't the JSON parsing fail if the UTF8 is invalid anyway?)

[0]: https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/telemetry.js#57-58
I've written a test case that demonstrates the JSON parser does fail on invalid UTF8 [0]. In slightly more detail, I use the invalid surrogate pair "\udc00" as a test case. The default Java behavior will let "\udc00" survive parsing by using UTF-16, so I constructed a more realistic case of an invalid byte string. The latter case fails at parsing, which should be caught in the batch view logic.

The underlying bug can still be in the batch view script, but maybe the problem is more complicated than a corrupted string. We'll have to dig deeper and reverse engineer the problematic fields in the parquet binary. 

[0] https://github.com/mozilla/telemetry-batch-view/compare/mozilla:7139c4a...acmiyaguchi:34fb269
As a small update, I have a simplified query that is broken because of this bug [0]. The case is a very small spark query counting the number of engine names. This doesn't tell us anything new, it's just makes it easy to replicate.

[0] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae
I am suspicious of the parquet-mr (reference implementation) library.

Blake and I have been testing the invalid partition locally using tooling provided by parquet [1]. Using the dump utility will cause parquet to blow up. More specifically, the following will throw an exception.
> java -jar target/parquet-tools-1.6.0rc3-SNAPSHOT.jar \
>     dump --columns engines.list.element.name --no-color \
>     part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet 

I also ran the file through the c++ implementation [1], using the parquet-scan and parquet_reader tools. This can be replicated with the following command where column 14 refers to engines.name:
> ./build/latest/parquet-scan --column=14 part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet 
The c++ tool seemed to decode the parquet without issues. This means that there could potentially be a bug with the parquet-mr decoder (or encoder? or the cpp implementation?). 

The next steps will involve debugging/dumping the process where the exception occurs. I want to add this edge case to parquet's test suite (if it makes sense). I also want to regenerate the SyncView dataset for 20161218, maybe running it again will help. 

[0] https://github.com/apache/parquet-mr/tree/master/parquet-tools
[1] https://github.com/apache/parquet-cpp
Rerunning the SyncView job again for 20161218 seems to fix the issue on the test case. The job fails if a dataset already exists, which makes rerunning backfill from Airflow more difficult. I will be manually replacing problematic datasets with regenerated data to validate that this mitigates issues.

We still don't know what the root cause of this issue is. The build of parquet-mr that generated the original corrupted data is build 32c46643845ea8a705c35d4ec8fc654cc8ff816d.
I wrote the date wrong in comment 8, the date should be 20161208. Running the SyncView job again does not fix the issue.
We can try to use binary search to pinpoint the raw ping/s which is/are triggering this failure. One way to do that is to make some changes to the summaries method [1] of the Dataset class to read just a segment of files from S3. Once we have found a raw file on S3 that is causing the failure it should be easy to determine which pings in that file are the problematic one.

[1] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/heka/Dataset.scala#L31
I've found that repartitioning the data before writing to parquet makes the dataset readable again. An interesting side effect of this is that there are now 0.05% more rows than there were previously.

I am writing a script to fix the invalid dates as they crop up in the original query. I will make the repair notebook available when I'm done with it.
I have a notebook to identify broken partitions [0]. Once the pull-request in comment 14 is merged, rerunning a troublesome day should fix the problem. 

I think that the underlying problem lies somewhere between the rdd->dataframe conversion and the collect->parquet encoding logic. Repartitioning causes the data to be shuffled and reorganized in memory, which might fix the internal representation of the data. I have no definitive proof, only a workaround. 

[0] https://gist.github.com/acmiyaguchi/e319fc2af5c61bd5229a3e13bcde67c1
I reran the dashboard this morning and everything seems to be good.
This dashboard runs fine now: 
https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2017-01-01%2000:00:00.000&p_end_date=2017-01-30%2000:00:00.000&p_version=5&p_min_err=1

I can even tell that now that we are collecting data from release, I need to rethink how I plot my graphs. :-/
:robotblake ran the query again on presto, and now the dashboard seems to be in working order according to :adavis in comment 16. This bug was solved by repartitioning the data in TelemetryBatchView::SyncView before writing to s3. Bug 1333896 has been filed to target the underlying issue identified here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: