893868 - Native map/reduce job needed for handling orphaned submissions

Reporter

Description

•

12 years ago

While bug 883277 tracks several improvements that reduce the number of orphaned documents, we will continue to receive them from older clients for an indefinite time period, and there is always the potential of new or undetected problems causing orphaning in the future. The goal is to have a regularly scheduled job that will scan the main table of submissions and filter out any documents that are determined to have a high likelihood of being orphans. Below are the major requirements for the job: 1. Calculate sets of documents that have a similar signature and are likely duplicates 2. Select one of these documents (the most recently submitted) to be the "head record" 3. Optionally Delete the non-head records from the main table 4. Optionally write the head records to a "snapshot" table 5. Count the three classifications of documents: 5a. Documents that have a unique signature 5b. Documents that were selected as head records 5c. Documents that were classified as orphans and subject to deletion (Note: 5a + 5b should equal the total number of valid documents in the table) 6. Optionally (configured per app channel) copy up to M documents for N signatures with orphans to an analysis table 7. Log the counts and any actions taken (i.e. delete, snapshot, sampling copy) for diagnostics and trending analysis

Anurag Phadke[:aphadke@mozilla.com]

Comment 1

•

12 years ago

code to scan hbase table, read data from hbase and write to HDFS is committed at https://github.com/mozilla-metrics/fhr-hadoop-hbase The above repo also includes scripts to: 1. convert file to tsv format for importing to HBase 2. pig script to scan table and dump raw output.

Anurag Phadke[:aphadke@mozilla.com]

Comment 2

•

12 years ago

3. https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHDFSDeleteHBase.java Reads list of ids (CSV) from HDFS, performs bulk delete in chunks of 1000. Deleted 350m ids as per bcolloran's last output. -anurag

Daniel Einspanjer [:dre] [:deinspanjer]

Reporter

Comment 3

•

12 years ago

Anurag, could you provide a status update with the current state of the all-in-one job that has the features described in comment #0? It is okay if it isn't far along because of the intermediate jobs / tests, but I want to make sure we are capturing the evolution of the desired final product.

Anurag Phadke[:aphadke@mozilla.com]

Comment 4

•

12 years ago

1. Calculate sets of documents that have a similar signature and are likely duplicates DONE. https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHBaseWriteHdfs.java 2. Select one of these documents (the most recently submitted) to be the "head record" DONE. https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHBaseWriteHdfs.java 3. Optionally Delete the non-head records from the main table DONE This isn't part of the above script but a separate script @ https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHDFSDeleteHBase.java The prime reason being we can do batch deletes in the above scripts as against individual deletes, the latter taking much longer. 4. Optionally write the head records to a "snapshot" table IN PROGRESS. tmary imported the snapshot from HDFS to another HBase table. Awaiting response from gps as whether the column family approach is what he wants in terms of speed and flexibility. 5. Count the three classifications of documents: 5a. Documents that have a unique signature 5b. Documents that were selected as head records 5c. Documents that were classified as orphans and subject to deletion AND 7. Log the counts and any actions taken (i.e. delete, snapshot, sampling copy) for diagnostics and trending analysis DONE. All the above scripts have counters in them: for finding orphans + head: public enum ReportStats { REDUCER_LINES, DUPLICATE_RECORDS, DUPLICATE_RECORD_COUNT, HEAD_RECORDS, NO_ID_TO_JSON_MAPPING, HEAD_RECORD_COUNT_OUT_OF_DUPLICATE_RECORD}; for delete: static enum DELETE_PROGRESS { ERROR_BATCH_DELETE, IDS_TO_DELETE, BATCH_DELETE_INVOKED }; The counters count more info than what's requested in 5a, 5b and 5c and also help debugging and understanding the job progress in general. 6. Optionally (configured per app channel) copy up to M documents for N signatures with orphans to an analysis table IN PROGRESS. Summing up, 4. is only that's what remaining, I will commit that code to github once gps/tmary confirm the new table approach.

Sheeri Cabral [:sheeri]

Assignee

Updated

•

11 years ago

Assignee: aphadke → scabral

Sheeri Cabral [:sheeri]

Assignee

Comment 5

•

10 years ago

Adding tmary to deal with this deorphaning pipeline bug, not sure if there's still anything left to do.

T [:tmary] Meyarivan

Comment 6

•

10 years ago

No longer relevant --

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INVALID

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

Bugzilla

Native map/reduce job needed for handling orphaned submissions

Categories

(Firefox Health Report Graveyard :: Data Collection, defect)

Tracking

(Not tracked)

People

(Reporter: dre, Assigned: scabral)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated