Closed Bug 893868 Opened 11 years ago Closed 9 years ago

Native map/reduce job needed for handling orphaned submissions

Categories

(Firefox Health Report Graveyard :: Data Collection, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: dre, Assigned: scabral)

References

Details

While bug 883277 tracks several improvements that reduce the number of orphaned documents, we will continue to receive them from older clients for an indefinite time period, and there is always the potential of new or undetected problems causing orphaning in the future.

The goal is to have a regularly scheduled job that will scan the main table of submissions and filter out any documents that are determined to have a high likelihood of being orphans.

Below are the major requirements for the job:

1. Calculate sets of documents that have a similar signature and are likely duplicates

2. Select one of these documents (the most recently submitted) to be the "head record"

3. Optionally Delete the non-head records from the main table

4. Optionally write the head records to a "snapshot" table

5. Count the three classifications of documents:
5a. Documents that have a unique signature
5b. Documents that were selected as head records
5c. Documents that were classified as orphans and subject to deletion

(Note: 5a + 5b should equal the total number of valid documents in the table)

6. Optionally (configured per app channel) copy up to M documents for N signatures with orphans to an analysis table

7. Log the counts and any actions taken (i.e. delete, snapshot, sampling copy) for diagnostics and trending analysis
code to scan hbase table, read data from hbase and write to HDFS is committed at https://github.com/mozilla-metrics/fhr-hadoop-hbase

The above repo also includes scripts to:
1. convert file to tsv format for importing to HBase
2. pig script to scan table and dump raw output.
3. https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHDFSDeleteHBase.java
Reads list of ids (CSV) from HDFS, performs bulk delete in chunks of 1000.
Deleted 350m ids as per bcolloran's last output.

-anurag
Anurag, could you provide a status update with the current state of the all-in-one job that has the features described in comment #0?  It is okay if it isn't far along because of the intermediate jobs / tests, but I want to make sure we are capturing the evolution of the desired final product.
1. Calculate sets of documents that have a similar signature and are likely duplicates
DONE. 
https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHBaseWriteHdfs.java

2. Select one of these documents (the most recently submitted) to be the "head record"
DONE.
https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHBaseWriteHdfs.java

3. Optionally Delete the non-head records from the main table
DONE
This isn't part of the above script but a separate script @ https://github.com/mozilla-metrics/fhr-hadoop-hbase/blob/master/src/com/mozilla/main/ReadHDFSDeleteHBase.java
The prime reason being we can do batch deletes in the above scripts as against individual deletes, the latter taking much longer.

4. Optionally write the head records to a "snapshot" table
IN PROGRESS.
tmary imported the snapshot from HDFS to another HBase table. Awaiting response from gps as whether the column family approach is what he wants in terms of speed and flexibility.

5. Count the three classifications of documents:
5a. Documents that have a unique signature
5b. Documents that were selected as head records
5c. Documents that were classified as orphans and subject to deletion
AND
7. Log the counts and any actions taken (i.e. delete, snapshot, sampling copy) for diagnostics and trending analysis
DONE.

All the above scripts have counters in them:
for finding orphans + head:
        public enum ReportStats { REDUCER_LINES, DUPLICATE_RECORDS, DUPLICATE_RECORD_COUNT, HEAD_RECORDS, NO_ID_TO_JSON_MAPPING, HEAD_RECORD_COUNT_OUT_OF_DUPLICATE_RECORD};
        
for delete:
        static enum DELETE_PROGRESS { ERROR_BATCH_DELETE, IDS_TO_DELETE, BATCH_DELETE_INVOKED }; 
        
The counters count more info than what's requested in 5a, 5b and 5c and also help debugging and understanding the job progress in general.

6. Optionally (configured per app channel) copy up to M documents for N signatures with orphans to an analysis table
IN PROGRESS.

Summing up, 4. is only that's what remaining, I will commit that code to github once gps/tmary confirm the new table approach.
Assignee: aphadke → scabral
Adding tmary to deal with this deorphaning pipeline bug, not sure if there's still anything left to do.
No longer relevant

--
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.