Closed Bug 1108622 Opened 10 years ago Closed 9 years ago

[back-end] Implement Folding Tests for URL classification on Moreover corpus

Categories

(Content Services Graveyard :: Classification Engine, defect)

x86
macOS
defect
Not set
normal
Points:
13

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
38.1 - 26 Jan

People

(Reporter: mzhilyaev, Assigned: mzhilyaev)

References

Details

(Whiteboard: .?)

Need to test classification accuracy/recall on Moreover corpus by folding corpus into older and recent chunks. Whereby, the training occurs on older chunk and is applied to the recent chunk.

If user history is incorporated into URL classification via model fitting (bug# 1104335) the testing should be extended to recency folding as well. However, a user model needs to be developed to synthesize history from sites popularity.
Points: --- → 13
Whiteboard: .?
Blocks: 1104322
We should have a general way to build rule set from a subset of corpus and apply it to yet another subset of corpus to allow for folding testing.  However, there's currently no algorithmic way to compute rule set - all manual updates to Matthew payload.  The bug# 1104364 is filed for automation of rule generation.
Depends on: 1109962
Depends on: 1109967
No longer depends on: 1109967
Summary: [back-end] Testing URL classification on Moreover corpus → [back-end] Implement Folding Tests for URL classification on Moreover corpus
Blocks: 1104329
No longer blocks: 1104322
Iteration: --- → 37.3 - 12 Jan
Iteration: 37.3 - 12 Jan → 38.1 - 26 Jan
This generic way was implemented by allowing DFR generation and Rule generation be done on specific chunk of the corpus.  Both scripts allow -f date and -t date arguments to enable selection by date range.
Assignee: nobody → mzhilyaev
./generateCorpusStats.js -h
USAGE: generateCorpusStats.js [OPTIONS]
Generates Corpus URL and Title stats

  -h, --help          display this help
  -v, --verbous       display debug info
  -d, --dbHost=ARG    db hosts: default=localhost
  -p, --dbPort=ARG    db port: default=27017
  -f, --fromDate=ARG  starting from date in yyyy/mm/dd format (like 2014/10/01): default none
  -t, --toDate=ARG    ending from date in yyyy/mm/dd format (like 2014/10/01): default none
  -l, --limit=ARG     docs limit: default none


./generateDFRStats.js -h
USAGE: generateDFRStats.js [OPTIONS] [DFR FILES]
Generates DFR matching stats

  -h, --help          display this help
  -v, --verbous       display debug info
  -d, --dbHost=ARG    db hosts: default=localhost
  -p, --dbPort=ARG    db port: default=27017
  -f, --fromDate=ARG  starting from date in yyyy/mm/dd format (like 2014/10/01): default none
  -t, --toDate=ARG    ending from date in yyyy/mm/dd format (like 2014/10/01): default none
  -l, --limit=ARG     docs limit: default none
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.