Closed Bug 663181 Opened 13 years ago Closed 13 years ago

Automatic cleanup for Sync error logs

Categories

(Firefox :: Sync, defect)

defect
Not set
minor

Tracking

()

RESOLVED FIXED
mozilla9

People

(Reporter: rnewman, Assigned: liuche)

References

Details

(Whiteboard: [good-first-bug][verified in services])

Attachments

(1 file, 3 obsolete files)

We (will) write a log file to disk per failed sync. We should have some mechanism for expiring/archiving/managing these logs, particularly as they will cause filesystem indexing and other work on behalf of the OS, and we'll be potentially syncing much more frequently.
Severity: normal → minor
Whiteboard: [good-first-bug]
Some thoughts:

* Cleanup threshold: I think a combination of number of logs and age works best. We probably want to clean up everything that's older than, say, 30 days, but only if we actually have more log files than, say, 20. Both those numbers should be prefs.

* When to clean up: Best time to check whether a clean up is necessary is probably when we've written a log. Because only if we're writing logs to disk we need to clean old ones up. We probably still want to do it as an idle observer for, say, 1 minute.

Note that IMHO we only need to clean up error logs. Success logs have to be enabled by the user explicitly.
Assignee: nobody → liuche
test_logErrorCleanup_age has a spin loop to "age" the error log, and a kind of ugly "add an Obs within an Obs." Test is time-dependent though, so I found this necessary.
Attachment #557363 - Flags: review?(philipp)
Comment on attachment 557363 [details] [diff] [review]
upon logging an error log, will delete error logs above a certain number and age threshold

Removing review flag since this patch is against an outdated revision of the tree.

>+  _cleanupLogs: function _cleanupLogs() {
>+    let errLogCount = 0;
>+    let errLogs = [];

These variables are only used within the `try` block so you can move them there. errLogCount also appears to be redundant as it seems to always be just errLogs.length.

>+    try {
>+      let dir = FileUtils.getDir("ProfD", ["weave", "logs"]);
>+      let entries = dir.directoryEntries;
>+
>+      while (entries.hasMoreElements()) {
>+        let entry = entries.getNext();
>+        let logFile = entry.QueryInterface(Ci.nsIFile);

Isn't XPCOM awesome? So much clutter! Good news is, you can combine those two lines to create slightly more concise code:

  let logFile = entries.getNext().QueryInterface(Ci.nsIFile);

>+        // Cleanup file if too old
>+        if (logFile.isFile()
>+          && (logFile.leafName.toString().search(LOG_ERROR_REGEX) > -1)

logFile.leafName is already string, so no need for toString().

When you're just searching whole substrings, regexes aren't very efficient. (But if you would want to use a regex, you probably want to precompile it.) They're also almost certainly the wrong hammer anyway. I would recommend doing

  logFile.leafName.slice(0, LOG_ERROR_REGEX.length) == LOG_ERROR_REGEX

>+          && Date.now() - logFile.lastModifiedTime > Svc.Prefs.get("log.maxErrAge")) {

You can precompute Date.now() - Svc.Prefs.get("log.maxErrAge") outside this while loop and then just need to compare it to logFile.lastModifiedTime.

>+            logFile.remove(false);

nsIFile::remove can throw on Windows for no apparent reason. We should wrap any call to it in a try/catch and trace-log the error.

Speaking of logging, it'd be good to have trace logging whenever individual files are removed.

>+        } else {
>+          errLogCount++;
>+          errLogs.push(logFile);
>+        }
>+      }
>+      let diff = errLogCount - Svc.Prefs.get("log.maxErrLogs");
>+      while (diff > 0) {
>+        let errLog = errLogs.shift();
>+        errLog.remove(false);
>+        diff--;
>+      }
>+    } catch (ex) {}

Should at least log the exception here:

  this._log._debug("Encountered error trying to clean up old log files: "
                   + Utils.exceptionStr(ex));

>   _resetFileLog: function _resetFileLog(flushToFile, filenamePrefix) {
>     let inStream = this._logAppender.getInputStream();
>     this._logAppender.reset();
>     if (flushToFile && inStream) {
>       try {
>         let filename = filenamePrefix + Date.now() + ".txt";
>         let file = FileUtils.getFile("ProfD", ["weave", "logs", filename]);
>         let outStream = FileUtils.openFileOutputStream(file);
>         NetUtil.asyncCopy(inStream, outStream, function () {
>           Svc.Obs.notify("weave:service:reset-file-log");
>         });
>+        if (filenamePrefix == LOG_PREFIX_ERROR) {
>+          Utils.nextTick(this._cleanupLogs, this);
>+        }

This might be a bit undeterministic, depending on whether the async operation fires before the next tick or not. Why not move this into the asyncCopy callback?

That would also mean we're not going to compete with the log writing for I/O access.

I wonder if my suggestion of Utils.nextTick was such a good idea. Perhaps we actually want to sleep for a few seconds before beginning that cleanup. Also, I'm a bit worried by the tight while loop(s) that do I/O operations on the main thread.

So either we might want to break this up with Utils.nextTick, too. Basically yield back to the event loop after every file removal. Alternatively we could just keep the tight loop and do it on a separate thread. We should do some research first to see if that's safe to do, though.

>diff --git a/services/sync/services-sync.js b/services/sync/services-sync.js
>--- a/services/sync/services-sync.js
>+++ b/services/sync/services-sync.js
>@@ -22,16 +22,18 @@ pref("services.sync.engine.prefs", true)
> pref("services.sync.engine.tabs", true);
> pref("services.sync.engine.tabs.filteredUrls", "^(about:.*|chrome://weave/.*|wyciwyg:.*|file:.*)$");
> 
> pref("services.sync.jpake.serverURL", "https://setup.services.mozilla.com/");
> pref("services.sync.jpake.pollInterval", 1000);
> pref("services.sync.jpake.firstMsgMaxTries", 300);
> pref("services.sync.jpake.maxTries", 10);
> 
>+pref("services.sync.log.maxErrLogs", 20);
>+pref("services.sync.log.maxErrAge", 432000); // 5 days

"Err"? Really? I think we have money for 2 more bytes to at least say "Error" here ;) Also, since this stuff is really tied to the file logger, we should use the 'services.sync.log.appender.file.' prefix for those anyway.

Also, please use seconds in prefs :)

>+// Check that only the maximum number of error logs is saved
>+add_test(function test_logErrorCleanup_max() {
>+  let maxNum = 10;
>+  let logfile1_name;
>+
>+  Svc.Prefs.set("log.appender.file.logOnError", true);
>+  Svc.Prefs.set("log.maxErrLogs", maxNum);
>+
>+  Svc.Obs.add("weave:service:reset-file-log", function onResetFileLog() {
>+    Svc.Obs.remove("weave:service:reset-file-log", onResetFileLog);
>+
>+    // Store logfile name for comparison.
>+    let entries = logsdir.directoryEntries;
>+    do_check_true(entries.hasMoreElements());
>+    let logfile = entries.getNext().QueryInterface(Ci.nsILocalFile);
>+    logfile1_name = logfile.leafName.toString();
>+
>+    // Ensure the log message was actually written to file.
>+    readFile(logfile, function (error, data) {
>+      do_check_true(Components.isSuccessCode(error));
>+      do_check_neq(data.indexOf(MESSAGE), -1);
>+    });
>+  });

You don't need to test here that the files are written to disk, we have other tests that cover that already!

>+  // Fake max number of unsuccessful syncs.
>+  for (i=0; i<maxNum; i++) {
>+    Svc.Obs.notify("weave:service:sync:error");
>+  }

You better hope that these won't have the same Date.now() for the filename! Also, these files are going to be created async.

I think it'd be easier to just create a bunch of fake files manually in the test or a test helper (sync I/O is fine here) and then have a bunch of unit tests for the cleanup function. On top of that we'd then just need one test that ensures the clean up function is actually called when an error log is written to disk.

(Also, nit: spaces around the operators in the for loop)

>+  Svc.Obs.add("weave:service:cleanup-logs", function onCleanupLogs() {
>+    Svc.Obs.remove("weave:service:cleanup-logs", onCleanupLogs);
>+
>+    let logcount = 0;
>+    let entries = logsdir.directoryEntries;
>+
>+    // Exactly max number of log files remain, the first one is deleted
>+    while (entries.hasMoreElements()) {
>+      let logfile = entries.getNext().QueryInterface(Ci.nsILocalFile);
>+      do_check_neq(logfile.leafName.toString(), logfile1_name);
>+      logcount++;
>+
>+      // Ensure the log message was actually written to file.
>+      readFile(logfile, function (error, data) {
>+        do_check_true(Components.isSuccessCode(error));
>+        do_check_neq(data.indexOf(MESSAGE), -1);
>+
>+        // Clean up.
>+        try {
>+          logfile.remove(false);
>+        } catch(ex) {
>+          dump("Couldn't delete file: " + ex + "\n");
>+          // Stupid Windows box.
>+        }
>+      });
>+    }
>+    do_check_eq(logcount, maxNum);
>+
>+    Svc.Prefs.resetBranch("");
>+    run_next_test();

You're doing that too prematurely. You can only do test clean up and run_next_test() when your async callbacks have been called! For that reason it's a good idea to limit the number of async callbacks you're expecting in a test.


So if I read the code in _cleanupLogs correctly, it will:

* delete any file that's older than the threshold
* and collect all remaining files in the order they appeared in the directory listing and then delete them until 20 are left.

There's one problem with this, specifically the latter of the two points: there's no guarantee the order the files appeared in the directory listing in is actually ordered by time (or filename, since that would also be ok). So there's the danger of leaving 20 random files, not just the 20 most recent ones.

This algorithm also doesn't actually implement what I suggested in comment 1. My idea was that we wouldn't even start bothering to look at files and their modified times and try to delete anything if you had less than 20 entries. That said, if you have less than 20 entries, deleting those few will probably not be super expensive, anyway.

So given that, the number threshold might be unnecessary altogether. What do you think?
Attachment #557363 - Flags: review?(philipp)
Went with the age-only filtering, and fixed the other little things previously pointed out. Now, after an error log is written, too-old error logs are filtered out to be deleted in non-blocking way (nextTick).
Attachment #557363 - Attachment is obsolete: true
Attachment #558701 - Flags: review?(philipp)
Comment on attachment 558701 [details] [diff] [review]
non-blocking error log cleanup by age only

Nice work! Just a bunch of nits at this point.

>  * Contributor(s):
>+ *  Chenxia Liu <liuche@mozilla.com>
>  *  Marina Samuel <msamuel@mozilla.com>
>  *  Philipp von Weitershausen <philipp@weitershausen.de>

Usually you add yourself to the bottom of this list ;)

>   initLogs: function initLogs() {
>     this._log = Log4Moz.repository.getLogger("Sync.ErrorHandler");
>     this._log.level = Log4Moz.Level[Svc.Prefs.get("log.logger.service.main")];
>+    this._cleaningUpFileLogs = false;
> 
>     let root = Log4Moz.repository.getLogger("Sync");
>     root.level = Log4Moz.Level[Svc.Prefs.get("log.rootLogger")];
> 
>     let formatter = new Log4Moz.BasicFormatter();
>     let capp = new Log4Moz.ConsoleAppender(formatter);
>     capp.level = Log4Moz.Level[Svc.Prefs.get("log.appender.console")];
>     root.addAppender(capp);
> 
>     let dapp = new Log4Moz.DumpAppender(formatter);
>     dapp.level = Log4Moz.Level[Svc.Prefs.get("log.appender.dump")];
>     root.addAppender(dapp);
> 
>     let fapp = this._logAppender = new Log4Moz.StorageStreamAppender(formatter);
>     fapp.level = Log4Moz.Level[Svc.Prefs.get("log.appender.file.level")];
>     root.addAppender(fapp);
>+
>   },

Please get rid of this unrelated whitespace fix.

>   /**
>+   * Finds all logs older than maxErrorAge and deletes them without tying up I/O
>+   *
>+   */

Nits: missing period at the end of the sentence, extra line before closing the comment.

>+  cleanupLogs: function cleanupLogs() {
>+    let direntries = FileUtils.getDir("ProfD", ["weave", "logs"]).directoryEntries;
>+    let oldLogs = [];
>+    let index = 0;
>+    let dateNow = Date.now();
>+    let maxAge = 1000*Svc.Prefs.get("log.appender.file.maxErrorAge");

Nit: space around * operator.

>+    while (direntries.hasMoreElements()) {
>+      let logFile = direntries.getNext().QueryInterface(Ci.nsIFile);
>+      if (dateNow - logFile.lastModifiedTime > maxAge) {
>+        oldLogs.push(logFile);
>+      }
>+    }
>+
>+    /* deletes a file from oldLogs each tick until there are none left */

Nit: Please either use // or JavaDoc style comment.

>+    function _deleteFile() {
>+      if (index >= oldLogs.length) {
>+        ErrorHandler._cleaningUpFileLogs = false;
>+        Svc.Obs.notify("weave:service:cleanup-logs");
>+        return;
>+      }
>+      try {
>+        oldLogs[index++].remove(false);

Please move the index increment to a separate statement for better readability. It can live outside the try/catch block.

>+      } catch (ex) {
>+        ErrorHandler._log._debug("Encountered error trying to clean up old log files: "
>+            + Utils.exceptionStr(ex));

Please align the + operator with the open quote.

>+      }
>+      Utils.nextTick(_deleteFile);
>+    }
>+
>+    if (oldLogs.length > 0) {
>+      ErrorHandler._cleaningUpFileLogs = true;
>+      Utils.nextTick(_deleteFile);
>+    }
>+  },

I'm not sure I like the underscore in _deleteFile or understand why it's there. It's a locally scoped function, so in that way it's like a local variable. I would prefer to get rid of it.

>@@ -562,17 +605,21 @@ let ErrorHandler = {
>     this._logAppender.reset();
>     if (flushToFile && inStream) {
>       try {
>         let filename = filenamePrefix + Date.now() + ".txt";
>         let file = FileUtils.getFile("ProfD", ["weave", "logs", filename]);
>         let outStream = FileUtils.openFileOutputStream(file);
>         NetUtil.asyncCopy(inStream, outStream, function () {
>           Svc.Obs.notify("weave:service:reset-file-log");
>-        });
>+          if (filenamePrefix == LOG_PREFIX_ERROR
>+            && ErrorHandler._cleaningUpFileLogs == false) {

Please align the && operator with the 'f' of filenamePrefix.

>+              Utils.nextTick(this.cleanupLogs, this);
>+          }
>+        }.bind(this));

You're mixing styles here between using 'this' (which is ErrorHandler thanks to the bind) and using 'ErrorHandler' explicitly. I'd say, don't do the binding (that'd be cheaper because it avoids cloning a function object) and just use ErrorHandler in the nextTick.

>diff --git a/services/sync/services-sync.js b/services/sync/services-sync.js
>--- a/services/sync/services-sync.js
>+++ b/services/sync/services-sync.js
>@@ -29,16 +29,18 @@ pref("services.sync.jpake.pollInterval",
> pref("services.sync.jpake.firstMsgMaxTries", 300);
> pref("services.sync.jpake.maxTries", 10);
> 
> pref("services.sync.log.appender.console", "Warn");
> pref("services.sync.log.appender.dump", "Error");
> pref("services.sync.log.appender.file.level", "Trace");
> pref("services.sync.log.appender.file.logOnError", true);
> pref("services.sync.log.appender.file.logOnSuccess", false);
>+pref("services.sync.log.appender.file.maxErrorLogs", 20);

This is obsolete now, right?

>diff --git a/services/sync/tests/unit/test_errorhandler_filelog.js b/services/sync/tests/unit/test_errorhandler_filelog.js
>--- a/services/sync/tests/unit/test_errorhandler_filelog.js
>+++ b/services/sync/tests/unit/test_errorhandler_filelog.js
>@@ -226,8 +226,65 @@ add_test(function test_login_error_logOn
>       run_next_test();
>     });
>   });
> 
>   // Fake an unsuccessful login due to prolonged failure.
>   setLastSync(PROLONGED_ERROR_DURATION);
>   Svc.Obs.notify("weave:service:login:error");
> });
>+
>+// Check that an error log file is deleted above a time threshold
>+add_test(function test_logErrorCleanup_age() {
>+  let maxAge = 1;
>+  let firstlog_name;
>+  let oldLogs = [];
>+  let numLogs = 10;
>+  let errString = "some error log\n";
>+
>+  Svc.Prefs.set("log.appender.file.logOnError", true);
>+  Svc.Prefs.set("log.appender.file.maxErrorAge", maxAge);
>+
>+  // make new file 
>+  let foStream = Cc["@mozilla.org/network/file-output-stream;1"].
>+                   createInstance(Ci.nsIFileOutputStream);
>+  let newLog = Cc["@mozilla.org/file/local;1"].
>+                    createInstance(Ci.nsILocalFile);

Please use the style present in all other Sync code and align the operators:

  let foobar = Cc[contract]
                 .createInstance(interface);

>+  for (let i = 0; i < numLogs; i++){

Nit: space between ) and {.

>+    newLog.initWithPath(logsdir.path + "/" + LOG_PREFIX_ERROR + Date.now() + i + ".txt");
>+    oldLogs.push(newLog.leafName);

I'm not sure why you can't use FileUtils here... Also where is 'logsdir' defined? I can't see it.

>+    foStream.init(newLog, 0x02 | 0x08 | 0x20, 0666, 0);

Nit: Please use the constants from FileUtils.jsm here.

>+    foStream.write(errString, errString.length);
>+    foStream.close();
>+  }
>+
>+  // Spin for time threshold
>+  let timerStart = Date.now();
>+  while (Date.now() - timerStart < maxAge*1000);

Ugh. We have better ways to pause for a second (namely Utils.namedTimer). But really, why do you have to do this? Can't you set maxAge to 0?
Attachment #558701 - Flags: review?(philipp)
Attached patch fixed nits (obsolete) — Splinter Review
Nits fixed, simplified maxAge test - originally was a test showing "deletes some files, but not all, filtered by age" but I guess that's not reallllly necessary.
Attachment #558701 - Attachment is obsolete: true
Comment on attachment 558716 [details] [diff] [review]
fixed nits

>   /**
>+   * Finds all logs older than maxErrorAge and deletes them without tying up I/O.
>+   */
>+  cleanupLogs: function cleanupLogs() {
>+    let direntries = FileUtils.getDir("ProfD", ["weave", "logs"]).directoryEntries;
>+    let oldLogs = [];
>+    let index = 0;
>+    let dateNow = Date.now();
>+    let maxAge = 1000 * Svc.Prefs.get("log.appender.file.maxErrorAge");
>+
>+    while (direntries.hasMoreElements()) {
>+      let logFile = direntries.getNext().QueryInterface(Ci.nsIFile);
>+      if (dateNow - logFile.lastModifiedTime > maxAge) {

You can precompute dateNow - maxAge and then just compare that to logFile.lastModifiedTime (cf. comment 3)

>+    // deletes a file from oldLogs each tick until there are none left 

Nit: this is a sentence, please capitalize and add full stop.

>+    function deleteFile() {
>+      if (index >= oldLogs.length) {
>+        ErrorHandler._cleaningUpFileLogs = false;
>+        Svc.Obs.notify("weave:service:cleanup-logs");
>+        return;
>+      }
>+      try {
>+        oldLogs[index].remove(false);
>+      } catch (ex) {
>+        ErrorHandler._log._debug("Encountered error trying to clean up old log files: "
>+                                 + Utils.exceptionStr(ex));

Can you log the filename here, please?

>@@ -562,16 +604,20 @@ let ErrorHandler = {
>     this._logAppender.reset();
>     if (flushToFile && inStream) {
>       try {
>         let filename = filenamePrefix + Date.now() + ".txt";
>         let file = FileUtils.getFile("ProfD", ["weave", "logs", filename]);
>         let outStream = FileUtils.openFileOutputStream(file);
>         NetUtil.asyncCopy(inStream, outStream, function () {
>           Svc.Obs.notify("weave:service:reset-file-log");
>+          if (filenamePrefix == LOG_PREFIX_ERROR
>+              && ErrorHandler._cleaningUpFileLogs == false) {

Don't compare with false, just use check for !ErrorHandler._cleaningUpFileLogs.

>+              Utils.nextTick(ErrorHandler.cleanupLogs, ErrorHandler);
>+          }

Nit: the Utils.nextTick line is indented 2 spaces too far.

>+// Check that error log files are deleted above a time threshold
>+add_test(function test_logErrorCleanup_age() {
>+  let maxAge = 0;
>+  let firstlog_name;
>+  let oldLogs = [];
>+  let numLogs = 10;
>+  let errString = "some error log\n";
>+
>+  Svc.Prefs.set("log.appender.file.logOnError", true);
>+  Svc.Prefs.set("log.appender.file.maxErrorAge", maxAge);
>+
>+  // Make some files.
>+  for (let i = 0; i < numLogs; i++) {
>+    let filename = LOG_PREFIX_ERROR + Date.now() + i + ".txt";
>+    let newLog = FileUtils.getFile("ProfD", ["weave", "logs", filename]);
>+    let foStream = FileUtils.openFileOutputStream(newLog);
>+    foStream.write(errString, errString.length);
>+    foStream.close();
>+    oldLogs.push(newLog.leafName);
>+  }

Why are you keeping the files in the oldLogs array?

>+
>+  Svc.Obs.add("weave:service:cleanup-logs", function onCleanupLogs() {
>+    Svc.Obs.remove("weave:service:cleanup-logs", onCleanupLogs);
>+
>+    // No log files remain, maxAge is 0.
>+    let entries = logsdir.directoryEntries;
>+    do_check_false(entries.hasMoreElements());

I see what you mean now in comment 6. I hadn't thought this maxAge = 0 thing through :). Alright, ok, so let's set maxAge to 100 ms and delay the Svc.Obs.notify("weave:service:sync:error") call by 150ms (100ms won't be enough on Windows because timers can fire a few ms earlier on those platforms). Then we should expect 1 remaining logfile, right? When you do this, please const the 100 and 150 and add comments about why these values and why we're doing this.
Expanded cleanup test to demonstrate cleanup only of "too-old" errors.
Attachment #558716 - Attachment is obsolete: true
Attachment #559210 - Flags: review?(philipp)
Attachment #559210 - Attachment description: nits, cleanup age test → nits, cleanup age test, v3(-ish)
Attachment #559210 - Flags: review?(philipp) → review+
STR

Threshold age for cleanup can be modified in prefs: services.sync.log.appender.file.maxErrorAge, units in seconds.

Cleanup applies to error logs, so create error logs somehow. Wait threshold age (maxErrorAge) and trigger cleanup through an erroneous sync.

Only logs older than threshold age should have been deleted from about:sync-logs, where age of file refers to last time modified.
http://hg.mozilla.org/services/services-central/rev/42cd2123ad10
Status: NEW → ASSIGNED
Summary: Follow-up: cleanup for log-on-error → Automatic cleanup for Sync error logs
Whiteboard: [good-first-bug] → [good-first-bug][fixed in services]
a little clarification of STR's for verification and test case writing.

1) in about config services.sync.log.appender.file.maxErrorAge to 120 (2 minutes). Then note clock time.
2) disconnect your network
3) Sync Now a few times to create error logs (confirm logs entries are present in about:sync-log)
4) wait 'til 2 minutes of clock time has passed
5) with network still disconnected; Sync Now once again
6) Check sync-logs again

expected results:
Only the log entry for the sync attempt in step 5 should be present. This is because of the short maxErrorAge.  (that pref defaults to 10 days)
log clean up looks good on s-c builds from this afternoon
Whiteboard: [good-first-bug][fixed in services] → [good-first-bug][verified in services]
https://hg.mozilla.org/mozilla-central/rev/42cd2123ad10
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: