Closed Bug 1080133 Opened 10 years ago Closed 9 years ago

Loop metrics logging can cause data loss/out of order logs

Categories

(Hello (Loop) :: Server, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: rhubscher)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file, 1 obsolete file)

56 bytes, text/x-github-pull-request
alexis+bugs
: review+
Details | Review
The Loop Server is using Winston for logging metrics to disk. Previously it was using the standard streams which were being logged/rotated by Circus. Winston's file transport seems to rely entirely on internal state variables to determine the log file to write to (see https://github.com/flatiron/winston/blob/master/lib/winston/transports/file.js#L82-86, specifically _created). I haven't investigated fully, but this means that if you restart the process while writing to e.g. loop_metrics4.log you will start writing to loop_metrics.log after the restart. This makes it very difficult to track what the current log file is, even with Heka's fairly powerful logstreamer input, and can potentially either cause data loss or the logs to be out of order.

The Loop Server should instead always log to loop_metrics.log as most log rotation schemes do, and rotate the log file to loop_metrics{N}.log as necessary.
Whiteboard: [qa-]
I though it was what Winston was doing (i.e always write to loop_metrics.log and rotate when it's full.)

Would it be better to use https://github.com/flatiron/winston/blob/master/lib/winston/transports/daily-rotate-file.js or to configure loop to always write to loop_metrics.log and then use the system logrotate?
Flags: needinfo?(whd)
It appears at a cursory glance that switching to the Winston daily rotate file transport suffers from the same problem with the _created internal state variable. I would prefer we switch back to writing to the standard streams and letting circus do the rotation, since this better matches the behavior of other applications and our logging standard: https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard. AFAIK loop doesn't presently emit anything to the standard streams. However, having Loop always write to loop_metrics.log and then using the system logrotate should indeed be sufficient to mitigate the issue.

In the long term, we'd like to use the new mozlog https://github.com/mozilla/mozlog library for all log output across Cloud Services' nodejs applications.
Flags: needinfo?(whd)
whd — Do you recommend us to start using mozilla/mozlog instead of winston for loop?
Flags: needinfo?(whd)
(In reply to Rémy Hubscher (:natim) from comment #3)
> whd — Do you recommend us to start using mozilla/mozlog instead of winston
> for loop?

Yes.
Flags: needinfo?(whd)
It looks like we lost some data due to this for 2014-12-06, and the dashboards for this week are messed up: https://metrics.services.mozilla.com/loop-server-dashboard/
By the time I noticed and checked the loop servers, one looked like:

loop_metrics15.log (old)
loop_metrics16.log (old)
loop_metrics17.log (old)
loop_metrics18.log (old)
loop_metrics19.log (old)
loop_metrics5.log (new)
loop_metrics6.log (new)
loop_metrics7.log (new)
loop_metrics8.log (new, being written to)

And the other:

loop_metrics15.log (old)
loop_metrics16.log (old)
loop_metrics17.log (old)
loop_metrics18.log (old)
loop_metrics19.log (old)
loop_metrics5.log (new)
loop_metrics6.log (new)
loop_metrics7.log (new)
loop_metrics8.log (new)
loop_metrics9.log (new)
loop_metrics.log (new, being written to)

This log rotation scheme is basically incomprehensible to me (and heka). https://github.com/mozilla-services/puppet-config/blob/master/loop/yaml/app/loop_server.prod.yaml#L36-40 indicates there should be only be five files at a time but it seems to have rotated the name at 20 and since deleted loop_metrics{1..4}.log as they are missing. I suspect the first server above will begin writing to loop_metrics.log after finishing with loop_metrics9.log, which may again break the logstreamer. I'll fix it tomorrow if it happens.

This most recent stack (2014-11-26) has been running longer than the average loop stack which is why the data loss occurred.

A way to mitigate this would be to either not rotate at all or fix the app logging settings (and/or the logger itself). With the amount of traffic loop gets and the rate at which we deploy new stacks I would be fine with turning off rotation completely until this has a proper fix.

I have attempted to recover and process as much of the existing logs as possible and will be adding anomaly detection to heka to catch this sooner in the future, but it must be addressed by the server or the logging library if we want to prevent this kind of data loss.
Thanks for looking into this Wesley!
Starting the switch to mozlog today.
Assignee: nobody → rhubscher
Actually it seems mozlog doesn't let us configure the filename where to store files. First reason why we choose to use winston. I will deactivate logrotate for winston.
Mozlog is a wrapper around Intel, which should support writing out to a file: https://github.com/seanmonstar/intel#filehandler. I would also prefer you use the standard streams for metrics anyway as outlined in https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard section "Format", but my primary concern is to stop losing data, which can be accomplished by disabling logrotate for winston.
Attached file Link to Github PR — #277. (obsolete) —
Attachment #8543245 - Flags: review?(alexis+bugs)
Following your advice I did this:

 - Remove everything that might write to stdout
 - Use mozlog to write to stdout with the heka logger

The format is now the one from mozlog:

For websocket logging:

 {"Timestamp":1420193467785000000,"Logger":"mozilla-loop-server",
  "Type":"loop.logger.request.summary","Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"code":204,"path":"/rooms/LcYL3OdXgg4","method":"post","query":"{}",
            "agent":"python-requests/2.4.1 CPython/2.7.6 Linux/3.13.0-43-generic",
            "time":"2015-01-02T10:11:07Z",
            "uid":"fabaf4f9f60c6f8d97158c75f0b9b2661738130eb654eed13d5ecdc8739d0f1a",
            "token":"LcYL3OdXgg4","v":"0.14.2","name":"mozilla-loop-server",
            "hostname":"rhubscher-ThinkPad-W540","ip":"127.0.0.1","errno":0,"action":"leave"}}

For requests logging:

 {"Timestamp":1420193467781000000,"Logger":"mozilla-loop-server","Type":"loop.logger.request.summary",
  "Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"code":200,"path":"/rooms/LcYL3OdXgg4","method":"post","query":"{}",
            "agent":"python-requests/2.4.1 CPython/2.7.6 Linux/3.13.0-43-generic",
            "time":"2015-01-02T10:11:07Z",
            "uid":"fabaf4f9f60c6f8d97158c75f0b9b2661738130eb654eed13d5ecdc8739d0f1a",
            "token":"LcYL3OdXgg4","v":"0.14.2","name":"mozilla-loop-server",
            "hostname":"rhubscher-ThinkPad-W540","ip":"127.0.0.1","errno":0,"action":"refresh"}}

The hekaMetrics configuration changed to fit the mozlog properties:

  hekaMetrics: {
    activated: {
      doc: "Defines if metrics should be directed to hekad",
      default: false,
      format: Boolean
    },
    debug: {
      doc: "Heka logger display development logs",
      format: Boolean,
      default: false
    },
    level: {
      doc: "Log level.",
      format: String,
      default: "INFO"
    },
    fmt: {
      doc: "Log level.",
      format: String,
      default: "heka"
    }
  },
I put to examples of request.summary, the one for websocket.summary is:

 {"Timestamp":1420193489623000000,"Logger":"mozilla-loop-server","Type":"loop.logger.websocket.summary",
  "Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"callId":"bfb6e7e5ade4b4ebdcb906382f820242","auth":"d27e26844c00c6b68005a2e6c1cb5960",
            "messageType":"hello","time":"2015-01-02T10:11:29Z"}}
This looks good, thanks. If possible remove the redundant "loop.logger" prefix from the Type field, since we have the Logger field for this purpose.
I will let loop.websocket.summary and loop.request.summary. Because mozlog doesn't handle an empty namespace:

  var log = require("mozlog")();

It writes `.websocket.summary` and `.request.summary`

Do you want me to create a patch for mozlog about this?
It appears the idiomatic way of doing this is to use different logger objects for different code paths (at least this is my guess from https://github.com/mozilla/fxa-profile-server/search?utf8=%E2%9C%93&q=logger+%3D). So in the requests path you would do e.g.

  var log = require("mozlog")("request");

and log a "summary" event.
Yes this would be one way to do it. Also it would means to write:

    hekaLogger("request").info("summary", data);

And I rather prefer to write:

    hekaLogger.info("request.summary", data);

Which does exactly the same thing.

I have created https://github.com/mozilla/mozlog/pull/3 in order to handle the empty namespace.
Adding a NI on Sean so we have feedback to know if this can be merged.
Flags: needinfo?(smcarthur)
Comment on attachment 8543245 [details] [review]
Link to Github PR — #277.

I've merged the code change, it's at https://github.com/mozilla-services/loop-server/commit/82c8883d80d4e74711cf6daba8e33bd7939005ce

I'm letting the NI on sean to have feedback about the PR Rémy opened on mozlog.
Attachment #8543245 - Flags: review?(alexis+bugs) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
sorry, missed this. I have left comments on the PR.
Flags: needinfo?(smcarthur)
I took a look at the logs in stage and they still have "Type" with the "loop." prefix, probably because :natim's mozlog PR has not been merged. I need this fixed before 0.15.* goes to production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1120911
Wes, you mentioned today we could proceed with 0.15.0 and that this logging change could go out in a future release.  Does it instead look like we need to halt 0.15.0 and do a re-spin with this change?
Flags: needinfo?(whd)
(In reply to Richard Pappalardo [:rpapa][:rpappalardo] from comment #21)
> Wes, you mentioned today we could proceed with 0.15.0 and that this logging
> change could go out in a future release.  Does it instead look like we need
> to halt 0.15.0 and do a re-spin with this change?

I need this fixed before 0.15.* goes to production, but I don't know if that requires halting current work on 0.15.0 in stage. It's a minor code change.
Flags: needinfo?(whd)
Remy, can you have a look at this?
Flags: needinfo?(rhubscher)
Sure what would you like me to do?
Flags: needinfo?(rhubscher)
:natim I need the "loop." prefix dropped from the Logger field. This might involve updating mozlog and the loop server or both. There's a ticket for a new stage deploy of 0.15.1 (bug 1123276), so perhaps we can roll this change into that (and call it 0.15.2 if need be).
There's a PR to mozlog that I reviewed. Needs an update and I'll merge and publish a new version to npm.
Ok Sean, Thanks I will do that then.
Thanks! I've published v1.0.2 to npm, you should be good to go.
Attachment #8553974 - Flags: review?(alexis+bugs)
Attachment #8553974 - Flags: feedback?(whd)
Attachment #8543245 - Attachment is obsolete: true
Attachment #8553974 - Flags: feedback?(whd) → feedback+
https://github.com/mozilla-services/loop-server/commit/4b3079990e8642103de88f7a5e3a908fe81d0a70
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Attachment #8553974 - Flags: review?(alexis+bugs) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: