Loop metrics logging can cause data loss/out of order logs

RESOLVED FIXED

Status

Hello (Loop)
Server
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: whd, Assigned: natim)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qa-])

Attachments

(1 attachment, 1 obsolete attachment)

56 bytes, text/x-github-pull-request
alexis
: review+
Details | Review | Splinter Review
(Reporter)

Description

3 years ago
The Loop Server is using Winston for logging metrics to disk. Previously it was using the standard streams which were being logged/rotated by Circus. Winston's file transport seems to rely entirely on internal state variables to determine the log file to write to (see https://github.com/flatiron/winston/blob/master/lib/winston/transports/file.js#L82-86, specifically _created). I haven't investigated fully, but this means that if you restart the process while writing to e.g. loop_metrics4.log you will start writing to loop_metrics.log after the restart. This makes it very difficult to track what the current log file is, even with Heka's fairly powerful logstreamer input, and can potentially either cause data loss or the logs to be out of order.

The Loop Server should instead always log to loop_metrics.log as most log rotation schemes do, and rotate the log file to loop_metrics{N}.log as necessary.
Whiteboard: [qa-]
(Assignee)

Comment 1

3 years ago
I though it was what Winston was doing (i.e always write to loop_metrics.log and rotate when it's full.)

Would it be better to use https://github.com/flatiron/winston/blob/master/lib/winston/transports/daily-rotate-file.js or to configure loop to always write to loop_metrics.log and then use the system logrotate?
(Assignee)

Updated

3 years ago
Flags: needinfo?(whd)
(Reporter)

Comment 2

3 years ago
It appears at a cursory glance that switching to the Winston daily rotate file transport suffers from the same problem with the _created internal state variable. I would prefer we switch back to writing to the standard streams and letting circus do the rotation, since this better matches the behavior of other applications and our logging standard: https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard. AFAIK loop doesn't presently emit anything to the standard streams. However, having Loop always write to loop_metrics.log and then using the system logrotate should indeed be sufficient to mitigate the issue.

In the long term, we'd like to use the new mozlog https://github.com/mozilla/mozlog library for all log output across Cloud Services' nodejs applications.
Flags: needinfo?(whd)
(Assignee)

Comment 3

3 years ago
whd — Do you recommend us to start using mozilla/mozlog instead of winston for loop?
Flags: needinfo?(whd)
(Reporter)

Comment 4

3 years ago
(In reply to Rémy Hubscher (:natim) from comment #3)
> whd — Do you recommend us to start using mozilla/mozlog instead of winston
> for loop?

Yes.
Flags: needinfo?(whd)
(Reporter)

Comment 5

3 years ago
It looks like we lost some data due to this for 2014-12-06, and the dashboards for this week are messed up: https://metrics.services.mozilla.com/loop-server-dashboard/
By the time I noticed and checked the loop servers, one looked like:

loop_metrics15.log (old)
loop_metrics16.log (old)
loop_metrics17.log (old)
loop_metrics18.log (old)
loop_metrics19.log (old)
loop_metrics5.log (new)
loop_metrics6.log (new)
loop_metrics7.log (new)
loop_metrics8.log (new, being written to)

And the other:

loop_metrics15.log (old)
loop_metrics16.log (old)
loop_metrics17.log (old)
loop_metrics18.log (old)
loop_metrics19.log (old)
loop_metrics5.log (new)
loop_metrics6.log (new)
loop_metrics7.log (new)
loop_metrics8.log (new)
loop_metrics9.log (new)
loop_metrics.log (new, being written to)

This log rotation scheme is basically incomprehensible to me (and heka). https://github.com/mozilla-services/puppet-config/blob/master/loop/yaml/app/loop_server.prod.yaml#L36-40 indicates there should be only be five files at a time but it seems to have rotated the name at 20 and since deleted loop_metrics{1..4}.log as they are missing. I suspect the first server above will begin writing to loop_metrics.log after finishing with loop_metrics9.log, which may again break the logstreamer. I'll fix it tomorrow if it happens.

This most recent stack (2014-11-26) has been running longer than the average loop stack which is why the data loss occurred.

A way to mitigate this would be to either not rotate at all or fix the app logging settings (and/or the logger itself). With the amount of traffic loop gets and the rate at which we deploy new stacks I would be fine with turning off rotation completely until this has a proper fix.

I have attempted to recover and process as much of the existing logs as possible and will be adding anomaly detection to heka to catch this sooner in the future, but it must be addressed by the server or the logging library if we want to prevent this kind of data loss.
Thanks for looking into this Wesley!
(Assignee)

Comment 7

3 years ago
Starting the switch to mozlog today.
Assignee: nobody → rhubscher
(Assignee)

Comment 8

3 years ago
Actually it seems mozlog doesn't let us configure the filename where to store files. First reason why we choose to use winston. I will deactivate logrotate for winston.
(Reporter)

Comment 9

3 years ago
Mozlog is a wrapper around Intel, which should support writing out to a file: https://github.com/seanmonstar/intel#filehandler. I would also prefer you use the standard streams for metrics anyway as outlined in https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard section "Format", but my primary concern is to stop losing data, which can be accomplished by disabling logrotate for winston.
(Assignee)

Comment 10

3 years ago
Created attachment 8543245 [details] [review]
Link to Github PR — #277.
Attachment #8543245 - Flags: review?(alexis+bugs)
(Assignee)

Comment 11

3 years ago
Following your advice I did this:

 - Remove everything that might write to stdout
 - Use mozlog to write to stdout with the heka logger

The format is now the one from mozlog:

For websocket logging:

 {"Timestamp":1420193467785000000,"Logger":"mozilla-loop-server",
  "Type":"loop.logger.request.summary","Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"code":204,"path":"/rooms/LcYL3OdXgg4","method":"post","query":"{}",
            "agent":"python-requests/2.4.1 CPython/2.7.6 Linux/3.13.0-43-generic",
            "time":"2015-01-02T10:11:07Z",
            "uid":"fabaf4f9f60c6f8d97158c75f0b9b2661738130eb654eed13d5ecdc8739d0f1a",
            "token":"LcYL3OdXgg4","v":"0.14.2","name":"mozilla-loop-server",
            "hostname":"rhubscher-ThinkPad-W540","ip":"127.0.0.1","errno":0,"action":"leave"}}

For requests logging:

 {"Timestamp":1420193467781000000,"Logger":"mozilla-loop-server","Type":"loop.logger.request.summary",
  "Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"code":200,"path":"/rooms/LcYL3OdXgg4","method":"post","query":"{}",
            "agent":"python-requests/2.4.1 CPython/2.7.6 Linux/3.13.0-43-generic",
            "time":"2015-01-02T10:11:07Z",
            "uid":"fabaf4f9f60c6f8d97158c75f0b9b2661738130eb654eed13d5ecdc8739d0f1a",
            "token":"LcYL3OdXgg4","v":"0.14.2","name":"mozilla-loop-server",
            "hostname":"rhubscher-ThinkPad-W540","ip":"127.0.0.1","errno":0,"action":"refresh"}}

The hekaMetrics configuration changed to fit the mozlog properties:

  hekaMetrics: {
    activated: {
      doc: "Defines if metrics should be directed to hekad",
      default: false,
      format: Boolean
    },
    debug: {
      doc: "Heka logger display development logs",
      format: Boolean,
      default: false
    },
    level: {
      doc: "Log level.",
      format: String,
      default: "INFO"
    },
    fmt: {
      doc: "Log level.",
      format: String,
      default: "heka"
    }
  },
(Assignee)

Comment 12

3 years ago
I put to examples of request.summary, the one for websocket.summary is:

 {"Timestamp":1420193489623000000,"Logger":"mozilla-loop-server","Type":"loop.logger.websocket.summary",
  "Severity":6,"Pid":11494,"EnvVersion":"2.0",
  "Fields":{"callId":"bfb6e7e5ade4b4ebdcb906382f820242","auth":"d27e26844c00c6b68005a2e6c1cb5960",
            "messageType":"hello","time":"2015-01-02T10:11:29Z"}}
(Reporter)

Comment 13

3 years ago
This looks good, thanks. If possible remove the redundant "loop.logger" prefix from the Type field, since we have the Logger field for this purpose.
(Assignee)

Comment 14

3 years ago
I will let loop.websocket.summary and loop.request.summary. Because mozlog doesn't handle an empty namespace:

  var log = require("mozlog")();

It writes `.websocket.summary` and `.request.summary`

Do you want me to create a patch for mozlog about this?
(Reporter)

Comment 15

3 years ago
It appears the idiomatic way of doing this is to use different logger objects for different code paths (at least this is my guess from https://github.com/mozilla/fxa-profile-server/search?utf8=%E2%9C%93&q=logger+%3D). So in the requests path you would do e.g.

  var log = require("mozlog")("request");

and log a "summary" event.
(Assignee)

Comment 16

3 years ago
Yes this would be one way to do it. Also it would means to write:

    hekaLogger("request").info("summary", data);

And I rather prefer to write:

    hekaLogger.info("request.summary", data);

Which does exactly the same thing.

I have created https://github.com/mozilla/mozlog/pull/3 in order to handle the empty namespace.
Adding a NI on Sean so we have feedback to know if this can be merged.
Flags: needinfo?(smcarthur)
Comment on attachment 8543245 [details] [review]
Link to Github PR — #277.

I've merged the code change, it's at https://github.com/mozilla-services/loop-server/commit/82c8883d80d4e74711cf6daba8e33bd7939005ce

I'm letting the NI on sean to have feedback about the PR Rémy opened on mozlog.
Attachment #8543245 - Flags: review?(alexis+bugs) → review+
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
sorry, missed this. I have left comments on the PR.
Flags: needinfo?(smcarthur)
(Reporter)

Comment 20

3 years ago
I took a look at the logs in stage and they still have "Type" with the "loop." prefix, probably because :natim's mozlog PR has not been merged. I need this fixed before 0.15.* goes to production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Updated

3 years ago
Blocks: 1120911
Wes, you mentioned today we could proceed with 0.15.0 and that this logging change could go out in a future release.  Does it instead look like we need to halt 0.15.0 and do a re-spin with this change?
Flags: needinfo?(whd)
(Reporter)

Comment 22

3 years ago
(In reply to Richard Pappalardo [:rpapa][:rpappalardo] from comment #21)
> Wes, you mentioned today we could proceed with 0.15.0 and that this logging
> change could go out in a future release.  Does it instead look like we need
> to halt 0.15.0 and do a re-spin with this change?

I need this fixed before 0.15.* goes to production, but I don't know if that requires halting current work on 0.15.0 in stage. It's a minor code change.
Flags: needinfo?(whd)
Remy, can you have a look at this?
Flags: needinfo?(rhubscher)
(Assignee)

Comment 24

3 years ago
Sure what would you like me to do?
Flags: needinfo?(rhubscher)
(Reporter)

Comment 25

3 years ago
:natim I need the "loop." prefix dropped from the Logger field. This might involve updating mozlog and the loop server or both. There's a ticket for a new stage deploy of 0.15.1 (bug 1123276), so perhaps we can roll this change into that (and call it 0.15.2 if need be).
There's a PR to mozlog that I reviewed. Needs an update and I'll merge and publish a new version to npm.
(Assignee)

Comment 27

3 years ago
Ok Sean, Thanks I will do that then.
Thanks! I've published v1.0.2 to npm, you should be good to go.
(Assignee)

Comment 29

3 years ago
Created attachment 8553974 [details] [review]
Link to Github PR — #289.
Attachment #8553974 - Flags: review?(alexis+bugs)
Attachment #8553974 - Flags: feedback?(whd)
(Assignee)

Updated

3 years ago
Attachment #8543245 - Attachment is obsolete: true
(Reporter)

Updated

3 years ago
Attachment #8553974 - Flags: feedback?(whd) → feedback+
https://github.com/mozilla-services/loop-server/commit/4b3079990e8642103de88f7a5e3a908fe81d0a70
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Attachment #8553974 - Flags: review?(alexis+bugs) → review+
You need to log in before you can comment on or make changes to this bug.