Closed Bug 914904 Opened 11 years ago Closed 11 years ago

Investigate webmaker.org PROD crash in Events

Categories

(Webmaker Graveyard :: DevOps, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: humph, Assigned: cade)

References

Details

Attachments

(5 files)

My inbox is filling up with webmaker.org crashes:

ALERT TIME: Tue Sep 10 23:29:13 UTC 2013
HOSTGROUP: MoFo-Public
HOSTNAME: Webmaker-Production Webmaker
SERVICE: healthcheck
STATUS: CRITICAL - Socket timeout after 30 seconds 
HOST URL: http://opsview.mofoprod.net:3000/status/service?host=Webmaker-Production

I wish we had butler on production so we had stacks, but until then, we should figure this out and stop it.
jp, cade, and I spent a bunch of time on this and figured out the problem.  It's related to CSRF handling and the new Likes feature.

The crash goes like this:

* https://github.com/senchalabs/connect/blob/2.7.6/lib/utils.js#L60 throws a Forbidden error in the node app, which is really a 403 due to this:

* https://github.com/senchalabs/connect/blob/2.7.6/lib/middleware/csrf.js#L51

It is trying to get the _csrf token off the request's body, query, or headers here: https://github.com/senchalabs/connect/blob/2.7.6/lib/middleware/csrf.js#L69-72

We stash the CSRF token via nunjucks in the details page in a meta element, and then when we POST back to the server, we attempt to extract it and put it in the body of the request (in https://github.com/mozilla/webmaker.org/blob/master/public/js/pages/details.js):

      $.post(method, {
        makeID: makeID,
        _csrf: $("meta[name='X-CSRF-Token']").attr("content")

The problem is that not all browsers support querySelector (see http://caniuse.com/#search=queryselector), and in particular, any IE before 8.0.  We're using the 2.x version of jQuery, which removed hacks for old IE browsers, so this query for the meta element with the X-CSRF-Token will fail.

As such, we have two things to consider:

1) What to do with browsers that don't support the syntax above.  One option is to do: _csrf: $("meta[name='X-CSRF-Token']").attr("content") || "unsupported-browser"

2) Guard against our app crashing when we try to access body._csrf and it isn't there.  Some people have monkey-patched their middleware pipeline by adding a dummy token, see: https://github.com/pilwon/ultimate-seed/commit/77f79316cd566d052266475fd5fd76aec1fd87fb

Assigning this to cade, as he's deeper into the code/solution than I am.  I just wanted to document our findings.
Assignee: david.humphrey → cade
Further info regarding 2) above.  Jon points out that this should never cause our node app to crash, and the fact that it is points to something further going wrong after the error occurs.  The most likely explanation is that the error reporter is dying, and causing a top-level error to make it up to the domain:

https://github.com/mozilla/webmaker.org/blob/5716b24567b01d395826d45da8871a3d9c022cdb/app.js#L64-78

We should wrap that in a try/catch so that failures in reportError don't cascade up to the domain, but get eaten.
Comment on attachment 803087 [details] [review]
https://github.com/mozilla/webmaker.org/pull/362

Two fixes.  First one is a nit, see PR.

Second one is a correction for an error Jon introduced in another patch.  Please change these three lines:

https://github.com/cadecairos/webmaker.org/blob/b840914f99355faa6e6032a3d252342473ae4a2b/app.js#L60-62

to this:

  GLOBAL.graylogHost = env.get( "GRAYLOG_HOST" );
  GLOBAL.graylogFacility = env.get( "GRAYLOG_FACILITY" );
  require( 'graylog' );

r=me with those changes.
Attachment #803087 - Flags: review?(david.humphrey) → review-
Comment on attachment 803087 [details] [review]
https://github.com/mozilla/webmaker.org/pull/362

updated with the fixes you pointed out
Attachment #803087 - Flags: review- → review?(david.humphrey)
Comment on attachment 803087 [details] [review]
https://github.com/mozilla/webmaker.org/pull/362

Great, let's get this on staging/prod ASAP.
Attachment #803087 - Flags: review?(david.humphrey) → review+
Commit pushed to master at https://github.com/mozilla/webmaker.org

https://github.com/mozilla/webmaker.org/commit/b3d334c452cf69f4e068d22886868ac43fda1b9a
Bug 914904 - Guard against errors thrown in reportError function r=humph
The root cause of this appears to be an insert into the events database. I've noticed Logs for the insert between 0 and 1 seconds before the crash.

https://loggins.mofoprod.net/messages/430e4940-1b20-11e3-8f5a-22000aa45f57
https://loggins.mofoprod.net/messages/703e9d00-1b1d-11e3-8f5a-22000aa45f57
https://loggins.mofoprod.net/messages/30d4a5c0-1b21-11e3-8f5a-22000aa45f57
More often than not, I'm seeing this error:

https://loggins.mofoprod.net/messages/a0632bb0-1b34-11e3-8f5a-22000aa45f57

Which is:

Uncaught, unspecified "error" event.

Assuming this is related to the other sequelize bits in the stack, and cade's idea about SQL inserts always being present, I think it might be related to us not having an 'error' event handler for the sql work, and it blowing up when it doesn't get handled.  Lots of people talk about it on the web, e.g. https://gist.github.com/deepwell/3501648 and most in the context of having bad auth info for the db.  That doesn't make sense in our case, I don't think, but not having an on('error', ...) handler does.

Thoughts?
So it looks like all the create, update and delete routes do have `.error()` calls.

Now, I'm thinking outside of the box here, but, In my testing, what immediately follows an insert into our events DB is a request to get all events: https://gist.github.com/cadecairos/0edbadc2fa23b7b53794#file-logs-L5

This findAll call has no error handling: https://github.com/mozilla/webmaker.org/blob/178afb744edbe7bb771819fbe8600bfe1d89b15b/lib/events/controllers/events.js#L383-L408

Reading though the docs for Sequelize ( http://sequelizejs.com/documentation#misc-asynchronicity ) I found this snippet below the block of code I linked to: 

"Please notice: Since v1.5.0 the 'error' event is used to notify about errors. If such events aren't caught however, Node.JS will throw an error. So you would probably like to catch them :D"

This error event caused by the select might be our problem, since it's nowhere to be found in the graylog logs, and should happen right after the insert (but in our case, we see the error!)
This adds in some error handling for findAll and all calls used by the event tool.
Attachment #803737 - Flags: review?(schranz.m)
Attachment #803737 - Flags: review- → review?(schranz.m)
Attachment #803737 - Flags: review?(schranz.m) → review+
Commit pushed to master at https://github.com/mozilla/webmaker.org

https://github.com/mozilla/webmaker.org/commit/07ec50862bbfa28d8674c4981dbda1ffe5053e30
Bug 914904 - Protect against SQL errors when executing findAll on the event data model
Unfortunately, Adding an event on production continues to cause that error to be thrown.
So what I've narrowed this down to so far is that we try here to send a response, but somewhere, it has already been sent (not sure where, but my one suspect is events, triggered by some sequelize error)

https://github.com/mozilla/webmaker.org/blob/29ee19f969677d06a2d5d48cc9ebd2b4153f3c01/app.js#L105-L106

So in loggins we see the original error( in the case I was looking at, something originating from this node cluster function:

https://github.com/joyent/node/blob/master/lib/cluster.js#L453-L479

and we also see the log from our try catch on this line: 

https://github.com/mozilla/webmaker.org/blob/29ee19f969677d06a2d5d48cc9ebd2b4153f3c01/app.js#L111

ultimately, this causes the process to restart, but because we never made it to ( https://github.com/mozilla/webmaker.org/blob/29ee19f969677d06a2d5d48cc9ebd2b4153f3c01/app.js#L108-L109 ) it takes 30 seconds.

At the least, we can probably solve the duplicate response issue.
Trying a few things here:

* Change domain kill timeout to 15s
* Add secondary try/catch around the pretty-500 when the domain fails (so we don't block on that)
* Change the order of the 404 and 500 handlers, as per the Express example.
Attachment #804109 - Flags: review?(cade)
Are we dealing with NewRelic crashing here?

2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Object.monitored (/var/www/webmakerorg/node_modules/newrelic/lib/transaction/tracer.js:159:17)
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at /var/www/webmakerorg/node_modules/newrelic/lib/transaction/tracer.js:180:22
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at ErrorTracer.monitor (/var/www/webmakerorg/node_modules/newrelic/lib/error.js:192:12)
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Object.cookieSession (/var/www/webmakerorg/node_modules/express/node_modules/connect/lib/middleware/cooki ...
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at /var/www/webmakerorg/node_modules/newrelic/lib/transaction/tracer.js:180:22
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Tracer.monitor (/var/www/webmakerorg/node_modules/newrelic/lib/transaction/tracer.js:178:28)
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: 109.70.6.77 - - [Fri, 13 Sep 2013 00:08:41 GMT] "POST /events HTTP/1.1" 403 8912 "https://webmaker.org/events/" ...
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: Server worker started.
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: Error: Forbidden
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Object.<anonymous> (/var/www/webmakerorg/node_modules/express/node_modules/connect/lib/middleware/csrf.js ...
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at next (/var/www/webmakerorg/node_modules/express/node_modules/connect/lib/proto.js:190:15)
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Object.exports.error (/var/www/webmakerorg/node_modules/express/node_modules/connect/lib/utils.js:60:13)
2013-09-13 00:08:41.0 	i-e2d8e799 	Info 	local2 	i-e2d8e799 app=webmakerorg-production: at Object.wrappedConnectHandle [as handle] (/var/www/webmakerorg/node_modules/newrelic/lib/instrumentation/c ...

Maybe we should move the NewRelic module out of the forked child-process, and just run it once on the cluster master process.
Another thing I'm wondering is whether the domain stuff is actually working for the lib/events stuff at all, the way it's setup.  We should consider normalizing this mess into a single Express instance.
I had hoped the newrelic upgrade to 0.10.3 that went into webmakerorg 0.0.148 would fix the newrelic part of this.  Unforunately, I see a ton of these errors in loggins from today.

https://loggins.mofoprod.net/messages/universalsearch?timespan=172800&query=%22%2Fvar%2Fwww%2Fwebmakerorg%2Fnode_modules%2Fnewrelic%2Flib%2Ferror.js%22

I wonder why this is happening only on webmakerorg.
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Executing: SET FOREIGN_KEY_CHECKS = 1;
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Executing: UPDATE `Events` SET `title`='Intro to Web',`description`='Students AIUB will participate here. Some of Student Ambassador may join with us.',`address`='AIUB, Dhaka, Dhaka Division, Bangladesh',`latitude`=23.7943,`longitude`=90.4055,`beginDate`='2013-09-26 00:00:00',`beginTime`='1899-12-31 14:00:00',`endDate`='2013-09-26 00:00:00',`endTime`='1899-12-31 17:00:00',`attendees`=35,`registerLink`='https://www.facebook.com/events/218948418264084',`updatedAt`='2013-09-13 05:52:08' WHERE `id`=1533
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Executing: SET FOREIGN_KEY_CHECKS = 1;
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Executing: UPDATE `Events` SET `title`='Intro to Web',`description`='Students AIUB will participate here. Some of Student Ambassador may join with us.',`address`='AIUB, Dhaka, Dhaka Division, Bangladesh',`latitude`=23.7943,`longitude`=90.4055,`city`='Dhaka',`country`='BD',`attendees`=35,`beginDate`='2013-09-26 00:00:00',`endDate`='2013-09-26 00:00:00',`beginTime`='1899-12-31 14:00:00',`endTime`='1899-12-31 17:00:00',`registerLink`='https://www.facebook.com/events/218948418264084',`picture`=NULL,`organizer`='ashickur.noor@gmail.com',`organizerId`='ashickur-noor',`featured`=false,`id`=1533,`createdAt`='2013-09-06 08:30:21',`updatedAt`='2013-09-13 05:52:08' WHERE `id`=1533
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Error: TypeError: Uncaught, unspecified "error" event.
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at TypeError (<anonymous>)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:74:15)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/emitters/custom-event-emitter.js:67:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:95:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/query-interface.js:448:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at b (domain.js:183:18)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:98:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at module.exports.finish (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:138:30)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at exec (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:92:16)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at onError (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:72:11)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: TypeError: Uncaught, unspecified "error" event.
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at TypeError (<anonymous>)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:74:15)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/emitters/custom-event-emitter.js:67:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:95:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/query-interface.js:448:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at b (domain.js:183:18)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at EventEmitter.emit (events.js:98:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at module.exports.finish (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:138:30)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at exec (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:92:16)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at onError (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:72:11)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Error: unable to send 500 Error: Can't set headers after they are sent.
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.OutgoingMessage.setHeader (http.js:707:11)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.res.setHeader (/var/www/webmakerorg/node_modules/express/node_modules/connect/lib/patch.js:59:22)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.res.set.res.header (/var/www/webmakerorg/node_modules/express/lib/response.js:518:10)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.res.send (/var/www/webmakerorg/node_modules/express/lib/response.js:124:10)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at fn (/var/www/webmakerorg/node_modules/express/lib/response.js:751:10)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at Function.app.render (/var/www/webmakerorg/node_modules/nunjucks/src/environment.js:139:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.res.render (/var/www/webmakerorg/node_modules/express/lib/response.js:755:7)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at ServerResponse.render (/var/www/webmakerorg/node_modules/newrelic/lib/instrumentation/express.js:40:23)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at Domain.<anonymous> (/var/www/webmakerorg/app.js:106:11)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at Domain.EventEmitter.emit (events.js:95:17)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: 
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Error: channel closed
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at process.target.send (child_process.js:402:26)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at Worker.send (cluster.js:401:21)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at sendInternalMessage (cluster.js:394:10)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at Cluster.cluster._getServer (cluster.js:586:3)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at dgram.js:193:15
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at /var/www/webmakerorg/node_modules/newrelic/node_modules/continuation-local-storage-glue/bootstrap.js:32:23
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at dns.js:72:18
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at /var/www/webmakerorg/node_modules/newrelic/node_modules/continuation-local-storage-glue/bootstrap.js:32:23
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production:     at process._tickDomainCallback [as _tickCallback] (node.js:459:13)
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Server worker 14 exited.
Sep 13 05:52:08 i-e2d8e799 app=webmakerorg-production: Starting server worker...
Sep 13 05:52:09 i-e2d8e799 app=webmakerorg-production: {"name":"newrelic","hostname":"i-e2d8e799","pid":22353,"level":30,"msg":"Starting New Relic Node.js instrumentation.","time":"2013-09-13T05:52:09.560Z","v":0}

That is the full stack error

This is a slightly different one, adding it to this bug for reference.
Attachment #804109 - Flags: review?(cade) → review+
Our crash looks like this:

Sep 13 17:17:58 i-18f9927b app=webmakerorg-production: Executing: INSERT INTO `Events` (`title`,`description`,`address`,`latitude`,`longitude`,`beginDate`,`beginTime`,`endDate`,`endTime`,`attendees`,`registerLink`,`organizerId`,`organizer`,`updatedAt`,`createdAt`) VALUES ('FakeEvent','Not a real Event','Yolo, CA','38.7646021','-121.90179539999997','2013-09-18 00:00:00','1899-12-31 10:00:00','2013-09-18 00:00:00','1899-12-31 10:00:00',6,'https://www.google.ca/signupforthisfakevent/wat/wat.html?aslkdfjasldkfjasldkfjasldkfjasdfasd=sdflksjdflskdfjslkdfjsldkfjslkdjfsldf','cade','chris@chrisdecairos.ca','2013-09-13 17:17:58','2013-09-13 17:17:58');
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production: Executing: SET FOREIGN_KEY_CHECKS = 1;
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production: Executing: UPDATE `Events` SET `title`='FakeEvent',`description`='Not a real Event',`address`='Yolo, CA',`latitude`='38.7646021',`longitude`='-121.90179539999997',`city`='Zamora',`country`='US',`attendees`=6,`beginDate`='2013-09-18 00:00:00',`endDate`='2013-09-18 00:00:00',`beginTime`='1899-12-31 10:00:00',`endTime`='1899-12-31 10:00:00',`registerLink`='https://www.google.ca/signupforthisfakevent/wat/wat.html?aslkdfjasldkfjasldkfjasldkfjasdfasd=sdflksjdflskdfjslkdfjsldkfjslkdjfsldf',`picture`=NULL,`organizer`='chris@chrisdecairos.ca',`organizerId`='cade',`featured`=false,`id`=2256,`createdAt`='2013-09-13 17:17:58',`updatedAt`='2013-09-13 17:17:59' WHERE `id`=2256
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production: CRASH: TypeError: Uncaught, unspecified "error" event.
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at TypeError (<anonymous>)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at EventEmitter.emit (events.js:74:15)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/emitters/custom-event-emitter.js:67:17)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at EventEmitter.emit (events.js:95:17)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at null.<anonymous> (/var/www/webmakerorg/node_modules/sequelize/lib/query-interface.js:448:17)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at b (domain.js:183:18)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at EventEmitter.emit (events.js:98:17)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at module.exports.finish (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:138:30)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at exec (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:92:16)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production:     at onError (/var/www/webmakerorg/node_modules/sequelize/lib/query-chainer.js:72:11)
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production: Internal error isolating shutdown sequence: Error: Can't set headers after they are sent.
Sep 13 17:17:59 i-18f9927b app=webmakerorg-production: Server worker 2 exited.

We've now got the cluster/domain stuff right.  We just need to figure this out so we don't crash at all.
Let's see if this is failing.
Attachment #804571 - Flags: review?(cade)
So we've improved things a bit, and landed a bunch of fixes.  Right now we've done the following:

* Turned off newrelic on webmaker.org prod (we were seeing it in stacks)
* Turned off graylog UDP logging from webmaker.org -> loggins (to see if this was affecting loggins, which was crashing non-stop all day)
* Pushed a bunch of small fixes to the domain and error handling code
* Pushed a fix to events to add error handling for reverse geocode lookups (that doesn't seem to be the problem, although it's a good fix nevertheless).  We were wondering if the fact that we don't have a MapQuest key for their API was causing us to hit the daily API limit on prod.  Doesn't look like that's happening (although we probably do need a key here).

We're no longer seeing nagios reports from webmaker.org on prod every few mins, so things are more stable.  However, we're still crashing all the time, and the node cluster is just restarting forks.

The STR to trigger the crash according to cade are:

1) Create an event (doesn't matter what data you enter)

This crash is *not* reproducible on staging, only on prod.

The three of us are burnt out on this problem, so we're going to circle back next week, and get Jon to help us look.  Other people might want to dig in to.

Somehow the events stuff is hosed, and missing an error handler that should be handling some bad case async, and it isn't.
Summary: Investigate webmaker.org crashes → Investigate webmaker.org PROD crash in Events
naive question: can we remove events from the home page for now?
(In reply to David Ascher (:davida) from comment #25)
> naive question: can we remove events from the home page for now?

No
Unfortunately, we are still seeing webmaker.org alerts.

Time (Any)	Host (Any)	Service (Any)	State (Any)	Output
22:01:422013-09-13	MenuWebmaker-Production	Menuhealthcheck	 OK	HTTP OK: HTTP/1.1 200 OK - 52373 bytes in 0.636 second response time
22:00:102013-09-13	MenuWebmaker-Production	Menuhealthcheck	 CRITICAL	CRITICAL - Socket timeout after 30 seconds
21:55:412013-09-13	MenuWebmaker-Production	Menuhealthcheck	 OK	HTTP OK: HTTP/1.1 200 OK - 52371 bytes in 0.645 second response time
21:55:102013-09-13	MenuWebmaker-Production	Menuhealthcheck	 CRITICAL	CRITICAL - Socket timeout after 30 seconds
21:27:392013-09-13	MenuWebmaker-Production	Menuhealthcheck	 OK	HTTP OK: HTTP/1.1 200 OK - 52371 bytes in 0.777 second response time
Comment on attachment 804571 [details] [review]
https://github.com/mozilla/webmaker.org/pull/370

This was an IRC r+
Attachment #804571 - Flags: review?(cade) → review+
Depends on: 916839
Root cause of this crash was a missing `city` and `country` field in the database. Since launch.

*whoops*

I also fixed up the code to catch that top-level error event, so that's good too.
Hooray!
Attachment #806688 - Flags: review?(david.humphrey)
Comment on attachment 806688 [details] [review]
https://github.com/mozilla/webmaker.org/pull/376

Great work digging into this, nice fix.
Attachment #806688 - Flags: review?(david.humphrey) → review+
Hooray!
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thanks for figuring this one out jbuck! It had been haunting my dreams for days.
Attachment mime type: text/plain → text/x-github-pull-request
Attachment mime type: text/plain text/plain text/plain text/plain → text/x-github-pull-request text/x-github-pull-request text/x-github-pull-request text/x-github-pull-request
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: