Communication disruption, and message loss, over poor TCP connection

RESOLVED FIXED

Status

Webtools
Pulse
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: ekyle, Assigned: ekyle)

Tracking

Trunk

Firefox Tracking Flags

(firefox41 affected)

Details

Attachments

(1 attachment, 6 obsolete attachments)

(Assignee)

Description

3 years ago
The GenericConsumer client will eventually loose contact with the server and will not notice this fact until the queue is deleted.  I suggest adding a `timeout` to the constructor so that this problem can be suppressed.

> #ORIGINAL 
> with consumer:
>     while True:
>         self.connection.drain_events()

> #NEW
> with consumer:
>     from socket import timeout as socket_timeout
>     while True:
>         try:
>             self.connection.drain_events(timeout=self.timeout)
>         except socket_timeout, _:
>             pass
(Assignee)

Comment 1

3 years ago
Created attachment 8616031 [details] [diff] [review]
proposed diff

add optional timeout
Attachment #8616031 - Flags: review?(mcote)
(Assignee)

Comment 2

3 years ago
Comment on attachment 8616031 [details] [diff] [review]
proposed diff

After a long run, it appears network failures are still silent.  This did not fix the problem.
Attachment #8616031 - Flags: review?(mcote) → review-
(Assignee)

Comment 3

3 years ago
Created attachment 8617322 [details] [diff] [review]
Can handle short term disruption

It does not recover from prolonged network outage, but it can recover from the TCP resets that plague my network.
Attachment #8617322 - Flags: review?(mcote)

Comment 4

3 years ago
Comment on attachment 8617322 [details] [diff] [review]
Can handle short term disruption

Review of attachment 8617322 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozillapulse/consumers.py
@@ +1,4 @@
>  # This Source Code Form is subject to the terms of the Mozilla Public
>  # License, v. 2.0. If a copy of the MPL was not distributed with this
>  # file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +from sys import stdout

Please put this under the 'from socket' import, and add a blank line after, to separate it from the third-party imports.

@@ +13,3 @@
>  
>  # Exceptions we can raise
> +from mozillapulse.publishers import InvalidExchange

Put this under the 'from config' (and actually, while you're here, please change that to 'from mozillapulse.config').

@@ +13,5 @@
>  
>  # Exceptions we can raise
> +from mozillapulse.publishers import InvalidExchange
> +try:
> +    from pyLibrary.debugs.logs import Log2

Er, this is one of your custom modules, isn't it?  If you want to include this patch in the standard mozillapulse library (which it probably should be), please just use regular logging, rather than trying to import a module that will only be used in your project.

@@ +192,5 @@
> +
> +    def _drain_events_loop(self):
> +        while True:
> +            try:
> +                Log.warning("test", {})

Please remove this.

@@ +194,5 @@
> +        while True:
> +            try:
> +                Log.warning("test", {})
> +                self.connection.drain_events(timeout=self.timeout)
> +            except socket_timeout, _:

You can just do 'except socket_timeout' here, since you aren't using the exception instance at all.
Attachment #8617322 - Flags: review?(mcote) → review-
(Assignee)

Comment 5

3 years ago
Created attachment 8622427 [details] [diff] [review]
DIFF.txt

...with changes requested
Attachment #8616031 - Attachment is obsolete: true
Attachment #8617322 - Attachment is obsolete: true
(Assignee)

Updated

3 years ago
Attachment #8622427 - Flags: review?(mcote)

Comment 6

3 years ago
Comment on attachment 8622427 [details] [diff] [review]
DIFF.txt

Review of attachment 8622427 [details] [diff] [review]:
-----------------------------------------------------------------

Better but still one problem and one nit.

::: mozillapulse/consumers.py
@@ +185,5 @@
> +        while True:
> +            try:
> +                self.connection.drain_events(timeout=self.timeout)
> +            except socket_timeout:
> +                logging.warning("timeout, full restart required")

This warning is slightly misleading--it sounds like you require the client to do the restart.  Perhaps something like "Socket timed out; reconnecting."

@@ +189,5 @@
> +                logging.warning("timeout, full restart required")
> +                try:
> +                    self.disconnect()
> +                except Exception, e:
> +                    logging.warning("Problem with disconnect()", e)

This doesn't work; I get "TypeError: not all arguments converted during string formatting".  There's a built-in way to log exceptions anyway:

logging.exception("Problem with disconnect().")

That will automatically log the stack trace (so you can do a blanket "except:" clause).
Attachment #8622427 - Flags: review?(mcote) → review-
(Assignee)

Comment 7

3 years ago
Created attachment 8637928 [details] [diff] [review]
DIFF.txt

Solved the last issue: Python logging does not accept just anything on the second parameter.

Sorry for the delay.
Attachment #8622427 - Attachment is obsolete: true
Attachment #8637928 - Flags: review?(mcote)
(Assignee)

Comment 8

3 years ago
Created attachment 8637929 [details] [diff] [review]
DIFF.txt

Forgot to fix the misleading statement.
Attachment #8637928 - Attachment is obsolete: true
Attachment #8637928 - Flags: review?(mcote)
Attachment #8637929 - Flags: review?(mcote)

Comment 9

3 years ago
Comment on attachment 8637929 [details] [diff] [review]
DIFF.txt

Review of attachment 8637929 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozillapulse/consumers.py
@@ +185,5 @@
> +        while True:
> +            try:
> +                self.connection.drain_events(timeout=self.timeout)
> +            except socket_timeout, e:
> +                log_warning("timeout! Restarting pulse consumer.", cause=e)

I'm confused--why can't you use logging.exception(), the standard way to log exceptions?
(Assignee)

Comment 10

3 years ago
`logging.exception()` logs an error, which is overstating the severity of the problem.

In general, I am trying to follow the exception chaining pattern in PEP 3134 [1].  Without a `raise x from y` idiom in 2.7, I personally use a method call: `Log.error(x, cause=y)`.  I do the same for warnings, (`Log.warning(x, cause=y)`), which has become a habit.  The benefit to this pattern is the root cause of the exception is not lost because the exceptions are chained.  

Specifically, I would like this code to keep the exception chain.  Python 2.7 `logging.exception()` does not reveal the exception chain, even when using `future.utils.raise_from()` [2]  Therefore, I want to provide a handle or method, `log_warning(message, cause)`, that some future logging module can intercept to provide this feature.  

Granted, it seems that `socket.timeout` is a root cause itself.  My main argument about keeping exception chains does not apply because there is only one link; hardly a chain.  Yet, looking into the kombu/ampq code deeper, we see that `socket.timeout` is not actually the root cause, it originally came from [3].  The exception was subsequently put into a queue, and then re-raised, loosing the original trace.  Should the maintainers decide to fix this, our code will correctly report the root cause.

I hope that you find the exception chaining pattern appealing (eg `raise ex from cause`, or `Log.error(ex, cause)` or `raise_from(ex, cause)`), despite it not needed in this specific case. 


[1] http://legacy.python.org/dev/peps/pep-3134/
[2] https://github.com/klahnakoski/pyLibrary/blob/800588b3b28ae9c3e7a8a1f91fa5aeb2028344d4/tests/test_logs.py#L184
[3] https://github.com/celery/kombu/blob/6449c1af53307798f83d8cfc2b8636b4d13fd8fa/kombu/transport/amqplib.py#L123
[4] Marionette was designed to track cause, too bad it's not used much:  https://dxr.mozilla.org/mozilla-central/source/testing/marionette/driver/marionette_driver/errors.py#20
(Sorry for the delay in replying)

Okay, that's interesting, and I see what you're saying.  But two things: (a) is there not anything standardized to do this already?  And more importantly, (b), the YAGNI principle.  You've specifically put in a comment that this is not used right now.  Why put it in now, then?  It would be simple to put it in later, and we should do it all over the code, not just in one place.

So for now, I think the presence of that code raises more questions than it answers, so I'd prefer if you left it out.  I will think more about this idea, though.
Comment on attachment 8637929 [details] [diff] [review]
DIFF.txt

Review of attachment 8637929 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the proviso that the logging stuff be left standard for now.
Attachment #8637929 - Flags: review?(mcote) → review+
(Assignee)

Comment 13

3 years ago
Created attachment 8641230 [details] [diff] [review]
DIFF.txt

a) I have been unable to find anything in python 2.7 that will properly raise a warning, along with its cause.  
b) On the subject of YAGNI:  I guess tracing causes is not a big deal.
Attachment #8637929 - Attachment is obsolete: true
Attachment #8641230 - Flags: review?(mcote)
Comment on attachment 8641230 [details] [diff] [review]
DIFF.txt

Review of attachment 8641230 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

Not sure if you have permission to push to the repo.  If not, I can do it.  No need to submit a follow-up patch in any case; either fix the minor things below on push, or I'll do it when I push.

::: mozillapulse/consumers.py
@@ +33,3 @@
>      somewhere, e.g. the constructor.
>      """
> +    def __init__(self, config, exchange=None, connect=True, heartbeat=False, timeout=None,

Line is too long.

@@ +136,5 @@
>          For info on one script listening to multiple channels, see
>          http://ask.github.com/carrot/changelog.html#id1.
>          """
> +        while True:
> +            consumer = self._build_consumer(callback=callback, on_connect_callback=on_connect_callback)

Line is too long.

@@ +188,5 @@
> +                logging.warning("timeout! Restarting pulse consumer.")
> +                try:
> +                    self.disconnect()
> +                except Exception:
> +                    logging.warning("Problem with disconnect()")

We should still log the exception here, no?
Attachment #8641230 - Flags: review?(mcote) → review+
(Assignee)

Comment 15

3 years ago
Created attachment 8642445 [details] [diff] [review]
DIFF.txt

I have no permissions to the repo, so I can not push.

I fixed the line lengths.

> @@ +188,5 @@
> > +                logging.warning("timeout! Restarting pulse consumer.")
> > +                try:
> > +                    self.disconnect()
> > +                except Exception:
> > +                    logging.warning("Problem with disconnect()")
> 
> We should still log the exception here, no?

mozillapulse was elegant because it had no logging, so I only added the minimum (which will log to stderr). 

Some options I considered:

* Using Python logger instances: Since we are on a path of structured logging, adding another text log seemed to be the wrong direction.  
* Using pyLibrary:  It comes with structured logging and exception chaining, but it comes with too much other stuff.
* Using mozlog:  It has structured logging, but is too test-specific.
* Just propagating the timeout error:  I worried it may break the assumptions of the existing callers.

In conclusion, I think it's best that the logging philosophy be decided and implement in a separate patch.
Attachment #8641230 - Attachment is obsolete: true
Attachment #8642445 - Flags: review?(mcote)
"Elegant" and "logging" should not be mutually exclusive, I hope!  The standard Python logging library has simple ways to turn down/disable logging for specific modules should you not care about their output.  That said, I am not opposed to logging being implemented in a separate bug.
Comment on attachment 8642445 [details] [diff] [review]
DIFF.txt

Review of attachment 8642445 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozillapulse/consumers.py
@@ +40,5 @@
> +        connect=True,
> +        heartbeat=False,
> +        timeout=None,
> +        **kwargs
> +    ):

This is weird, but I'll fix it on check-in.
Attachment #8642445 - Flags: review?(mcote) → review+

Updated

3 years ago
Assignee: nobody → klahnakoski
Thanks for the clean-up as well!

https://hg.mozilla.org/automation/mozillapulse/rev/cafbb0f97f10

(Btw I added mozillapulse to MozReview if you want to use that for any future patches. :)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.