Closed Bug 360118 Opened 18 years ago Closed 18 years ago

Thunderbird SMTP client does not honor SMTP negative responses codes and state machine.

Categories

(Thunderbird :: General, defect)

x86
All
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sant9442, Assigned: ch.ey)

Details

(Keywords: fixed1.8.1.1)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7

This TBIRD mail client problem begins with is hardcoded usage of domain literals for issuing the EHLO and HELO commands.  See bug 244030.  Since the IP addresses do not match, a SMTP server can potentially reject this transaction as shown to the case below.   As you can see, TBIRD ignored the 501 error and continued with the fallback to the HELO command (should only be done on 500 response). It received another 501 response code, which TBIRD ignored and continues with the MAIL FROM command.    

This behavior can put users at risk with erroneous red flags marking them as SMTP spam abusers who don't follow the SMTP protocol state machine correctly.


Real example session showing problem:

Wildcat! SMTP Server v6.1.451.9
SMTP log started at Mon, 06 Nov 2006  19:14:50
Connection Time: 20061106 19:14:50  cid: 00007126
SSL Enabled: NO
Client IP: 72.144.114.198 (unknown)
19:14:50 S: 220 winserver.com Wildcat! ESMTP Server v6.1.451.9 ready
19:14:50 C: EHLO [192.168.1.103]
19:14:50 S: 501 Invalid EHLO client address.
19:14:50 C: HELO [192.168.1.103]
19:14:50 S: 501 Invalid HELO client address.
19:14:50 C: MAIL FROM:<xxxxxxxx@xxxxxxxx.com>
19:14:50 S: 503 Send HELO first.
19:14:52 ** Completed


Reproducible: Always

Steps to Reproduce:
1. Install TBIRD on a workstation behind a NAT where the IP is 192.168.1.x.

2. Try to send mail to a SMTP server who is checking valid EHLO/HELP Domain.  The Santronics Wildcat! SMTP server is such a server.  Try sending mail to mail.winserver.com to see the behavior.

3.  Capture the log session is possible with TBIRD to see this transaction

Actual Results:  
Simply send to a SMTP server that is performing EHLO/HELO domain literal violations.

Expected Results:  
Mail rejection.

Stop the transaction at the first 501.

However, related to bug 244030, TBIRD should offer a way to not use the DOMAIN LITERAL which is the initial reason for the rejection.  Other MAIL CLIENTS offer a way to issue the local computer name for example.
This was my first bugzilla bug report so I wasn't sure how the final description would appear.  It seems ok to me and hopefully knowledgeable SMTP people will easily follow the issue at hand.

The solution to the problem comes in two forms:

1) TBIRD needs an option to not use the domain literal. This is discussed in bug 244030. 

2) It needs to basically honor ALL response codes properly.

- 500 for the EHLO should be used as the fallback to HELO.

- 50x should not allow TBIRD to continue. It should issue QUIT and drop the connection.



Status: UNCONFIRMED → NEW
Ever confirmed: true
You're right on that the address visible for the client often isn't the one the outside world sees. I can't think of an automatic solution and so manual configuration seems ok. Adding such an option has been discussed in bug 68877 some time ago, please also have a look there. Anyways, that's another bug.

> As you can see, TBIRD ignored the 501 error and continued with the fallback to
> the HELO command (should only be done on 500 response).

I can see that the response code from our "HELO" isn't checked. I agree that's wrong.
And after reading the relevant RFC chapters several times I think a 501 can be interpreted that we shouldn't even fall back.

Tomorrow I'll try to write a patch that honors all EHLO/HELO response codes and issues meaningful messages.
Hi Chris,

No doubt, following the proper SMTP response codes are paramount.  Not only because it is a SMTP requirement and compliance is become more important these days, more importantly, we don't want TBIRD users to be erroneously flagged for improper SMTP client behavior.  So resolving this now will eliminate all wrong client behavior flagging potential.

Regarding the domain literal.   Overall, I think a new UI option at the Outbound SMTP Server dialog will resolve this.  If we want an UI option to govern this, it could be:

  [x] Perform PTR lookup for EHLO/HELO  (New Default)

But can we do this automatically without a UI change or a default UI condition?

I believe the code in nsSmtpProtocol::GetUserDomainName() could be changed to offer an automated solution.  I don't know the source code completely, but something like this outline:

  rv = socketTransport->GetSelfAddr(&iaddr);
  if (NS_SUCCEEDED(rv))
  {
    // If IP is 127.0.0.1, use "LocalHost"
    // Otherwise perform PTR lookup VIA DNS (don't use GetHostByName(NULL))
    // If NULL response, use the Local Computer Name
  }

I think the above will suffice.  Just don't use GetHostByName(NULL) because under this will return the local computer name or localhost or what's in the HOSTS file first. You need to do a real DNS client PTR lookup to bypass the HOSTS file.

PS:  You can use the smtp server MAIL.WINSERVER.COM to test this if you wish. 

--
HLS
Attached patch proposed patchSplinter Review
This one changes the fallback condition to only 500 and 502 responses. And it also adds a check for succeeded HELO. All errors that don't trigger the fallback will produce an alert that includes the server response for troubleshooting.
Anything that I missed?
Assignee: mscott → ch.ey
Status: NEW → ASSIGNED
(In reply to comment #4)
> Created an attachment (id=245220) [edit]
> proposed patch
> 
> This one changes the fallback condition to only 500 and 502 responses. 

The fallback for 500 and 502 are correct.  As a side note, looking in our
SMTP server code, I see we use to have a 502 which was deprecated to 500. See 4.2.4.  We decided it was best this way since it was possible an client will not look for a 502 because either EHLO is supported or it is not.  

In any case, from TBIRD's standpoint, this would be correct to look for both 500 and 502.  Any other negative response code should trigger an session abort.

> And it also adds a check for succeeded HELO. All errors that don't 
> trigger the fallback will produce an alert that includes the 
> server response for troubleshooting.

I am not sure what this means. I can't make heads and tails with the diff you provided.  If you want to send me the actual changed *.CPP file, I will be glad to look at it.

Overall, If anything else but a 250 is recieved, the client must not move to the next state.  It can't get pass EHLO/HELO, then it must issue a QUIT, technically wait for a response and then close the connection.  That is the graceful client end session transaction and the SMTP server world will be graceful for it. :-)

> Anything that I missed?

Is the Domain Literal Issue (BUG 244030) going to be addresed as well in this change?  That issue is what propagated this errant state machine discovery in this first place.  No doubt, this change can run on its own, but we might as well kill two birds with one stone. :-)

I appreciate your time and effort in improving TBIRD.

Thanks

Hector Santos, CTO
http://www.santronics.com
> Overall, If anything else but a 250 is recieved, the client must not move to
> the next state.

That's exactly what that patch does. Up to now no check for HELO response is present at all.

> It can't get pass EHLO/HELO, then it must issue a QUIT

Well, not issuing QUIT in error cases is a long standing bug in Mozilla code.
Have to see if there already is a bug on that issue.

> Is the Domain Literal Issue (BUG 244030) going to be addresed as well in this
> change?

No, different issue -> different patch in different bug.
>> It can't get pass EHLO/HELO, then it must issue a QUIT
> 
> Well, not issuing QUIT in error cases is a long standing bug in Mozilla code.
> Have to see if there already is a bug on that issue.

argh.  Why can't a QUIT be issued doing the class instance destructor.  QUIT MUST be issued in all cases, good or bad.  So a straight forward QUIT command in the destructor or some cleanup member function is sufficient.

The reason is straight forward.  While SMTP servers are required (a MUST) to have a IDLE timeout (a minimum of 5 minutes) at each STATE, the concept of having a DROP detector is an IMPLEMENTATION design issue, not a PROTOCOL requirement.

Hence, when clients DROP a connection, it is quite possible for servers to have hanging sessions that will not time out until 5 minutes later.  This can have an effect on scalability and load balancing issues at servers.

I think it is a good idea to create another bug report to resolve this one.

Hector Santos, CTO
http://www.santronics.com
> Why can't a QUIT be issued doing the class instance destructor.

Maybe (maybe! - I didn't have a look) the connection is already closed then.

> I think it is a good idea to create another bug report to resolve this one.

Just did a search and bug 62836 looks like the place to chat about that issue.
Attachment #245220 - Flags: review?(bienvenu)
Comment on attachment 245220 [details] [diff] [review]
proposed patch

+        /* potential security flaw when using DIGEST_MD5 (and maybe GSSAPI)
+         * where not only the client is authenticated by the server
+         * but also vice versa. Faked server could just not advertise
+         * any mechanism to bypass authentication process.
+         */


thx for the patch, Christian!

Is the problem the above comment describes something we're introducing with this patch, or an existing problem we're documenting?
It's an existing design flaw I want to document. I wasn't and am not sure if this is the right way and point in time to write that down, but think the earlier the better.
(In reply to comment #10)
> It's an existing design flaw I want to document. I wasn't and am not sure if
> this is the right way and point in time to write that down, but think the
> earlier the better.
> 

In all honesty, my quick review is not making me feel right that corrections are right.  Seems awfully kludgy and complex for something that is rather simple to address.   But then again, I am basing this on a visual diff and the current code.

Here's the thing. I'm here to help pinpoint some problems our customers and we were able to repeat with TB.   The corrections are pretty straight forward.   What I find disheartening is that the bugs actually PREVENTS people from using TB and there seems to be a lack of understanding that delays in solving this does not help TB promotions nor marketing.

I even considered using it myself and I can't.  The same problem many of my customers have reported were realized immediately.  Its simple.  Many will not be able to use TB due to these very fundamental SMTP based flaws that really has nothing to do with anything else but not following the proper SMTP protocol.

I don't know the Mozilla code or framework, but looking at the source, it seems pretty obvious that these issues can be resolved.  So I have a little bit of confusing what the basic issue with resolving these problems.  Is it a matter of  assigning the work?  A volunteer to fix these?  To run the test and get a patch out?

We are a windows shop and I have installed the source code so I can look at the TB source.  But it seems complex to actually get a compiling environment going and I prefer not to go thru the headaches to do this.  However, it was simple to  get a compiling Windows environment going, rest assured, these TB SMTP issues would of been solved by now (or atleast I would of provided the proper corrections).

Thanks

Hector Santos, CTO
http://www.santronics.com






Sorry Hector, I don't understand what you're talking about. You reported a problem, I agreed the existance, provided a patch - so the (I think) proper
corrections are provided.

Currently the patch needs review and super-review, but I there's nothing wrong and nothing you can do about. For how the workflow is, have a look at http://www.mozilla.org/hacking/life-cycle.html.
(In reply to comment #12)
> Sorry Hector, I don't understand what you're talking about. You reported a
> problem, I agreed the existance, provided a patch - so the (I think) proper
> corrections are provided.

Chris,

I am just not familar with this Open Source method where there seems to be little managerial control from a "systems" point of view.   Someone has to be the cog who puts this all together.    

TB has three problems:

   - This bug where it doesn't handle negative response correctly,
   - It uses a FIXED domain literal concept, and as we learned, 
   - it doesn't issue QUIT correctly on failed states, leaving 
     potential servers hanging in an IDLE state for 5 minutes.

We are not SENDMAIL or other popular mail server vendors, but we been in the business for over 20 years and our server is well branded in the Windows world.  I am actively involved in the IETF process and our server is very modern.  I'm here, which is a short period, to help TB.  I'm not here to go against any process you may have.

So if there is a systems engineer that can solve these problems, that would be wonderful not only for TB, but for all SMTP vendors just as ourselves who have to deal with the support process with TB users fail to work.  Better yet, TB will instantly get new marketing from all system operators who had problems with it and can not promote it from their users.  Now they would be able to.  Consider the potential with this thorn on the side removed.

> Currently the patch needs review and super-review, 

If you can apply the patch to produce the full nsSMTPProtocol.cpp file, post it, I will gladly take a look at it.

Thanks

Hector Santos, CTO
http://www.santronics.com
Thx again, Christian. I missed your comment because I forgot to cc myself on the bug. I'll run with the patch and note my review.
Attachment #245220 - Flags: review?(bienvenu) → review+
Marking fixed since checked in trunk and 1.8 branch.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Should be fixed1.8.1.1 then, no?
Keywords: fixed1.8.1.1
Thanks, something like that. Or fixed1.8.1 that David used in the other bugs from the same checkin? Sorry, I'm confused about all that sub-sub-versions.
my bad - Scott told me yesterday we should use fixed1.8.1.1 now instead of fixed1.8.1 (though I don't know how much it matters since we'll have to query for both)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: