Open Bug 1423437 Opened 7 years ago Updated 1 year ago

Mailsploit: Do not unescape email addresses (forbidden by RFC 2047)

Categories

(MailNews Core :: MIME, enhancement)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: BenB, Assigned: KaiE, NeedInfo)

References

()

Details

(Keywords: sec-moderate)

Attachments

(3 files, 1 obsolete file)

This is part of bug 1423430. See there for background.

We should not try to decipher everything that the email standard allows in freaky email addresses. We tried really hard to properly decode things according to the internet standards, but I think that's a bad idea from a security perspective. The standards are decades old and tried to support systems that no longer exist. We should deliberately *not* support escape sequences in email addresses, and strip anything that's unusual.

We should only allow today's normal email addresses, that is alphanumeric characters, and a few common special characters like ".", "-", "_", "+".

This has 3 parts:
1. Remove quote strings in email addresses. That is, if the email address is <bla"foo@whitehouse.gov"baz@example.com>, do *not* treat the part between "" as special, as the standard once defined, but simply strip the ". Similarly, strip the @. This would leave <blafoowhitehouse.govbaz@example.com>.
2. Strip the @ sign in the user part.
3. Remove slash-escaping \03 of characters in email addresses. Instead, strip the character. That is, if the email address is <foo\03bla@example.com>, you leave <foobla@example.com>.
Summary: Mailsploit: Do not unescape email addresses, and strip highly unusual characters → Mailsploit: Do not unescape email addresses
Blocks: 1423439
It would be almost enough to treat as your title suggests: Do not unescape email addresses. This should include not to interpret any character combination like \n, \t, \0, ... they should be treated literally as a two character string. If I understand the RFC right, it doesn't ask to interpret \n and the other control sequences.

The question is what to do with illegal addresses: You suggest to strip characters, it also could be displayed as is in italic and be colored in red with a tool tip "invalid address". Please feel free to change that suggestion if you have a better idea.

The main point I try to address is: Stay conform to the RFC and do not restrict allowed characters, instead stop interpreting control sequences.
(In reply to Adrian Zaugg from comment #1)

I was slightly misguided with my response above, I am sorry for that: The problem is not related to backslashes rather than in quoted-printable or base64 encoded characters like newline or the null byte char. Decoding the From address should sanitize the From address and mark it as illegal, when they contain such characters.
Group: mail-core-security
Group: mail-core-security
Yeah, I like that idea of simply replacing such a freaky local part *entirely* with "[invalid]@" and simply not support them. That's the safest. It avoids that things go wrong in other ways.
Per RFC 2047,
> + An 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'.

If we really tried hard to follow the internet standards, why does this bug matter at all?
Many other sibling bugs are WONTFIXed because we should be RFC compliant. Let's do it.
Severity: normal → S3

(In reply to Masatoshi Kimura [:emk] from comment #4)

Per RFC 2047,

  • An 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'.

If we really tried hard to follow the internet standards, why does this bug
matter at all?
Many other sibling bugs are WONTFIXed because we should be RFC compliant.
Let's do it.

Should this bug be assigned a security rating?

Flags: needinfo?(kaie)

(In reply to Masatoshi Kimura [:emk] from comment #4)

Per RFC 2047,

  • An 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'.

Thank you for pointing out this part of the RFC.

I don't understand your explanation text, it's not clear what you suggest to do.

In my opinion, if this kind of escaped data is completely forbidden by RFC, we should completely ignore it, not display it.

(In reply to Wayne Mery (:wsmwk) from comment #5)

Should this bug be assigned a security rating?

Giving a rating requires understanding the issue.
It wasn't clear to me what's pending, I had to spend a few hours reading and researching.

The described approach can still be used to produce a confusing from address. With some tricks, it's possible to show a full (fake) email address, and have the domain shown separately on screen, which might be sufficient to confuse/trick users.

So, given a from address that was encoded according to RFC 1342, for example with a decoded address like "fake@example.com"@top-domain.com
I think we should completely hide the part "fake@example.com".

Ben suggested to show the text "invalid" instead. IMO that isn't perfect, because of localization.
I think we should show something that:

  • can be the same for all locales
  • should use symbols that aren't a valid email address

I think that a question mark "?" is a good international symbol to show something that isn't clear.

I suggest to show [???]@example.com

Attached image fakeperson.png

So, using the described tricks in email headers, I can produce a message that is shown in Thunderbird as in the attached picture.

I think this can be sufficiently convincing for someone. (The amount of ----- characters and spacing can be chosen by the attacker, I thought it's nice to have some separation from the trailing shown domain, here @i.i)

I'd call this a moderate security issue. But all these details have been public for years, so I see no reason to hide the discussion.

Flags: needinfo?(kaie)
Keywords: sec-moderate
Attached image better.png (obsolete) —

I have a patch that produces this display instead.

It replaces the forbidden encoded text in the email address with the string [???]

Can we just make it invalid@? I think that's clearer than [???] or "[???]" and has less chances of going wrong elsewhere.

My earlier patch didn't pass tests.
I had to to a much deeper analysis and write a completely different patch.

Now I have a patch that passes tests.
But it's limited to fixing the address in the message header pane.

I haven't yet been able to fix the addres that is shown in the list view.
I couldn't yet find the place that produces it.

Also I spent more time on this than I had intended...
But given it's a security issue, I think the effort is justified.

(In reply to Ben Bucksch (:BenB) from comment #11)

Can we just make it invalid@? I think that's clearer than [???] or "[???]" and has less chances of going wrong elsewhere.

I don't want to use "invalid@" with a trailing @.

I think we should completely replace the address. We should NOT show a domain.

(Also, in some deployments, "invalid@" might be a valid mailbox.)

I'm OK to show something like [invalid], where the word "invalid" must be localized.

Comment on attachment 9319668 [details]
better.png

Marking suggested screenshot as obsolete.

Attachment #9319668 - Attachment is obsolete: true
Assignee: nobody → kaie
Status: NEW → ASSIGNED

We should NOT show a domain.

Why not? It would be the real domain of the faker. This is no worse than if the faker used president@whitehousegovernment.org from the start, which he can always do. invalid@whitehousegovernment.org is even clearer. It makes it clearer to the user that something is wrong. And there won't be problems in validation later on. Either the response will bounce, or it will be delivered, or go to /dev/null at the fake server. In no case this is worse than if the faker used president@whitehousegovernment.org .

Email addresses like donotreply@ or invalid@ to signify that responses will not be read have (unfortunately) become common recently, so this is in line with current practice. Users are used to that. We even have code to alert end users when they reply to such email addresses, alerting that the response might not be read. We should use that here.

Also, in the odd chance that the email was actually legit and it was just a very strange system, the user has at least some clue where the email came from. If you remove the domain entirely, there's all information missing that would be needed to investigate.

(In reply to Ben Bucksch (:BenB) from comment #16)

We should NOT show a domain.

Why not?

Because the RFC 2047 says that the escaping isn't allowed in any part of the email address. If there's escaping in the local part, the domain part can be considered invalid, too.

In theory, escaping could be used for the domain part, too. And detecting which part to show requires a more complicated logic.

I'd prefer to keep it simple, as soon as we detect that any part of the address contains a forbidden encoding, we say that there isn't any valid email address found.

What would the new behaviour be for addresses like From: "Bugzilla@Mozilla" <bugzilla-daemon@mozilla.org> previously used at BMO? See bug 1423440. "xx @ yy" <person@example.com> is frequent used, Google groups even send From: "sunshinefox33@gmail.com" <sunshinefox33@gmail.com>. This is also common for badly maintained mailouts where the e-mail address is used as the display name.

Is "2. Strip the @ sign in the user part." (comment #0) the same as bug 1423440?

(In reply to Ben Bucksch (:BenB) from comment #16)

Email addresses like donotreply@ or invalid@ to signify that responses will not be read have (unfortunately) become common recently, so this is in line with current practice. Users are used to that. We even have code to alert end users when they reply to such email addresses, alerting that the response might not be read. We should use that here.

I think it is misleading to transform something completely different into any english-word like mailbox name. Even if some organizations use a mailbox (or alias) like invalid@, it's still a guess that it doesn't exist.

I'd prefer to avoid guessing that can go wrong.

(In reply to PS from comment #20)

What would the new behaviour be for addresses like From: "Bugzilla@Mozilla" <bugzilla-daemon@mozilla.org> previously used at BMO? See bug 1423440. "xx @ yy" <person@example.com> is frequent used, Google groups even send From: "sunshinefox33@gmail.com" <sunshinefox33@gmail.com>. This is also common for badly maintained mailouts where the e-mail address is used as the display name.

Is "2. Strip the @ sign in the user part." (comment #0) the same as bug 1423440?

I'm not suggesting any change for that in this ticket.
This ticket here is purely about emails that use an RFC 2047 encoding in the From header.

Summary: Mailsploit: Do not unescape email addresses → Mailsploit: Do not unescape email addresses (forbidden by RFC 2047)

(In reply to Ben Bucksch (:BenB) from comment #18)

Also, in the odd chance that the email was actually legit and it was just a very strange system, the user has at least some clue where the email came from. If you remove the domain entirely, there's all information missing that would be needed to investigate.

If debugging is really necessary, the email source code is available.

Ben, maybe I made a mistake in trying to reuse the bug that you had filed.

You have made some arbitrary suggestions in the initial comment of this bug, but I would like to do something else.

I would like to fix something that is obviously wrong and forbidden: I want to disallow emails, that use a RFC 2047 encoding for the email address part. I'm not talking about the name or the comment in the from field, only about the email address field. That's forbidden. And that's what I would like to disable and convert to an invalid display.

the email source code is available.

Normal end users cannot read RFC822. But they still will want to know who sent this email.

escaping could be used for the domain part, too

I see. I wasn't aware of that. I thought that escaping is valid only for the local part, but maybe I misremember (it's a long time ago).
If we're not certain about the domain part, I agree we shouldn't show it.

I'm not sure we're speaking about the same thing, though. You're talking about RFC 2047. My understanding is that this is just a way to encode header values. The decoding happens before we consider it an email address. But you may be right to consider such encoded email address as dangerous (iff assuming that there normally is no valid reason to encode it this way) and forbid it.

I was talking about RFC 822 Sections 3.3 and 3.4.x "quoted-string", "quoted-pair", "comment", and similar tokens in the email address itself. While parsing the email address, you should be able to easily distinguish whether it's the local part or domain part. I was referring to the more common case that only the local part is invalid. If the domain is invalid, then the email was most likely already rejected by the email server and has never reached the end user.

arbitrary suggestions

Please keep the discussion civil. Your [???] suggestion is no less arbitrary. We're trying to find out what is best for the user. I have given some reasons for my suggestions.

It would be nice to address the concerns raised:

  • Make it clear to the end user that it's invalid (whereas I do not consider [???] as clear, but rather raising more questions for the end user - even literally raising questions, by using ?)
  • Do not negatively affect other subsystems of TB and third party systems
  • If the user hits reply, we should trigger the already existing dialog which says "Your reply might not be read"
  • If we are certain about the domain, I would recommend to show it to the end user, so that the user assess the origin. It may or may not be legit. (Showing the correct origin domain it is no worse than a valid fake address.) I understand that you might reasonably disagree about this last point, and I respect that.

All that said, I appreciate that you try to fix an old outstanding bug.

Alex, can you chime in here, from the end user perspective?

We're discussing about the case where the local part (foo in foo@bar.com) of the email address itself uses some compatibility encoding from the early days of the internet, which is not in use anymore. Pretty much any email address that uses it would be invalid (there is a very small chance that the email is legit) and might be used for phishing by faking other email addresses. Even the phishing case is very rare (I've rarely if ever seen it in my inbox). The question is what we should use as replacement.

I am fine with whatever you decide, as long as we're not creating more problems accidentally.

Flags: needinfo?(alessandro)

(In reply to Ben Bucksch (:BenB) from comment #25)

Normal end users cannot read RFC822. But they still will want to know who sent this email.

If the "From" contains an invalid encoding, nobody knows who sent it - it's very likely a crafted encoding created by an attacker with the attempt to trick the user.

escaping could be used for the domain part, too

I see. I wasn't aware of that. I thought that escaping is valid only for the local part, but maybe I misremember (it's a long time ago).
If we're not certain about the domain part, I agree we shouldn't show it.

I'm not sure we're speaking about the same thing, though. You're talking about RFC 2047. My understanding is that this is just a way to encode entire header values. The decoding happens before we consider it an email address. But you may be right to consider such encoded email address as dangerous (iff assuming that there normally is no valid reason to encode it this way) and forbid it.

RFC 2047 allows you to mix multiple sequences in a single header. You may have something like
=?encoding?= second name <test@example.com>

Have a look at the examples here:
https://searchfox.org/comm-central/rev/5d15e1b92e4ffa09bddb979f94405ca9bad619c2/mailnews/mime/jsmime/test/unit/test_header.js#744

arbitrary suggestions

Please keep the discussion civil. Your [???] suggestion is no less arbitrary. We're trying to find out what is best for the user. I have given some reasons for my suggestions.

I'm sorry if you are offended by the term "arbitrary", no offense was intended. I should have used a better term.
My point was, you invented ideas how to solve the, like dropping certain characters in certain scenarios.
By my intended original use of the term "arbitrary", I agree that my suggestion to use "[???]" equally deservers the term "arbitrary suggestion".

I see that the intentions of your suggestions were also positive, so I'm not criticizing that at all.

You had ideas which types of email address should be allowed and which shouldn't be allowed.
You suggested to drop certain characters.
I think that suggestion isn't backed by RFCs, and I'm not yet sure if we should do that. If it's allowed to have an escaped @ character inside the local part of an email, then removing it has the risk to break someone.

On the other hand, my suggestion to reject encodings that are forbidden by RFC 2047 is probably immediately agreeable. The only thing we can discuss is how those addresses should be shown. We have different opinions currently. If we have both suggestions that the other doesn't like, it may mean we haven't found the best solution yet. Maybe we need to show an explanation text that clearly explains that the email used invalid data as the From email address.

(In reply to Kai Engert (:KaiE:) from comment #12)

I haven't yet been able to fix the addres that is shown in the list view.
I couldn't yet find the place that produces it.

Found it: nsMsgDBView::FetchAuthor()

(In reply to PS from comment #20)

What would the new behaviour be for addresses like From: "Bugzilla@Mozilla" <bugzilla-daemon@mozilla.org> previously used at BMO? See bug 1423440. "xx @ yy" <person@example.com> is frequent used, Google groups even send From: "sunshinefox33@gmail.com" <sunshinefox33@gmail.com>. This is also common for badly maintained mailouts where the e-mail address is used as the display name.

Parts 1 & 2 refer to quotes inside the address, not the display name, so this affects From: Support <it"support@mozilla.org"em@evil.com> but neither From: "support@mozilla.org" <item@evil.com> nor any of the examples in comment #20.

See addr-spec in RFC 5322 §3.4. §3.4.1 permits a quoted-pair, which may contain an @. This bug simply suggests some (RFC-breaking) normalization that removes the quotes and any @ within them (since it's invalid without the quotes).

That brings up an extra point for this request: Since comments and/or folding white space (CFWS & FWS) are removed from within a quoted-string (see RFC 5322 §3.2.4), Thunderbird will also have to strip those when it strips the quotes.

(In reply to Kai Engert (:KaiE:) from comment #19)

Because the RFC 2047 says that the escaping isn't allowed in any part of the email address. If there's escaping in the local part, the domain part can be considered invalid, too.

I don't think it is a correct interpretation. RFC 2047 does not make a valid RFC 822 message invalid because not all implementations support MIME (RFC 2045 - 2049). If an addr-spec contains a =?...?= sequence, it should be treated as-is instead of being unescaped IMO.

(In reply to Masatoshi Kimura [:emk] from comment #30)

If an addr-spec contains a =?...?= sequence, it should be treated as-is instead of being unescaped IMO.

Agreed.

The original request in this bug is wontfix per the robustness principle.

Magnus wrote in Phab D171099:
[show =?UTF-8?Q?Pr=C3=B6sident?= to end user]

We might as well show kjdfgjdjfg aksjgd hghdj.

kaie wrote:

we can discuss is how those addresses should be shown

Yes. I agree with you in principle, in the way we should fix this. If we cannot be certain about the domain part, I'd also be OK with showing (invalid), as you suggested, or something else that is not a valid RFC822 address.

However, I would not translate it, so that we can special-case it later, in a similar way how we currently treat donotreply@ email addresses with a warning dialog during reply. (Have you seen that?) It would be nice to ensure that we trigger that dialog during reply.

That said, I am happy to cede to Alex' decision on any of these points. This is basically UX, and that's his call.

(In reply to Ben Bucksch (:BenB) from comment #33)
It's not a matter of making the output pretty in any way. We want the output to be ugly/strange looking so nobody is fooled into thinking it's something official, in the very rare cases such messages would make it to a users inbox.

(In reply to Masatoshi Kimura [:emk] from comment #30)

IIf an addr-spec contains a =?...?= sequence, it should be treated as-is instead of being unescaped IMO.

I don't disagree in general, but this is much harder to implement.

It's especially hard, if you want partial unescaping in allowed regions (in the name), and want to skip unescaping in the forbidden region (in addr-spec).

The remainder of this comment will be a brain dump that explains the compexity of the related code. Because of this complexity, we'll either have to spend more time working out a clever implementation, or the alternative is to use a simplification. My suggestion to show a placeholder like [???] or [invalid] was an attempt to minimize the amount of work.

The parsing of the addressing header is rather complex, because it can be a comma separated list of multiple name<email> items, and may involve something like "colon separated groups".

Today, the parser will unescape all RFC 2047, and then attempt to parse the string.

When I first attempted to implement a fix for this bug, I tried to use a rather simple stratey. I wanted to start by parsing, and only do unescaping of the parts that were allowed. I failed with that strategy, because of the complex parser, and because a RFC 2047 encoding may span across several tokens.

I don't want to rewrite the parser completely. With that, it seems necessary to continue to start by doing the initial RFC 2047 unescaping. Then, when parsing, I need to know if a token was originally contained with RFC 2047 or wasn't.

In other words, I have a sequence of strings, some were escaped, and some were not escaped.

When parsing the strings, each string produces several tokens.

I'll give you an example:
John {Löwe} <{jl}@example.com>

In this example, I have used {} to show which part was escaped. The parts Löwe and jl were escaped.

When processing this data, the code would conclude that the unescaped "Löwe" is in an allowed position, and we are allowed to use the unescaped Löwe. Then we'd detect that parts of the email address was escaped. That isn't allowed.

It could be worse, we could have
Hans {Däne <jl}@example.com>
with the encoding spanning part of the name and part of the address.

We'd have to decide how we want to handle such input.

I think the perfect implementation would show Löwe (decoded/unescaped), but would show jl unescaped.

In the second example, we cannot show Däne unescaped, because it spans into the email address. we'd have to show
Hans =?......?=@example.com

The implementation would have to be appropriately enhanced. Currently it simply iterates along the characters, and adds up the name and email address strings.

With that example, when parsing, we'd start by adding Däne to the name, and then we'd see that jl is in a forbidden location. We'd then have to roll back, and someone remove Däne from the name. I don't want to work on that.

If you want to keep the unescaped original data, then I'd prefer a simpler solution: If we find that an escaped string touches an addr-spec anywhere, then leave the complete addressing header unescaped - even if the header spans multiple names/addresses. Because one RFC 2047 sequencing (inside the header) could span multiple names/addresses.

However, if we agree to NOT show the unescaped data (=?....?=), but accept that we show a replacement for invalid parts (e.g. [invalid]), then our display can be smarter. We might be able to show the name/address pairs that are valid, and only show [invalid] for the bad ones.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: