Open Bug 1463289 Opened 6 years ago Updated 2 years ago

Convert libmime from C to C++

Categories

(MailNews Core :: MIME, enhancement)

enhancement

Tracking

(Not tracked)

ASSIGNED
Thunderbird 62.0

People

(Reporter: BenB, Assigned: BenB)

Details

Attachments

(1 file)

libmime is written in object orientated C, using C syntax to define and call classes, but it's compiled as C++. We should change the class definitions to proper C++.

This project should be kept minimal in scope, to reduce the risk. It merely changes the syntax and not the implementation, so the risk for regressions should be minimal. The code changes are straight-forward and mechanical. That's what makes it feasible. More ambitious changes can be done later and are out of scope.

---

Background

libmime is our core backend library that processes all incoming email and converts it for display and other purposes. It lets us understand emails, and display them. It know all our details about decoding email formats, and implements all the details about the message body display. So, it's a core library.

It's been designed and written by jwz, who founded mozilla.org, together with others, and who was the first technical lead of mozilla.org, before Brendan Eich. libmime was actually created long before mozilla.org, but somewhere in the Netscape 2.0 or 3.0 time frames.

Back then, he decided to write this backend library in C, not C++. Back then, C++ was a language used for GUI implementation, go figure. However, he saw the value of object orientation and a class hierarchy. So, he designed and implemented his own object system, in pure C. He's not the only one to do that, GTK+ also chose to use C, but with OO. However, GTK+ is passing around a lof ot null pointers and then casting them. Not so libmime. It has to use casts, due to the C language but it casts from higher level object types, so there are no (or hardly any) null pointers in the system, so it's actually type safe. It's the only OO system in C that I've ever seen that manages to avoid null pointers. It also implements all 3 object orientation properties, of encapsulation, inheritance, and polymorphs. All in C. That's kind-of ingenious, really. More details are in mimei.h <https://dxr.mozilla.org/comm-central/source/mailnews/mime/src/mimei.h>. I highly recommend reading the large comment at the top, it's very educating and enlightening.

Also, libmime has a very nicely designed class hierarchy. It's in fact the clearest class hierarchy in all of Thunderbird, and the best documented one, too. Again, top of mimei.h and mimei.cpp.

mimei stands for libmime interface and contains the "switchboard" that decides which libmime implementation class to use for parsing which MIME part.

---

This project

But it's called mimei.cpp, not mimei.c, so isn't it already C++? Yes and no. It's compiled as C++ code, and we do already use C++ Mozilla classes like nsString in libmime. But the libmime classes themselves are still defined in C. That is a left-over and no longer necessary. A lot of the prejudice against libmime comes from the fact that people don't understand that this is actually C code.

I think it would make sense to convert libmime classes to real, modern C++ classes. It would make libmime a lot more readable to today's developers. At the same time, it's a very feasible project, because libmime is already completely object orientated, and it's already compiled as C++, so the transition is a pure syntactical one. At the same time, we could clean up the code style to be in line with Mozilla Style Guide for new code.

It would be a very straight-forward, simple transition, a simple mapping of syntax. None of the logic changes, at all. We don't even have to change the string and buffer use at all, and we will not, to avoid breaking existing code. That is the only thing that makes this project viable. It reduces the risk of regressions to almost zero, because it only changes language syntax and not any implementation. Otherwise, I would not attempt this.

----

FAQ

1. Why not change it to JavaScript directly?

* libmime uses a lot of low-level XPCOM features, like byte sequences, C strings, and Unicode strings, and converts between them. It uses low-level XPCOM charset conversion functions extensively to do its work - after all, that's part of its primary purpose. Changing that to JS Unicode strings and octet streams and the like would change the string handling massively, would completely change the logic, and risk many regressions.

* In this project here, we don't even have to touch the function implementations, only change the syntax of function calls and function signatures. Many of the function implementation lines won't even change.

* It would not only be a syntax change, but a re-implementation. It's a much larger change, and not one I'm ready or able to make.

* In other words, changing to JS would probably be 10 or 100 times more work than what I propose here.


2. Why not use JSMime instead?

* This has been attempted a few years ago, and they found that JSMime cannot yet replace libmime. I don't know JSMime, nor am I familiar with the reasons in detail, but they were severe enough that this project was abandonned and JSMime parsing used only for header parsing, not content parsing. JSMime was written by jcranmer, and he's a highly competent engineer, and I respect him. If he didn't manage to get it to production quality for 25 million users, I won't dare to attempt that. The risk is huge.

* libmime and its support libraries have features that JSMime does not have. For example, the plaintext conversion code received a lot of love and is very finely tuned. That includes quote recognition, struct and smiley recognition, and URL recognition, which is the most difficult. All this would have to be re-implemented and ported, to match the existing code, to avoid regressions. This part alone is a year-long project to fine-tune, and that's just one part of libmime, but there are many others.

* In other words, changing to JSMime would be between 10 and 100 times more work than what I propose here.


3. What is the goal?

* Convert libmime class definitions from C syntax to C++ syntax. Same for function definitions.

* Adapt the function calls, remove the casts and change |myClass| et al to |this|.

* Use a C++ name space and remove the pseudo name space in the class names.

* Change the code style to the Mozilla Style Guide <https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style>


4. What is out of scope?

* Any functional changes. The logic should not change in any way. The output should not change in any way. If we find bugs during the course of this project, they will be filed as separate bug reports and fixed after the conversion is over.

Atomic commits <https://en.wikipedia.org/wiki/Atomic_commit#Atomic_commit_convention>: "Greater understandability comes from the small size and focused nature of the commit. It is much easier to understand what is changed and reasoning behind the changes if you are only looking for one kind of change. This becomes especially important when making format changes to the source code. If format and functional changes are combined it becomes very difficult to identify useful changes"

* Exceptions. Error handling stays with integer result codes for now. We will even stay with libmime's own result code, not nsresult. This limits the scope and risk of this change here. This could be changed later in a followup commit to either nsresult or exceptions, if desired. Changing error handling changes mostly the function implementations, while this project changes mostly the class definitions. It's reasonable to leave this out of scope here, because this change would be mostly orthogonal. It can be left out here, so it should, to limit scope and risk and regressions.
Before commenting, please make sure to read all of the above, because I tried to anticipate certain reactions.
Style changes:
I'll unfortunately have to postpone that. As much as I'd like to, and as much as some of the current style burns into my eyes, it's too much to change at the same time with the class conversion. I've started doing that for 2-3 classes, and I think it's needed, but I noticed that it's *way* too much work for me to chew all at once. So:

Out of scope:
* Change the code to the Mozilla Style Guide <https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style>. To make this class conversion feasible, I'd prefer to first only convert to C++ classes, and then as a second step, we can together adapt the code style. This should also keep most of the controversial opinionated discussion out of this purely technical conversion. That said, I do think this code should be converted to Mozilla Code Style. Just not here in this bug in this first step.
* Dear reviewer, please do *not* ask me to change the style of anything, even for lines that I touched. This changes a huge amount of code, and if tinker with style *at the same time*, this will never finish. I'm all for changing the style here. But let's please leave this for after this project. One step at a time. I cannot do everything at once, otherwise this becomes impossible. Again, this specifically also includes lines that I change in this diff. There are too many lines changed.

In scope:
* I did try to change the public class method names to the Mozilla Code Style. Only the method names, not the parameters,
  to limit the changes.
https://bugzilla.mozilla.org/show_bug.cgi?id=1463289

Before commenting, please read this post in full. I know what some of you will respond, and I tried to anticipate the responses.

libmime is written in object orientated C, using C syntax to define and call classes, but it's compiled as C++. We should change the class definitions to proper C++. This project should be kept minimal in scope, to reduce the risk. It merely changes the syntax and not the implementation, so the risk for regressions should be minimal. The code changes are straight-forward and mechanical. That's what makes it feasible. More ambitious changes can be done later and are out of scope.

Background

libmime is our core backend library that processes all incoming email and converts it for display and other purposes. It lets us understand emails, and display them. It know all our details about decoding email formats, and implements all the details about the message body display. So, it's a core library.


It's been designed and written by jwz, who founded mozilla.org, together with others, and who was the first technical lead of mozilla.org, before Brendan Eich. libmime was actually created long before mozilla.org, but somewhere in the Netscape 2.0 or 3.0 time frames.

Back then, he decided to write this backend library in C, not C++. Back then, C++ was a language used for GUI implementation, go figure. However, he saw the value of object orientation and a class hierarchy. So, he designed and implemented his own object system, in pure C. He's not the only one to do that, GTK+ also chose to use C, but with OO. However, GTK+ is passing around a lof ot null pointers and then casting them. Not so libmime. It has to use casts, due to the C language but it casts from higher level object types, so there are no (or hardly any) null pointers in the system, so it's actually type safe. It's the only OO system in C that I've ever seen that manages to avoid null pointers. It also implements all 3 object orientation properties, of encapsulation, inheritance, and polymorphs. All in C. That's kind-of ingenious, really. More details are in mimei.h <https://dxr.mozilla.org/comm-central/source/mailnews/mime/src/mimei.h>. I highly recommend reading the large comment at the top, it's very educating and enlightening.

Also, libmime has a very nicely designed class hierarchy. It's in fact the clearest class hierarchy in all of Thunderbird, and the best documented one, too. Again, top of mimei.h and mimei.cpp.

mimei stands for libmime interface and contains the "switchboard" that decides which libmime implementation class to use for parsing which MIME part.

This project

But it's called mimei.cpp, not mimei.c, so isn't it already C++? Yes and no. It's compiled as C++ code, and we do already use C++ Mozilla classes like nsString in libmime. But the libmime classes themselves are still defined in C. That is a left-over and no longer necessary. A lot of the prejudice against libmime comes from the fact that people don't understand that this is actually C code.

I think it would make sense to convert libmime classes to real, modern C++ classes. It would make libmime a lot more readable to today's developers. At the same time, it's a very feasible project, because libmime is already completely object orientated, and it's already compiled as C++, so the transition is a pure syntactical one. At the same time, we could clean up the code style to be in line with Mozilla Style Guide for new code.

It would be a very straight-forward, simple transition, a simple mapping of syntax. None of the logic changes, at all. We don't even have to change the string and buffer use at all, and we will not, to avoid breaking existing code. That is the only thing that makes this project viable. It reduces the risk of regressions to almost zero, because it only changes language syntax and not any implementation. Otherwise, I would not attempt this.

FAQ

1. Why not change it to JavaScript directly?

* libmime uses a lot of low-level XPCOM features, like byte sequences, C strings, and Unicode strings, and converts between them. It uses low-level XPCOM charset conversion functions extensively to do its work - after all, that's part of its primary purpose. Changing that to JS Unicode strings and octet streams and the like would change the string handling massively, would completely change the logic, and risk many regressions.

* In this project here, we don't even have to touch the function implementations, only change the syntax of function calls and function signatures. Many of the function implementation lines won't even change.

* It would not only be a syntax change, but a re-implementation. It's a much larger change, and not one I'm ready or able to make.

* In other words, changing to JS would probably be 10 or 100 times more work than what I propose here.


2. Why not use JSMime instead?

* This has been attempted a few years ago, and they found that JSMime cannot yet replace libmime. I don't know JSMime, nor am I familiar with the reasons in detail, but they were severe enough that this project was abandonned and JSMime parsing used only for header parsing, not content parsing. JSMime was written by jcranmer, and he's a highly competent engineer, and I respect him. If he didn't manage to get it to production quality for 25 million users, I won't dare to attempt that. The risk is huge.

* libmime and its support libraries have features that JSMime does not have. For example, the plaintext conversion code received a lot of love and is very finely tuned. That includes quote recognition, struct and smiley recognition, and URL recognition, which is the most difficult. All this would have to be re-implemented and ported, to match the existing code, to avoid regressions. This part alone is a year-long project to fine-tune, and that's just one part of libmime, but there are many others.

* In other words, changing to JSMime would be between 10 and 100 times more work than what I propose here.

Scope

What is the goal?

* Convert libmime class definitions from C syntax to C++ syntax. Same for function definitions.

* Adapt the function calls, remove the casts and change |myClass| et al to |this|.

* Use a C++ name space and remove the pseudo name space in the class names.

* Change the class method names to the Mozilla Style Guide <https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style>. Only the method names, not the parameters.


What is out of scope?

* Any functional changes. The logic should not change in any way. The output should not change in any way. If we find bugs during the course of this project, they will be filed as separate bug reports and fixed after the conversion is over.

Atomic commits <https://en.wikipedia.org/wiki/Atomic_commit#Atomic_commit_convention>: "Greater understandability comes from the small size and focused nature of the commit. It is much easier to understand what is changed and reasoning behind the changes if you are only looking for one kind of change. This becomes especially important when making format changes to the source code. If format and functional changes are combined it becomes very difficult to identify useful changes"

* Exceptions. Error handling stays with integer result codes for now. We will even stay with libmime's own result code, not nsresult. This limits the scope and risk of this change here. This could be changed later in a followup commit to either nsresult or exceptions, if desired. Changing error handling changes mostly the function implementations, while this project changes mostly the class definitions. It's reasonable to leave this out of scope here, because this change would be mostly orthogonal. It can be left out here, so it should, to limit scope and risk and regressions.

* Change the code to the Mozilla Style Guide <https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style>. I've started doing that for 2-3 classes, and I think it's needed, but I noticed that it's too much work for me to chew all at once. So, to make this class conversion feasible, I'd prefer to first only convert to C++ classes, and then as a second step, we can together adapt the code style. This should also keep most of the controversial opinionated discussion out of this purely technical conversion. That said, I do think this code should be converted to Mozilla Code Style. Just not here in this bug in this first step.
Attachment #8979431 - Attachment description: Convert libmime to C++ → Convert libmime from C to C++
Interesting. Wouldn't it make sense to clean up the code where you touch it?

So instead of changing to:
int
MimeHeaders_ParseLine (const char *buffer, int32_t size, MimeHeaders *hdrs)

you change to
int
MimeHeaders_ParseLine(const char *buffer, int32_t size, MimeHeaders *hdrs)

Otherwise I'd have to visit it all again later with another review potentially. Also, style changes always mess up blame.

This "Longer lines will be wrapped, and indented at least 4 spaces." is not true, it's two spaces.

Can you please attach a valid HG patch. |hg import https://phabricator.services.mozilla.com/D1344?download=true| fails and downloading the patch and adding it to the queue fails as well.

With at bit of wrestling I got it to apply, but it doesn't apply cleanly, there are five rejects. So please rebase to current trunk.
(In reply to Jorg K (GMT+2) (bustage-fix only, NI for urgent reviews) from comment #5)
> With at bit of wrestling I got it to apply, but it doesn't apply cleanly,
> there are five rejects. So please rebase to current trunk.
Sorry, my mistake, I messed up the patch when manually editing it. After correcting that, it now applies. However, heaps of compile errors starting with:
 0:15.27 c:\mozilla-source\comm-central\obj-x86_64-pc-mingw32\dist\include\mimei.h(104): error C2429: language feature 'nested-namespace-definition' requires compiler flag '/std:c++17'

So perhaps this is a little ahead of its time ;-)
> Interesting. Wouldn't it make sense to clean up the code where you touch it?

I sometimes do it. Where it's simple. But I can't do everything at once. If I need to take care for style, this project becomes impossible and I need to abandon it right now.
If you look at the result, it should already be a *lot* better than before, even for style. :)
Some of the changes are simple search/replace, e.g. when I remove the pseudo-prefix in favor of a proper C++ namespace.
Yes, it's very far from compiling. I just wanted to get some early feedback whether I should continue with this at all.
I've worked on it for the last 14  hours non-stop, I think I'll take a break :-)
I like the refactoring.
Simplifying things like 'clazz == (MimeObjectClass *)&mimeMultipartMixedClass' to 'clazz == MultipartMixedClass' seems worthwhile and less error prone.
Yes, jcranmer sees not much benefit and it would be so, if you just proposed it as a future work.
But if you intend to finish the patch so it compiles and works, I'm sure we can find a reviewer :)

Why couldn't calls like 'mime_typep(child, (PartClass*) &mimeEncryptedClass)' be simplified too? Is that because you said to not touch the function arguments yet?
old libmime:

typedef struct MimeInlineTextClass MimeInlineTextClass;
typedef struct MimeInlineText      MimeInlineText;

struct MimeInlineTextClass {
  MimeLeafClass   leaf;
  int (*rot13_line) (MimeObject *obj, char *line, int32_t length);
  int (*convert_line_charset) (MimeObject *obj, char *line, int32_t length);
  int (*initialize_charset) (MimeObject *obj);
};

extern MimeInlineTextClass mimeInlineTextClass;

struct MimeInlineText {
  MimeLeaf leaf;      /* superclass variables */
  char *charset;      /* The charset from the content-type of this
                         object, or the caller-specified overrides
                         or defaults. */

-----------------------------------------
becomes now:

abstract class Text : public Leaf {
public:
  Text();
  virtual ~Text();

  override int ParseDecodedBuffer(const char* buf, int32_t size);
  override int ParseEOF(bool abort_p);

  int ROT13Line(char* line, int32_t length);
  int ConvertLineCharset(char* line, int32_t length);
  int InitializeCharset();

  /**
   * The charset from the content-type of this object,
   * or the caller-specified overrides or defaults.
   */
  char *charset;
  ...
-----------------------------------------

ParseDecodedBuffer() and ParseEOF were overridden in old libmime, too, but only in the implementation. Now it's explicit.

Note that the split between Text and TextClass is gone.

I think that will help new developers a lot.
> Why couldn't calls like 'mime_typep(child, (PartClass*) &mimeEncryptedClass)' be simplified too?

mime_typep() is the old signature. That means this file has not been converted yet, only search/replace.
The new call would be: child->IsType(EncryptedClass)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: