Open Bug 1316754 (ripdl) Opened 8 years ago Updated 8 months ago

Implement IPDL parser in Rust

Categories

(Core :: IPC, defect, P3)

defect

Tracking

()

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(1 obsolete file)

Bill said it might be useful to reimplement the IPDL parser in Rust, for use in SearchFox, or possibly eventually in tree itself. I've started working on this.
Alias: ripdl
I don't want to be a downer, but what's the benefit here?
- Most obvious potential benefit is speed, but is the IPDL parser speed an issue?
- Downside is there's extra compilation required at build-time.
- In terms of niceness, Rust is perhaps a little better than Python, though not necessarily.

I feel like there are places where a Rust rewrite would be a much better thing, like old C/C++ code modules with a history of security flaws.
(In reply to Nicholas Nethercote [:njn] from comment #1)
> I don't want to be a downer, but what's the benefit here?
> - Most obvious potential benefit is speed, but is the IPDL parser speed an
> issue?

The IPDL compiler is surprisingly slow. I'm told that it adds about 1 minute to Windows builds (when it runs). And it needs to run before almost anything is compiled since tons of stuff depends on IPDL-generated headers.

> - Downside is there's extra compilation required at build-time.

Yeah, we're hoping the compile-time won't be too bad. I'm a little worried about that though.

> - In terms of niceness, Rust is perhaps a little better than Python, though
> not necessarily.

Perhaps you've never looked at lower.py? It's probably some of the worst code we have. 5000 lines of dynamically typed crap in a single file. It's really hard to change because it involves constructing an AST (for C++ code) that's never fully defined anywhere. You kind of figure it out as you go each time you hit an exception.

My hope is that Rust will make this a lot better. It has a lot of similarities to ML, so I would expect it to work pretty well for writing a compiler. If nothing else, having a type definition for the AST (both for IPDL and for the C++ we generate) will be great.

> I feel like there are places where a Rust rewrite would be a much better
> thing, like old C/C++ code modules with a history of security flaws.

I think there's no question that we need to rewrite this code. Anyone who has ever looked at it thinks it's horrible. So the only question is whether we'll rewrite it in Python, Rust, or something else. Rust seems like the best choice to me.
(In reply to Nicholas Nethercote [:njn] from comment #1)
> I don't want to be a downer, but what's the benefit here?

The immediate benefit is just making things easier for IPDL parsing in SearchFox.

> - Most obvious potential benefit is speed, but is the IPDL parser speed an
> issue?
> - Downside is there's extra compilation required at build-time.

Yeah, I'm not sure what the speed tradeoff is. Same issues as a JIT.

> I feel like there are places where a Rust rewrite would be a much better
> thing, like old C/C++ code modules with a history of security flaws.

Do you have any suggestions? I haven't come up with any myself, though I've only idly thought about it. The XML parser is an obvious possibility, but it is over 10,000 lines of code, and compatibility is a much bigger problem. A nice property of IPDL is that every file we care about handling is checked into the tree. The IPDL parser is only about a thousand lines of code, so it isn't a huge time investment (of course, the code gen is maybe 5000 lines or so of code).
Fair enough. I have looked at and made minor modifications to this code but nothing major, so I will take your word for the problems with it.

BTW, erahm and I have discussed the possibility of him rewriting the XML parser. He's fixed a number of security bugs for it in the past. We decided to hold off for a bit while oxidation is still underway.
(In reply to Bill McCloskey (:billm) from comment #2)
> (In reply to Nicholas Nethercote [:njn] from comment #1)
> > - In terms of niceness, Rust is perhaps a little better than Python, though
> > not necessarily.
> 
> Perhaps you've never looked at lower.py? It's probably some of the worst
> code we have. 5000 lines of dynamically typed crap in a single file. It's
> really hard to change because it involves constructing an AST (for C++ code)
> that's never fully defined anywhere. You kind of figure it out as you go
> each time you hit an exception.

As somebody who has done significant modifications on lower.py, I cannot agree with this statement enough.
(In reply to Bill McCloskey (:billm) from comment #2)
> Yeah, we're hoping the compile-time won't be too bad. I'm a little worried
> about that though.

For what it is worth, on my machine, doing a --release build takes about four and a half minutes. That's mostly dependencies, and it seemed like most of the time was lalrpop-snap.
(In reply to Andrew McCreight [:mccr8] from comment #6)
> (In reply to Bill McCloskey (:billm) from comment #2)
> > Yeah, we're hoping the compile-time won't be too bad. I'm a little worried
> > about that though.
> 
> For what it is worth, on my machine, doing a --release build takes about
> four and a half minutes. That's mostly dependencies, and it seemed like most
> of the time was lalrpop-snap.

That's...pretty bad. :(
(In reply to Andrew McCreight [:mccr8] from comment #6)
> (In reply to Bill McCloskey (:billm) from comment #2)
> > Yeah, we're hoping the compile-time won't be too bad. I'm a little worried
> > about that though.
> 
> For what it is worth, on my machine, doing a --release build takes about
> four and a half minutes. That's mostly dependencies, and it seemed like most
> of the time was lalrpop-snap.

I wonder if we could check in the generated code? Maybe we wouldn't even have to install or build lalrpop unless you've changed something in the codegen?
I haven't tried lalrpop, but I do have some advice on parser combinator libraries:

1. Avoid the ones with lots of deeply nested types; the one time I tried to use one of those, on a much smaller grammar than IPDL, the result compiled so slowly that it became part of the rustc benchmark suite.

2. nom seemed pretty good for compile time, but you'll want to read the entire file into memory (in general you can't rely on getting `Incomplete` instead of `Error` for incomplete input) and I'm still not sure what the right way is to get line numbers for errors.  Using something other than a slice as the input type might fix some of that, but then you'd need to write your own leaf-level parsers; I haven't tried that.

For reference, the thing I wrote with nom is https://github.com/jld/nss-certdata-parser/blob/master/src/syntax.rs which, like the IPDL parser, is (mainly) meant to be used as a build tool.  It's parsing a simpler language than IPDL, but a release build of that crate (including all dependencies) runs in ~4 seconds for me.
(In reply to Jed Davis [:jld] {⏰UTC-7} from comment #9)
> I haven't tried lalrpop, but I do have some advice on parser combinator
> libraries:
> 
> 1. Avoid the ones with lots of deeply nested types; the one time I tried to
> use one of those, on a much smaller grammar than IPDL, the result compiled
> so slowly that it became part of the rustc benchmark suite.

Indeed, it's called jld-day15-parser and I'm quite familiar with it. The good news is that the compile time for debug builds of that program have dropped by 3x since the start of September, and by ~75x since the start of June. (But only by ~15x since the start of the year; there were some significant regressions along the way.) Some of the relevant speedups haven't yet made it into rustc release yet.

But optimized builds can be a lot slower than release builds. A multi-minute build step that blocks all C++ compilation is not going to be acceptable :(
IPDL and WebIDL parsing/source generation are both long poles in the "export" tier of the build system. Essentially, this tier prepares everything for compiling C++. We currently wait on all source generation to complete before doing *any* compilation.

We could likely initiate some compilation before all source generation is complete. But we're pretty much limited to 3rd party sources because it isn't known which Gecko C++ files include files generated from IPDL, WebIDL, XPIDL, etc processing. That might be enough though, as projects like ICU take way longer to compile than *DL processing. But time is still time: IDPL and WebIDL both take several seconds to process and any reduction can be felt.
Just for reference, running IPDL on my machine takes 17 seconds. This is broken down as:

parsing: 5.8s
typechecking: 0.2s
C++ codegen: 11.0s
My github repo at least parses some simple files like dom/ipc/PCycleCollectWithLogs.ipdl now. PContent.ipdl doesn't work because Bill told me not to bother with supporting features mentioned in the blocking bug. The final piece was handling comments. lalrpop doesn't have a way to generate lexers. I made a couple of attempts at making one to integrate into lalrpop, which did not go well. At Bill's suggestion, I just implemented a pre-pass that replaces comments with spaces.

I haven't looked if this is right, but the output looks like: ([Include(Protocol, "PContent")], [[(Namespace { name: "PCycleCollectWithLogs", namespaces: ["mozilla", "dom"] }, Protocol(Protocol { send_semantics: Async, nesting: None, managers: [["PContent"]], manages: [], messages: [MessageDecl { name: "CloseGCLog", send_semantics: Async, nesting: None, prio: Normal, direction: Some(In), in_params: [], out_params: [], compress: None, verify: false }, MessageDecl { name: "CloseCCLog", send_semantics: Async, nesting: None, prio: Normal, direction: Some(In), in_params: [], out_params: [], compress: None, verify: false }, MessageDecl { name: "__delete__", send_semantics: Async, nesting: None, prio: Normal, direction: Some(In), in_params: [], out_params: [], compress: None, verify: false }] }))]])
I looked a little more at build times. As I said in comment 6, building the dependencies for the full lalrpop takes over four minutes, mostly spent in lalrpop-snap, which is the front end. Touching the .lalrpop file and rebuilding takes 5 seconds, which is a fair amount (9 or 10 seconds with --release). The .rs file that results from the lalrpop process is 1.48mb, which is rather large.

I created a new repo [1] that does not depend on lalrpop's frontend code, just the lalrpop-util library. A --release build of everything takes about 30 seconds. I'd estimate about 14 seconds of that is building the regex library, and my usage of it is trivial, so that could easily be eliminated. About 9 seconds of that is the ipdl.rs file, so skipping the front end doesn't really save any compilation time there. (It is slightly faster without the .lalrpop step in a non --release build.)

I also wrote a crude file shrinker for the output of the lalrpop process, shrinker.py. This removes single line comments, leading whitespace, and turns the giant parsing table into a single line that looks like ",0,0,0,0,-35,-35,0,0,0,-35,0,0,0,0,0,-35,0,0,0,0,0" etc. This reduces the size of the output file from 1.48mb to 0.32mb, which is still bad, but is less bad. It did not have any real effect on compilation time. I'll file an issue with lalrpop to make this the default, as there's really no need for all the debug comments about states in the output.

[1] https://github.com/amccreight/gen_ipdl_parser
> I'll file an issue with lalrpop to make this the default, as there's really no need for all the debug comments about states in the output.
I filed https://github.com/nikomatsakis/lalrpop/issues/164 for that.
The Rust IPDL parser now parses all 160 IPDL files I see in a regular build in 0.3 seconds (that's with the various constructs in the bugs blocking this one deleted). A lot of that could parallelized (the first 100 or so files could all be done in parallel), though that's fast enough it probably isn't worth the effort.
Depends on: 1335886
Comment on attachment 8812383 [details] [diff] [review]
Delete various IPDL constructs the Rust IPDL parser does not understand for testing purposes.

This is not needed any more, because I have removed the unsupported constructs from Python IPDL.

I landed a type checker for the Rust IPDL compiler on my github repo. It parses and type checks all of the files in about 0.2 seconds. (When I run it through Instruments, it says 58ms, but I'm not sure what the discrepancy is from.)

That's 66% of the time to just do parsing I mentioned previously, because I did some profiling and fixed a few egregious perf issues.
Attachment #8812383 - Attachment is obsolete: true
OS: Unspecified → All
Priority: -- → P2
Hardware: Unspecified → All
Version: unspecified → Trunk
I got the parser and type checker working, but this stalled out because the code generator is really big and I never got to the point of wanting to dive into that.
See Also: → 1379739
Assignee: continuation → nobody
Priority: P2 → P3
Tristan Bourvon has worked on this, including typechecking and some of the C++ code generation. He's hoping to be able to finish it off at some point, perhaps by the end of this year.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: