Open Bug 1420263 Opened 5 years ago Updated 4 years ago
Find ways to reduce the size of compressed Binary AST files
Yoric indicated that on some JS source files, the compressed BinAST file can be about 20% larger than the corresponding compressed minified js file. He suggested investigating opportunities for domain-specific compression approaches that improve upon our current approach (and can be layered with our existing techniques). I did some digging and made this bug to post the results and related code.
This is just a tarball of the node project for this. The code is not particularly clean, but it does the job.
We've done quite a bit of investigation into this and I want to summarize my findings so far. Basically, while compression methods have been initially promising, they generally do very poorly compared to text once passed through a sequential compressor. There have been some limited success in dictionary-based compression of the trees (the tree re-pair algorithm, and online tree prefix calculations), but once again - the gains diminish once passed through a sequential compressor. My analytic conclusion is that there is simply no straightforward way to express the tree-based probability information in terms of a sequence of bytes that captures the same information. In particular, if we're feeding the encoder/decoder with direct knowledge of the tree we're encoding (and where we are in it), we can specifically take advantage of data dependencies that a sequential compressor would be completely unable to predict. Consider the expression: ``` BinaryExpression(op="+", right => IdentifierExpression("b"), left => IdentifierExpression("c")) ``` The encoding of this, with symbolic byte constants, would look something like: ``` <Node:BinExpr> <StrIdx:"+"> <Node:IdExpr> <StrIdx:"b"> <Node:IdExpr> <StrIdx:"c"> ``` We note several things: 1. The byte for <Node:BinExpr> guarantees that the following byte will select one of the small set of string operators. The frequency distribution for these strings will be influenced not by the general string selection, but by the frequency of operators. 2. The <Node:BinExpr> byte also directly influences the probabilities of the <Node:IdExpr> bytes. In this case a sequential encoder might be able to infer probabilities for the second <Node:IdExpr> from the 4-bytes removed <Node:BinExpr>, but not with the fidelity that we can use if we know for a fact those bytes are related. To test the general clustering and redundancy characteristics of our data I built a small frequency counter that builds frequency table contextualized to a particular parent node type and child/field name. The total dump data is large (because I build tables over all field values as well as types), but here are some interesting excerpts: ``` Key Script=>statements ExpressionStatement => 949 IfStatement => 1 ``` Here, we count the types showing up under Script.statements (array child). We note that ExpressionStatement is more than 99% likely to show ups as a child, at least on this corpus. It should take a small fraction of a bit to encode the "ExpressionStatement" node type with knowledge of the "Script.statements" context. Next: ``` Key IfStatement=>test BinaryExpression => 1285 StaticMemberExpression => 512 UnaryExpression => 458 IdentifierExpression => 329 CallExpression => 286 ComputedMemberExpression => 24 ``` Here, IfStatement.test is over 50% likely to be a BinaryExpression. I assume greater predictive capability if we go one level deeper (e.g. build BinaryExpression.op's probability distribution from (IfStatement.test, BinaryExpression.op) as the key). But even here, we should require slightly more than 1 bit to encode the type of the BinaryExpression child node at IfStatement.test. Moreover: ``` Key Block=>statements ExpressionStatement => 1540 VariableDeclaration => 437 IfStatement => 380 ReturnStatement => 308 ForStatement => 27 BreakStatement => 18 ContinueStatement => 6 ForInStatement => 4 TryCatchStatement => 2 ThrowStatement => 2 SwitchStatementWithDefault => 2 SwitchStatement => 1 WhileStatement => 1 ``` Here, Block.Statements children are dominated by ExpressionStatement. Block.statements is a very heterogeneous array, the but the probability distribution of children within it are heavily biased. Furthermore, we expect Block's children to be encoded far away in the stream from the block node itself (e.g. statements at the end of function bodies). The sequential predictor has no way of associating the following two bytes: ``` <NodeType:Block> ... statements ... <NodeType:ExpressionStatement> ... <NodeType:ExpressionStatement> .... ``` The probability of each ExpressionStatement byte is influenced directly by the Block byte, and we can draw our probability tables from that knowledge, but the sequential encoder has to expect any byte with some vague sequential model to predict what it might be. I'll post the code for this on the binjs-ts repos soon. I also want to look at what the probability tables for a two-level tree inference look like. I suspect there is a lot more specificity at that layer than is being exposed at a broad (parent-type, name) context.
You need to log in before you can comment on or make changes to this bug.