Compilation
graph LR; A{{ Source }} --> B[Lexer] --> C[Parser] --> D[Compiler] --> E{{ Executable }} click B "/vm/source-to-success-compilation.html#lex" click C "/vm/source-to-success-compilation.html#parse" click D "/vm/source-to-success-compilation.html#compile"
We’ll look at each step in more detail now.
Lex
In this step, the Hark source code is transformed into a stream of “tokens” representing valid pieces of syntax (e.g. numbers, identifiers, keywords).
graph LR; A{{"Source code (.hk file)"}} --> B{{Token stream}}
Relevant functions & classes:
compile_file
-- hark_lang/load.pytl_parse
-- hark_lang/hark_parser/parser.pyHarkLexer
-- hark_lang/hark_parser/parser.py
To see what the token stream looks like, load the file with the DEBUG_LEX
environment variable set:
$ DEBUG_LEX=1 hark service.hk
ID : import
( :
ID : foo
, :
ID : pysrc
, :
NUMBER : 1
) :
TERM : ;
FN : fn
ID : bar
( :
ID : x
... etc
Note:
- each token has a “type” (e.g.
ID
-- Identifier, orFN
-- the “define function” keyword) - some tokens have a “value” (e.g.
foo
) - punctuation is syntax (e.g.
(
) - some literals have their own tokens (e.g.
"true"
->TRUE
)
The lexing is done by Sly.
Parse
Here, the token stream is converted into a tree-like data-structure representing valid Hark program structure.
graph LR; A{{Token stream}} --> C --> B{{"Abstract syntax tree (AST)"}} C("(nested) expressions")
Relevant functions & classes:
tl_parse
-- hark_lang/hark_parser/parser.pyHarkParser
-- hark_lang/hark_parser/parser.pyNode
-- hark_lang/hark_parser/nodes.py
We don’t have a good visualisation of the Hark AST yet :(. Make one and submit a PR for Issue #14!
What does valid program structure mean? Hark programs are sequences of “expressions”. Everything is an expression, and expressions may contain other expressions.
The AST is made up of Node
instances, which represent expressions.
For example, the token TRUE
is parsed as an expression (expr
) into a Node
subclass called N_Literal
, which holds a literal value (Python True
in this
case).
@_("TRUE")
def expr(self, p):
return N(self, p, n.N_Literal, True)
Note: N
is a factory function to construct a Node with debug symbols taken
from the parser.
Another example. The sequence of tokens representing an if-expression (yup, if
is also an expr
) are parsed into a Node to represent it.
@_("IF expr block_expr TERM rest_if")
def expr(self, p):
return N(self, p, n.N_If, p.expr, p.block_expr, p.rest_if)
Here, block_expr
and rest_if
are parser items with rules expressed
elsewhere.
This is how a function call, e.g. foo(x)
, is parsed:
@_("ID '(' arglist ')'")
def expr(self, p):
identifier = N(self, p, n.N_Id, p.ID)
return N(self, p, n.N_Call, identifier, p.arglist)
ID
:foo
'('
: a literal opening bracketarglist
: a parser rule to parse a list of arguments (defined elsewhere)')'
: a literal closing bracket
Check the Sly docs for more details.
Error handling: The parser is currently fragile - any error causes immediate failure. A more conventional approach would be to attempt to continue and present all errors at the end.
Compile
The AST is now compiled into an Executable -- a multi-step process.
graph TB; A{{"AST containing Nodes"}} F{{Executable}} A --> B[Register imports] --> B2[Wrap Python functions] --> F A --> 1A 1C --> F subgraph Compile Function 1A[Optimise AST] --> 1B[Compile expressions] --> 1C[Change symbolic Gotos to Jumps] end
Relevant functions & classes:
Node
-- hark_lang/hark_parser/nodes.pyCompileToplevel
-- hark_lang/hark_compiler/compiler.pyoptimise_block
-- hark_lang/hark_compiler/compiler.pyreplace_gotos
-- hark_lang/hark_compiler/compiler.pyExecutable
-- hark_lang/machine/executable.pyInstruction
-- hark_lang/machine/instructionset.py and hark_lang/machine/instruction.pyTlType
-- hark_lang/machine/types.py
Currently, each Hark function is compiled individually. So there is no opportunity for cross-function optimisation.
First, let’s briefly look at the (simplified) Executable class to know where we’re going.
@dataclass
class Executable:
code: List[Instruction]
bindings: Dict[str, TlType]
locations: Dict[str, int]
attributes: dict
code: All of the executable machine instructions.
bindings: A map of string names (identifiers) to either hark functions or imported Python functions.
locations: Lookup-table of Hark function locations in code.
attributes: (not used yet) Attributes of Hark functions for compiler/runtime behaviour configuration.
Here’s the result of compiling service.hk
.
$ hark asm service.hk
BYTECODE:
/
| ;; #F:foo:
| 0 | PUSHB foo
| 1 | CALL 1
| 2 | RETURN
| ;; #1:bar:
| 3 | BIND x
| 4 | POP
| 5 | PUSHV 1
| 6 | PUSHB x
| 7 | PUSHB +
| 8 | CALL 2
| 9 | RETURN
| ;; #2:compute:
| 10 | BIND x
| 11 | POP
| 12 | PUSHB x
| 13 | PUSHB foo
| 14 | ACALL 1
| 15 | BIND a
| 16 | POP
| 17 | PUSHB x
| 18 | PUSHB bar
| 19 | CALL 1
| 20 | BIND b
| 21 | POP
| 22 | PUSHB b
| 23 | PUSHB a
| 24 | WAIT 0
| 25 | PUSHB +
| 26 | CALL 2
| 27 | RETURN
| ;; #3:main:
| 28 | PUSHV 1
| 29 | PUSHB compute
| 30 | CALL 1
| 31 | RETURN
\
BINDINGS:
NAME VALUE
foo ....... <TlForeignPtr pysrc.foo>
bar ....... <TlFunctionPtr #1:bar>
compute ... <TlFunctionPtr #2:compute>
main ...... <TlFunctionPtr #3:main>
This shows how each Hark function has an associated block of bytecode, and the
one imported Python function has been wrapped (using a #F:
prefix to indicate
that it is different from the other functions that are just #n:
).
Register imports
Any import
expressions at file top-level are saved as named bindings in the
final executable. They’re also wrapped in Hark functions so that they can be
called just like Hark functions (#F:foo
above). Currently this wrapped version
is only needed when calling a Python function asychronously -- usually, the
Python function is called directly.
Compile Functions
Each top-level definition is compiled into a “body” of instructions and a named
binding. The final executable code is created by simply concatenating all of the
bodies together and saving the locations by name (e.g. bar -> #1:bar -> code[3]
above).
Optimise AST
Currently tail-recursion optimisation is implemented, which makes recursive functions significantly faster because no activation record (stack frame) is created in the recursive call.
There’s lots of scope for develoment here! e.g. Issue #15
Compile Expressions
Each node (representing an expression) in the tree must be converted into a sequence of VM instructions (bytecode).
Examples of compiling expressions:
- convert literal Python types to Hark types
- convert
N_If
(the if-expression Node) into linear instructions and conditional Jumps
For example, a literal value is simply pushed onto the data stack:
@compile_expr.register
def _(self, n: nodes.N_Literal):
val = mt.to_hark_type(n.value)
return [mi.PushV.from_node(n, val)]
mi
is the machine instruction module, and mt
is machine types.
The from_node
constructor copies node debug data into the instruction so that
tracebacks can be created.
A function call is more interesting as the arguments have to be evaluated first.
def _compile_call(self, n: nodes.N_Call, is_async: bool) -> List[Instruction]:
arg_code = flatten(self.compile_expr(arg) for arg in n.args)
instr = mi.ACall if is_async else mi.Call
return (
arg_code
+ self.compile_expr(n.fn)
+ [instr.from_node(n, mt.TlInt(len(n.args)))]
)
arg_code
contains a list of instructions to evaluate each of the function
arguments, which is simply prefixed to the function call instruction.
Note that n.fn
, the function identifier, is also compiled -- the Call (or
ACall) instruction requires the identifier to be on the stack.