From 7cdd4ee759c5f5071483b11caeaa7d13eb2618bd Mon Sep 17 00:00:00 2001 From: Emin Arslan Date: Wed, 4 Feb 2026 22:54:53 +0300 Subject: [PATCH] updated the design document --- doc/env.md | 176 +++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 158 insertions(+), 18 deletions(-) diff --git a/doc/env.md b/doc/env.md index 6df89f7..824cf09 100644 --- a/doc/env.md +++ b/doc/env.md @@ -1,41 +1,181 @@ This document holds my design notes for lexical and global environments for this compiler. I have not yet named the language. -# Environments +# Closures -An environment is an integral part of the runtime of the language. -There is a global environment that holds the values of all global -symbols. - -Lexical environments generally don't exist in practice, instead we use -flat closures. When a closure is created at runtime, all free variables +The environment system implements flat closures. +When a closure is created at runtime, all free variables it uses are packaged as part of the function object, then the function body uses a GetFree instruction to get those free variables by an index. -Free variables are propagated from inner closures outwards. This is necessary, -as this also handles multiple-argument functions gracefully. - +(Free variables are propagated from inner closures outwards. This is necessary, +as this also handles multiple-argument functions gracefully.) ```scheme (let ((a 10)) (print (+ a 5))) ``` +This code will be compiled as a lambda that takes a single parameter and executes +the body `(print (+ a 5))`, which is called immediately with the value 10. + +The compiler tries to perform symbol resolution on expressions in the body of the +let as well, however it sees no other expressions creating further scopes. + +Since there are two free symbols in this code (`+` and `print`), and the surrounding +environment does not have these two symbols defined locally, both of these symbols +will be resolved to their global definitions directly. + +Now let's examine a classic example of closures: + +```scheme +(define (adder x) + (lambda (y) (+ x y))) +``` + +The adder function takes an argument x, and creates returns a function that adds x +to its argument. + +This is implemented by a compiler pass that resolves symbols. Starting from top-level +expressions, it scans downwards, noting every free symbol. A free symbol is one +that is used in an expression, yet has no value defined locally in that expression. +In other words, its value must come from the surrounding scope. + +In this example, the adder function has a symbol x that is a part of its function definition. +This is clearly not a free variable. However, examining the inner lambda expression, +we can see that it uses y (which is not free) and x. The value of x is not defined +as part of the lambda expression, so it must be free. + +The compiler, seeing this, notes that the inner lambda has a free variable `x`, and a parameter +`y`. Thus, the lambda has 1 free variable and 1 parameter. This means the closure object will have +a code pointer along with an array of length 1 forming the storage for the free variable(s). +The compiler compiles the body of the lambda such that every occurance of `x` is replaced +with code to get free variable #0 from the current closure. (`y` is, naturally, parameter #0). +Otherwise, no special handling is necessary. + +The inner lambda has no other expressions creating further scopes, so the compiler +knows it has hit the deepest scope in the expression, and starts scanning outwards once again. + +Scanning outwards, the compiler sees that there is a defined symbol x, and in the scope +of this definition, a lambda expression that uses a free symbol named x is used. The +compiler matches these, and compiles the lambda expression (as in, the value that the lambda +expression will evaluate to) such that it creates a closure object: a pair of code pointer +pointing to the already compiled body, and an array of length 1 containing the current +value of x. + +This newly created value represents the closure. As you might notice, the current value +of x has been copied into the closure object. The closure is now returned, and the +scope of `adder` is destroyed. The closure object survives. + +Note: in actuality, the outer `adder` function itself is also a closure. The inner +lambda actually has *two* free variables: `+` is also a symbol, and its value is not +defined in the body of the lambda. Since `adder` also doesn't define it, the free symbol +is propagated outwards, and adder also accesses it as a free variable. The compiler +(when propagating free symbols) eventually reaches the global environment, and +resolves these free symbols to their global definitions. + +This behaviour is necessary (for some definition of "necessary") to ensure correct runtime +behaviour. This is because all symbols are `set!`able. Thus, the adder function can be +defined while `+` is bound to its builtin value, then modified into a different value. +The following is valid: + +``` +(define (adder x) + (lambda (y) (+ x y))) +(set! '+ 5) +; + now equals 5, but adder still works. +``` + +This behaviour may seem ridiculous (why on earth would anyone define `+` to be `5`?), +and it may be tempting to prevent using `set!` on standard library symbols, this is perfectly +valid for global symbols defined by the user. + +## Note on currying + +Because this language is actually a curried variant of lisp/scheme, the +above function could also be written like this: + +```scheme +(define (adder x y) (+ x y)) +``` + +or, even like this: + +```scheme +(define adder +) +``` + +... since the built-in `+` function is also already curried. In fact, the entire +language is curried. All function calls are (or behave as if they were) unary. +The function call syntax `(f x y)` is actually treated as `((f x) y)` by the +compiler. + +## Note on syntax + +I am using more or less regular Scheme syntax in this document. However, this is +potentially subject to change. I have not decided on what the official syntax +should be like. I am using Scheme syntax simply because I think it is fairly clean, +but some changes might make sense in the future as the semantics of this language +deviate greatly from Scheme's. + +## Note on performance + +This design document may raise concerns of performance. If everything above is +truly set in stone, then it seems obvious that there should be a performance +penalty. + +As written, this design requires a basic addition like `(+ 1 2)` to allocate a +closure object after all. No matter how fast OCaml's minor heap may be +(and it is plenty fast, to be fair), that is not going to go well in a tight loop. + +These are valid concerns, and I am currently leaving these problems to my future +self. + +Optimizing multiple-argument functions is actually fairly straightforward (or +it looks easy, at least), however I want to first make sure the language +has consistent semantics. A slow language is better than no language, after all. +So I intend to add the facilities necessary for these optimizations into the +compiler at a later point. ## Global Definitions -Any symbol defined through a top-level `define` form is made globally available -after the definition form. +Global definitions get a separate section because they're mostly straightforward. -This is the most common use for define. +Any symbol defined through a top-level `define` form is made globally available +after the definition form. More accurately, the symbol is present in the program +before the define is reached, however it will be bound to a dummy value until +it is accessed. + +This behaviour is proposed for the purpose of allowing mutually +recursive definitions without issue, however please note that this is not yet certain, +because this design comes with the tradeoff that errors involving symbols accessed +before the point they are supposed to be defined can only be detected at runtime. + +To illustrate the problems this could cause: + +``` +(define b (+ a 10)) +(define a 5) +``` + +This is pretty clearly an error - yet the compiler cannot, as proposed, determine +this. In the future, further passes over the source code could be added to scan +for such issues, or a differentiator between top-level function and variable +definitions to prevent this. + +Notably, this problem does not occur for function definitions. In fact, the following +is perfectly fine despite looking a bit similar: + +``` +(define (b) (+ a 10)) +(define a 5) +``` Generally any symbol appearing in the body of a function, will only be compiled to access that symbol. The symbol is only accessed once the function is called. Thus, you can create mutually recursive functions at the top level with no issue. -## Local Definitions - -It is valid to use `define` forms in body sections. Informally, a body section -is the body of most built-in forms, including `lambda`, `let`, and `letrec`. - +The body of the definition is only executed once the `define` form is reached. +Thus, definitions with side effects will execute exactly in the order they +appear in the source.