Perfect statement interference: a grammar driven approach
In programming languages like C, C++, Rust, Zig, C# and Java statements are terminated by semicolons. For example in C:
int square(const int t_num) { return t_num * t_num; }
In programming languages like Python, Javascript, Lua, Go, Scala, Swift and Kotlin a semicolon is not required in order indicate to the compiler when a statement has ended. How these languages infer when a statements has ended differs. I recommend reading this article that details the differences between the strategies different programming languages use to tell when a statement has ended.
Programming languages that do not require semicolons often fall in either one of two strategies for telling when a statement has been terminated.
- Semicolon insertion strategy
- Grammar driven strategy
Strategy 1.
Programming languages that follow strategy 1. are: 1. Javascript 2. Go 3. Scala
For these languages semicolons by the lexer according to a custom ruleset (in Javascripts case the parser inserts the semicolons). This works about 90% of the time. But there are cases where semicolons are accidentally inserted where they should not. For example Go's lexer ruleset will insert a semicolon if a line ends in an identifier or literal. So this code will be correctly terminated:
num := 10 // Semicolon inserted sum := num + num // Semicolon inserted sum := num + num // Semicolon inserted
But this code will not work:
num := 10 // Semicolon inserted sum := num // Semicolon inserted + num // Error: +num (value of type int) is not used
The issue with this strategy is that the lexer acts on individual tokens. Since the lexer has no understanding of the programs grammar it fails to look at the bigger picture.
Strategy 2
I think the second strategy is a lot more promising than the first strategy. The following programming languages follow a grammar driven approach to statement interference:
- Lua
- Kotlin
- Swift
Lua
Lua's approach is very eager. Lua statements often do not require termination. This works because Lua's grammar is limited in some cases. For example Lua does not allow expressions as statements (more about this later):
2 + 2 -- Not allowed -- For all the following statements a == 20 a = 10 + 10 a = 10 + 10 a = 10 + 10
Lua completely discards newlines its grammar does not require them. This can lead to some funky looking but valid code.
a=10 b=20 c=30 print (c) -- This will print 30
It is recommended for Lua to insert a semicolon before a statement that starts with a parenthesis. This prevents Lua from being to greedy when parsing.
num + num; // Prevents Lua from resolving to a num() call ()
Kotlin
Kotlin deals with newlines explicitly in its grammar. Newlines are converted into semicolons when a statement can terminated.
fun main() { var apples = 10 + // Statement cant be terminated continue parsing 10 println(apples) // Prints 20 }
This way of parsing is not perfect. Whilst this prevents some of Lua's too greedy parsing cases. If you want your statements to be properly parsed you must indicate to the parser that the rest of the expression is on the next line. For example if your infix operator starts on a newline parsing will fail:
fun main() { var apples = 10 // Statement can be terminated + 10 // Statement is treated as unary plus println(apples) // Prints 10, which is probably not intended }
If only we had some way distinguish ambiguity in these kind of cases (more about this later).
Swift
Swift is very interesting in its approach because its approach is mostly like Kotlin's where a newline is treated as a terminator when possible. Except Swift uses the spacing rule to distinguish between unary operators and infix operators. Unary operators like prefix operators are not allowed a space after them and postfix operators are not allowed a space before them. For example:
// Correct: +20 // Unary plus num++ // Postfix increment // Wrong: + 20 // This is seen as an infix operator num ++ // This is an error
This way there is no ambiguity between unary operators and infix operators. This means that Swift does not have any pitfalls related to operator parsing and statement interference.
var num = 10 + 10 // Addition print(num) // Prints 20 num = 10 // num == 10 +10 // Unary plus print(num) // Prints 10
Expressions as statements
Whilst reading up on different strategies for interfering statements the following struck me. Why allow expressions as statements at all?
Consider the following case:
void func(){ 2 + 2; }
This expression does nothing!
This addition will do nothing at all.
There are only a couple programming languages that I know that actually have a use for expressions as statements.
For example Rust and a few other functional programming languages allow you to omit the return
keyword and instead return an expression:
fn func() -> int { 2 + 2 // Functions returns 4 now }
I see no reason to allow expressions as statements. Go actually errors if you do not use the result of an expression.
If you actually disallow expressions as statements (excluding function calls). You can clear up the ambiguity between infix and unary operators without requiring the spacing rule.
This would allow you write code like this in Kotlin:
fun main() { // We check if there is an infix operator after the newline's // Expressions cant be statements so it must be part of the `var apples` statement var apples = 10 + 10 println(apples) // Prints 20 // We can check for the . operator after newlines Object obj; obj .method1() .method2() }
This would lead to perfect statement interference. Then by making newlines part of the grammar you can also control where newlines would be valid. This would prevent the programmer from writing code like this:
func example() { someFunc // Error expected a '(' () // I would prefer return statements return 2 + 2 }
I plan on using this strategy for the grammar of my programming language Crow.