Introducing hare-lex

I have been pretty much focused on hare-template these days. My goal was to improve its syntax parser, and specifically the embedded Hare parsing. Previously, a very lazy job has been done to pass through the Hare syntax, and to let the compiler to detects the syntax issues. I tried really hard to embed the standard library Hare lex and parser module, and to use it directly from the hare-template lexer itself. This was a dead-end because we adapt the token list, add new tokens, or change how we detect some patterns. At some point I would have to re-implement the lexer and parser in a more flexible manner.

If I am going the hard way, I might as well implement a generic purpose Lexing library. This library would embed a pre-configured Hare lexer ruleset, that hare-template could use directly. I seeked at existing approachs, and found the work of Tim Henderson. That suited my use-cases really well, and am sure would cover yours. hare-lex use the same ideas, adapted to Hare.

Hare-lex is based on graph science. Currently, the default backend is a non-deterministic finite automata (NDFA), but I am already working on a determistic (DFA) that could theorically be faster. That being said, the NDFA is already fast enough to replace the hare-templater lexer already.

Let’s present how to use hare-lex:

The user prepare actions callbacks to compile a backend, and to init the lexer. The longest pattern matched prefix wins. In case of ties, the pattern with the highest precedence wins.

let actions: []lex::action = [];
defer free(actions);

append(actions, lex::action {
	expr = `"([^\\"]|(\\.))*"`,
	cb = &literal,
	name = "LIT_STR",
	...
})!;

// use default backend (DFA without environment variable)
const backend = lex::def_backend()!(actions)!;
defer lex::destroy(backend);

// a buffered stream designed for the machinery efficiency
const rbuf: [os::BUFSZ]u8 = [0...];
const lexbuf = buffer::init(in, rbuf)!;
defer io::close(&lexbuf)!;

const lexer = lex::init(backend, &lexbuf);
defer lex::finish(&lexer);

An action callback is associated with a regular expression to match the tokens. The action callbacks are free to initialize tokens as they please, but the [[scanner]] object provide some convenient functions.

fn literal(
	scan: *lex::scanner,
	lexeme: []u8,
	user: nullable *opaque,
) (size | *lex::token | lex::error) = {
	return lex::scan_token(scan, void, len(lexeme));
};

This action callback would return a token of the added action type (ex: “LIT_STR”), with a void value, and lexing the complete lexeme pattern that has been matched (ex: ‘“foo”’).

When the callback simply returns a size, it represents the lexeme size to swallow. This can be used to ignore some patterns, as white-spaces or line returns.

append(actions, lex::action {
	expr = "( |\t|\n|\r)+",
	cb = &skip,
	...
})!;

fn skip(
	scan: *lex::scanner,
	lexeme: []u8,
	user: nullable *opaque,
) (size | *lex::token | lex::error) = {
	return len(lexeme);
};

Action callbacks can be used to match hatch symbols, and then to lex the scanned input manually.

append(actions, lex::action {
	expr = `\<`,
	cb = &html,
	name = "ID"
	...
})!;

fn html(
	scan: *lex::scanner,
	lexeme: []u8,
	user: nullable *opaque,
) (size | *lex::token | lex::error) = {
	const start = scan.start;
	let brk = 0z;
	for (let i = 0z; true; i += 1) {
		const r = match (buffer::read_rune(scan)?) {
		case io::EOF =>
			break;
		case let this: rune =>
			yield this;
		};
		if (r == '<') {
			brk += 1;
		} else if (r == '>') {
			brk -= 1;
		};
		if (brk == 0) {
			return lex::scan_token(scan, void, i + 1);
		};
	};

	return lex::syntaxf(start, "unclosed HTML literal");
};

The very last subtlety is that you can differenciate the lexeme and morphene. This can be used to separate the full bytes to its meaning part. Consider you are lexing this Hare syntax:

const foo = "bar"; // some comment

In that situation, the very last semicolon ; is attached to the next comment // some comment\n. When using the [[flag::COMMENT]], the hare-lex Hare lexer would concatenate this comment pattern to most of the expressions, to tie this as a whole:

fn name(
	scan: *lex::scanner,
	lexeme: []u8,
	user: nullable *opaque,
) (size | *lex::token | lex::error) = {
	const lexer = lex::scan_lexer(scan): *lexer;
	const morphene = slicecomment(lexer, lexeme)!; // rtrim the comment
	return lex::scan_token(scan, void, morphene, len(lexeme));
};

Another use case of this is to handle the hare-template “{{{” escaped brackets. Its morphene is “{{”, but the lexeme is “{{{”. So we “swallow” the three brackets, but we then use the token that represent two of them.

With that new library in hand, I was able to convert the modules from the standard library that lex and parse the Hare syntax. There is also a generic parse modules to easily parse syntax for user barebones lexers, while bubbleing errors on encouter. This comes with the full test-suite that make sure everything works well.

Now hare-template syntax would error with a clear message when errors in the middle of Hare expressions are detected. Consider this example, with a missing semicolon after 0z:

#[template::gen(foo: size)]
export def subject1 = "{{ for (let i = 0z i < foo; i += 1 }}foo {{ i }}{{ end}}";

Would produce this error message while using the codegen tool:

Template error: subject1:1:20: syntax error: Unexpected 'HARE_NAME', was expecting 'HARE_SEMICOLON'