Parsing fun with anonymous methods
Dont underestimate anonymous methods
Delphi has supported anonymous methods for many years now, but like other languages it can take a while before a feature impacts a community’s codebase at large. Delphi is like a whale, it can take a while to turn, but when it turns it come about full force. Thankfully we have now reached the stage where generics and anonymous procedures is almost everywhere. There is still tons of classical object pascal (Delphi 7 style) online. Probably hundreds of thousands of examples and snippets people can look at. But libraries, components and new code people post now make good use of the excellent features Delphi have to offer.
Take parsing. A common method that you find in almost all parsers is ReadTo(). It’s a function that keeps on reading, char by char, until some condition is met. Usually this condition is finding a specific character, or that the parser hits a character that it doesn’t expect.
For example like this:
TCharSet= ('A'..'Z','a'..'z','_'); function ReadTo(breakers: TCharSet; var text: string): boolean;
This is more than enough to read method-names, variable names and ordinary plain-text elements. The TCharset set acts as a filter, so when Read() hits a character not in that set – the procedure exits from the read loop, returning false. There is naturally more to it than this, but it’s just to give you an idea.
With anonymous methods, we can write more flexible parsing code. While sets work just fine, it is sometimes easier to do ad-hoc checking. In a complex language or file-format the amount of set-types can grow quite large in the old model. So in-place testing is very welcome, and this is where anonymous methods are super handy.
Consider this parsing function:
function TParserBuffer.ReadTo(const Validator: TParserValidator): boolean; begin SetLength(Text, 0); if not Empty then begin if assigned(Validator) then begin while not EOF do begin if not Validator(Current) then break else Text := text + Current; if not next() then break; end; result := Text.Length > 0; end else result := false; end else result := false; end;
With that in place we can now do ad-hoc reading and testing like this:
type TParserValidator = reference to function (Value: char): boolean; LParser.ReadTo( function (value:char): boolean begin result := CharInSet(Value,['A'..'Z','a'..'z','_','0'..'9']); end, LText);
Advanced parsing always results in pretty much the same techniques being evolved. You can take a programmer in the US, another programmer in Germany – isolate them with no internet access and ask them to write a language parser; perhaps with a deadline like 6 months. When you audit the code you will discover that both programmers have “discovered” and “invented” more or less the same techniques. There are many ways to do parsing, but common to the topic in general are a few fundamental principles.
One of those principles is called “parser context”. The idea is that you isolate information about a parsing session in a separate object (to avoid one massive, complex and nearly undecipherable class). So the current parser position (offset into the text buffer), the column and row info – but more importantly: the “data model” that your parser have built up, which is refered to as an AST (abstract symbol tree) and many other names – is instantly available via the context.
Let’s dig into it
Im not going to write to much on this (since its a huge topic), but ok let’s dig into benefits of the context object.
An important part of the context object is a stack mechanism. As the parser moves through the source-code, it creates objects that is stored in the data-model. Objects that have clear relationships. For example, a local variable will naturally be connected to a procedure or function. A method will likewise be connected to a class.
The context-stack helps you deal with recursive parsing. You will have a class parser, who in turn will invoke a method parser when it finds that, or a property parser when that is found, or a field parser (and so on).
To simplify recursive parsing, what people do is to push the current model-object on the context stack, then it calls the parser-method in question. That parser will in turn do the same if it has sub-elements that should be parsed into itself as the parent. Using a stack saves a lot of time and allows you to write very clean, effective parsing methods.
Take a Delphi class, a simple class with some fields, a couple of properties and a few methods. The process for parsing that would look something like this:
- Parse ancestor
- Parse fields
- Parse field declaration
- Parse datatype
- Parse method
- Parse parameters
- Parse datatype
- Parse result datatype (function)
- Parse parameters
- Parse properties
- Parse getter
- Parse setter
Instead of having to write code to look-up the parent for each “parse job”, or even worse – having to pass along the parent via parameters or setting properties (remember, we use a context to de-couple data from process), the stack allows us to define the target or parent for each child job:
// pass the "unit" on the stack Context.Stack.Push(self.unitmodel); try ParseClass(); finally context.stack.pop(); end;
And the same style of code just continues in all the sub-elements:
// get the unit model-object from the stack LUnit:= TUnitNModel(Context.Stack.peek); // Add a class object to the model LClass:= LUnit.AddClassModel(); // push that onto the stack for children Context.Stack.Push(LClassModel ); try //Naturally more work here, but through the //context object the parser now always knows //the parent where new data should be stored ParseFields</span>(); ParseMethods</span>(); ParseProperties() finally context.stack.pop(); end;
Anonymous procedures and functions now simplifies this even further. Being able to pass methods as parameters means that recursive and complex jobs can be written more effectively. Where I previously had to implement say, a “4 step” parsing job (where each parser calls the other for sub-elements) as classes, i can now do pretty much the same in 4 calls (!)
Context.Buffer.push(); try Context.Buffer.ReadTo( function(value:char):boolean begin context.buffer.push(); try Context.Buffer.ReadTo( function(value:char):boolean begin if bla bla bla finally Context.Buffer.Pop(); end; end); finally Context.Buffer.pop(); end;
Notice the context.buffer.push()? Well, that a feature for pushing the current position and offset of the parser on a stack. So now I can easily proof-read ahead quite deep structures, and return to the very position the declaration started at. This is very helpful when dealing with distinctions. For instance, when telling the difference between a method call locally or in another object:
In the above source, we would hit “.”, then do an anonymous reading session to pickup “someproc;”, return back without losing the current position – and eval what laws should be applied to the text ahead. Dealing with parameter brackets, be they empty of with data, also becomes a recursive and simple task.
Now I have just started to play with anonymous methods in my latest parser, so there are going to be a lot of exciting techniques to explore.