Home > Delphi, Object Pascal > Parsing fun with anonymous methods

Parsing fun with anonymous methods

January 11, 2017 Leave a comment Go to comments

Anonymous methods can save you days, weeks and months of coding. Bug hunting in a smaller codebase is always more effective

Anonymous methods can save you days, weeks and months of coding. Bug hunting in a smaller codebase is always more effective

I must have written 100 parsers in my day. Making programming languages, script engines or just text-based file-formats, is something I have enjoyed doing since the Amiga days (80’s and 90’s). For each parser I make, I discover new and interesting aspects. That’s the beauty about programming; any programmer that propose that he is fully taught, that he is the best or that there is nothing he cannot do, is a liar. Programming is an art that is never exhausted, and there will always be new techniques to learn and different aspects to explore.

Dont underestimate anonymous methods

Delphi has supported anonymous methods for many years now, but like other languages it can take a while before a feature impacts a community’s codebase at large. Delphi is like a whale, it can take a while to turn, but when it turns it come about full force. Thankfully we have now reached the stage where generics and anonymous procedures is almost everywhere. There is still tons of classical object pascal (Delphi 7 style) online. Probably hundreds of thousands of examples and snippets people can look at. But libraries, components and new code people post now make good use of the excellent features Delphi have to offer.

Take parsing. A common method that you find in almost all parsers is ReadTo(). It’s a function that keeps on reading, char by char, until some condition is met. Usually this condition is finding a specific character, or that the parser hits a character that it doesn’t expect.

For example like this:

TCharSet= ('A'..'Z','a'..'z','_');
function ReadTo(breakers: TCharSet; var text: string): boolean;

This is more than enough to read method-names, variable names and ordinary plain-text elements. The TCharset set acts as a filter, so when Read() hits a character not in that set – the procedure exits from the read loop, returning false. There is naturally more to it than this, but it’s just to give you an idea.

With anonymous methods, we can write more flexible parsing code. While sets work just fine, it is sometimes easier to do ad-hoc checking. In a complex language or file-format the amount of set-types can grow quite large in the old model. So in-place testing is very welcome, and this is where anonymous methods are super handy.

Consider this parsing function:

function TParserBuffer.ReadTo(const Validator: TParserValidator): boolean;
begin
  SetLength(Text, 0);
  if not Empty then
  begin
    if assigned(Validator) then
    begin
      while not EOF do
      begin
        if not Validator(Current) then
        break else
        Text := text + Current;

        if not next() then
        break;
      end;

      result := Text.Length > 0;
    end else
    result := false;
  end else
  result := false;
end;

With that in place we can now do ad-hoc reading and testing like this:

type
TParserValidator = reference to function (Value: char): boolean;

LParser.ReadTo( function (value:char): boolean
  begin
    result := CharInSet(Value,['A'..'Z','a'..'z','_','0'..'9']);
  end, LText);

Advanced parsing always results in pretty much the same techniques being evolved. You can take a programmer in the US, another programmer in Germany – isolate them with no internet access and ask them to write a language parser; perhaps with a deadline like 6 months. When you audit the code you will discover that both programmers have “discovered” and “invented” more or less the same techniques. There are many ways to do parsing, but common to the topic in general are a few fundamental principles.

One of those principles is called “parser context”. The idea is that you isolate information about a parsing session in a separate object (to avoid one massive, complex and nearly undecipherable class). So the current parser position (offset into the text buffer), the column and row info – but more importantly: the “data model” that your parser have built up, which is refered to as an AST (abstract symbol tree) and many other names – is instantly available via the context.

Let’s dig into it

Im not going to write to much on this (since its a huge topic), but ok let’s dig into benefits of the context object.

An important part of the context object is a stack mechanism. As the parser moves through the source-code, it creates objects that is stored in the data-model. Objects that have clear relationships. For example, a local variable will naturally be connected to a procedure or function. A method will likewise be connected to a class.

The context-stack helps you deal with recursive parsing. You will have a class parser, who in turn will invoke a method parser when it finds that, or a property parser when that is found, or a field parser (and so on).

To simplify recursive parsing, what people do is to push the current model-object on the context stack, then it calls the parser-method in question. That parser will in turn do the same if it has sub-elements that should be parsed into itself as the parent. Using a stack saves a lot of time and allows you to write very clean, effective parsing methods.

Take a Delphi class, a simple class with some fields, a couple of properties and a few methods. The process for parsing that would look something like this:

  • ParseClass
    • Parse ancestor
  • Parse fields
    • Parse field declaration
    • Parse datatype
  • Parse method
    • Parse parameters
      • Parse datatype
      • Parse result datatype (function)
  • Parse properties
    • Parse getter
    • Parse setter

Instead of having to write code to look-up the parent for each “parse job”, or even worse – having to pass along the parent via parameters or setting properties (remember, we use a context to de-couple data from process), the stack allows us to define the target or parent for each child job:

// pass the "unit" on the stack
Context.Stack.Push(self.unitmodel);
try
  ParseClass();
finally
  context.stack.pop();
end;

And the same style of code just continues in all the sub-elements:

// get the unit model-object from the stack
LUnit:= TUnitNModel(Context.Stack.peek);

// Add a class object to the model
LClass:= LUnit.AddClassModel();

// push that onto the stack for children
Context.Stack.Push(LClassModel );
try
  //Naturally more work here, but through the
  //context object the parser now always knows
  //the parent where new data should be stored
  ParseFields</span>();
  ParseMethods</span>();
  ParseProperties()
finally
  context.stack.pop();
end;

Anonymous procedures and functions now simplifies this even further. Being able to pass methods as parameters means that recursive and complex jobs can be written more effectively. Where I previously had to implement say, a “4 step” parsing job (where each parser calls the other for sub-elements) as classes, i can now do pretty much the same in 4 calls (!)

Context.Buffer.push();
try
  Context.Buffer.ReadTo( function(value:char):boolean
    begin
      context.buffer.push();
      try
        Context.Buffer.ReadTo( function(value:char):boolean
          begin
            if bla bla bla
      finally
        Context.Buffer.Pop();
      end;
    end);
finally
  Context.Buffer.pop();
end;

Notice the context.buffer.push()? Well, that a feature for pushing the current position and offset of the parser on a stack. So now I can easily proof-read ahead quite deep structures, and return to the very position the declaration started at. This is very helpful when dealing with distinctions. For instance, when telling the difference between a method call locally or in another object:

myproc;
FObject.someproc;

In the above source, we would hit “.”, then do an anonymous reading session to pickup “someproc;”, return back without losing the current position – and eval what laws should be applied to the text ahead. Dealing with parameter brackets, be they empty of with data, also becomes a recursive and simple task.

Now I have just started to play with anonymous methods in my latest parser, so there are going to be a lot of exciting techniques to explore.

Cheers guys!

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: