Home > Delphi, Object Pascal > Castalia parser, how to use

Castalia parser, how to use

September 13, 2014 Leave a comment Go to comments

Everyone who has ever wanted to create their own scripting engine, or indeed – their own compiler, have checked out the free and open-source Delphi parser called Castalia. Written by Jacob Thurman, the Castalia parser is also a vital part of the Castalia Delphi IDE extension. If you havent had a look at Castalia before then head over to JT’s website and have a look.

To sum up the situation: Castalia is a commercial product, but the Castalia Parser is open-source and can be used, improved and downloaded by anyone. In fact, the Castalia parser is hosted on GitHub here: https://github.com/jacobthurman/Castalia-Delphi-Parser

So why isn’t it used in more projects? Well, because it’s hardly documented. And examples are thin on the ground to say the least. But perhaps more than anything else — parsing programming source-code is not within the range of “normal” programming. There are a few good books on compiler and parsing technology out there, but in general this is something you either learn through advanced computing classes at some university (or just figuring it out by spending time on the subject). So it takes a bit of thinking before you jump in.

So how do we use it?

Simple, but before we dig into the how you have to understand why. Without understanding why a parser is designed the way it is – you wont really be able to use it properly. So here are some simple factoids to teach you just that:

  • A parser reads a file character by character, not line by line like humans do. To a parser, CR(carriage return, #13) and LF (linefeed, #10) has no meaning at all, they are just characters like any other. It’s up to the parser to treat these are line-end identifiers.
  • A human readable word, like “begin” or “procedure” is simply “an array of chars” with no meaning. The meaning is once again provided by our code.
  • Connected to the parser is something called a “tokenizer”, which is a class meant to translate and recognize words and characters, turning them into tokens – which we can work with in our programming.
    So characters like “(” becomes “OpenRound”, and “;” becomes “semicolon”. A tokenizer is the computer version of what human users recognize as the language being parsed.

A tokenizer is what makes it possible for us to write criteria and expectation code. Expectations are the criteria the parser holds for a piece of code, these are called “future” elements, while tokens who are already read are called past criteria.

Take something simple, like parsing the following:

Procedure TMyClass.Myproc(aValue:Integer);
begin
end;

The code for dealing with the keyword “procedure” will have expectations to what comes next. Namely that it’s a valid procedure name. It can also have criteria to the past, for instance – is it defined inside a class? If that is the case, then the class-name should be omitted and no begin/end section is allowed (that should be below the implementation marker).

Moral? Parsing is just as nitty-gritty as you imagine it to be!

With that in mind, have look at the Castalia parser source-code. You will notice that each language word or feature is mapped directly to a class-member. What you are supposed to do here is to override those that you need (a full engine would override them all) and use the information to build a program-model in memory (a program would be a list of units, who contains a list of classes and functions, constants and variables etc).

What you must take height for, is that elements are not created complete, but one piece at a time. A class declaration for instance, should be registered in your model when the class method triggers (you have to check that it’s not just a forward declaration ofcourse), while the name of the class is delivered afterwards as the parser get’s to it. So you need to write you parser like this:

type

TMyParser = Class(TSimpleParParse)
private
  FCurrentClass: TMyClassDef;
  FCurrentMethod: TMyClassMethod;
end;

So whenever the class is invoked and it should be registered, you register the object without a name and assign it to FCurrentClass. Whenever a class method is triggered, then it should apply to the current class and also be assigned to FCurrentmethod so setting the name is easier. And so fourth. This applies more or less to all the constructs of the language which Castalia covers and deals with nicely!

You could, perhaps, build a simplification object on top of castalia, which triggers events like “OnRegisterClass(Sender:TObject; Ancestor:String; aName:String)” or something like it, but Castalia really is extremely well made and simple to work with once you understand how it works.

Language rules

But now you probably wonder — what about rules? If the parser just blindly recognize words and triggers that procedure in the Castalia parser class — what will protect us from translating gibberish? I mean, would “begin procedure function” actually compile?

No. The rules of “object pascal” is maintained by two things:

  • Expectations to future words (read: symbols)
  • Expectations on past words

For instance, a method dealing with “procedure-names” would expect the past word to be “procedure”, and it would also have expectations on future symbols to be either OpenRound (followed by a list of parameters) or just semicolon.

  procedure SOMENAME( params );
  procedure SOMENAME;
Compile sequence representation

Compile sequence representation

If the expectations on future symbols (being either “(” or “;”) is not met, then you know something is wrong and it’s a syntax error. And it’s the same with expectations on past symbols. If the past word is not “procedure” or “function”, then once again we know that something is wrong and throw a syntax-error exception. If the past symbol is “function”, we would also expect “:” and “datatype” at the end of the function declaration. So parsing a procedure is a bit different from procedures. It also defines if the built-in variable “result” is allowed in the content code.

Either way — When you combine the complete set of expectations for each keyword — the end result is what we call object pascal, and it defines what you are allowed to do in every aspect of the language. You dont need a huge kit of code to test for everything under the sun first, you test the validity of the program one symbol and token at a time.

To read “future” tokens you use the following:

Lexer.InitAhead;
FNextToken:=Lexer.AheadTokenID; //What is the future token?

The extended ID of the “present” position can be read using the TokenID and ExID properties. There really is no need to document these features here, since they are used everywhere throughout the source code. You will also find how to use them in the actual methods you want to override. And last but not least, expectations are handled by “Expected(ptSemiColon);” on entry (replace token by what you expect).

What typically confuses people when playing with Castalia, is the utter lack of an object model. Which is not the job of a parser/lexer/tokenizer at all.

Object model, the missing piece

What Castalia is, is a very effective hand-written parser for Delphi. The author(s) really should be praised for this – and it’s very sad that it havent been deployed in more projects over the past 5-6 years. It’s like a gem that has been lying there without being noticed like forever, waiting for someone to pick it up and make use of it. But it’s missing the piece which naturally is what people want, and that is an object-model for a unit of program.

Such an object model is typically called AST, which is short for “abstract symbol tree”. How you implement and organize this piece of code is different from language to language and compiler to compiler, but a few common denominators are to be found across the board.

The AST is basically what the .NET “intermediate language” compiles from. So the first pass of the .net compiler breaks the source-code into an AST, the second pass takes the AST and compiles it into CIL code (virtual assembly language) – which at it’s final stage can be translated into real machine code.

If you think this is easy work – think again, because finding a model that is flexible enough to work AND be maintained over time, is extremely hard. In fact, it doesn’t get much harder in life than writing compilers (with the exception of coding a kernel or OS perhaps).

Take something simple, like optimizing the generated code, let’s say we have a procedure which looks like this:

procedure TMyObject.SomeProc;
var
  x: Integer;
Begin
  for x:=0 to 255 do
  inc(y);
end;

A silly example I know, but could this be optimized? Yes indeed. By doing loop expansion we could handle 8 calls at once or 16, that is a typical optimization to better performance. It has the benefit of being simple to automate and make work without side effects (if the conditions are right):

procedure TMyObject.SomeProc;
var
  x,y: Integer;
Begin
  for x:=0 to 31 do
  Begin
    inc(y);
    inc(y);
    inc(y);
    inc(y);
    inc(y);
    inc(y);
    inc(y);
    inc(y);
  end;
end;

But in order to do this automatically, the AST have to well enough built. Remember, what we are dealing with is an abstract symbol tree. So in reality it looks like this:

  • [TQTXClassMethod]
    • [TQTXClassMethodLocalVariableList]
      • Items
        • X
          • Datatype: TQTXDatatype.dsInteger32
          • Name: “X”
          • InitialValue: [QTXConstantValue] 0
    • FOR
      • Target: [QTXClassMemberLocalVar]
      • Start: [QTXConstantValue] 0
      • End: [QTXConstantValue] 255
    • INC
      • Target: [QTXClassMemberLocalVar]
      • Value: [QTXConstantValue] 1

In this case, the inner expression-block consists of just one “INC” call, so we could perform loop expansion by simply duplicating the expression-block X number of times (8 or 16 being traditional numbers) and get away with it. But once the internal expression-block (the code within the for/next loop) is to complex, that wont work. In fact, it may screw up the entire program.

A second optimization would be to remove the whole for/next block and replace it with a single:

  • INC
    • Target: [QTXClassMemberLocalVar]
    • Value: [QTXConstantValue] 255

That would be the next best optimization. But since we are using inc() on a local variable only to exit the procedure, the procedure should be ultimately eliminated from the compiled code altogether, because it doesn’t alter existing data or produce any output. This is an area of compilation where DWScript is really, really good. It’s a perfect example of a seasoned, evolved and well designed compiler and code-generator. Eric Grange really did a fantastic job on that one.

Why all the references?

If you look at the pseudo-tree above and think “whats with all the references? Why not simply have X where it says X?”, then you probably dont get the big picture yet. There are many different value types a parser/compiler has to deal with. For instance, there are local variables for a procedure, but there are also class-global fields. Some fields are visible only to the member of that class (public or protected) while others are out of sight for decendants. Then there are unit variables which can be accessed only by unit procedures (both classes and non objective procedures) – depending on if they are defined above or below the implementation marker. Properties on the other hand cannot be handled in the same way, since they involve a getter/setter mechanism.

The only way to deal with all these different types from a unified codebase, is to derive them all from a common value class, which implements standard methods for getting and setting values. Hence the number “0” or “255” is not explicitly defined, instead these constant values (constant because they dont change) must be registered and evaluated as they are used.

The original values would be replaced when a code generator uses them, resolved back to their initial representation. A C# codegen would end up with something like this:

for (x=0;x<256;;) {
  x++;
}

Writing a new compiler

Parsers and compiler AST models are extremely complex and time-consuming to build, but they are also the most rewarding intellectual endeavor you can pursue in the world of computing. At least I think so. Making a game or a utility program is ofcourse fun, but nothing beats making a real, life compiler which takes code and “makers something” from it.

I really hope this little article inspire more use of Castalia, both the commercial product – as well as it’s free Delphi parser. There is ample room in the world of Delphi for new and interesting compilers, so give it a whirl!

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: