Home > N++, Object Pascal, OP4JS, Smart Mobile Studio > Building a single context parser

Building a single context parser

December 22, 2014 Leave a comment Go to comments

Right, in the last post I introduced a simple text-buffer class which implemented the most basic functions you need when parsing a source-file.

In this post we will look closer at the actual parser itself, in our case a recursive-decent parser. But before we dive into the code, which is very easy – let’s spend a moment talking about the parser and how it works.

Recursive? Decent? Woot?

Ye old compiler bible

Ye old compiler bible

Right. Without going to far into the boring stuff, in short: there exists a lot of different parsing methods. Parsing source-code and building compilers is actually a science course that you can take in college and university – so a lot of very old, very dull people have spent decades arguing over what methods is the best.

The type of parser you create depends greatly on the language you want to parse. If you want to parse a basic source-code then naturally you dont need a very complex parser. Chances are that the most advanced piece of source you will encounter is a recursive if/then/else section (that according to the old rules from the 70’s, should not exceed a depth of 8 if/then sections).

So for basic you get away with a very humble parser that doesnt need that much in terms of infrastructure.

The recursive part of the name points to the fact that you want to re-cycle the parser rules and code you write to take care of different commands and structures. The decent means that your parser enters the code from the top, and keeps parsing into the various declarations, only to exit backwards again as the structures are closed.

For instance, in the code below every time the parser encounters an IF section, it will create a context object and push whatever context is presently used on the stack. The new “IF” context informs the parser about where it is, and what the rules are for the code ahead:

 if a>0 then
   if b>0 then
     if C>0 then
     doSomething;
   end if
  end if;

The moment the parser hits “end if” it will pop backwards, re-fetching the context for the previous entry. Each section of the program has it’s own context, which describes where you are and what object you are doing work for (in context to).
Typically you start with a program context, only to parse into a class context, then into a class-field context, then back out again and into a class member context (and so on).

The context changes as your parser moves through the source-file — so that at all times the parser knows what rules to follow, what to expect and (perhaps more importantly) what constitutes an error.

Recursive decent also has a huge impact on AST storage, meaning: how the binary representation of your program code in-memory (for further processing by the codegen) is represented. Below is a typical abstract symbol tree, where everything is parsed into objects and stored in nodes – a binary “tree” in memory.

Recursive Decent Schema

Recursive Decent Schema

Single context parser

Context parsers are great, orderly and very efficient (!) but also more tricky to program. There are other solutions rather than contexts of course, and that is to move rule-checking and possible-combinations onto the actual parser instead. So instead of storing rules and information in a separate context object, using a stack to navigate — you create a parser for each identifier and symbol which are allowed — for each parser.

Take this Delphi code:

If a>0 then
Begin
  Case A of
  10: Begin
        B:=12;
        a:=0;
      End;
  end;
End;

The parsing process actually looks more like this:

  • IF Parser
    • Condition Parser
      • Begin Parser
        • Case Parser
          • CaseItem Parser
            • Begin Parser
              • Assignment Parser
              • Assignment Parser
            • Exit Begin Parser
          • Exit CaseItem Parser
        • Exit Case Parser
      • Exit Begin Parser
  • Exit IF parser

So whenever the parser finds a keyword or identifier, a parser is created for that identifier (if it’s allowed by the current parser),  who in turn would have all it’s allowed parsers registered (and so on, until the whole language and possible combinations are mapped).

Confused yet? Good. It’s actually very simple once you get the basics into your fingers.

The code

Now for large and complex languages I would go for a context-parser, but for our little experiments we can stick with a single context parser. This means that there will be more code associated with the various parsers, and also that there will be many different parser classes – each who contain an internal-registry (list) of allowed sub parsers.

unit qtxparser;

interface

uses
  SmartCL.System;

const
  TOK_ROUNDOPEN   = 1;  //  "("
  TOK_ROUNDCLOSE  = 2;  //  ")"
  TOK_CUROPEN     = 3;  //  "{"
  TOK_CURCLOSE    = 4;  //  "}"
  TOK_SPACE       = 5;  //  " "

  TOK_ADD         = 6;  //  "+"
  TOK_SUBTRACT    = 7;  //  "-"
  TOK_DIVIDE      = 8;  //  "/"
  TOK_POWER       = 9;  //  "^"

  TOK_COMMA       =10;  //  ","
  TOK_COLON       =11;  //  ";"

  Numeric_Operators: Array [0..3] of Integer =
    (TOK_ADD,TOK_SUBTRACT,TOK_DIVIDE,TOK_POWER);

type

  ENPPException = Class(EW3Exception);

  TQTXBuffer = Class(TObject)
  private
    FData:    String;
    FIndex:   Integer;
    FLineNr:  Integer;
  protected
    function  getCurrent:String;virtual;
  public

    Property  LineNr:Integer read FLineNr;

    Property  Current:String read getCurrent;
    Function  Back:Boolean;
    function  Next:Boolean;
    function  BOF:Boolean;
    function  EOF:Boolean;

    procedure SkipJunk;

    function  ReadTo(Const aChars:array of string;
              var text:String):Boolean;overload;
    function  ReadTo(aValue:String):Boolean;overload;
    function  ReadTo(aValue:String;var inner:String):Boolean;overload;

    function  ReadWord(var Text:String):Boolean;

    function  ReadToEOL(var text:String):Boolean;

    function  PeekAhead(aCount:Integer;var Text:String):Boolean;
    function  Compare(aText:String):Boolean;

    procedure First;

    procedure LoadFromString(Text:String);

    procedure Clear;
  End;

  TQTXToken = Record
    qToken: Integer;
    qText:  String;
    qline:  Integer;
    qCol:   Integer;
  end;

  TQTXTokenBuffer = Array of TQTXToken;

  TQTXContext = Class(TObject)
  Private
    FBuffer:    TQTXBuffer;
    FTokens:    TQTXTokenBuffer;
  public
    Property    Buffer:TQTXBuffer read FBuffer;
    Property    Tokens:TQTXTokenBuffer read FTokens;
    Constructor Create;virtual;
    Destructor  Destroy;Override;
  end;

  TQTXCustomParser      = Class;
  TQTXCustomParserClass = Class of TQTXCustomParser;

  TQTXParserInfo = Record
    piInstance:     TQTXCustomParser;
    piKeyword:      String;
  end;

  TQTXCustomParser = Class(TObject)
  private
    FContext:   TQTXContext;
    FParsers:   Array of TQTXParserInfo;
  protected
    function    getParserIndexFor(keyword:String;
                var aIndex:Integer):Boolean;
    function    getParserInstanceFor(keyword:String;
                var aParser:TQTXCustomParser):Boolean;
    procedure   RegisterParser(keyword:String;
                const aParser:TQTXCustomParser);
  public
    Property    Context:TQTXContext read FContext;
    procedure   Parse;virtual;
    Constructor Create(aContext:TQTXContext);virtual;
    Destructor  Destroy;Override;
  end;

  (* Parser for the keyword "program" *)
  TQTXProgramParser = Class(TQTXCustomParser)
  protected
    procedure   Parse;override;
  public
    Constructor Create(aContext:TQTXContext);override;
  end;

  TQTXLPPParser = Class(TQTXCustomParser)
  protected
    procedure   Parse;override;
  public
    Constructor Create(aContext:TQTXContext);override;
  end;

implementation

var
  TOK_SYM: Array[TOK_ROUNDOPEN..TOK_COLON] of string =
  ('(',')','{','}',' ','+','-','/','^',',',';');

//###########################################################################
// TQTXLPPParser
//###########################################################################

Constructor TQTXLPPParser.Create(aContext:TQTXContext);
begin
  inherited Create(aContext);
  self.RegisterParser('program',TQTXProgramParser.Create(context));
end;

procedure TQTXLPPParser.Parse;
var
  mCache: String;
  mSubParser: TQTXCustomParser;
Begin

  repeat
    case Context.Buffer.Current of
    ' ':  self.Context.Buffer.SkipJunk;
    '(':  Begin
            if Context.Buffer.Compare('(*') then
            Context.buffer.SkipJunk else
            Begin

              showmessage('Token =' + mCache);

              if getParserInstanceFor(mCache,mSubParser) then
              begin
                (* Invoke sub parser *)
                mSubParser.Parse;

                (* SKip trailing junk if any *)
                Context.Buffer.SkipJunk;

                (* clear cache *)
                mCache:='';

              end else
              raise ENPPException.CreateFmt
              ('Syntax error [%s]',[mCache]);

            end;
          end;
    else
      Begin
        mCache += Context.Buffer.Current;
      end;
    end;

    if not Context.buffer.next then
    break;
  until Context.buffer.EOF;
end;

//###########################################################################
// TQTXProgramParser
//###########################################################################

Constructor TQTXProgramParser.Create(aContext:TQTXContext);
begin
  inherited Create(aContext);
end;

procedure TQTXProgramParser.Parse;
var
  mName:  String;
Begin
  context.buffer.SkipJunk;

  showmessage('here!');

  (* Check if program name is defined *)
  if context.Buffer.Current = TOK_SYM[TOK_ROUNDOPEN] then
  Begin
    if Context.buffer.ReadTo(')',mName) then
    begin
      showmessage('name was:' + mName);
    end;
  end;
end;

//###########################################################################
// TQTXCustomParser
//###########################################################################

Constructor TQTXCustomParser.Create(aContext:TQTXContext);
Begin
  inherited Create;
  FContext:=aContext;
end;

Destructor TQTXCustomParser.destroy;
begin
  (* Release sub-parser instances if any *)
  while FParsers.Count>0 do
  Begin
    if FParsers[0].piInstance<>NIL then
    Begin
      FParsers[0].piInstance.free;
      FParsers[0].piInstance:=NIL;
    end;
    FParsers.Delete(0,1);
  end;
  inherited;
end;

function TQTXCustomParser.getParserIndexFor(keyword:String;
         var aIndex:Integer):Boolean;
var
  x:  Integer;
begin
  result:=False;
  aIndex:=-1;
  keyword:=lowercase(trim(keyword));
  if keyword.length>0 then
  Begin
    if FParsers.Count>0 then
    begin
      for x:=0 to FParsers.count-1 do
      Begin
        if sameText(FParsers[x].piKeyword,keyword) then
        begin
          result:=true;
          aIndex:=x;
          break;
        end;
      end;
    end;
  end;
end;

(*
function TQTXCustomParser.getParserClassFor(keyword:String;
         var aClass:TQTXCustomParserClass):Boolean;
var
  mIndex: Integer;
Begin
  aClass:=NIL;
  result:=getParserIndexFor(keyword,mIndex);
  if result then
  aClass:=FParsers[mIndex].piParserClass;
end;  *)

function TQTXCustomParser.getParserInstanceFor(keyword:String;
         var aParser:TQTXCustomParser):Boolean;
var
  x:  Integer;
begin
  result:=False;
  aParser:=NIL;
  keyword:=lowercase(trim(keyword));
  if keyword.length>0 then
  Begin
    if FParsers.Count>0 then
    begin
      for x:=0 to FParsers.count-1 do
      Begin
        if sameText(FParsers[x].piKeyword,keyword) then
        begin
          result:=true;
          aParser:=FParsers[x].piInstance;
          break;
        end;
      end;
    end;
  end;
end;

procedure TQTXCustomParser.RegisterParser(keyword:String;
          const aParser:TQTXCustomParser);
var
  mIndex: Integer;
  mInfo:  TQTXParserInfo;
Begin
  keyword:=lowercase(trim(keyword));
  if keyword.length>0 then
  Begin
    if aParser<>NIL then
    begin
      if not getParserIndexFor(keyword,mIndex) then
      Begin
        (* register sub parser *)
        mInfo.piInstance:=aParser;
        mInfo.piKeyword:=keyword;
        FParsers.Add(mInfo);
      end else
      raise ENPPException.createFmt
      ('Parser for keyword [%s] already registered',[keyword]);
    end else
    raise ENPPException.CreateFmt
    ('Parser class for keyword [%s] was NIL error',[keyword]);
  end;
end;

procedure TQTXCustomParser.Parse;
begin
  raise ENPPException.CreateFmt
  ('No parser implemented for class %s',[classname]);
end;

//###########################################################################
// TQTXContext
//###########################################################################

Constructor TQTXContext.Create;
begin
  inherited Create;
  FBuffer:=TQTXBuffer.create;
end;

Destructor TQTXContext.destroy;
Begin
  FTokens.clear;
  FBuffer.free;
  inherited;
end;

//###########################################################################
// TQTXBuffer
//###########################################################################

procedure TQTXBuffer.First;
Begin
  If FData.length>0 then
  begin
    FLineNr:=1;
    FIndex:=0;
  end;
end;

procedure TQTXBuffer.SkipJunk;
var
  mTemp:  String;
begin
  repeat
    case Current of
    ' ':  Begin
            if Next then
            continue else
            break;
          end;
    '/':  Begin
            (* Skip C style remark *)
            if Compare('/*') then
            begin
              if readTo('*/') then
              Begin
                Next;
                Next;
              end else
              raise ENPPException.Create('Comment not closed error');
            end else
            Begin
              (* Skip Pascal style remark *)
              if Compare('//') then
              begin
                if ReadToEOL(mTemp) then
                next else
                raise ENPPException.Create('Expected end of line error');
              end;
            end;
          end;
    '(':  Begin
            (* Skip pascal style remark *)
            if compare('(*') then
            Begin
              if readTo('*)') then
              begin
                Next;
                Next;
              end else
              raise ENPPException.Create('Comment not closed error');
            end;
          end;
    #13:
        begin
          //
        end;
    else
      begin
        Break;
      end;
    end;
    next;
  until EOF;
end;

Procedure TQTXBuffer.Clear;
begin
  FData:='';
  FLineNr:=1;
  FIndex:=0;
end;

procedure TQTXBuffer.LoadFromString(Text:String);
Begin
  FLineNr:=1;
  FData:=trim(text);
  FData:=StrReplace(FData,#10,#13);
  FData:=StrReplace(FData,#13#13,#13);
  FIndex:=1;
  if length(FData)<1 then
  FIndex:=-1;
end;

function TQTXBuffer.getCurrent:String;
Begin
  result:=FData[FIndex];
end;

function TQTXBuffer.ReadWord(var Text:String):Boolean;
begin
  result:=False;
  Text:='';
  if not EOF then
  begin
    repeat
      if (current in ['A'..'Z','a'..'z','0'..'9']) then
      Text += current else
      break;
    until not next;
    result:=length(Text)>0;
  end;
end;

function TQTXBuffer.Compare(aText:String):Boolean;
var
  mData:  String;
Begin
  result:=PeekAhead(length(aText),mData)
  and SameText(lowercase(mData),lowercase(aText));
end;

function TQTXBuffer.PeekAhead(aCount:Integer;var Text:String):Boolean;
Begin
  if not EOF then
  Begin
    while aCount>0 do
    begin
      Text+=Current;
      if not Next then
      break;
    end;
    //text:=copy(FData,FIndex,aCount);
    //inc(FIndex,length(text));
    result:=length(text)>0;
  end;
end;

function TQTXBuffer.ReadToEOL(var text:String):Boolean;
Begin
  result:=ReadTo([#13,#10],text);
end;

function TQTXBuffer.ReadTo(aValue:String;var inner:String):Boolean;
var
  mText:  String;
begin
  inner:='';
  result:=False;

  aValue:=lowercase(aValue);
  if aValue.length>0 then
  begin
    repeat
      if PeekAhead(aValue.length,mText)
      and SameText(lowercase(mtext),aValue) then
      Begin
        result:=true;
        break;
      end else
      Begin
        inner += Current;
        if not next then
        break;
      end;
    until EOF;
  end;
end;

function TQTXBuffer.readTo(aValue:String):Boolean;
var
  mText:  String;
begin
  aValue:=lowercase(aValue);
  if aValue.length>0 then
  begin
    repeat
      if PeekAhead(aValue.length,mText)
      and SameText(lowercase(mtext),aValue) then
      Begin
        result:=true;
        break;
      end else
      if not next then
      break;
    until EOF;
  end else
  result:=False;
end;

function TQTXBuffer.ReadTo(Const aChars:Array of string;
         var text:String):Boolean;
var
  x:  Integer;
Begin
  result:=False;
  text:='';
  if aChars.Length>0 then
  begin
    for x:=FIndex to FData.length do
    Begin
      //FIndex:=x;
      //if FData[x] in aChars then
      if (Current in aChars) then
      Begin
        result:=true;
        break;
      end else
      text+=text[x];

      if not Next then
      break;
    end;
  end;
end;

Function TQTXBuffer.Back:Boolean;
begin
  result:=FIndex>1;
  if result then
  dec(FIndex);
end;

function TQTXBuffer.Next:Boolean;
begin
  Result:=FIndex<FData.Length;
  if result then
  Begin
    inc(FIndex);
    if (Current in [#13,#10]) then
    inc(FLineNr);
  end;
end;

function TQTXBuffer.BOF:Boolean;
begin
  result:=FIndex=1;
end;

function TQTXBuffer.EOF:Boolean;
begin
  result:=FIndex>=FData.Length;
end;

end.

Using the code

The above code is not finished, it has just the bare-bone stuff needed. Also, only the first parser is implemented (the “program” parser) which will trigger whenever the “program” keyword is found. Here is the code to test the parser:

procedure TForm1.W3Button1Click(Sender: TObject);
var
  mContext: TQTXContext;
  mParser:  TQTXLPPParser;
begin
  mContext:=TQTXContext.Create;
  try
    mParser:=TQTXLPPParser.Create(mContext);
    try
      mContext.Buffer.LoadFromString(
        #'program("test") {
          execute (*) {
            /* Code here */
          }
        }');

      mParser.Parse;

    finally
      mParser.free;
    end;
  finally
    mContext.free;
  end;
end;

Next time

In the next article we will add more meat to the parser. We will finish the program-keyword and more into the execute method – as well as start building that mysterious abstract symbol tree, which in essence is just a in-memory representation of your code.

It’s actually a very worth-while experiment, because I have no idea of the browser will handle it, or how complex or long a source code is before the browser says “thats to power hungry for me” and drops the thread. There is a time factor involved in javascript where a single procedure entry cannot last to long before it’s dropped.

But being able to compile or otherwise make use of this on a website or in the cloud is quite fun! Who knows, maybe we’ll move Smart Mobile Studio completely to the cloud in the future – which makes sense. But for now, let’s focus on the experiments and see where we end up.

After all, N++ have some CSS to automate 🙂

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: