Home > Delphi, Object Pascal > Building a Delphi Database engine, part two

Building a Delphi Database engine, part two

In the first episode of this tutorial we looked at some of the fundamental ideas behind database storage. We solved the problem of storing arbitrary length data by dividing the database file into conceptual parts; we discovered how we could daisy-chain these parts together to form a sequence; and we looked at how this elegantly solves reclaiming lost space and recycling that for new records. Last but not least we had a peek at the bit-buffer that helps us keep track of what blocks are taken, so we can easily grow and shrink the database on demand.

In this article we will be getting our hands dirty and put our theories into practise. The goal today is to examine the class that deals with these blocks or parts; While we could maybe get better performance by putting everything into a monolithic class, but the topics are best kept separate while learning the ropes. So let’s favour OOP and class encapsulation for this one.

The DbLib framework

Prior to writing this tutorial I had to write the code you would need. It would be a terrible mistake to run a tutorial with just theories to show for it. Thankfully I have been coding for a number of years now, so I had most of the code in my archives. To make life easier for you I have unified the code into a little framework.

This doesn’t mean that you have to use the framework. The units I provide are there to give you something tangible to play with. I have left ample room for optimization and things that can be done differently on purpose.

I have set up a bitbucket git repository for this tutorial, so your first business of the day is to download or fork our repository:

https://bitbucket.org/cipher_diaz/dbproject/src/master/

The database file class

The first installment of this tutorial ended with a few words on the file header. This is the only static or fixed data segment in our file. And it must remain fixed because we need a safe space where we can store offsets to our most important sequences, like the root sequence.

The root sequence is simply put the data that describes the database, also known as metadata. So all those items I listed at the start of the previous article, things like table-definitions, index definitions, the actual binary table data (et al), well we have to keep track of these somewhere right?

Well, that’s where the file header comes in. The header is the keeper of all secrets and is imperative for the whole engine.

The record list

Up until this point we have covered blocks, sequences, the file header, the bit-buffer that keeps track of available and reserved blocks — but what about the actual records?

db_file_sequenceWhen someone performs a high-level insert operation, the binary data that makes up the record is written as a sequence; that should be crystal clear by now. But having a ton of sequences stored in a file is pretty useless without a directory or list that remembers them. If we have 10.000 records (sequences) in a file – then we must also keep track of 10.000 offsets right? Otherwise, how can we know where record number 10, 1500 or 9000 starts?

Conceptually, metadata is not actual data. Metadata is a description of data, like a table definition or index definition. The list that holds all the record offsets is real data; As such I don’t want to store it together with the metadata but keep it separate. The bit buffer that keeps track of block availability in the file is likewise “real” data, so I would like to keep that in a separate sequence too.

When we sit down and define our file-header record, which is a structure that is always at the beginning of the file (or stream), we end up with something like this:

  • Unique file signature: longword
  • Version minor, major, revision: longword (byte, byte, word)
  • Database name: 256 bytes [holds utf8 encoded text]
  • Encryption cipher: integer
  • Compression id: integer
  • root-sequence: longword
  • record-list-sequence: longword
  • bit-buffer-sequence: longword

If you are wondering about the encryption and compression fields, don’t overthink it. It’s just a place to store something that identifies whatever encryption or compression we have used. If time allows we will have a look at zlib and RC4, but even if we don’t it’s good to define these fields for future expansion.

The version longword is actually more important than you think. If the design of your database and header changes dramatically between versions, you want to check the version number to make absolutely sure you can even handle the file. I have placed this as the second field in the record, 4 bytes into the header, so that it can be read early. The moment you have more than one version of your engine you might want to write a routine that just reads the first 8 bytes of the file and check compatibility.

What are those buffers?

node

The buffer classes are Delphi implementation of Node.JS buffers, including insert and remove functionality

Having forked the framework, you suddenly have quite a few interesting units. But you can also feel a bit lost if you don’t know what the buffer classes do, so I want to start with those first.

The buffer classes are alternatives to streams. Streams are excellent but they can be quite slow if you are doing intense read-write operations. More importantly streams lack two fundamental features for DB work, namely insert and remove. For example, lets say you have a 100 megabyte file – and then you want to remove 1 megabyte from the middle of this file. It’s not a complex operation but you still need to copy the trailing data backwards as quick as possible before scaling the stream size. The same is true if you want to inject data into a large file. It’s not a huge operation, but it has to be 100% accurate and move data as fast as possible.

I could have just inherited from TStream, but I wanted to write classes that were faster, that had more options and that were easier to expand in the future. The result of those experiments were the TBuffer classes.

So mentally, just look at TDbBuffer, TDbBufferMemory and TDbBufferFile as streams on steroids. If you need to move data between a stream and a buffer, just create a TDbLibStreamAdapter instance and you can access the buffer like a normal TStream decendent.

Making a file of blocks

With enough theory behind us, let’s dig into the codebase and look at the class that deals with a file as blocks, or parts. Open up the unit dblib.partaccess.pas and you will find the following class:

  TDbLibPartAccess = class(TObject)
  private
    FBuffer:    THexBuffer;
    FheadSize:  integer;
    FPartSize:  integer;
  protected
    function GetPartCount: integer; inline;
  public
    property Buffer: THexBuffer read FBuffer;
    property ReservedHeaderSize: integer read FheadSize;
    property PartSize: integer read FPartSize;
    property PartCount: integer read GetPartCount;
    procedure ReadPart(const PartIndex: Integer; var aData); overload;
    procedure ReadPart(const PartIndex: Integer; const Data: THexBuffer); overload;
    procedure WritePart(const PartIndex: Integer; const Data; const DataLength: Integer); overload;
    procedure WritePart(Const PartIndex: Integer; const Data: THexBuffer); overload;

    procedure AppendPart(const Data; DataLength: Integer); overload;
    procedure AppendPart(const Data: THexBuffer); overload;

    function CalcPartsForData(const DataSize: Int64): integer; inline;
    function CalcOffsetForPart(const PartIndex: Integer): Int64; inline;

    constructor Create(const DataBuffer: THexBuffer;
      const ReservedHeaderSize: Integer; const PartSize: Integer); virtual;
  End;

As you can see this class is pretty straight forward. You pass a buffer (either memory or file) via the constructor together with the size of the file-header. This helps the class to avoid writing to the first section of the file by mistake. Whenever the method CalcOffsetForPart() is called, it will add the size of the header to the result, shielding the header from being over-written.

The other methods should be self-explanatory; you have various overloads for writing a sequence part (block), appending them to the database file, reading them – and all these methods are offset based; meaning you give it the part-number and it calculates where that part is physically located inside the file.

One important method is the CalcPartsForData() function. This is used before splitting a piece of data into a sequence. Let’s say you have 1 megabyte to data you want to store inside the database file, well then you first call this and it calculates how many blocks you need.

Once you know how many blocks you need, the next step is to check the bit-buffer (that we introduced last time) if the file has X number of free blocks. If the file is full, well then you either have to grow the file to fit the new data – or issue an error message.

See? It’s not that complex once you have something to build on!

Proof reading test, making sure what we write is what we read

With the scaffolding in place, let’s write a small test to make absolutely sure that the buffer and class logistics check out ok. We are just going to do this on a normal form (this is the main project in the bitbucket project folder), so you don’t have to type this code. Just fork the code from the URL mentioned at the top of this article and run it.

Our test is simple:

  • Define our header and part records, doesn’t have to be accurate at this point
  • Create a database file buffer (in memory) with size for header + 100 parts
  • Create a TDblibPartAccess class, feed in the sizes as mentioned above
  • Create a write buffer the same size as part/block record
  • Fill that buffer with some content we can easily check
  • Write the writebuffer content to all the parts in the file
  • Create a read buffer
  • Read back each part and compare content with the write buffer

If any data is written the wrong way or overlapping, what we read back will not match our write buffer. This is a very simple test to just make sure that we have IO fidelity.

Ok, lets write some code!

unit mainform;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils,
  System.Variants, System.Classes, Vcl.Graphics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls,
  dblib.common,
  dblib.buffer,
  dblib.partaccess,
  dblib.buffer.memory,
  dblib.buffer.disk,
  dblib.encoder,
  dblib.bitbuffer;

const
  CNT_PAGESIZE = (1024 * 10);

type

  TDbVersion = packed record
    bvMajor:  byte;
    bvMinor:  byte;
    bvRevision: word;
  end;

  TDbHeader = packed record
    dhSignature:  longword;     // Signature: $CAFEBABE
    dhVersion:    TDbVersion;   // Engine version info
    dhName:       shortstring;  // Name of database
    dhMetadata:   longword;     // Part# for metadata
  end;

  TDbPartData = packed record
    ddSignature:  longword;
    ddRoot:       longword;
    ddNext:       longword;
    ddBytes:      integer;
    ddData: packed array [0..CNT_PAGESIZE-1] of byte;
  end;

  TfrmMain = class(TForm)
    btnWriteReadTest: TButton;
    memoOut: TMemo;
    procedure btnWriteReadTestClick(Sender: TObject);
  private
    { Private declarations }
    FDbFile:    TDbLibBufferMemory;
    FDbAccess: TDbLibPartAccess;
  public
    { Public declarations }
    constructor Create(AOwner: TComponent); override;
    destructor  Destroy; override;
  end;

var
  frmMain: TfrmMain;

implementation

{$R *.dfm}

{ TfrmMain }

constructor TfrmMain.Create(AOwner: TComponent);
begin
  inherited;
  // Create our database file, in memory
  FDbFile := TDbLibBufferMemory.Create(nil);

  // Reserve size for our header and 100 free blocks
  FDBFile.Size := SizeOf(TDbHeader) + ( SizeOf(TDbPartData) * 100 );

  // Create our file-part access class, which access the file
  // as a "block" file. We pass in the size of the header + part
  FDbAccess := TDbLibPartAccess.Create(FDbFile, SizeOf(TDbHeader), SizeOf(TDbPartData));
end;

destructor TfrmMain.Destroy;
begin
  FDbAccess.Free;
  FDbFile.Free;
  inherited;
end;

procedure TfrmMain.btnWriteReadTestClick(Sender: TObject);
var
  LWriteBuffer:  TDbLibBufferMemory;
  LReadBuffer: TDbLibBufferMemory;
  LMask: ansistring;
  x:  integer;
begin
  memoOut.Lines.Clear();

  LMask := 'YES!';

  // create a temporary buffer
  LWriteBuffer := TDbLibBufferMemory.Create(nil);
  try
    // make it the same size as our file-part
    LWriteBuffer.Size := SizeOf(TDbPartData);

    // fill the buffer with our test-pattern
    LWriteBuffer.Fill(0, SizeOf(TDbPartData), LMask[1], length(LMask));

    // Fill the dbfile by writing each part, using our
    // temporary buffer. This fills the file with our
    // little mask above
    for x := 0 to FDbAccess.PartCount-1 do
    begin
      FDbAccess.WritePart(x, LWriteBuffer);
    end;

    LReadBuffer := TDbLibBufferMemory.Create(nil);
    try
      for x := 0 to FDBAccess.PartCount-1 do
      begin
        FDbAccess.ReadPart(x, LReadBuffer);

        if LReadBuffer.ToString  LWriteBuffer.ToString then
          memoOut.Lines.Add('Proof read part #' + x.ToString() + ' = failed')
        else
          memoOut.Lines.Add('Proof read part #' + x.ToString() + ' = success');
      end;
    finally
      LReadBuffer.Free;
    end;

  finally
    LWriteBuffer.Free;
  end;
end;

end.

The form has a button and a memo on it, when you click the button we get the following result:

writeread

Voila, we have IO fidelity!

Finally things are starting to become more interesting! We still have a way to go before we can start pumping records into this thing, but at least we have tangible code to play with.

In our next installment we will implement the sequence class, which takes the TDbLibPartAccess class and augments it with functionality to read and write sequences. We will also include the bit-buffer from our first article and watch as the siluette of our database engine comes into view.

Again, this is not built for speed but for education.

Until next time.

  1. No comments yet.
  1. No trackbacks yet.

Leave a comment