Home > Delphi > MyOwnDB – internal structure and layout

MyOwnDB – internal structure and layout

Note: This post is a follow-up to My own database engine. Make sure you read that article first for context.

Also make sure you download and play with ByteRage, which can be downloaded from google code.

Basic database theory

Clientdataset? Whos that?

Clientdataset? Who?

Stop and think about how a database works, not on the level of sql or anything like that, but in low level terms dealing with the actual database disk file and storage. If you ponder how a record is stored inside the database file you will immediately reach a point where you realize, that a single record of variable size cannot really be pre-calculated. If your records are fixed, so that you know that each post will take up exactly 2k of data, then writing the IO routines for a database file is quite simple. But the moment you introduce BLOBS, CLOBS and MEMO fields – being able to pre calculate the position of a record inside a file goes out the window.

I mean, if you know that each record is 1024 bytes in size, then finding the position of record #25 inside the file is just a matter of:

mFileOffset := SizeOf(TPageData) * 25;
mFileStream.position:=mFileOffset;

But when each record can have a unique length, there is no way of pinpointing the location of a record using math alone. You need to keep track of where records starts in a list, and you also need some mechanism for re-using record space.

Break it down

The solution to “variable length records” is that you have to break the record data into smaller pieces, called pages (or “blocks”) of a fixed size. If you can store a record inside one block, then that is great – but when the data exceeds the size of a single page, then you have to figure out a clever way of chaining these blocks together inside the file. Such a chain of blocks is simply called a “sequence”.

But once you have divided a file into pages, you also introduce another problem. Namely how do you keep track of free blocks in the middle of a database file versus those containing record data. If you delete a record that is stored in the middle of a database file, you really dont want to truncate the entire file just to keep things linear (if your database file is 10 gigabytes in size, that can take forever, even with memory mapping).

First, lets look at how a page inside the database file looks like:

  PSRLPageData = ^TSRLPageData;
  TSRLPageData = Packed Record
    pdUsedBytes:  Integer;
    pdPrevPage:   Longword;
    pdNextPage:   Longword;
    pdData:       Array[1..SRL_PageFile_PageSize-12] of Byte;
  End;

As you can see it has an array to contain data (which can be the whole record, or a part of a record). It also have two fields that contains a reference to the next block in the chain and the previous. This solves how you chain blocks of data together.

The problem now is keeping track of “free” pages, so we can recycle the space occupied by deleted records for new ones.

The “free block/page” problem

File structure

File structure

When you delete a sequence of pages (a record broken into X pieces), how will you keep track of the free pages? How will you re-cycle those pages for new records? Since the pages can be spread throughout the entire file (just like files on a disk can be fragmented) this can be tricky. You can’t use an array of bytes or integers since the datasize is hardly justified. You need something smaller, something small enough to be saved inside the table file without being unreasonable – yet effective and fast enough for practical use.

The answer is to represent each page in the file as a single BIT. If the bit is set, then you know that this page is in use. If the bit is zero, then you know this can be re-cycled. The maximum amount of pages a TFileStream can handle can thus be allocated straight away (!) And its reasonable in size for a database file that can span gigabytes.

TotalBits = (MaxSize / PageSize);
TotalBytes = TotalBits / 8;

Voila! We can now pack a lot of information into a very small storage medium (doesn’t get any smaller than bits!). This “bit map” of the file is isolated in the TSRLBitBuffer class.

In short: a single byte can now keep track of 8 pages of data. Meaning, that if the pagesize is set to 2048 bytes, and your database file is 10.000 pages in size, the full “map” of the file will only require 10.000 / 8 = 1250 bytes. When we add ZLib compression to the bit-buffer on saving, the number is even smaller. That is extremely cost-effective and very, very fast for looking up the state of a database page.

I mean, if you quickly want to find a “free bit”, you simply use a normal PByte and check for a value less than 255. If the value is less, you know that there is a free bit in that byte. This can be made very fast using loop expansion (reading 8 or 16 bytes in one go until you find a byte that is <255 in value).

NOTE: Like pointed out by a reader, it is also custom to keep track of free/

mAddr:=Pbyte(FBuffer);
mOffset:=0;
Repeat
  if not (mAddr^=255) then
  Begin
    //We have a byte with a bit not set!
    //Now figure out which bit, and calculate page number from that
    Break;
  end;
  inc(mAddr);
  inc(mOffset);
until mOffset:=FBytesInbuffer;

Now, when I create new technology, I don’t give a flying spaghetti monster about speed. The prototype is always about solving the problem. Once the prototype has been tested and verified, then I start to refactor the classes and tuning the code for better speed. The bitbuffer class would gain a few clock cycles if it was a part of TSRLPageFile, and even more if I re-wrote it in assembler.

Having said that, posting 10.000 records to the table takes less than a second – so I am actually quite pleased with the speed. Using TBRPersistent for storage is completely overkill, so it will be even faster once I have some time to work on it. And naturally it will be wrapped into TDataset.

Working with a stream as an array of pages

The base class for the database engine simply implements the basics: namely to read data from the table stream in pages (or blocks). So the class TSRLPageFile contains the code for creating, reading and writing to the stream in segments rather than at offset positions.

Reading and writing sequences of pages

Since a database record can span several pages we need to be able to read pages in sequence, glue them together using page-numbers, delete a complete sequence – and of course, re-cycle unused or deleted pages. The mechanisms for this is isolated in TSRLSequenceFile, which interhits from TSRLPageFile.

Working with a stream as an array of records

The actual table class, which adds stuff like storing the bit-map and list of record offsets to the file — is TSRLTable. This class also adds a rudimentary “cursor”, BOF/EOF/Next/Prev functions and so on. It is just a raw shell over the table system.

Indexing, the mother of all.. well, quite cool actually

As you probably have guessed, myOwnDB is just the “raw” storage engine. It takes care of breaking record data into pieces, storing them in sequence, keeping track of free pages inside the databasefile, and stuff like compacting the file and updating internal list information. In some ways you can say that it provides half a database, since there is no code for things like indexing, sql and other things.

I did implement a very crude, brute force search function. It allows you to search the database for a simple string value — but it’s no more effective than filters in TDataset (if not worse). It was just a bit of fun. It gives the same results as the HTML5 objective database would give, where you would insert an object with a key (GUID) and quickly get it back again.

Proper database indexing requires a much more detailed scheme. You would first need a clearly defined field-model, not just stuffing the equivalent of a Microsoft property-bag and calling it a record. String indexes usually have an array of “key phrases” that are either included or excluded – and from this you build a BTree node/lead structure which is updated on inserting data and removing data.

Indexing is very expensive in terms of storage space. If you index your database wrong, your index file can actually exceed your database file (!) So this is not something you slap together over the weekend.

Having said that, my database engine makes it very easy to store and handle indexing within a single file. It would be stored inside the database as a sequence of data just like the sequence-list, the bitmap and whatever else. It would be loaded into memory when you open the table-file, and stored when you close it (if changes has been made and write access to the file is allowed).

With indexing in place, writing a good SQL parser is all that is left (which is not a walk in the park, but more time consuming than technical challenging).

Imagine the speed of this thing if we added memory mapping to the equation 🙂 But better to keep it native Delphi so it’s 100% portable.

Advertisements
  1. woutervannifterick
    March 8, 2014 at 1:35 pm

    In innodb they solve it in a different way. Instead of marking a page as deleted, they maintain two doubly linked lists of pages. One for deleted pages, and one for used pages. You can do that with the same set of prev/next pointers, because a page is either on one list or the other,

    When you delete a page, you simply add the page to the other list instead.
    That way you don’t need to have an extra field, and to find an available page, you don’t have to scan all pages.

    • Jon Lennart Aasenden
      March 8, 2014 at 1:49 pm

      Yup. But a list is more expensive than an array of bytes. Part of this experiment was to make the smallest i could think of. But you are right, there are many ways of solving this 🙂

    • Jon Lennart Aasenden
      March 8, 2014 at 2:07 pm

      Also, the byte scanner doesnt start at byte 0. Whenever it finds a byte with free blocks, it will start there the next time. Caching known bytes with free bits is also much more cost effective than popping from a list. But I do agree that a stack would be much faster in raw speed, but require much more memory and disk space.
      It’s a trade-off i guess. The fun part is that this part of the system can be easily replaced with something else.

  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: