System Codecs

Unit paths

  • System
    • System.Codec
      • System.Codec.Base64
      • System.Codec.Url

Introduction

Codec is short for encoder-decoder, it is a term used to describe a module that can either encode or decode a piece of data (or both) from one format into another. Most people have heard of movie codecs or music codecs (mp3 encoding for instance). Microsoft Windows has a very effective codec architecture that makes it easy for developers to add support for new formats – be it movies, music, compression or encryption.

But codecs is not just about movies and music. It first and foremost represents a standard way of processing data, regardless of type. Like already mentioned there are also encryption codecs (also called “ciphers”), character encoding codecs and all manner of compression codecs. A codec is ultimately a neat way of making the same code work on many different algorithms.

A codec simply defines a way to feed data into a process, and also a way of getting data out. Under Microsoft Windows most codecs operate with streams. So you attach a stream you want processed to the input – and a stream where you want the result to the output. When you start the codec, the data will be read from the source, processed, and written to the target.

Tip: Delphi and Lazarus TCompressionStream class is a good analogy of what a codec does. There you connect source and a target, and simply copy data over to perform the compression. The class acts as a conduit and “middle-ware” between the LZ library and the stream classes that are unique to object pascal.

Codecs under Smart Pascal

The Smart RTL contains a small but highly effective codec architecture. As of writing only two codecs are implemented. But they are implemented for a reason:

  • Base64 encode / decode
  • URL encode / decode

You may wonder why these have been implemented from scratch, considering that both are exposed by the browser as functions? The reason is simple: JavaScript have no such thing as streams, and for many JavaScript developers the idea of working with low-level bytes and untyped buffers are outside their ordinary field of expertise.

Object Pascal developers however are used to working primarily with streams, and we are likewise used to things like pointers, allocating memory and interacting with raw data. In other words: being able to Base64 encode a string via the browser is useless when you have a binary stream to want to process.

The Codec Interface

The codec interface is very simple but also very powerful. Smart has a special interface that is common between all storage mediums, be it buffers, allocated memory segments or streams; namely IBinaryTransport. IBinaryTransport defines simple read, write and seek operations. And by implementing that interface any class can be used together with TReader and TWriter.

The codec allows you to attach a source and a target of type IBinaryTransport. Here is the codec class interface as defined in System.Codec.pas :

  TCustomCodec = class(TObject, ICodecBinding, ICodecProcess )
  private
    FBindings:  TCodecBindingList;
    FCodecInfo: TCodecInfo;
  protected
    /* IMPLEMENTS :: ICodecBinding */
    procedure RegisterBinding(const Binding: TCodecBinding);
    procedure UnRegisterBinding(const Binding: TCodecBinding);
  protected
    /* IMPLEMENTS :: ICodecBinding */
    procedure Encode(const Source: IBinaryTransport;
        const Target: IBinaryTransport); virtual; abstract;

    procedure Decode(const Source: IBinaryTransport;
        const Target: IBinaryTransport); virtual; abstract;
  protected
    function  MakeCodecInfo: TCodecInfo; virtual;
  public
    property Info: TCodecInfo read FCodecInfo;
    constructor Create;virtual;
    destructor Destroy;Override;
  end;

If you wish to create your own codecs, perhaps to implement your own encryption schemes or compression – inherit from TCustomCodec and override Encode, Decode and MakeCodecInfo. The latter contains information about the codec and what it does. This is keept in the codec registry.

The codec registry can be accessed via the manager, exposed by this function:

function CodecManager: TCodecManager;

Using a codec

Using a codec is fairly simple. In many ways it is identical to how you would use TCompressionStream in Delphi or Freepascal. But we wanted to make it even easier – so instead of you having to create a new instance every single time, it is created once and managed internally.

Before using, you must first request the codec by its name or direct identifier. The manager will search its registry and return a set of suitable candidates:

procedure TForm1.EncodeData(const DataToEncode: TStream);
var
  LList: TCodecList;
begin
  inherited;
  if CodecManager.QueryByName('Base64Codec', LList) then
  begin
    for var LCodec in LList do
    begin
      if ([cdRead, cdWrite] in LCodec.Info.DataFlow) then
      begin
        // use codec [LCodec] here
        break;
      end;
    end;
  end;
end;

The reason you have to perform a search, is because you can have more than one codec dealing with the same data – but their capability may differ. You could for instance have one codec that encodes only, and another that decodes only – both sharing the same identifier (which they should).

In the example above we check each candidate until we find one that has both encoding and decoding qualities. The “dataflow” property determines if the codec can encode data, decode data – or both; which is what we want in this case.

With the codec identified and a reference to the instance obtained we can access it. And this is done through a codec binding. As the name suggest this is an object that acts as a proxy; binding itself to the codec instance for as long as you need it.

This ensures that only one instance of the codec is ever in memory, while multiple routines can access it at the same time.

procedure TForm1.EncodeData(const DataToEncode: TStream);
var
  LList: TCodecList;
  LBinding: TCodecBinding;
begin
  if CodecManager.QueryByName('Base64Codec', LList) then
  begin
    for var LCodec in LList do
    begin
      if  (cdRead   in LCodec.Info.DataFlow)
      and (cdWrite  in LCodec.Info.DataFlow) then
      begin
        // Found codec, create binding and break search
        LBinding := TCodecBinding.Create(LCodec);
        break;
      end;
    end;
  end;

  if LBinding  nil then
  begin
    try
      // Connect input and output
      // We create a temporary output stream
      LBinding.Input := DataToEncode;
      LBinding.Output := TMemoryStream.Create;

      // execute transformation
      try
        LBinding.encode();
      except
        on e: exception do
        begin
          // Handle exception
        end;
      end;

      // LBinding.Output now contains the encoded data
      // You would move that data elsewhere here

    finally
      // release temporary output stream
      if LBinding.Output  nil then
      LBinding.Output.free;

      LBinding.free;
    end;
  end;

Daisy chaining codecs

One of the reasons we take interfaces rather than actual instances as input / output parameters, is because this makes it easier to daisy-chain codecs together.

For example, let’s say you want to Base64 encode some text – but you also want to encode that text as UTF-8 first. In that case you would look up both an UTF-8 codec and a base64 codec, connect the output of UTF-8 as the input of Base64 – and when it’s all connected – call encode(). The data will then travel through the codecs, being processed along the way – and the final result comes out into the target stream.

Which is pretty cool!

Types of codecs

Codecs are presently added but not substantially used in the RTL. The normal (read: browser based) encoding and decoding functions are in still in place, but these will be replaced by methods that use codecs.

I mentioned above that you should not create codecs directly, but if you know exactly what you need then there is nothing stopping you from doing that. But just remember that the API has been defined for a reason, and that is to make sure you as a user is shielded from internal changes as much as possible.

There will also be many other types of codecs. As more and more complex code is ported and implemented in JavaScript – the amount of code that belongs in codec form grows. Candidates are:

  • RC4 cipher codec
  • Blowfish cipher codec
  • Mp4 movie codec
  • LXZ compression codec
  • BinHex text codec
  • UTF-8 text codec
  • UTF-16 text codec

So while the RTL as of writing only have two, relatively small codec classes for use – that list is going to grow exponentially in future updates.