Category Archives: CBF-8

CBF-8 – Pt 6 Bit Alignment Policies

For loss-less data compression, only redundant or insignificant data may be discarded. For example, the numbers 45 and 000000045 have the same value but in the second case there are seven redundant zeros before the significant data. A policy that aligns data on the least-significant bit (LSB) allows redundant data on the most-significant part to be discovered and discarded. This is the policy best suited to whole numbers.

Fractions can also have redundant data. Consider 0.125 versus 0.1250000000. In this example the seven trailing zeros are insignificant and may be discarded. A policy that aligns data on the most-significant bit (MSB) allows redundant data on the least-significant part to be discovered and discarded. This is the policy best suited to fractions.

When a number contains both a whole number and a fraction; the bit on which the two numbers align is a fixed number of bits from the most significant bit and is therefore msb-aligned. The alignment policies divide as follows:

MSB Alignment:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
LSB Alignment:
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

prev continue

CBF-8 – Pt 7 Sign-Extension Policies

For MSB aligned data the position of the sign is fixed. However, for LSB aligned data the position of the sign, if any, depends on the sign extension policy. CBF-8 has two LSB policy alternatives: unsigned and twos-complement sign extended. With the unsigned policy no bit is negative. The first digit in a stream has a value between 0 and 63. All zeros preceding a value between 0 and 63 are therefore redundant and may be discarded. The following LSB aligned policies are unsigned.

Unsigned Policies:
68 ? Boolean policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

The policy for signed integral data is the twos-complement sign extension policy.

Twos-Complement Sign-Extended Policy:
69 - Twos-complement sign-extended integral value policy

With this policy the most significant bit of the first digit has a negative weight of -32. Thus, first digit values between 0 and 31 are non-negative whereas the value 64 is subtracted from values in the range 32 to 63 to yield a value between -32 and -1. This means the digit z is the value -1. If the leading digit is a z and the successor digit is a digit between W and z, then the leading digit is redundant and may be discarded. Similarly if the leading digit is 0 and the successor digit is a digit between 0 and V, then the leading 0 is redundant.

By discarding redundant digits, CBF-8 adapts the length of the byte stream to the size of the number. There is no byte, short, int, long; there is only a stream of bytes where the stream length is the fewest number of base-64 digits that can hold the integral value.

prev continue

CBF-8 – Pt 4 Numeric (Seven-Bit) Policies

In the previous post we defined a number as a sequence of base-64 digits. In this post I will name the numeric policy indicators and illustrate the use of the dimension policy with raw, eight-bit byte arrays.

The policy indicator marks the start of a field of data. Since the start of the successor field marks the end of the current field, policy indicators are also field separators. More importantly policy indicators dictate the form into which the data will morph when decoded. Following are the numeric policy indicators (ASCII special characters) ordered by their septet values.

NumericPolicyIndicator:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

A numeric field can have a plurality of numeric elements provided each element has the same policy. In that event the prefix for subsequent elements is septet 75, the intra-field separator, comma.

NumericField:
NumericPolicyIndicator Numberopt
NumericField , Numberopt

For example, consider a two-dimensional array with dimensions 63×15.
Septet 74 space is called the raw data separator because it terminates the inner most dimension and marks the transition to an eight-bit raw data policy. Thus, the first 15-byte array begins

*z,F 15 bytes

The eight-bit raw data policy is limited to the 15 bytes needed to populate the 15 byte array. Once populated, the decoder reverts to the seven-bit dimension policy and expects the second of 63 arrays. Since the array lengths of the 63 arrays may change, the length of subsequent byte arrays must be specified.

The default length is 15 bytes. We can accept the default value by omitting the length (i.e. the number between the comma and the space) as in the following example

, 15 bytes

or override the default value with a new value (e.g. 11 bytes)

,B 11 bytes

An omission does not always signal the use of a default value. It can also mean that the data are unknown. For example, to specify an unknown number of arrays the first of which is 15 bytes long we omit the number between the * and the ,

*,F 15 bytes

Alternatively, the policy indicator can be a placeholder for the field’s data and the fact that the number is omitted may throw an exception that prevents validation.

In my next post, we will look at how CBF-8 differentiates text from numbers.
prev continue

CBF-8 – pt 5 Literals

CBF-8 literals are based on standards compliant eight-bit Unicode Character Set (UCS) Transformation Format (UTF-8). Each member of the character set is called a code point. Let a literal be defined as a sequence of Unicode code points.

Literal:
CodePoint
Literal CodePoint

To symbolize each code point we need to bit-bust a sequence of bytes. Let each byte be symbolized using eight characters where 0 is a zero; 1 is a one; z is a zero or one and y is a zero or one where at least one y in a series of y is a one. We can now define a code point as follows:

CodePoint:
0zzzzzzz
110yyyyz 10zzzzzz
11100000 101zzzzz 10zzzzzz
1110yyyy 10zzzzzz 10zzzzzz
except surrogate
11110000 10yyzzzz 10zzzzzz 10zzzzzz
11110yyy 10zzzzzz 10zzzzzz 10zzzzzz

The Unicode standard reserves two ten-bit ranges as surrogate pairs.
In standards compliant UTF-8 these ranges are not used. In modified UTF-8 (Oracle) surrogate pairs are referred to as code units. When combined, surrogate pairs yield a 20-bit value which when added to 0x10000 yields code points in the range 0x10000 to 0x10FFFF. These are also known as supplementary code points because they supplement the basic multilingual plane with sixteen additional planes.

Surrogate:
11101101 1010zzzz 10zzzzzz leading surrogate
11101101 1011zzzz 10zzzzzz trailing surrogate

A simplified understanding of UTF-8 is that bytes beginning 10zzzzzz are extender bytes and that the number of ones preceding a zero in the most significant part of the lead byte specifies the total number of bytes comprising the code point. Code points can be up to four bytes long. So, when there are five, six or seven leading ones, UTF-8 regards these as illegal. CBF-8 regards these as literal terminators.

UTF-8 Terminator:
111110zz illegal five-byte pattern
1111110z illegal six-byte pattern
11111110 illegal seven-byte pattern
11111111 standard terminator

There are two advantages to using these values as terminators. First, code points do not have to be interpreted. There is no need to search for single or double quotes or trailing NUL bytes. Thus,

CBF-8 decodes code points but NEVER evaluates their content

Second, illegal five, six or seven byte patterns cannot be exploited as malware. When a terminator is encountered CBF-8 reverts to seven-bit policy and any extender byte beginning with a one immediately throws an exception.

We are now in a position to define a field of literals.

LiteralField:
" Literalopt Terminator
LiteralField , Literalopt Terminator

Note how septet 73 the double quote character ” causes a transition to an eight-bit Unicode policy and how the terminator reverts back to the seven-bit policy. The literal is encapsulated by the seven-bit policy so that every valid code point can be used without restriction. This is how CBF-8 separates text from everything else. In my next posts I will describe how individual numeric policies govern the morphology of numbers.
prev continue

CBF-8 – Pt 3 Seven vs Eight-Bit Policies

UTF-8 uses all eight bits of a byte. Each byte of a byte array must also be able to assume any of the 256 possible values. Both are examples of eight-bit policies. CBF-8 also has a seven-bit policy that restricts byte usage to a subset of the 128 values comprising the ASCII character set. The default policy is the seven-bit policy.

When the seven-bit policy is applied to an input stream, the value of each byte is converted from an ASCII character to a seven-bit septet. Negative bytes (sign-extended values between -128 and -1 or unsigned values between 128 and 255) are illegal. The septets 0 to 63 are reserved for base-64 digits. Septets between 64 and 94 are reserved for use as policy indicators. The septets 95 to 127 are the ASCII control characters and are not used. All numbers are encoded as base-64 digits.

number:
base64Digit
number base64Digit
base64Digit: in order
0 1 2 3 4 5 6 7
8 9 A B C D E F
G H I J K L M N
O P Q R S T U V
W X Y Z $ & a b
c d e f g h i j
k l m n o p q r
s t u v w x y z

prev continue

CBF-8 – Pt 1 Introduction

In my next posts I will be describing a novel data transport concept for the efficient transfer of both text and binary formatted data called Compressed Binary Format – eight bit (CBF-8). CBF-8 applies polymorphism to a stream of bytes to yield different types of data. With the CBF-8 encoder one can safely merge UTF-8 text with raw dimensioned byte arrays and binary formatted numbers such as real numbers, sign-extended integral values, unsigned integral values, logical values, Boolean values angle values, date values, timestamp values and a hierarchy of indexed objects in a common byte stream. The CBF-8 decoder decomposes the byte stream to yield the individual elements of the composition. The purpose of CBF-8 is to reduce the quantity of bytes transferred in database queries and their responses, thereby reducing required bandwidth and increasing responsiveness.
continue

Timestamps pt 4 – Adaptive Time Base

When dates and times are manually entered we may need only some of the following: year, month, day, time zone, hour, minute, second. Although the timestamp may have millisecond resolution, high resolution data may be inconsequential. For example a list of birthdays requires only year, month and day. Time zone, hour, minute, second and milliseconds are no factor and are implicitly set to zero. When exchanging data with a server or another application it would be desirable to send only the data we need and no more. This can be done with an adaptive time base.

Color My Data proposes an Eight-Bit Compressed Binary Format (CBF-8) coder / decoder which generates a policy indicator (DATETIME) followed by zero or more base-64 digits as follows:

  • digit 1 – years range -32 to +31 step 64 years bias 2000
  • digit 2 – years range 0 to 63 step 1 year
  • digit 3 – months range 1 to 12 step 1 month
  • digit 4 – days range 1 to 31 step 1 day
  • digit 5 – time zone range -24 to +24 step 30 minutes
  • digit 6 – MSB time zone range 0 to 1 step 15 minutes
  • digit 6 – five bits hour range 0 to 23 step 1 hour
  • digit 7 – minutes range 0 to 59 step 1 minute
  • digit 8 – seconds range 0 to 59 step 1 second
  • digit 9 – subseconds range 0 to 62 step 16 milliseconds
  • digit 10 – subseconds range 0 to 63 step 250 microseconds
  • digits 11 to n – subseconds range 0 to 63 step 250 x 6410-n microseconds.

The DATETIME policy causes CBF-8 to use MSB alignment which ignores trailing zeros. As trailing zeros increase, both the number of encoded digits and the effective clock frequency decrease. Returning to the birthday example, a four-byte (digit) stream may generate any birthday between Jan 1 28 BC and Dec 31 4047 AD with a resolution of one day. A seven-byte (digit) increases the clock rate from one cycle per day to one cycle per minute by adding time zone, hours and minutes (and nothing more). On the other extreme increasing the clock rate to 2GHz (500 picosecond period) would typically take 15 digits.

The input arguments for encoding timestamps are the minute of epoch 2000-01-01T00:00Z mm2K() and sec() from the previous post and the time-zone, a value between -48 and +48 that scales to -12:00 to +12:00 in 15 minute increments.

CBF8#append(OutputStream sink, int mm2K, double sec, byte timeZone);

The CBF8Decoder supports verification of the DATETIME policy, access to the array of base-64 digits and the number of encoded digits. This greatly simplifies the conversion to human readable formats. The timestamp class uses these data to decode mm2K and sec.

int mm2K(CBF8Decoder);
double sec(CBF8Decoder);

Advantages:

  • encoded data are portable across all implementations
  • timestamp splits into year, month, day, hour, minute, second and subsecond
  • split data is easy to format for human readability
  • encoded data require fewer bytes than formatted text
  • only significant digits are encoded
  • number of digits adapts to clock rate

Disadvantages:

  • Computationally intensive relative to addition / subtraction due to numerous integer remainder and integer divide operations.
  • Base-64 digits above F (15) not easy to relate to decimal digits
  • .

Suggested Use Cases:

Peer to peer and client-server data exchanges. Splitting data for formatted data input / output.