Category Archives: Bit Busting

Composition and decomposition of binary values.

CBF-8 – Pt 6 Bit Alignment Policies

For loss-less data compression, only redundant or insignificant data may be discarded. For example, the numbers 45 and 000000045 have the same value but in the second case there are seven redundant zeros before the significant data. A policy that aligns data on the least-significant bit (LSB) allows redundant data on the most-significant part to be discovered and discarded. This is the policy best suited to whole numbers.

Fractions can also have redundant data. Consider 0.125 versus 0.1250000000. In this example the seven trailing zeros are insignificant and may be discarded. A policy that aligns data on the most-significant bit (MSB) allows redundant data on the least-significant part to be discovered and discarded. This is the policy best suited to fractions.

When a number contains both a whole number and a fraction; the bit on which the two numbers align is a fixed number of bits from the most significant bit and is therefore msb-aligned. The alignment policies divide as follows:

MSB Alignment:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
LSB Alignment:
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

prev continue

CBF-8 – Pt 7 Sign-Extension Policies

For MSB aligned data the position of the sign is fixed. However, for LSB aligned data the position of the sign, if any, depends on the sign extension policy. CBF-8 has two LSB policy alternatives: unsigned and twos-complement sign extended. With the unsigned policy no bit is negative. The first digit in a stream has a value between 0 and 63. All zeros preceding a value between 0 and 63 are therefore redundant and may be discarded. The following LSB aligned policies are unsigned.

Unsigned Policies:
68 ? Boolean policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

The policy for signed integral data is the twos-complement sign extension policy.

Twos-Complement Sign-Extended Policy:
69 - Twos-complement sign-extended integral value policy

With this policy the most significant bit of the first digit has a negative weight of -32. Thus, first digit values between 0 and 31 are non-negative whereas the value 64 is subtracted from values in the range 32 to 63 to yield a value between -32 and -1. This means the digit z is the value -1. If the leading digit is a z and the successor digit is a digit between W and z, then the leading digit is redundant and may be discarded. Similarly if the leading digit is 0 and the successor digit is a digit between 0 and V, then the leading 0 is redundant.

By discarding redundant digits, CBF-8 adapts the length of the byte stream to the size of the number. There is no byte, short, int, long; there is only a stream of bytes where the stream length is the fewest number of base-64 digits that can hold the integral value.

prev continue

CBF-8 – Pt 4 Numeric (Seven-Bit) Policies

In the previous post we defined a number as a sequence of base-64 digits. In this post I will name the numeric policy indicators and illustrate the use of the dimension policy with raw, eight-bit byte arrays.

The policy indicator marks the start of a field of data. Since the start of the successor field marks the end of the current field, policy indicators are also field separators. More importantly policy indicators dictate the form into which the data will morph when decoded. Following are the numeric policy indicators (ASCII special characters) ordered by their septet values.

NumericPolicyIndicator:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

A numeric field can have a plurality of numeric elements provided each element has the same policy. In that event the prefix for subsequent elements is septet 75, the intra-field separator, comma.

NumericField:
NumericPolicyIndicator Numberopt
NumericField , Numberopt

For example, consider a two-dimensional array with dimensions 63×15.
Septet 74 space is called the raw data separator because it terminates the inner most dimension and marks the transition to an eight-bit raw data policy. Thus, the first 15-byte array begins

*z,F 15 bytes

The eight-bit raw data policy is limited to the 15 bytes needed to populate the 15 byte array. Once populated, the decoder reverts to the seven-bit dimension policy and expects the second of 63 arrays. Since the array lengths of the 63 arrays may change, the length of subsequent byte arrays must be specified.

The default length is 15 bytes. We can accept the default value by omitting the length (i.e. the number between the comma and the space) as in the following example

, 15 bytes

or override the default value with a new value (e.g. 11 bytes)

,B 11 bytes

An omission does not always signal the use of a default value. It can also mean that the data are unknown. For example, to specify an unknown number of arrays the first of which is 15 bytes long we omit the number between the * and the ,

*,F 15 bytes

Alternatively, the policy indicator can be a placeholder for the field’s data and the fact that the number is omitted may throw an exception that prevents validation.

In my next post, we will look at how CBF-8 differentiates text from numbers.
prev continue

CBF-8 – pt 5 Literals

CBF-8 literals are based on standards compliant eight-bit Unicode Character Set (UCS) Transformation Format (UTF-8). Each member of the character set is called a code point. Let a literal be defined as a sequence of Unicode code points.

Literal:
CodePoint
Literal CodePoint

To symbolize each code point we need to bit-bust a sequence of bytes. Let each byte be symbolized using eight characters where 0 is a zero; 1 is a one; z is a zero or one and y is a zero or one where at least one y in a series of y is a one. We can now define a code point as follows:

CodePoint:
0zzzzzzz
110yyyyz 10zzzzzz
11100000 101zzzzz 10zzzzzz
1110yyyy 10zzzzzz 10zzzzzz
except surrogate
11110000 10yyzzzz 10zzzzzz 10zzzzzz
11110yyy 10zzzzzz 10zzzzzz 10zzzzzz

The Unicode standard reserves two ten-bit ranges as surrogate pairs.
In standards compliant UTF-8 these ranges are not used. In modified UTF-8 (Oracle) surrogate pairs are referred to as code units. When combined, surrogate pairs yield a 20-bit value which when added to 0x10000 yields code points in the range 0x10000 to 0x10FFFF. These are also known as supplementary code points because they supplement the basic multilingual plane with sixteen additional planes.

Surrogate:
11101101 1010zzzz 10zzzzzz leading surrogate
11101101 1011zzzz 10zzzzzz trailing surrogate

A simplified understanding of UTF-8 is that bytes beginning 10zzzzzz are extender bytes and that the number of ones preceding a zero in the most significant part of the lead byte specifies the total number of bytes comprising the code point. Code points can be up to four bytes long. So, when there are five, six or seven leading ones, UTF-8 regards these as illegal. CBF-8 regards these as literal terminators.

UTF-8 Terminator:
111110zz illegal five-byte pattern
1111110z illegal six-byte pattern
11111110 illegal seven-byte pattern
11111111 standard terminator

There are two advantages to using these values as terminators. First, code points do not have to be interpreted. There is no need to search for single or double quotes or trailing NUL bytes. Thus,

CBF-8 decodes code points but NEVER evaluates their content

Second, illegal five, six or seven byte patterns cannot be exploited as malware. When a terminator is encountered CBF-8 reverts to seven-bit policy and any extender byte beginning with a one immediately throws an exception.

We are now in a position to define a field of literals.

LiteralField:
" Literalopt Terminator
LiteralField , Literalopt Terminator

Note how septet 73 the double quote character ” causes a transition to an eight-bit Unicode policy and how the terminator reverts back to the seven-bit policy. The literal is encapsulated by the seven-bit policy so that every valid code point can be used without restriction. This is how CBF-8 separates text from everything else. In my next posts I will describe how individual numeric policies govern the morphology of numbers.
prev continue

CBF-8 – Pt 3 Seven vs Eight-Bit Policies

UTF-8 uses all eight bits of a byte. Each byte of a byte array must also be able to assume any of the 256 possible values. Both are examples of eight-bit policies. CBF-8 also has a seven-bit policy that restricts byte usage to a subset of the 128 values comprising the ASCII character set. The default policy is the seven-bit policy.

When the seven-bit policy is applied to an input stream, the value of each byte is converted from an ASCII character to a seven-bit septet. Negative bytes (sign-extended values between -128 and -1 or unsigned values between 128 and 255) are illegal. The septets 0 to 63 are reserved for base-64 digits. Septets between 64 and 94 are reserved for use as policy indicators. The septets 95 to 127 are the ASCII control characters and are not used. All numbers are encoded as base-64 digits.

number:
base64Digit
number base64Digit
base64Digit: in order
0 1 2 3 4 5 6 7
8 9 A B C D E F
G H I J K L M N
O P Q R S T U V
W X Y Z $ & a b
c d e f g h i j
k l m n o p q r
s t u v w x y z

prev continue

CBF-8 – Pt 1 Introduction

In my next posts I will be describing a novel data transport concept for the efficient transfer of both text and binary formatted data called Compressed Binary Format – eight bit (CBF-8). CBF-8 applies polymorphism to a stream of bytes to yield different types of data. With the CBF-8 encoder one can safely merge UTF-8 text with raw dimensioned byte arrays and binary formatted numbers such as real numbers, sign-extended integral values, unsigned integral values, logical values, Boolean values angle values, date values, timestamp values and a hierarchy of indexed objects in a common byte stream. The CBF-8 decoder decomposes the byte stream to yield the individual elements of the composition. The purpose of CBF-8 is to reduce the quantity of bytes transferred in database queries and their responses, thereby reducing required bandwidth and increasing responsiveness.
continue

Timestamps pt 4 – Adaptive Time Base

When dates and times are manually entered we may need only some of the following: year, month, day, time zone, hour, minute, second. Although the timestamp may have millisecond resolution, high resolution data may be inconsequential. For example a list of birthdays requires only year, month and day. Time zone, hour, minute, second and milliseconds are no factor and are implicitly set to zero. When exchanging data with a server or another application it would be desirable to send only the data we need and no more. This can be done with an adaptive time base.

Color My Data proposes an Eight-Bit Compressed Binary Format (CBF-8) coder / decoder which generates a policy indicator (DATETIME) followed by zero or more base-64 digits as follows:

  • digit 1 – years range -32 to +31 step 64 years bias 2000
  • digit 2 – years range 0 to 63 step 1 year
  • digit 3 – months range 1 to 12 step 1 month
  • digit 4 – days range 1 to 31 step 1 day
  • digit 5 – time zone range -24 to +24 step 30 minutes
  • digit 6 – MSB time zone range 0 to 1 step 15 minutes
  • digit 6 – five bits hour range 0 to 23 step 1 hour
  • digit 7 – minutes range 0 to 59 step 1 minute
  • digit 8 – seconds range 0 to 59 step 1 second
  • digit 9 – subseconds range 0 to 62 step 16 milliseconds
  • digit 10 – subseconds range 0 to 63 step 250 microseconds
  • digits 11 to n – subseconds range 0 to 63 step 250 x 6410-n microseconds.

The DATETIME policy causes CBF-8 to use MSB alignment which ignores trailing zeros. As trailing zeros increase, both the number of encoded digits and the effective clock frequency decrease. Returning to the birthday example, a four-byte (digit) stream may generate any birthday between Jan 1 28 BC and Dec 31 4047 AD with a resolution of one day. A seven-byte (digit) increases the clock rate from one cycle per day to one cycle per minute by adding time zone, hours and minutes (and nothing more). On the other extreme increasing the clock rate to 2GHz (500 picosecond period) would typically take 15 digits.

The input arguments for encoding timestamps are the minute of epoch 2000-01-01T00:00Z mm2K() and sec() from the previous post and the time-zone, a value between -48 and +48 that scales to -12:00 to +12:00 in 15 minute increments.

CBF8#append(OutputStream sink, int mm2K, double sec, byte timeZone);

The CBF8Decoder supports verification of the DATETIME policy, access to the array of base-64 digits and the number of encoded digits. This greatly simplifies the conversion to human readable formats. The timestamp class uses these data to decode mm2K and sec.

int mm2K(CBF8Decoder);
double sec(CBF8Decoder);

Advantages:

  • encoded data are portable across all implementations
  • timestamp splits into year, month, day, hour, minute, second and subsecond
  • split data is easy to format for human readability
  • encoded data require fewer bytes than formatted text
  • only significant digits are encoded
  • number of digits adapts to clock rate

Disadvantages:

  • Computationally intensive relative to addition / subtraction due to numerous integer remainder and integer divide operations.
  • Base-64 digits above F (15) not easy to relate to decimal digits
  • .

Suggested Use Cases:

Peer to peer and client-server data exchanges. Splitting data for formatted data input / output.

Timestamps Pt 3 – Configurable Time Base

The speed of light is approximately 186 miles per millisecond. In a distance measurement application (e.g. radar) the round trip distance is 93 miles per millisecond. If each clock cycle is one millisecond, the distance can be no more precise than 93 miles. This realization argues for higher clock rates. A one megahertz clock improves range resolution to about a tenth of a mile; a one gigahertz clock improves range resolution to about six inches (e.g. ground penetrating radar).

By introducing a configurable clock, portability is broken even more than when we introduced T0 in the previous post. To address the portability issue consider breaking the time into two parts: minutes and seconds. Let the minutes be referenced to a start minute M0 at the same time as the clock counter is referenced to the start time T0. When constructing the timestamp object let us specify M0, T0 the clock rate in Hertz and an implementation of the interface now(). This interface returns the raw number of clock cycles since some initial time as in the following example.

long now() {

return System.currentTimeMillis();

}

The product of the clock rate and 60 is the number of clock cycles in one minute. Let us call that value ONE_MINUTE.

Let us calculate the unsigned remainder (modulus) on dividing now() by ONE_MINUTE. Dividing the modulus by ONE_MINUTE yields a fraction of one minute between 0 up to but not including 1.0. The product of 60 and this value yields the number of seconds after the current minute, a portable value that is independent of the clock rate.

Subtracting the modulus from the dividend yields a value that ONE_MINUTE can evenly divide. The resulting quotient when added to M0 is the minute of epoch. So long as everyone agrees on the epoch beginning, this value is also portable and independent of all configurable parameters.

The year 2000 is a leap year exception to the 100 year leap year exception to the every fourth year leap year. For that reason, I chose 2000-01-01T00:00Z as the epoch reference; named the minute of epoch mm2K and implemented the following functions.

Timestamp set(int mm2K, double sec);
int mm2K();
double sec();

Advantages:

  • Configurable time bases work with any clock rate
  • Timestamps with different clock rates can be compared to one another

Disadvantages:

  • Integer division is computationally intensive relative to addition or subtraction
  • Data exchange increases from 64 to 96 bits
  • Still not human readable

Suggested Use Case:

Creating timelines from different systems may require comparing timestamps with different clock rates. Since mm2K and sec are portable they are used to implement the interface java.io.Comparable<Timestamp>

Timestamps Pt 2 – Reducing Storage Size

Automated data-loggers may need to generate thousands if not millions of timestamps per day. If we could reduce the size of the timestamp 50% or more, the memory savings become significant. In the previous post we showed that nearly 15 bits (about 16000 days) can be discarded if we simply measure time relative to a configurable start time T0. More redundant bits can be eliminated if we shorten the amount of time before the clock rolls over. However, this requires that T0 be reset periodically. Here is how storage size affects the maximum time between T0 resets.

  • 2 bytes about 1.1 minutes
  • 3 bytes about 4.66 hours
  • 4 bytes about 49.7 days

Advantages

  • Significant Reduction in Storage Size
  • Well suited for high frequency data logging
  • Subtraction of T0 is not computationally intensive

Disadvantages

  • Without T0; the time is no longer portable
  • Adding T0 doubles the amount of data to be ported.
  • Doesn’t address light speed distance measurement
  • Still requires processing for human readability

Suggested Use Case

Confine usage to data storage and retrieval as in the following example.

store(System.currentTimeMillis()-this.T0, rowData, row);
long then = load(rowData, row) + this.T0;

Timestamps pt 1 – One Size Does Not Fit All

For Java programmers the function

long t = System.currentTimeMillis();

assigns t the number of milliseconds that have elapsed since midnight Jan 01 1970. It takes 10 bits to represent a second. Add 17 bits for 86400 seconds in a day and 9 bits for 365 days in a year and it leaves 28 bits for about 268 million years worth of time measurement. Over that period of time the earths rotation rate will have slowed and how we think of time will have changed radically. Let us discuss this representation’s advantages and disadvantages.

Advantages:

  • Minimizes amount of time needed to record the time
  • Suitable for many applications including data logging
  • Portable across numerous implementations

Disadvantages:

  • It is a thousand to a million times too coarse for lightspeed distance measurement. (e.g. radar)
  • More than 16000 days have elapsed since Jan 1 1970 necessitating nearly 15 bits of redundant data with present day data-loggers.
  • It is difficult to relate to year, month, day, hour, minute, second without additional processing.

Suggested Use Case:

Getting or setting the time as in the following example:

Timestamp set(long t); //sets timestamp
long longValue(); //gets value of t