CBF-8 – Pt 3 Seven vs Eight-Bit Policies

UTF-8 uses all eight bits of a byte. Each byte of a byte array must also be able to assume any of the 256 possible values. Both are examples of eight-bit policies. CBF-8 also has a seven-bit policy that restricts byte usage to a subset of the 128 values comprising the ASCII character set. The default policy is the seven-bit policy.

When the seven-bit policy is applied to an input stream, the value of each byte is converted from an ASCII character to a seven-bit septet. Negative bytes (sign-extended values between -128 and -1 or unsigned values between 128 and 255) are illegal. The septets 0 to 63 are reserved for base-64 digits. Septets between 64 and 94 are reserved for use as policy indicators. The septets 95 to 127 are the ASCII control characters and are not used. All numbers are encoded as base-64 digits.

number:
base64Digit
number base64Digit
base64Digit: in order
0 1 2 3 4 5 6 7
8 9 A B C D E F
G H I J K L M N
O P Q R S T U V
W X Y Z $ & a b
c d e f g h i j
k l m n o p q r
s t u v w x y z

prev continue

CBF-8 – Pt 2 Polymorphism

Morphology is the study of form. Polymorphism is about objects that can have a plurality of forms. To illustrate polymorphism consider a double precision floating point value. It can be viewed as a 64-bit value or as a composition of a sign, a biased exponent and a significand. If the biased exponent is a maximum and the significand is zero, the value is infinite. On the other hand if the significand is non-zero, the significand morphs into a logical value where each bit signals a different not-a-number condition. When the biased exponent is between 1 and max-1, the significand morphs into the ratio of a numerator to an implied denominator. The point I am making is that the meaning (i.e. semantics) changes as a function of the content.

Now let us apply polymorphism to a stream of bytes. The content which changes the meaning of the bytes is called a policy indicator. Policy indicators perform two important functions: they separate elements within a byte stream and they specify the form into which the element will morph. (i.e. will it be UTF-8 text? a real-number? an angle? an unsigned integral value? a logical value? a Boolean value? a byte-array etc.)

prev continue

CBF-8 – Pt 1 Introduction

In my next posts I will be describing a novel data transport concept for the efficient transfer of both text and binary formatted data called Compressed Binary Format – eight bit (CBF-8). CBF-8 applies polymorphism to a stream of bytes to yield different types of data. With the CBF-8 encoder one can safely merge UTF-8 text with raw dimensioned byte arrays and binary formatted numbers such as real numbers, sign-extended integral values, unsigned integral values, logical values, Boolean values angle values, date values, timestamp values and a hierarchy of indexed objects in a common byte stream. The CBF-8 decoder decomposes the byte stream to yield the individual elements of the composition. The purpose of CBF-8 is to reduce the quantity of bytes transferred in database queries and their responses, thereby reducing required bandwidth and increasing responsiveness.
continue

Timestamps pt 4 – Adaptive Time Base

When dates and times are manually entered we may need only some of the following: year, month, day, time zone, hour, minute, second. Although the timestamp may have millisecond resolution, high resolution data may be inconsequential. For example a list of birthdays requires only year, month and day. Time zone, hour, minute, second and milliseconds are no factor and are implicitly set to zero. When exchanging data with a server or another application it would be desirable to send only the data we need and no more. This can be done with an adaptive time base.

Color My Data proposes an Eight-Bit Compressed Binary Format (CBF-8) coder / decoder which generates a policy indicator (DATETIME) followed by zero or more base-64 digits as follows:

  • digit 1 – years range -32 to +31 step 64 years bias 2000
  • digit 2 – years range 0 to 63 step 1 year
  • digit 3 – months range 1 to 12 step 1 month
  • digit 4 – days range 1 to 31 step 1 day
  • digit 5 – time zone range -24 to +24 step 30 minutes
  • digit 6 – MSB time zone range 0 to 1 step 15 minutes
  • digit 6 – five bits hour range 0 to 23 step 1 hour
  • digit 7 – minutes range 0 to 59 step 1 minute
  • digit 8 – seconds range 0 to 59 step 1 second
  • digit 9 – subseconds range 0 to 62 step 16 milliseconds
  • digit 10 – subseconds range 0 to 63 step 250 microseconds
  • digits 11 to n – subseconds range 0 to 63 step 250 x 6410-n microseconds.

The DATETIME policy causes CBF-8 to use MSB alignment which ignores trailing zeros. As trailing zeros increase, both the number of encoded digits and the effective clock frequency decrease. Returning to the birthday example, a four-byte (digit) stream may generate any birthday between Jan 1 28 BC and Dec 31 4047 AD with a resolution of one day. A seven-byte (digit) increases the clock rate from one cycle per day to one cycle per minute by adding time zone, hours and minutes (and nothing more). On the other extreme increasing the clock rate to 2GHz (500 picosecond period) would typically take 15 digits.

The input arguments for encoding timestamps are the minute of epoch 2000-01-01T00:00Z mm2K() and sec() from the previous post and the time-zone, a value between -48 and +48 that scales to -12:00 to +12:00 in 15 minute increments.

CBF8#append(OutputStream sink, int mm2K, double sec, byte timeZone);

The CBF8Decoder supports verification of the DATETIME policy, access to the array of base-64 digits and the number of encoded digits. This greatly simplifies the conversion to human readable formats. The timestamp class uses these data to decode mm2K and sec.

int mm2K(CBF8Decoder);
double sec(CBF8Decoder);

Advantages:

  • encoded data are portable across all implementations
  • timestamp splits into year, month, day, hour, minute, second and subsecond
  • split data is easy to format for human readability
  • encoded data require fewer bytes than formatted text
  • only significant digits are encoded
  • number of digits adapts to clock rate

Disadvantages:

  • Computationally intensive relative to addition / subtraction due to numerous integer remainder and integer divide operations.
  • Base-64 digits above F (15) not easy to relate to decimal digits
  • .

Suggested Use Cases:

Peer to peer and client-server data exchanges. Splitting data for formatted data input / output.

Timestamps Pt 3 – Configurable Time Base

The speed of light is approximately 186 miles per millisecond. In a distance measurement application (e.g. radar) the round trip distance is 93 miles per millisecond. If each clock cycle is one millisecond, the distance can be no more precise than 93 miles. This realization argues for higher clock rates. A one megahertz clock improves range resolution to about a tenth of a mile; a one gigahertz clock improves range resolution to about six inches (e.g. ground penetrating radar).

By introducing a configurable clock, portability is broken even more than when we introduced T0 in the previous post. To address the portability issue consider breaking the time into two parts: minutes and seconds. Let the minutes be referenced to a start minute M0 at the same time as the clock counter is referenced to the start time T0. When constructing the timestamp object let us specify M0, T0 the clock rate in Hertz and an implementation of the interface now(). This interface returns the raw number of clock cycles since some initial time as in the following example.

long now() {

return System.currentTimeMillis();

}

The product of the clock rate and 60 is the number of clock cycles in one minute. Let us call that value ONE_MINUTE.

Let us calculate the unsigned remainder (modulus) on dividing now() by ONE_MINUTE. Dividing the modulus by ONE_MINUTE yields a fraction of one minute between 0 up to but not including 1.0. The product of 60 and this value yields the number of seconds after the current minute, a portable value that is independent of the clock rate.

Subtracting the modulus from the dividend yields a value that ONE_MINUTE can evenly divide. The resulting quotient when added to M0 is the minute of epoch. So long as everyone agrees on the epoch beginning, this value is also portable and independent of all configurable parameters.

The year 2000 is a leap year exception to the 100 year leap year exception to the every fourth year leap year. For that reason, I chose 2000-01-01T00:00Z as the epoch reference; named the minute of epoch mm2K and implemented the following functions.

Timestamp set(int mm2K, double sec);
int mm2K();
double sec();

Advantages:

  • Configurable time bases work with any clock rate
  • Timestamps with different clock rates can be compared to one another

Disadvantages:

  • Integer division is computationally intensive relative to addition or subtraction
  • Data exchange increases from 64 to 96 bits
  • Still not human readable

Suggested Use Case:

Creating timelines from different systems may require comparing timestamps with different clock rates. Since mm2K and sec are portable they are used to implement the interface java.io.Comparable<Timestamp>

Timestamps Pt 2 – Reducing Storage Size

Automated data-loggers may need to generate thousands if not millions of timestamps per day. If we could reduce the size of the timestamp 50% or more, the memory savings become significant. In the previous post we showed that nearly 15 bits (about 16000 days) can be discarded if we simply measure time relative to a configurable start time T0. More redundant bits can be eliminated if we shorten the amount of time before the clock rolls over. However, this requires that T0 be reset periodically. Here is how storage size affects the maximum time between T0 resets.

  • 2 bytes about 1.1 minutes
  • 3 bytes about 4.66 hours
  • 4 bytes about 49.7 days

Advantages

  • Significant Reduction in Storage Size
  • Well suited for high frequency data logging
  • Subtraction of T0 is not computationally intensive

Disadvantages

  • Without T0; the time is no longer portable
  • Adding T0 doubles the amount of data to be ported.
  • Doesn’t address light speed distance measurement
  • Still requires processing for human readability

Suggested Use Case

Confine usage to data storage and retrieval as in the following example.

store(System.currentTimeMillis()-this.T0, rowData, row);
long then = load(rowData, row) + this.T0;

Timestamps pt 1 – One Size Does Not Fit All

For Java programmers the function

long t = System.currentTimeMillis();

assigns t the number of milliseconds that have elapsed since midnight Jan 01 1970. It takes 10 bits to represent a second. Add 17 bits for 86400 seconds in a day and 9 bits for 365 days in a year and it leaves 28 bits for about 268 million years worth of time measurement. Over that period of time the earths rotation rate will have slowed and how we think of time will have changed radically. Let us discuss this representation’s advantages and disadvantages.

Advantages:

  • Minimizes amount of time needed to record the time
  • Suitable for many applications including data logging
  • Portable across numerous implementations

Disadvantages:

  • It is a thousand to a million times too coarse for lightspeed distance measurement. (e.g. radar)
  • More than 16000 days have elapsed since Jan 1 1970 necessitating nearly 15 bits of redundant data with present day data-loggers.
  • It is difficult to relate to year, month, day, hour, minute, second without additional processing.

Suggested Use Case:

Getting or setting the time as in the following example:

Timestamp set(long t); //sets timestamp
long longValue(); //gets value of t

 

Timestamps – Overview

In my next blogs I will share work in progress on a timestamp class that conserves memory and works as well with manual data entry as it does with automated data-logging. The blog will feature the following topics:

  1. One size does not fit all
  2. Reducing the size of stored data
  3. Configurable time base
  4. Adaptive time base
  5. Formatting time for human readability

IEEE-754 pt 3 – Increased granularity

In part 2 we showed that the position of the implicit bit i was also the width of the biased exponent. That makes the bias i-1 bits wide. Since all bits of the bias are 1, the value of the bias is 2i-1-1. With one bit for the sign, that leaves n-i-1 bits for the width of the significand. By specifying i as a function of n the storage format becomes fully defined. This works not only for eight-bit boundaries but for the six-bit boundaries of Color My Data eight-bit compressed binary format (CBF8). CBF8 is designed for loss-less data exchange using binary formats. One base-64 digit could represent plus or minus sixteenths between 0 and 0.9375; eighths between 1 and 1.875; fourths between 2 and 3.75, infinity, quiet NaN or NaN with up to two signals. That represents a potential compression of 87.5% over double precision WITH NO LOSS OF DATA.
The graph below shows how i varies with n in comparison to the current standard.ieee-754 graph

The graph was generated from the following tables. Here are the column definitions.
Bytes (Digits):

Data storage size in bytes or base-64 digits
Bits:
Data storage size in bits (8 bits per byte; 6 bits per base-64 digit)
Sign:
The sign is a one-bit value
BiasExp:
The width, i of the biased exponent in bits
Significand:
The width of the significand in bits
Precision:
The number of decimal digits of precision (Ada digits)
DynRange:
Twice the bias converted to decimal digits
bytes bits sign biasExp significand precision bias dynRange
1 8 1 3 4 1.20 3 1.81
2 16 1 5 10 3.01 15 9.03
3 24 1 7 16 4.82 63 37.93
4 32 1 8 23 6.92 127 76.46
5 40 1 9 30 9.03 255 153.53
6 48 1 10 37 11.14 511 307.65
7 56 1 10 45 13.55 511 307.65
8 64 1 11 52 15.65 1023 615.91
9 72 1 11 60 18.06 1023 615.91
10 80 1 12 67 20.17 2047 1232.42
11 88 1 12 75 22.58 2047 1232.42
12 96 1 13 82 24.68 4095 2465.44
13 104 1 13 90 27.09 4095 2465.44
14 112 1 14 97 29.20 8191 4931.47
15 120 1 14 105 31.61 8191 4931.47
16 128 1 15 112 33.72 16383 9863.55
digits bits sign biasExp significand precision bias dynRange
1 6 1 2 3 0.90 1 0.60
2 12 1 4 7 2.11 7 4.21
3 18 1 5 12 3.61 15 9.03
4 24 1 7 16 4.82 63 37.93
5 30 1 8 21 6.32 127 76.46
6 36 1 8 27 8.13 127 76.46
7 42 1 9 32 9.63 255 153.53
8 48 1 10 37 11.14 511 07.65
9 54 1 10 43 12.94 511 307.65
10 60 1 11 48 14.45 1023 615.91
11 66 1 11 54 16.26 1023 615.91
12 72 1 11 60 18.06 1023 615.91
13 78 1 12 65 19.57 2047 1232.42
14 84 1 12 71 21.37 2047 1232.42
15 90 1 13 76 22.88 4095 2465.44
16 96 1 13 82 24.68 4095 2465.44
17 102 1 13 88 26.49 4095 2465.44
18 108 1 14 93 28.00 8191 4931.47
19 114 1 14 99 29.80 8191 4931.47
20 120 1 14 105 31.61 8191 4931.47
21 126 1 15 110 33.11 16383 9863.55
22 132 1 15 116 34.92 16383 9863.55

IEEE-754 pt 2 – Composition

There are five important components to an IEEE-754 floating point number

  1. sign – a one-bit value indicating whether the value is negative
  2. biased exponent – determines the magnitude of the floating-point value
  3. significand – the precision of the floating point value derives from its width
  4. bias – the dynamic range of the floating point value derives from its width
  5. implicit bit – has an implied weight of 1.

In the following illustrations let us break down the composition of a floating point value n bits wide. Constants are shown as 0 and 1. The variable x represents any 0 or 1; the variable y represents at least one 1 and the variable z represents at least one 0.  Note that the position i of the implicit bit (blue) is also the width of the biased exponent (orange) and is the least significant bit (LSB) of both the bias (green) and the biased exponent. The sign (red) is always the most significant bit (MSB) and the significand (yellow) is always aligned on the implicit bit.  The weight of each bit of the significand is one half the weight of the predecessor bit. Thus, bit i+1 is one half the weight of the implicit bit or one half. This means the significand represents a value between 1.0 up to but not including 2.0. The product of this value and 2biasedExponent-bias is the magnitude. Applying the sign to the magnitude yields what is called the normalized value.

ieee-754 comp

The minimum value for a biased exponent is zero. The IEEE-754 standard reserves this value for zero and subnormal values.  The maximum value for the biased exponent is the sum of twice the bias and one (i.e. the exponent with bias removed is greater than the bias). The IEEE-754 standard reserves this value for infinity and Not-a-Number (NaN) When the significand is zero, the value is an infinity; but, when the significand has at least one 1 the result is a NaN. When the 1 is the most significant bit of the significand and the remaining bits are zero, the NaN is called a quiet NaN. When any other bit of the significand is set, the NaN is called a signaling NaN and the set bit is the signal.

ieee-754-cases