ADS-B Lite Illustration

In the following illustration we see two unmanned aerial systems (UAS) on a collision course. Neither aircraft sees the other because both aircraft are outside the field of view of the other aircraft.

At 22 seconds into the video, we reenact the encounter with both aircraft presumed to have ADS-B Lite. With only a one watt transmitter, detection may occur up to 12 miles away. For the sake of brevity detection is shown 12 seconds before arrival at the intersection. For large aircraft the transmit and receive antennas can be spaced widely apart. This is called antenna diversity. Small UAS may have to use a single antenna. This means neither aircraft can receive if both are transmitting at the same time. ADS-B Lite solves this problem by synchronizing the measurements to GPS and then delaying data transmission to a selected broadcast period. Latency is removed by discarding the subseconds after the second from the receive time (i.e. the broadcast delay). You see this in the video as alternating blue and yellow broadcasts. Unlike standards compliant ADS-B which transmits location and velocity data twice per second, ADS-B Lite only transmits once per second and groups all data in a single broadcast period instead of broadcasting data piece-meal over multiple broadcast periods. This gives more time to listen without mutual interference.  Known as time division multiple access (TDMA) means up to 1024 emitters can be tracked concurrently. Together with low power ADS-B Lite’s approach to TDMA solves a major problem with standards compliant ADS-B: frequency saturation.

With latency removed a single broadcast is sufficient to accurately predict, the time, place and altitude of an encounter.  When the first blue wavefront from the fixed wing aircraft crosses the quad-copter, the quad copter is alerted to the presence of the fixed-wing aircraft. Within a thousandth of a second the quad-copter displays conflict awareness by changing its color to red; by placing a red triangle at the intersection and by issuing the directive to climb along with a counter indicating 12 seconds remain until minimum separation. When the first yellow wavefront from the quad-copter crosses the fixed wing aircraft, the fixed wing aircraft is alerted to the presence of the quad-copter and it too changes color; places a red triangle at the intersection; issues a directive to descend and indicates 11 seconds remain until minimum separation.

Part of the economy of scale for UAS beyond visual line of sight (BVLoS) operations is that a single pilot commands multiple aircraft concurrently relying upon automation to handle simple tasks like maintaining attitude, course, airspeed and altitude. In the simulation neither pilot has responded by eight seconds to go, so an audio alarm is sounded to attract their attention. At four seconds to go no action has been taken so the automatic systems override the pilots by commanding the fixed wing aircraft to descend and commanding the quad-copter to climb. Each aircraft’s maneuver is determined from an encounter model so that the actions are coordinated. The change of altitude displays at the left and right edges of the simulation and in the views from the display.  With 75 feet of separation, the quad-copter passes over the fixed wing aircraft and the collision is avoided. After the all clear signals the encounter is over, the alarms are turned off and the aircraft return to their original altitudes.

If you would like to be notified when ADS-B Lite is available, please send your contact information to plc@colormydata.com and put PROTOTYPE TESTING, PRE-PRODUCTION or PRODUCTION in the subject line.

ADS-B Lite Differences

As the name suggests ADS-B Lite is something less than the standard ADS-B 1090MHz-ES with transponder. It is also quite a bit more. In this blog we will explore how ADS-B Lite does more with less.

Less Functionality
If ADS-B is truly an alternative to radar surveillance, why not go all in and eliminate the transponder functionality: no IFF receiver, no Mode S; just the Extended Squitter broadcast.
Less Power
The rationale for eliminating the transponder is power. The DO-160B Minimum Operational Performance Standards (MOPS) specify the power requirements for a transponder. At high altitudes, detection beyond 200 miles is common. If cell phone range is all that is needed for slow, low-altitude operations we can cut the power requirement by a factor of over 10 (e.g. six watts and not 70 watts). Moreover, since we only need six watts every thousandth of a second, we can significantly reduce power consumption, an important factor for power-limited, battery operated drones or sailplanes (gliders) and an affordable alternative for light sport aircraft
Less Cost
The cost driver for a standard ADS-B system is the RF power amplifier. Quotes range from the thousands of dollars to the tens of thousands of dollars. Many manufacturers reply with no bid. Cell phone manufacturers have much more affordable power amplifiers because of market size (millions vs thousands) and a lower power requirement. Will a ten-fold reduction in power translate to a ten-fold reduction in price? The jury is out.
Less Weight
For electrically powered UAVs reducing the power requirement means a lighter power supply, less shielding of the RF section and a smaller RF power amplifier. Every ounce saved in avionics is an ounce that can be applied to a payload.
Less Range
In a previous post, you saw what FRUIT looks like on a radar screen. Now imagine hundreds of UAVs adding to that mix in a congested area such as the Los Angeles basin. Overlapping messages could result in garbled messages causing even more FRUIT and dropped updates. Reducing power to cell phone range limits the amount of bandwidth overload. Coordinating the inputs from multiple ADS-B “in” receivers in a mesh network would allow air traffic controllers to see all traffic in controlled airspace without overloading their radar bandwidth.

Now that you have seen what gets taken away with ADS-B Lite; let us talk about what gets added.

Option to Comply with Established Standards
For aircraft with reciprocating or gas turbine engines avionics power consumption is no factor. By simply enabling the IFF receiver and a Mode S software plug-in and by connecting the six watt output to an RF Power amplifier and its power supply, ADS-B Lite becomes a DO-160B MOPS compliant Mode S transponder with ADS-B 1090 MHz ES.
Option for Inertial Navigation System (INS) Integration
Using the Applications Programming Interface (API) of an INS system such as the Lord MicroStrain 3DM-GX4-45, ADS-B Lite will provide plug and play functionality over a USB connection. An INS provides an alternative source of location data in the event of loss of GPS. It also provides attitude and heading data that may be used for antenna steering, sensor pointing, stabilization and alignment. Click on the image below to download the manufacturer’s product data sheet.GX4-45ProductImage_1.00a

Waypoint Navigation
The intersection of a plane passing through the center of a sphere and a sphere is a great circle. The shortest distance between two points is an arc of a great circle, A great circle is fully defined by the latitude L, longitude λ and track TR being flown.

Great Circle (Latitude, Longitude, Track)

Great Circle (Latitude, Longitude, Track)

Let A be the present position and let B be a waypoint. Given the latitude L and longitude λ of points A and B, the following equations yield range and bearing from A to B. A two argument arctangent of the second and third elements yields the true bearing from A to B and the arc cosine of the first elements yields the great circle arc between A and B. Great circle arc in radians converts to nautical miles or kilometers using π radians = 10800 nautical miles = 20000 km. The API implements these equations for waypoint navigation and traffic conflict assessment.

Equation 1
Note that in spherical geometry the reciprocal bearing from B to A is generally not 180 degrees from the bearing from A to B. This can be readily seen by interchanging the roles of A and B.

Dead Reckoning
Dead reckoning predicts the latitude and longitude at a future time as a function of present latitude L, present longitude λ, present time t0, future time t, ground speed GS and track TR. The product of GS and t-t0 is a distance. On converting the distance in nautical miles or kilometers to a great circle arc γ in radians the sine and cosine of γ and the track TR are inserted into the dead reckoning equations below to solve for latitude and longitude. The two argument arctangent of the first two elements yields the longitude of point B and the arc sine of the third element yields the latitude of point B. Note that the product of the three by three matrix and its transpose is the identity matrix.
DR equations
Traffic Conflict Assessment
On each ADS-B “in” message containing GPS data ADS-B Lite calculates the intersection where one’s own trajectory crosses with the trajectory of the reporting aircraft. From this it derives the point of minimum separation and estimates the time of arrival and amount of separation both horizontally and vertically. If separation is below minimums, a maneuver is proposed to maximize separation. If coupled to the flight controller, the evasive maneuver is performed without human intervention; alternatively, a real-time process monitoring ADS-B Lite real-time data could display traffic conflicts on a horizontal situation indicator and/or collision avoidance system.
Geo-Fencing
The boundary of a restricted area is treated the same as an aircraft trajectory except that the trajectory of the boundary is timeless and may span a large range of altitudes. Obstacles such as bridges, towers or power lines may be handled with geo-fences. Calculations are performed on each GPS update using an on-board geo-fence database amended as necessary by notices to airmen (NOTAMS).
Terrain Avoidance
Imagine an elevation contour on a map as a special case of a geo-fence. To avoid terrain, the aircraft must have the capability to overfly the bounded area. If it does not have this capability, the route must be replanned to circumnavigate the bounded area. Elevation contours would be extracted from an on-board database and a profile of minimum elevation versus range along the current trajectory would be updated at regular intervals where the interval length may vary depending on terrain steepness and aircraft maneuvers. In principle an aircraft could fly nap of the earth using GPS, ADS-B Lite and the on-board databases.
Time Division Multiple Access (TDMA)
TDMA forces aircraft to take turns broadcasting their ADS-B data. It is important that the timestamp applied to a position report be accurate; otherwise, the calculation of the point of intersection by the traffic conflict assessment process will be erroneous. Current practice is to minimize the latency between a fix and the corresponding position report. An alternative is to ignore latency and use a deterministic process to calculate the fix time. For example, let the position report begin at a fixed delay after the GPS pulse per second (PPS) signal. Subtracting the delay and rounding to the nearest second yields the exact time of the GPS fix. The advantage of this alternative is that each aircraft can be assigned a specific time slot where it can broadcast its position report. If each time slot is different, messages do not overlap one another and garbled messages become infrequent. When synchronized with GPS, the DO-160B MOPS require all broadcasts be 200 milliseconds after the GPS PPS signal. This approach will cause messages to be garbled at the worst possible moment, when two aircraft are about to collide.
Pulse on Pulse Logic
When messages are garbled, it may still be possible to decouple overlapping signals using pulse on pulse logic. Pulse-on-pulse logic will be tested on the breadboard ADS-B Lite system under development.

About ADS-B Lite

Automatic Dependent Surveillance – Broadcast (ADS-B) was conceived as an alternative to radar for tracking the location and movement of air traffic. Near airports Airborne Surveillance Radars (ASRs) scan the skies for aircraft. Identification Friend or Foe (IFF) interrogates the aircraft and a beacon on the aircraft called a transponder encodes a reply identifying itself to the radar operator.

ASR with IFF

ASR with IFF

Mode C transponders encode altitude and mode S transponders reply only when called. This helps in rejecting false replies unsynchronized in time (FRUIT). This is what a radar scope looks like before FRUIT has been removed.fruit_ppi

ADS-B broadcasts GPS location data twice per second on the radar’s 1090 MHz frequency in a reserved part of a transponder broadcast called Extended Squitter (ADS-B 1090 MHz ES). Since this location data is more accurate and more frequent than radar, the FAA has mandated that all aircraft operating within controlled airspace (altitudes above 18000 feet and close proximity to airports with control towers) have ADS-B by the year 2020. The FAA has also been directed to share the national airspace with UAVs. Their response to date has been to propose very restrictive rules that would make many commercial uses of drones unfeasible.

What if air traffic controllers could use voice commands to direct UAV use of controlled airspace, monitor compliance with ADS-B 1090 MHz knowing that the UAV would stay away from restricted airspace (geo-fencing), avoid collisions with structures and terrain and most importantly automatically avoid collisions with other ADS-B equipped manned or unmanned aircraft anywhere in the national airspace? Would that open the skies to commercially viable uses of UAVs? That is my vision for the ADS-B Lite project. It also overlaps NASA’s vision of an Unmanned Autonomous System (UAS) Traffic Management (UTM) system.

ADM – Pt 3 Large Sets of Objects

Application Data Manager (ADM) is an open source solution for processing large amounts of real-time data. In this segment I describe the ADM process for incrementally allocating memory to and sequencing large sets of objects.

Unconstrained Array of Objects

In the previous segment I introduced the concept of an unconstrained array of numeric aliases called a Sequence. In this segment I extend the unconstrained array concept to a set of objects of generic class M.

Unconstrained Array Declarations

A set is a number of things of the same kind that belong or are used together. In Java we can make things be of the same kind by requiring they be objects of the same class. Since we have no a priori knowledge of what that class is, we use the generic class T (type) and enclose T in angle brackets. Making this class be an extension of Sequence provides the means to add and recycle elements; moreover, implementing Iterable<T> means we can iterate over the order of appearance.


class UnconstrainedArray<M> extends Sequence implements Iterable<T> {


private T[] smallE;
private T[][] mediumE;
private T[][][] largeE;

}

Case Small – The first 256 elements

Since the sequence and the unconstrained array must be in one-to-one correspondence, every allocation of 256 elements to the sequence requires a corresponding allocation of 256 instances of T. However, all 256 instances of T are initially null.


this.smallE = (T[])new Object[256];

An append method obtains an alias (i.e. index) from the sequence and then assigns a new instance of T at that index. The allocated memory is only as large as it needs to be and no larger.

Case Medium – The next increment of 256 elements

For objects the transition from small to medium is a bit simpler than for sequences, taking only three lines of code:


this.mediumE = (T[][]) new Object[256][];
this.mediumE[0] = this.smallE; //note the continued use of smallE !!!
this.mediumE[1] = (T[]) new Object[256];

Case Large – The 256th increment of 256 elements

When the capacity of mediumE is reached, largeE is allocated with the following code


this.largeE = (T[][][]) new Object[256][256][];
this.largeE[0] = this.mediumE; //note the continued use of mediumE !!!
this.largeE[1] = (T[][]) new Object[256][];
this.largeE[1][0] = (T[]) new Object[256];

Summary

  • The unconstrained array of objects T is a set
  • The capacity of the set grows in increments of 256 elements to a maximum capacity of 16,777,216 elements.
  • Addition and recycling are managed using the underlying Sequence.
  • Each element of the underlying Sequence (an alias / index) is in one-to-one correspondence with an element of set T
  • The underlying sequence enumerates the set of T.
  • Every element of set T is accessible via its alias.
  • Iteration over the order of appearance is supported.

In the next segment I will introduce unconstrained arrays of numeric data.
Copyright © 2014 Color My Data, All Rights Reserved

Previous Continue

ADM – Pt 2 Unconstrained Arrays

Application Data Manager (ADM) is an open source solution for processing large amounts of real-time data. In this segment I describe the ADM process for memory allocation.

Unconstrained Arrays

In the segment on Sequences the length of the sequence had reached maximum capacity and we needed to append a new element. One solution would be to allocate more memory and then copy the old sequence to a subset of the new sequence. In a real time system this can have significant performance consequences and may lead to memory leaks. Another strategy would be to create a linked list. This minimizes memory allocation but can be much less efficient than direct access.
The solution chosen for ADM is an unconstrained array. As defined here, an unconstrained array combines the benefits of a linked list with direct access as it increases capacity in regular increments to meet the demands of real-time data acquisition.

Unconstrained Array Declarations:


private byte[] small;
private byte[][] medium;
private byte[][][] large;

Unconstrained Array Initial State – Case Small

A byte can enumerate up to 256 elements. Initially medium and large are both null. The array small is dimensioned 256 x 1 byte (blue) and assigned the values 0 to 255 (note: sign extension can be overridden by a policy that widens bytes unsigned to int). Next, let us illustrate what happens when we exceed the capacity of small.

Unconstrained-2d

First Increment of 256 elements – Case Medium

On adding the 257th element the 256 x 1 byte array small needs to be widened to the 256 x 2 byte array medium[0] (green). Another 256 x 2 byte array medium[1] is allocated to increase the capacity by another 256 elements and initialized with the values 256 to 511. The arrays medium[2] to medium[255] can remain null until more increments of 256 elements are needed. When medium[255] is allocated, the unconstrained array has a maximum capacity of 65536 elements.

Increment 257 of 256 elements – Case Large

Up until this time the array large has been null. On adding element 65537 the 256 arrays of 256 x 2 byte elements of medium (green) must be widened to 256 arrays of 256 x 3 byte elements (yellow) as shown in the following illustration.
Unconstrained-3d
The array large[0] allocates 256 arrays large[0][0] to large[0][255] and populates these with the widened data from medium.  The array large[1]  allocates 256 arrays large[1][0] to large[1][255] but only allocates memory to the 256 x 3 byte array large[1][0], initializing it with the 256 values 65536 to 65791.

As required capacity increases,  additional arrays of 256×3 elements are allocated to large[m][n]. When both m and n are 255, the maximum capacity of 16777216 elements has been reached. 

Disjoint Subsets

If additional capacity is required, the problem must be restructured using disjoint subsets so that each disjoint subset has less than 16777216 elements. Disjoint subsets are also recommended for capacities much lower than the 16777216 maximum.

Unconstrained Arrays of Objects.

Until now we have only looked at sequences. Unconstrained arrays can also be used for numeric row data and sets of objects. These will be the topic for future segments in this blog.

Copyright © 2014 Color My Data, All Rights Reserved

previous next

ADM – Pt 1 Sequences

Application Data Manager (ADM) is an open source solution for processing large amounts of data in real time. In this segment, I describe the ADM process for automated data recycling, Sequences.

Set Theory

ADM is rooted in set theory.

A set is a number of things of the same kind that belong or are used together.

  • The rows of a table or view are a set.
  • The columns of a table or view are a set.
  • The tables, views, reports and scripts of a database are sets.
  • The databases of the application data model (ADM) are a set.

Enumerating Members of a Set

To enumerate a set is to specify one element after another.
One way to do this is to rearrange the order of the elements of the set.
To minimize overhead we will NOT use this approach.

Instead, let us create a set S of non-negative integer aliases for each element of the generic set of many elements M and require that the elements of M and S be in one-to-one correspondence with one another as defined below.

Definitions

Enumerand:
an enumerated element of set M
Alias:
a non-negative integer member of set S in an immutable, one-to-one correspondence with an enumerand
Sequence:
a one-to-one mapping of S onto itself in some order.

To change enumerand access order we alter the sequence S and do nothing to the set M itself.

For every set M there is at least one sequence called the order of appearance.
The primary usage for the order of appearance is element recycling.

Recycling Process

The following table illustrates the recycling process.

The column step enumerates the set of recycling process actions.

The column sequence illustrates a newly allocated sequence for a set of up to eight elements. Note that the numeric aliases have been initialized with the values 0 to 7. They are show in green because they have never been used.

The column capacity is the maximum number of elements that can be accessed without additional memory allocation.

The column hi-water is the maximum number of elements ever activated.

The column unused is the difference between hi-water and capacity and represents the number of elements that have never been used and is shown in green

The column length is the number of active elements and is shown in yellow.

The column recycled is the difference between hi-water and length and is shown in red.

The column available is the number of elements available for set expansion and is the difference between capacity and length.

Sequence

Step 2: to append an element, simply increment the length and use the alias. Yellow signals that the alias is active.

Step 3: after appending five more elements; six elements are in use (yellow) and there is availability for two more elements (green).

Step 4: element 3 is no longer needed and has been marked for recycling.

Step 5: a left cyclic permutation (x<-3<-4<-5<-x) and decrement in length inactivates and recycles element 3 (red). Note that the hi-water mark does not change as elements are recycled.

Step 6: element 1 is no longer needed and has been marked for recycling.

Step 7: a left cyclic permutation (x<-1<-2<-4<-5<-3<-x) and decrement in length inactivates and recycles element 1 (red).

Step 8: a new element is appended reactivating element 3. In set M the object aliased by element 3 is removed and replaced.

Step 9: a new element is appended reactivating element 1. In set M the object aliased by element 1 is removed and replaced.

Step 10: two new elements are appended and the sequence capacity is reached.

In my next blog I will describe the unconstrained arrays that allow sequences to grow incrementally up to a maximum capacity of 16,777,216 elements.

Copyright © 2014 Color My Data, All Rights Reserved

CBF-8 – Pt 6 Bit Alignment Policies

For loss-less data compression, only redundant or insignificant data may be discarded. For example, the numbers 45 and 000000045 have the same value but in the second case there are seven redundant zeros before the significant data. A policy that aligns data on the least-significant bit (LSB) allows redundant data on the most-significant part to be discovered and discarded. This is the policy best suited to whole numbers.

Fractions can also have redundant data. Consider 0.125 versus 0.1250000000. In this example the seven trailing zeros are insignificant and may be discarded. A policy that aligns data on the most-significant bit (MSB) allows redundant data on the least-significant part to be discovered and discarded. This is the policy best suited to fractions.

When a number contains both a whole number and a fraction; the bit on which the two numbers align is a fixed number of bits from the most significant bit and is therefore msb-aligned. The alignment policies divide as follows:

MSB Alignment:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
LSB Alignment:
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

prev continue

CBF-8 – Pt 7 Sign-Extension Policies

For MSB aligned data the position of the sign is fixed. However, for LSB aligned data the position of the sign, if any, depends on the sign extension policy. CBF-8 has two LSB policy alternatives: unsigned and twos-complement sign extended. With the unsigned policy no bit is negative. The first digit in a stream has a value between 0 and 63. All zeros preceding a value between 0 and 63 are therefore redundant and may be discarded. The following LSB aligned policies are unsigned.

Unsigned Policies:
68 ? Boolean policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

The policy for signed integral data is the twos-complement sign extension policy.

Twos-Complement Sign-Extended Policy:
69 - Twos-complement sign-extended integral value policy

With this policy the most significant bit of the first digit has a negative weight of -32. Thus, first digit values between 0 and 31 are non-negative whereas the value 64 is subtracted from values in the range 32 to 63 to yield a value between -32 and -1. This means the digit z is the value -1. If the leading digit is a z and the successor digit is a digit between W and z, then the leading digit is redundant and may be discarded. Similarly if the leading digit is 0 and the successor digit is a digit between 0 and V, then the leading 0 is redundant.

By discarding redundant digits, CBF-8 adapts the length of the byte stream to the size of the number. There is no byte, short, int, long; there is only a stream of bytes where the stream length is the fewest number of base-64 digits that can hold the integral value.

prev continue

CBF-8 – Pt 4 Numeric (Seven-Bit) Policies

In the previous post we defined a number as a sequence of base-64 digits. In this post I will name the numeric policy indicators and illustrate the use of the dimension policy with raw, eight-bit byte arrays.

The policy indicator marks the start of a field of data. Since the start of the successor field marks the end of the current field, policy indicators are also field separators. More importantly policy indicators dictate the form into which the data will morph when decoded. Following are the numeric policy indicators (ASCII special characters) ordered by their septet values.

NumericPolicyIndicator:
64 # IEEE-754 real number policy
65 ^ Angle policy
66 : Date and Timestamp policy
67 ~ Logical set (bit-vector) policy
68 ? Boolean policy
69 - Twos-complement sign-extended integral value policy
70 + Unsigned integral value policy
71 @ Indexed element policy
72 * Array dimension policy

A numeric field can have a plurality of numeric elements provided each element has the same policy. In that event the prefix for subsequent elements is septet 75, the intra-field separator, comma.

NumericField:
NumericPolicyIndicator Numberopt
NumericField , Numberopt

For example, consider a two-dimensional array with dimensions 63×15.
Septet 74 space is called the raw data separator because it terminates the inner most dimension and marks the transition to an eight-bit raw data policy. Thus, the first 15-byte array begins

*z,F 15 bytes

The eight-bit raw data policy is limited to the 15 bytes needed to populate the 15 byte array. Once populated, the decoder reverts to the seven-bit dimension policy and expects the second of 63 arrays. Since the array lengths of the 63 arrays may change, the length of subsequent byte arrays must be specified.

The default length is 15 bytes. We can accept the default value by omitting the length (i.e. the number between the comma and the space) as in the following example

, 15 bytes

or override the default value with a new value (e.g. 11 bytes)

,B 11 bytes

An omission does not always signal the use of a default value. It can also mean that the data are unknown. For example, to specify an unknown number of arrays the first of which is 15 bytes long we omit the number between the * and the ,

*,F 15 bytes

Alternatively, the policy indicator can be a placeholder for the field’s data and the fact that the number is omitted may throw an exception that prevents validation.

In my next post, we will look at how CBF-8 differentiates text from numbers.
prev continue

CBF-8 – pt 5 Literals

CBF-8 literals are based on standards compliant eight-bit Unicode Character Set (UCS) Transformation Format (UTF-8). Each member of the character set is called a code point. Let a literal be defined as a sequence of Unicode code points.

Literal:
CodePoint
Literal CodePoint

To symbolize each code point we need to bit-bust a sequence of bytes. Let each byte be symbolized using eight characters where 0 is a zero; 1 is a one; z is a zero or one and y is a zero or one where at least one y in a series of y is a one. We can now define a code point as follows:

CodePoint:
0zzzzzzz
110yyyyz 10zzzzzz
11100000 101zzzzz 10zzzzzz
1110yyyy 10zzzzzz 10zzzzzz
except surrogate
11110000 10yyzzzz 10zzzzzz 10zzzzzz
11110yyy 10zzzzzz 10zzzzzz 10zzzzzz

The Unicode standard reserves two ten-bit ranges as surrogate pairs.
In standards compliant UTF-8 these ranges are not used. In modified UTF-8 (Oracle) surrogate pairs are referred to as code units. When combined, surrogate pairs yield a 20-bit value which when added to 0x10000 yields code points in the range 0x10000 to 0x10FFFF. These are also known as supplementary code points because they supplement the basic multilingual plane with sixteen additional planes.

Surrogate:
11101101 1010zzzz 10zzzzzz leading surrogate
11101101 1011zzzz 10zzzzzz trailing surrogate

A simplified understanding of UTF-8 is that bytes beginning 10zzzzzz are extender bytes and that the number of ones preceding a zero in the most significant part of the lead byte specifies the total number of bytes comprising the code point. Code points can be up to four bytes long. So, when there are five, six or seven leading ones, UTF-8 regards these as illegal. CBF-8 regards these as literal terminators.

UTF-8 Terminator:
111110zz illegal five-byte pattern
1111110z illegal six-byte pattern
11111110 illegal seven-byte pattern
11111111 standard terminator

There are two advantages to using these values as terminators. First, code points do not have to be interpreted. There is no need to search for single or double quotes or trailing NUL bytes. Thus,

CBF-8 decodes code points but NEVER evaluates their content

Second, illegal five, six or seven byte patterns cannot be exploited as malware. When a terminator is encountered CBF-8 reverts to seven-bit policy and any extender byte beginning with a one immediately throws an exception.

We are now in a position to define a field of literals.

LiteralField:
" Literalopt Terminator
LiteralField , Literalopt Terminator

Note how septet 73 the double quote character ” causes a transition to an eight-bit Unicode policy and how the terminator reverts back to the seven-bit policy. The literal is encapsulated by the seven-bit policy so that every valid code point can be used without restriction. This is how CBF-8 separates text from everything else. In my next posts I will describe how individual numeric policies govern the morphology of numbers.
prev continue