CBF-8 literals are based on standards compliant eight-bit Unicode Character Set (UCS) Transformation Format (UTF-8). Each member of the character set is called a code point. Let a literal be defined as a sequence of Unicode code points.
To symbolize each code point we need to bit-bust a sequence of bytes. Let each byte be symbolized using eight characters where 0 is a zero; 1 is a one; z is a zero or one and y is a zero or one where at least one y in a series of y is a one. We can now define a code point as follows:
11100000 101zzzzz 10zzzzzz
1110yyyy 10zzzzzz 10zzzzzz
11110000 10yyzzzz 10zzzzzz 10zzzzzz
11110yyy 10zzzzzz 10zzzzzz 10zzzzzz
The Unicode standard reserves two ten-bit ranges as surrogate pairs.
In standards compliant UTF-8 these ranges are not used. In modified UTF-8 (Oracle) surrogate pairs are referred to as code units. When combined, surrogate pairs yield a 20-bit value which when added to 0x10000 yields code points in the range 0x10000 to 0x10FFFF. These are also known as supplementary code points because they supplement the basic multilingual plane with sixteen additional planes.
11101101 1010zzzz 10zzzzzzleading surrogate
11101101 1011zzzz 10zzzzzztrailing surrogate
A simplified understanding of UTF-8 is that bytes beginning 10zzzzzz are extender bytes and that the number of ones preceding a zero in the most significant part of the lead byte specifies the total number of bytes comprising the code point. Code points can be up to four bytes long. So, when there are five, six or seven leading ones, UTF-8 regards these as illegal. CBF-8 regards these as literal terminators.
- UTF-8 Terminator:
111110zzillegal five-byte pattern
1111110zillegal six-byte pattern
11111110illegal seven-byte pattern
There are two advantages to using these values as terminators. First, code points do not have to be interpreted. There is no need to search for single or double quotes or trailing NUL bytes. Thus,
CBF-8 decodes code points but NEVER evaluates their content
Second, illegal five, six or seven byte patterns cannot be exploited as malware. When a terminator is encountered CBF-8 reverts to seven-bit policy and any extender byte beginning with a one immediately throws an exception.
We are now in a position to define a field of literals.
" Literalopt Terminator
LiteralField , Literalopt Terminator
Note how septet 73 the double quote character ” causes a transition to an eight-bit Unicode policy and how the terminator reverts back to the seven-bit policy. The literal is encapsulated by the seven-bit policy so that every valid code point can be used without restriction. This is how CBF-8 separates text from everything else. In my next posts I will describe how individual numeric policies govern the morphology of numbers.