1
0
Fork 0

Fix byte sequence difference algorithm

master
Michael Raitza 2021-01-31 22:56:09 +01:00
parent 1f66561cdc
commit c26dfad78d
5 changed files with 56 additions and 66 deletions

View File

@ -2,6 +2,22 @@ USING: arrays assocs byte-arrays help.markup help.syntax io.encodings.utf8
kernel math serialize trees.cb.private ; kernel math serialize trees.cb.private ;
IN: trees.cb IN: trees.cb
ARTICLE: "trees.cb" "Binary crit-bit trees"
"The " { $vocab-link "trees.cb" } " vocabulary is a library for binary critical bit trees, a variant of PATRICIA tries. A crit-bit tree stores each element of a non-empty set of keys " { $snippet "K" } " in a leaf node. Each leaf node is attached to the tree of internal split nodes for bit strings " { $snippet "x" } " such that " { $snippet "x0" } " and " { $snippet "x1" } " are prefixes of (serialized byte arrays of) elements in " { $snippet "K" } " and ancestors of other bit strings higher up in the tree. Split nodes store the prefix compressed as two values, the byte number and bit position, in the subset of " { $snippet "K" } " at which the prefixes of all ancestors to the left differ from all ancestors to the right."
$nl
"Serialization of keys is implemented using " { $link key>bytes } ". Crit-bit trees can store arbitrary keys and values, even mixed (but see implementation notes to " { $link key>bytes* } "). Due to the nature of crit-bit trees, for any given input key set that shares a common prefix, the tree compresses the common prefix into the split node at the joint extending the lookup by one node for arbitrarily long prefixes."
$nl
"Keys are serialized once for every lookup and insertion not adding a new leaf node. Two keys are serialized for every insertion adding a new leaf node to the tree."
$nl
"Due to ordering ancestors at split nodes into crit-bit '0' (left) and crit-bit '1' (right), the order of the elements in a crit-bit tree is total allowing efficient suffix searches and minimum searches."
$nl
"Crit-bit trees consume 2 * " { $emphasis "n" } " - 1 nodes in total for storing " { $emphasis "n" } " elements; each internal split node consumes two pointers and a fixnum and an integer; each leaf node two pointers to the key and value. Their shape is unique for any given set of keys, which also means lookup times are deterministic for a known set of keys regardless of insertion order or the tree having been cloned."
$nl
"Compared to hash tables, crit-bit trees provide fast access without being prone to malicious input (but see limitations of the standard implementation of " { $link key>bytes* } ") and also provide ordered operations (e.g. finding minimums). Compared to heaps, they support exact searches and suffix searches in addition. Compared to other ordered trees (AVL, B-), they support the same set of operations while keeping a simpler inner structure."
$nl
"Crit-bit trees conform to the assoc protocol."
;
HELP: CB{ HELP: CB{
{ $syntax "CB{ { key value }... }" } { $syntax "CB{ { key value }... }" }
{ $values { "key" "a key" } { "value" "a value" } } { $values { "key" "a key" } { "value" "a value" } }
@ -22,20 +38,4 @@ HELP: key>bytes*
{ $values { "key" object } { "bytes" byte-array } } { $values { "key" object } { "bytes" byte-array } }
{ $description "Converts a key, which can be any " { $link object } ", into a " { $link byte-array } ". Standard methods convert strings into its " { $link utf8 } " byte sequences and " { $link float } " values into byte arrays representing machine-specific doubles. Integrals are converted into a byte sequence of at least machine word size in little endian byte order." { $description "Converts a key, which can be any " { $link object } ", into a " { $link byte-array } ". Standard methods convert strings into its " { $link utf8 } " byte sequences and " { $link float } " values into byte arrays representing machine-specific doubles. Integrals are converted into a byte sequence of at least machine word size in little endian byte order."
$nl $nl
"All other objects are serialized using " { $link object>bytes } ". In the standard implementation, this maps " { $link f } " to the byte array " { $snippet "B{ 110 }" } " and " { $link t } " to " { $snippet "B{ 116 }" } ", which is identical to the respective integers." } ; "All other objects are serialized using " { $link object>bytes } ". In the standard implementation, this maps " { $link f } " to the byte array " { $snippet "B{ 110 }" } " and " { $link t } " to " { $snippet "B{ 116 }" } ", which is identical to using the respective literal byte arrays as inputs." } ;
ARTICLE: "trees.cb" "Binary crit-bit trees"
"The " { $vocab-link "trees.cb" } " vocabulary is a library for binary critical bit trees, a variant of PATRICIA tries. A crit-bit tree stores each element of a non-empty set of keys " { $snippet "K" } " in a leaf node. Each leaf node is attached to the tree of internal split nodes for bit strings " { $snippet "x" } " such that " { $snippet "x0" } " and " { $snippet "x1" } " are prefixes of (serialized byte arrays of) elements in " { $snippet "K" } " and ancestors of other bit strings higher up in the tree. Split nodes store the prefix compressed as two values, the byte number and bit position, in the subset of " { $snippet "K" } " at which the prefixes of all ancestors to the left differ from all ancestors to the right."
$nl
"Serialization of keys is implemented using " { $link key>bytes } ". Crit-bit trees can store arbitrary keys and values, even mixed (but see implementation notes to " { $link key>bytes* } "). Due to the nature of crit-bit trees, for any given input key set that shares a common prefix, the tree compresses the common prefix into the split node at the root extending the lookup by one for arbitrary long prefixes."
$nl
"Keys are serialized once for every lookup and insertion not adding a new leaf node. Two keys are serialized for every insertion adding a new leaf node to the tree."
$nl
"Due to ordering ancestors at split nodes into crit-bit '0' (left) and crit-bit '1' (right), the order of the elements in a crit-bit tree is total allowing efficient suffix searches and minimum searches."
$nl
"Crit-bit trees consume 2 * " { $emphasis "n" } " - 1 nodes in total for storing " { $emphasis "n" } " elements; each internal split node consumes two pointers and two fixnums; each leaf node two pointers to the key and value. Their shape is unique for any given set of keys, which also means lookup times are deterministic for a known set of keys regardless of insertion order or the tree having been cloned."
$nl
"Compared to hash tables, crit-bit trees provide fast access without being prone to malicious input (but see limitations of the standard implementation of " { $link key>bytes* } ") and also provide ordered operations (e.g. finding minimums). Compared to heaps, they support exact searches and suffix searches in addition. Compared to other ordered trees (AVL, B-), they support the same set of operations while keeping a simpler inner structure."
$nl
"Crit-bit trees conform to the assoc protocol."
;

View File

@ -1,24 +0,0 @@
USING: assocs kernel tools.test trees trees.cb trees.private ;
IN: trees.cb.tests
! Insertion into empty tree
{ T{ cb { root T{ node { key 0 } { value 0 } } } { count 1 } } } [
0 0 <cb> [ set-at ] keep
] unit-test
! Insertion into a leaf-node resulting in splitting
{
T{ cb
{ root
T{ cb-node
{ bits 247 }
{ left T{ node { key 1 } { value 1 } } }
{ right T{ node { key 0 } { value 0 } } }
}
}
{ count 2 }
}
} [
0 0 <cb> [ set-at ] keep
1 1 rot [ set-at ] keep
] unit-test

17
trees/cb/cb-tests.factor Normal file
View File

@ -0,0 +1,17 @@
USING: assocs kernel tools.test trees trees.cb trees.cb.private trees.private ;
IN: trees.cb.tests
CONSTANT: 4tree CB{ { 0 0 } { 1 1 } { 2 2 } { 3 3 } }
! Insertion into an empty tree
{ CB{ { 0 0 } } } [
0 0 <cb> [ set-at ] keep
] unit-test
! Insertion into a leaf-node resulting in splitting
{
CB{ { 0 0 } { 1 1 } }
} [
0 0 <cb> [ set-at ] keep
1 1 rot [ set-at ] keep
] unit-test

View File

@ -16,9 +16,9 @@
USING: accessors alien arrays assocs byte-arrays combinators USING: accessors alien arrays assocs byte-arrays combinators
combinators.short-circuit fry io.binary io.encodings.binary io.encodings.private combinators.short-circuit fry io.binary io.encodings.binary io.encodings.private
io.encodings.string io.encodings.utf8 kernel layouts locals make math io.encodings.string io.encodings.utf8 kernel layouts locals make math math.order
math.private namespaces parser prettyprint.custom sequences serialize strings math.private namespaces parser prettyprint.custom sequences sequences.private
trees trees.private vectors ; serialize strings trees trees.private vectors ;
IN: trees.cb IN: trees.cb
TUPLE: cb < tree ; TUPLE: cb < tree ;
@ -27,7 +27,7 @@ TUPLE: cb < tree ;
<PRIVATE <PRIVATE
TUPLE: cb-node { byte# integer } { bits integer } left right ; TUPLE: cb-node { byte# integer } { bits fixnum } left right ;
: new-node ( byte# bits class -- node ) : new-node ( byte# bits class -- node )
new new
@ -54,31 +54,27 @@ TUPLE: cb-node { byte# integer } { bits integer } left right ;
bitxor msb0 bitxor msb0
[ key-side ] keep ; [ key-side ] keep ;
: elt-from-long-seq ( seq1 seq2 -- elt i/f ) : nth0 ( n seq -- elt/0 )
2dup [ length ] bi@ { ?nth [ 0 ] unless* ;
{ [ 2dup > ] [ 2nip [ swap nth ] keep ] }
{ [ 2dup < ] [ drop [ drop ] 2dip [ swap nth ] keep ] }
[ 4drop 0 f ]
} cond ;
: order-by-length ( seq1 seq2 -- seq-short seq-long ) : 2nth0 ( n seq1 seq2 -- elt1/0 elt2/0 )
2dup [ length ] bi@ > [ swap ] when ; [ nth0 ] bi-curry@ bi ;
! For two byte strings, calculate the critical bit, byte and direction of ! For two byte strings, calculate the critical bit, byte and direction of
! difference. ! difference. For meaningful results ensure that newbytes ≠ oldbytes
: (bytes-diff) ( newbytes oldbytes -- side bits byte# ) : bytes-diff ( newbytes oldbytes -- side bits byte# )
2dup mismatch 2dup mismatch
[ [
[ '[ _ swap nth ] bi@ byte-diff ] keep [ -rot 2nth-unsafe byte-diff ] keep
] [ ] [
! Equal prefix over full (shorter) byte sequence. ! [ [ length ] bi@ = ] 2keep rot
elt-from-long-seq [ [ 0 ] dip ] [ ] if* ; ! [ 2drop 0 0 f ]
[ 1 255 ] 2dip shorter length 1 - ! [
[ min-length dup ] 2keep
2nth0 byte-diff rot
! ] if
] if* ; ] if* ;
: bytes-diff ( newbytes oldbytes -- side bits byte#/f )
bytes-diff ;
PRIVATE> PRIVATE>
GENERIC: key>bytes* ( key -- bytes ) GENERIC: key>bytes* ( key -- bytes )
@ -116,7 +112,7 @@ SYMBOL: new-side
! Extract the critical byte ! Extract the critical byte
: byte-at ( byte# -- byte/0 ) : byte-at ( byte# -- byte/0 )
key-bytes get ?nth [ 0 ] unless* ; key-bytes get nth0 ;
! For the current key and cb-node determin which side to go next ! For the current key and cb-node determin which side to go next
: select-side ( node -- node side ) : select-side ( node -- node side )
@ -159,7 +155,7 @@ M: f cb-update
! or create a new split node and attach a fresh leaf node with the new key and ! or create a new split node and attach a fresh leaf node with the new key and
! value. ! value.
M: node cb-update M: node cb-update
dup key>> current-key get = [ dup key>> key>bytes key-bytes get = [
current-key get >>key current-key get >>key
swap >>value f swap >>value f
] [ ] [
@ -303,7 +299,7 @@ SYNTAX: CB{
M: cb assoc-like drop dup cb? [ >cb ] unless ; M: cb assoc-like drop dup cb? [ >cb ] unless ;
M: cb pprint-delims drop \ CB{ \ } ; M: cb pprint-delims drop \ CB{ \ } ;
M: cb >pprint-sequence >alist ; M: cb >pprint-sequence >cb-alist ;
M: cb pprint-narrow? drop t ; M: cb pprint-narrow? drop t ;
PRIVATE> PRIVATE>

View File

@ -1 +1,2 @@
Critical bit trees as described in http://cr.yp.to/critbit.html Critical bit trees as described in http://cr.yp.to/critbit.html.
They are implemented as subclasses of trees.