Saturday, June 18, 2016

String operations and bioinformatics

Strings makes it possible to generalize the concept of sets. In BigZ a set is a nested set of nested sets and lists like

{ 123 ( 234 { 345 456 { 567 678 } } ) } cr zet. 
{123,(234,{345,456,{567,678}})} ok

and the only lack of generality concern the atomic elements, which must be non negative single numbers. But virtually anything can be denoted as a string which can be interpreted as a list of characters:

s" {Hello world!,How are you?}" >str stringset>zet cr zet.
{(72,101,108,108,111,32,119,111,114,108,100,33),(72,111,119,32,97,114,101,32,121,111,117,63)} ok
In this way also sets of big integers, Gaussian integers etc can be elements of sets.

A nice way to handle strings in Forth is using a string stack, which in this implementation consists of two stacks, one for the arrays of ASCII signs and one for addresses to the arrays of signs.

>str \ ad n -- string    Push a string on the stack
str> \ string -- ad n    Pop a string from the stack
str@ \ string -- string | -- ad n

sempty \ string -- string | -- flag
.str \ --  Prints the stack without changing it
str. \ str --  Print and drop the topmost element

sdup sdrop sover snip sswap srot stuck spick does the normal operations.
soover \ str1 str2 str3 -- str1 str2 str3 str1
A shorter way to enter strings from commando line is
s Hello world"
However, in definitions one must use 
s" Hello world" >str
Some words for string manipulations:
s& \ s1 s2 -- s1&s2   Concatenation
sleft \ s1 -- s2 | n --  Skip all but the n leftmost characters
sright \ s1 -- s2 | n -- The samr for the n rightmost chars
ssplit \ s -- s' s" | n --  split string after the nth letter
sanalyze \ s1 s2 -- s1 s3 s1 s4 / s2 | -- flag 
split s2 if s1 is a part of s2 and if true flag then s2=s3&s1&s4.
substring \ s1 s2 -- s1 s2 | -- flag
sreplace \ s1 s2 s3 -- s4    Replace s2 with s1 in s3
scomp \ s1 s2 -- | -- n    -1:s1>s2, +1:s1<s2, 0:s1=s2
snull \ -- emptystring
schr& \ s -- s' | ch --   Concatenate ch to top string
slen= \ s1 s2 -- | -- flag   Test if same length
strail \ s -- s'  Remove trailing spaces
>capital \ ch -- ch'  Change common to capital
>common \ ch -- ch'  The oposite 
capital \ ch --flag  Test if capital letter
common \ ch -- flag  Test if common letter
slower \ s -- s'  Change to lower in string
supper \ s -- s'  Opposite as above
str>ud \ s -- s' | -- ud flag   Unsigned double from string
str>d \ s -- s' | -- d flag     Double from string
snobl \ s -- s'      Remove all blanks
sjustabc \ s -- s'   Remove all signs but eng. letters
alphabet \ s -- s'   Gives the alphabet of string
zet>stringset \ set -- string
stringset>zet \ string -- set
sunion \ str1 str2 -- str3
sintersection \ str1 str2 -- str3
sdiff \ str1 str2 -- str3
s {brown,red,orange,yellow,green}"  ok
s {blue,violet,brown,black}"  ok
sunion str. {black,brown,violet,blue,green,yellow,orange,red} ok
hamming \ s1 s2 -- s1 s2 | n   The Hamming distance
editdistance \ s1 s2 -- s1 s2 | n   The Levenshtein distance

This code is now included in the BigZ code.