Repeat-Finding
A. First Edition
This is a small practice of "radix sort". Actually it is what I read from "Battle of Wits" which gave a brief
introduction of the application of "automaton" which was invented by IBM in "code-breaking" area.
If you have taken "data structure" course, I guess you know "radix sort". But I am not sure if you
understand it or know how to use it. For example, I personally don't know what I can use for "radix
sort" at all before I read my favourite book---"Battle of Wits".
In World War II, American secret service needed to break the code of Japanese. One way to do it is
to find out so-called "double hits", which means the repeated string in scrambled code. So, here is
the question: How to find out a certain length of repeated string in a text file?
It seems easy at first glance. However, what is your way if you take looking at following text:
THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVA
LUESANDACONSTANTREFERENCEVALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHE
COMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFORMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNC
ANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALPHAVALUEISCOMPAREDTOREF
USINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQ
UENTSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATI
ONTHEGLALPHAFUNCFUNCTIONOPERATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOI
NTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPERATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCR
EENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEVEINFORMATIONRELATED
TOTHEGLALPHAFUNCFUNCTION
In order to make the text look like real enciphered, I removed all punctuations and white spaces.
The IBM machine can sort punch cards automatically which was used to find out repeated strings. The sorting
method is exactly "radix sort" which clarify each string by its indexed characters. (I don't know what I am
talking about at all.)
First, you sort all string by its first character. Second, you sort them by its second character...Finally, you
just look those strings who are neighbours. You compare them to see if they are truly repeating each other.
(After I finished my little practice, I called Mr. Zhu and tried to impress him by asking the question. Quite
to my surprise, he quickly gave me the most obvious solution which should be used by almost 95% of c/c++
programmer: You start from the first character in the stream, then your second pointer move from first to end
to try to find the repeated one. That's it. Even though it is square(n), it works!!)
E.Further improvement
It is a small game and no one would bother to treat them seriously, I presume.
F.File listing
1. repeat.cpp(main)
file name: repeat.cpp
#include <stdio.h> #include <stdlib.h> const int MaxRepeatNumber=300; const int MaxStringLen=20; const int MaxFileLength=2048; const int MaxBinNumber=26; class Repeat { private: char buffer[MaxFileLength]; int current; int source[MaxBinNumber][MaxRepeatNumber]; int target[MaxBinNumber][MaxRepeatNumber]; int sourceCounter[MaxBinNumber]; int targetCounter[MaxBinNumber]; public: void readFile(char* fileName); void findRepeat(); void report(); }; int main() { Repeat R; R.readFile("nick.txt"); R.findRepeat(); R.report(); return 0; } void Repeat::findRepeat() { int pos; for (pos=0; pos<current; pos++) { source[buffer[pos]-'A'][sourceCounter[buffer[pos]-'A']++]=pos; } for (int i=1; i<MaxStringLen; i++) { //for all character for (int bin=0; bin<MaxBinNumber; bin++) { //for each character which has similar character for (int count=0; count<sourceCounter[bin]; count++) { pos=source[bin][count]; if (pos+i<current) { target[buffer[pos+i]-'A'][targetCounter[buffer[pos+i]-'A']++]=pos; } } } //now copy target back to source for (bin=0; bin<MaxBinNumber; bin++) { for (int count=0; count<targetCounter[bin]; count++) { source[bin][count]=target[bin][count]; } sourceCounter[bin]=targetCounter[bin]; targetCounter[bin]=0; } } } void Repeat::report() { int first, second; bool found; printf("first let me show you the strings\n%s\n\n", buffer); for (int i=0; i<MaxBinNumber; i++) { first=0; found=true; for (second=1; second<sourceCounter[i]; second++) { found=true; for (int index=0; index<MaxStringLen; index++) { if (source[i][second]+index>=current) { found=false; break; } if (buffer[source[i][first]+index]!=buffer[source[i][second]+index]) { found=false; break; } } if (found&&first!=second) { printf("found repeat string of length %d at first=%d, second=%d\n", MaxStringLen, source[i][first], source[i][second]); printf("the string is:"); for (int j=0; j<MaxStringLen; j++) { printf("%c", buffer[source[i][first]+j]); } printf("\n"); } first=second; } } } void Repeat::readFile(char* fileName) { FILE* stream; char ch; current=0; stream=fopen(fileName, "r"); ch=fgetc(stream); if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z') { buffer[current]=toupper(ch); } while (!feof(stream)) { ch=fgetc(stream); if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z') { current++; buffer[current]=toupper(ch); } } current++; buffer[current]='\0'; for (int i=0; i<MaxBinNumber; i++) { sourceCounter[i]=targetCounter[i]=0; } }
The result is like following:
first let me show you the strings THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVALUESANDACONSTANTREFERENCE
VALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHECOMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFO
RMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNCANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALP
HAVALUEISCOMPAREDTOREFUSINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQUEN
TSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATIONTHEGLALPHAFUNCFUNCTIONOPER
ATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOINTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPER
ATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCREENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEV
EINFORMATIONRELATEDTOTHEGLALPHAFUNCFUNCTION found repeat string of length 20 at first=127, second=584 the string is:THEGLALPHAFUNCFUNCTI found repeat string of length 20 at first=584, second=738 the string is:THEGLALPHAFUNCFUNCTI found repeat string of length 20 at first=738, second=875 the string is:THEGLALPHAFUNCFUNCTI found repeat string of length 20 at first=129, second=586 the string is:EGLALPHAFUNCFUNCTION found repeat string of length 20 at first=586, second=740 the string is:EGLALPHAFUNCFUNCTION found repeat string of length 20 at first=740, second=877 the string is:EGLALPHAFUNCFUNCTION found repeat string of length 20 at first=128, second=585 the string is:HEGLALPHAFUNCFUNCTIO found repeat string of length 20 at first=585, second=739 the string is:HEGLALPHAFUNCFUNCTIO found repeat string of length 20 at first=739, second=876 the string is:HEGLALPHAFUNCFUNCTIO