Inexact Match
A. First Edition
This is a small test program as suggested by Mr. Li. It is a kind of simplified inexact match. The idea is a kind
of straight forward and it is like a kind of scanner.
B.The problem
In China, many people use their mobile phone as their address book. They may store contacts with UNICODE or ASCII
code. So, the searching problem is a bit complicated.
We assume following conditions:
1. The index for ASCII code is using its own ASCII string.
2. The UNICODE string is using their "pinyin" string and especially in embedded system the storage resource is
very tight and the developer may not want to store index as "pinyin" along with UNICODE string. Since there are
only limited "pinyin" combinations, they are willing to consider that conversion from UNICODE to "pinyin" is fast.
3. The order between a ASCII code string and a UNICODE string with same letter is that ASCII goes before UNICODE.
i.e. The "pinyin" for my name of UNICODE is "huang qing zhe". If I also input an ASCII code string "huang qing zhe",
I will expect to find ASCII code before UNICODE. And the UNICODE is before ASCII string of "goo" since ASCII has
a letter "g" which is after "pinyin" of "h".
4. An inexact match is performed by accepting user's input string and should return all matched string.
i.e. UNICODE string with "pinyin" of "huang qing zhe" will be considered to be match with "hqz", "huqizh",
"huangqz", or "hqzhe". And the following is NOT matched: "haqz", "hugqingzhe" etc.
5. However, this kind of grammar has one ambiguity: i.e. Suppose we have an ASCII string "shang hai shi". And the
searching string is "shs". There can be two explanation: one is that it should be matched since "shs" are all
starting letter for all words. The other is that it is not matched since "sh" is matched with first word "shang".
And letter "s" will not be matched in second word "hai" any more.
This kind of ambiguity is usually solved by pre-defined priority of searching pattern.
C.The
idea of
programOriginally Mr. Li suggested me to draw the possible DFA. But I feel it is a kind of more difficult than coding.
The dictionary is input by a file named "nick.txt" and it should be the same directory along with executable file.
The input file format is like following:
[number of words in name] [flag for UNICODE, 1==UNICODE] [string1, string2 ...]
D.The major functions
The findFirst function is returning the insertion point for new-added name.
The function "unico2pinyin" is only for simplification and you should take advantage of some library in your
platform.
E.Further improvement
The searching for matching can be improved if you make use of the property that the address book is already sorted.
กก
F.File listing
1. addressbook.cpp
file name: addressbook.h
#include <iostream> #include <fstream> using namespace std; const int LongestWordLength=10; const int MaxWordNumber=10; const int MaxNameNumber=100; char* fileName="nick.txt"; char* unico2pinyin(char* str); struct MyWord { bool isUNICODE; char myword[LongestWordLength]; bool isSmaller(MyWord& other); void display(); char* toString(); }; struct MyName { int wordCount; MyWord mywords[MaxWordNumber]; void display(); bool match(char* str); }; class MyBook { private: MyName mynames[MaxNameNumber]; int nameCount; public: void initialize(); MyBook(); //void addNewName(MyName& newName); void addNewName(bool isUnico, int wordCount, char newWords[][LongestWordLength]); int findFirst(bool isUnico, char* str); void display(); void findMatch(char* str); }; int main() { char buf[LongestWordLength]; MyBook book; book.initialize(); book.display(); printf("now let's find the match\n"); //book.findMatch("mozd"); do { printf("input your searching string>"); scanf("%s", buf); if (strcmp(buf, "exit")==0) { break; } book.findMatch(buf); } while (true); return 0; } void MyBook::findMatch(char* str) { char* ptr; for (int i=0; i<nameCount; i++) { ptr=mynames[i].mywords[0].toString(); if (mynames[i].match(str)) { printf("find a match:"); mynames[i].display(); } } } // bool MyName::match(char* str) { bool canGoOn; int source=0, target=0, index=0; char* ptr; if (mywords[0].myword[0]!=str[0]) { return false; } //end condition is exhaust source string while (str[source]!='\0') { canGoOn=false; target=0; //ending condition is "not cangoon" ptr=mywords[index].toString(); while (ptr[target]!='\0') { //if match if (ptr[target]==str[source]) { canGoOn=true; source++; target++; if (str[source]=='\0') { return true; } } else { break; } } if (!canGoOn) { return false; } index++; if (index==wordCount) { return false; } } return true; } MyBook::MyBook() { nameCount=0; } void MyBook::initialize() { ifstream in; int count, unicoFlag; char names[MaxWordNumber][LongestWordLength]; in.open(fileName); do { in>>count>>unicoFlag; for (int i=0; i<count; i++) { in>>names[i]; } addNewName(unicoFlag==1, count, names); } while (!in.eof()); } void MyName::display() { if (mywords[0].isUNICODE) { printf("unicode and corresponding pinyin is:"); } for (int i=0; i<wordCount; i++) { mywords[i].display(); printf(","); } printf("\n"); } void MyWord::display() { if (isUNICODE) { printf("%s", unico2pinyin(myword)); } else { printf("%s", myword); } } void MyBook::display() { for (int i=0; i<nameCount; i++) { mynames[i].display(); } } void MyBook::addNewName(bool isUnico, int wordCount, char newWords[][LongestWordLength]) { char* ptr; int index; if (isUnico) { ptr=unico2pinyin(newWords[0]); } else { ptr=newWords[0]; } index=findFirst(isUnico, ptr); //shift array to insert for (int i=nameCount; i>index; i--) { mynames[i]=mynames[i-1]; } mynames[i].wordCount=wordCount; for(i=0; i<wordCount; i++) { mynames[index].mywords[i].isUNICODE=isUnico; strcpy(mynames[index].mywords[i].myword, newWords[i]); } nameCount++; } //this is the simplest conversion function, I just want to simplify my //operation. char* unico2pinyin(char* str) { return str; } char* MyWord::toString() { if (isUNICODE) { return unico2pinyin(myword); } else { return myword; } } bool MyWord::isSmaller(MyWord& other) { char* myself, *theOther; myself=toString(); theOther=other.toString(); //same type, simple comparison if (isUNICODE&&other.isUNICODE||!isUNICODE&&!other.isUNICODE) { return strcmp(myself, theOther)<=0; } else { //here is different type //different type and different "class" if (myself[0]!=theOther[0]) { return strcmp(myself, theOther)<0; } else { //same "class", the unicode is bigger in index return !isUNICODE; } } } int MyBook::findFirst(bool isUnico, char* str) { //you can use binary search too //char* ptr; //bool afterUnico=false; MyWord temp; temp.isUNICODE=isUnico; strcpy(temp.myword, str); for (int i=0; i<nameCount; i++) { if (temp.isSmaller(mynames[i].mywords[0])) { return i; } /* //just skip until first non-unico or end of address book if (afterUnico) { if (!mynames[i].mywords[0].isUNICODE) { return i; } else { ptr=unico2pinyin(mynames[i].mywords[0]); if (strcmp(str, ptr)>0) { return i; } } } else { //in case we have only unico in address book if (mynames[i].mywords[0].isUNICODE) { ptr=unico2pinyin(mynames[i].mywords[0].myword); if (ptr[0]==ch) { afterUnico=true; continue; } } else { if (mynames[i].mywords[0].myword[0]==ch) { return i; } } } */ } //this is the case of end of address book return i; }
กก
running result: (This is the running results of 10 seconds. The position of each quantum is
just approximation since I am trying to use integer to map double.)
The input file is named "nick.txt" and it is like this:
3 1 huang qing zhe
2 0 xiang dong
3 1 zheng shu zhao
2 1 yong jun
4 0 xiao xiao yu er
5 1 this is a unicode test
1 0 ascii
2 1 many test
3 1 yang zi wen
3 0 hong yang yang
2 0 zhang qing
3 1 li yi hong
3 1 xiao yong jun
3 0 xiao jian dong
3 1 li qing qing
2 0 xu jin
2 1 huang yong
3 1 huang chang zheng
3 1 zhu chun ming
3 0 zhu chun min
3 0 zhu jian min
3 1 ni xiang li
3 0 wu hui min
3 1 li teng long
3 0 chen zhi xiong
3 1 hong xiang yang
3 0 su cong wei
3 1 su dong po
2 1 yang yang
2 1 yang tao
3 1 yang jian zhong
3 1 huang hong dao
3 1 wu qi wei
3 1 jiang jie shi
3 0 mao ze dong
2 1 zhu de
3 1 ren bi shi
3 0 du yu min
2 0 xiao sun
2 0 zhang yan
3 0 li ke xin
3 0 wo shi shui
3 0 huang shi ren
3 0 xu jia hui
3 0 bei jin shi
3 0 shang hai shi
3 1 yang zhen ning
3 1 li zheng dao
3 1 wu de w
4 0 a pei a wang
2 0 cheng long
3 1 ma jun ren
2 0 liang zhan
3 1 li deng hui
3 0 su jin qiang
3 1 huang xi jin
3 0 huang tian ci
2 0 tian ci
3 0 zhang yun lei
3 0 ma jia jun
กก
The following is the running result:
a,pei,a,wang,
ascii,
bei,jin,shi,
chen,zhi,xiong,
cheng,long,
du,yu,min,
hong,yang,yang,
huang,tian,ci,
huang,shi,ren,
unicode and corresponding pinyin is:hong,xiang,yang,
unicode and corresponding pinyin is:huang,xi,jin,
unicode and corresponding pinyin is:huang,hong,dao,
unicode and corresponding pinyin is:huang,chang,zheng,
unicode and corresponding pinyin is:huang,yong,
unicode and corresponding pinyin is:huang,qing,zhe,
unicode and corresponding pinyin is:jiang,jie,shi,
li,ke,xin,
liang,zhan,
unicode and corresponding pinyin is:li,deng,hui,
unicode and corresponding pinyin is:li,zheng,dao,
unicode and corresponding pinyin is:li,teng,long,
unicode and corresponding pinyin is:li,qing,qing,
unicode and corresponding pinyin is:li,yi,hong,
ma,jia,jun,
mao,ze,dong,
unicode and corresponding pinyin is:ma,jun,ren,
unicode and corresponding pinyin is:many,test,
unicode and corresponding pinyin is:ni,xiang,li,
unicode and corresponding pinyin is:ren,bi,shi,
shang,hai,shi,
su,jin,qiang,
su,cong,wei,
unicode and corresponding pinyin is:su,dong,po,
tian,ci,
unicode and corresponding pinyin is:this,is,a,unicode,test,
wo,shi,shui,
wu,hui,min,
unicode and corresponding pinyin is:wu,de,w,
unicode and corresponding pinyin is:wu,qi,wei,
xiang,dong,
xiao,sun,
xiao,jian,dong,
xiao,xiao,yu,er,
xu,jia,hui,
xu,jin,
unicode and corresponding pinyin is:xiao,yong,jun,
unicode and corresponding pinyin is:yang,zhen,ning,
unicode and corresponding pinyin is:yang,jian,zhong,
unicode and corresponding pinyin is:yang,tao,
unicode and corresponding pinyin is:yang,yang,
unicode and corresponding pinyin is:yang,zi,wen,
unicode and corresponding pinyin is:yong,jun,
zhang,yun,lei,
zhang,yan,
zhang,qing,
zhu,jian,min,
zhu,chun,min,
unicode and corresponding pinyin is:zheng,shu,zhao,
unicode and corresponding pinyin is:zhu,de,
unicode and corresponding pinyin is:zhu,chun,ming,
now let's find the match
input your searching string>yy
find a match:unicode and corresponding pinyin is:yang,yang,
input your searching string>yzn
find a match:unicode and corresponding pinyin is:yang,zhen,ning,
input your searching string>yanzh
find a match:unicode and corresponding pinyin is:yang,zhen,ning,
input your searching string>yanzhn
find a match:unicode and corresponding pinyin is:yang,zhen,ning,
input your searching string>yatao
find a match:unicode and corresponding pinyin is:yang,tao,
input your searching string>yo
find a match:unicode and corresponding pinyin is:yong,jun,
input your searching string>t
find a match:tian,ci,
find a match:unicode and corresponding pinyin is:this,is,a,unicode,test,
input your searching string>tianc
find a match:tian,ci,
input your searching string>zcm
find a match:zhu,chun,min,
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>zcming
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>apei
find a match:a,pei,a,wang,
input your searching string>
กก