Board index » cppbuilder » String Parsing Optimization

String Parsing Optimization


2003-09-30 07:45:37 AM
cppbuilder114
Thanks Gambit for helping out with last problem with memory issues. I've
run across another little stumbling block. It works great when I only have
a couple of delimiters but When I start using a lot of delimiters say 30 it
really starts to slow down the processing. Basically I loop through my
delimiters and determine which is the closest delimiter to the front of the
string. The one that is closest I use as my delimiter to extract my next
token from the string. Someone know of a quicker way of determining this or
how I could optimise my loop?
The code thats in between the coments that look like this is what needs to
be optimized.
//=======
//*********
code
//*********
//=======
just including these so you know the types of everything
...
private:
AnsiString FBaseString;
TStringList* FDelimiters;
TStringList* FTokens;
TStringList* FAlphabetizedTokens;
bool FNeedsAlphabetized;
bool FNeedsTokenCount;
int* FTokenCount;
int FTotalTokensCount;
int FUniqueTokensCount;
...
void CTokenizer::Parse()
{
AnsiString strTempBase;
int delimiterCount = FDelimiters->Count;
if(delimiterCount)
strTempBase = FBaseString + FDelimiters->Strings[0];
int maxPos = strTempBase.Length();
FTokens->Clear();
int pos = 1;
AnsiString strDelimiter;
AnsiString strToken;
FTotalTokensCount = 0;
char* ptrStart = strTempBase.c_str();
char* ptrEnd = ptrStart;
char* ptrEnd2;
char* ptrMaxTempBase;
int whichDelimiter = -1;
while(pos < maxPos)
{
if(delimiterCount)
{
strDelimiter = FDelimiters->Strings[0];
whichDelimiter = 0;
}
ptrEnd = AnsiStrPos(ptrStart, strDelimiter.c_str());
//=======================================
//******************************************************
// This is the section of code that needs to be optimized
//basically I'm looping thourgh my TStringList of delimiters and
//determining which one is the closest delimiter
if(delimiterCount>1)
{
for(int i = 1; i < delimiterCount; i++)
{
ptrEnd2 = AnsiStrPos(ptrStart, FDelimiters->Strings[i].c_str());
if((ptrEnd2) && (*ptrEnd2 != '\0') && (ptrEnd2 < ptrEnd))
{
ptrEnd = ptrEnd2;
whichDelimiter = i;
}
}
}
//******************************************************
//=======================================
if(ptrEnd)
{
strToken = strTempBase.SubString(pos, ptrEnd - ptrStart);
if(strToken != "")
{
FTokens->Add(strToken);
FTotalTokensCount++;
}
ptrEnd += FDelimiters->Strings[whichDelimiter].Length();
pos += ptrEnd - ptrStart;
ptrStart = ptrEnd;
}
}
FNeedsAlphabetized = true;
}
if you would like me post the entire class I will.
if interested I was using these as my delimiters
CTokenizer tokenizer;
tokenizer.Delimiters->Add(" ");
tokenizer.Delimiters->Add("<br>");
tokenizer.Delimiters->Add("");
tokenizer.Delimiters->Add(" ");
tokenizer.Delimiters->Add("\r");
tokenizer.Delimiters->Add("\n");
tokenizer.Delimiters->Add(";");
tokenizer.Delimiters->Add(".");
tokenizer.Delimiters->Add("<");
tokenizer.Delimiters->Add(">");
tokenizer.Delimiters->Add("/");
tokenizer.Delimiters->Add("!");
tokenizer.Delimiters->Add("*");
tokenizer.Delimiters->Add("-");
tokenizer.Delimiters->Add("'");
tokenizer.Delimiters->Add("\"");
tokenizer.Delimiters->Add("[");
tokenizer.Delimiters->Add("]");
tokenizer.Delimiters->Add("|");
tokenizer.Delimiters->Add("(");
tokenizer.Delimiters->Add(")");
tokenizer.Delimiters->Add(",");
tokenizer.Delimiters->Add("?");
tokenizer.Delimiters->Add("{");
tokenizer.Delimiters->Add("}");
tokenizer.Delimiters->Add("+");
tokenizer.Delimiters->Add("=");
tokenizer.Delimiters->Add("&");
tokenizer.Delimiters->Add(":");
 
 

Re:String Parsing Optimization

I had a thought which I will try out and let you know the results of. My
idea was something like this
for(int i = 0; i < delimiterCount; i++)
strTempBase = StringReplace(strTempBase,
FDelimiters->Strings[delimiterCount-1], TReplaceFlags() << rfReplaceAll);
And then simply using FDelimiters->Strings[delimiterCount-1] as my sole
delimiter.
I fear that when strTempBase is of significant size that I will run into the
same memory fragmention errors as I was getting before when doing parsing of
the tokens. Ohh well worth a try I suppose. Is there a way to reduce the
memory framentation of the above loop?
 

Re:String Parsing Optimization

"Junk Mail" < XXXX@XXXXX.COM >wrote in message
Quote
I had a thought which I will try out and let you know the results of. My
idea was something like this

for(int i = 0; i < delimiterCount; i++)
strTempBase = StringReplace(strTempBase,
FDelimiters->Strings[delimiterCount-1], TReplaceFlags() << rfReplaceAll);

And then simply using FDelimiters->Strings[delimiterCount-1] as my sole
delimiter.

I fear that when strTempBase is of significant size that I will run into
the
same memory fragmention errors as I was getting before when doing parsing
of
the tokens. Ohh well worth a try I suppose. Is there a way to reduce the
memory framentation of the above loop?

Heres what I ended up using
AnsiString strDelimiter = FDelimiters->Strings[delimiterCount-1];
for(int i = 0; i < delimiterCount-1; i++)
strTempBase = StringReplace(strTempBase, FDelimiters->Strings[i],
strDelimiter, TReplaceFlags() << rfReplaceAll);
It doesn't really result in an memory fragmentation that I can really see so
thats not that big of deal.
But now instead of taking a long time in my loop its just taking a long time
to do the StringReplace's
when strTempBase is 500k and delimiterCount is 30 it sure does take an
awfully long time to finish.
Any ideas?
 

{smallsort}

Re:String Parsing Optimization

"Junk Mail" < XXXX@XXXXX.COM >wrote in message
Quote
I fear that when strTempBase is of significant size that I will run
into the same memory fragmention errors as I was getting
before when doing parsing of the tokens.
Yes, using StringReplace() in that fashion is no better than what you were
doing originally.
Gambit
 

Re:String Parsing Optimization

"Junk Mail" < XXXX@XXXXX.COM >wrote in message
Quote
Thanks Gambit for helping out with last problem with memory issues.
I've run across another little stumbling block. It works great when I
only have a couple of delimiters but When I start using a lot of
delimiters
say 30 it really starts to slow down the processing.
Of course it will. The more delimiters you have, the longer it is going to
take to search for them all.
Quote
Basically I loop through my delimiters and determine which is the
closest delimiter to the front of the string. The one that is closest I
use as my delimiter to extract my next token from the string.
Someone know of a quicker way of determining this or how I
could optimise my loop?
I don't think you can use the native VCL's capabilities for what you're
asking. They weren't designed for optimazation more than they already are.
You'll probably just have to search the string data manually to do what
you're asking for.
One of the problems with your current code is that you are searching the
entire string text each time you look for delimiters, and then comparing the
results one at a time. One optimazation I can think of off the top of my
head would be for you to write your own AnsiStrPos()-style function that
takes a third parameter to specify the character position to stop searching
at. That way, when you search the string once, you get a position for a
delimiter, then you can pass that position as the extra parameter on the
next call so that you only scan up to that point and no further. If you are
looking for the first available delimiter then there is no need to scan past
a delimiter you have already determined. As you progress through your
delimiters list for a particular string, the searching will get smaller and
faster as the search range is shortened as more matches are found. Soryr, I
don't have any code for that, though.
The more I think of it, I'm starting to think that maybe some kind of lookup
table mechanism might work better instead. Then you can start at the
beginning of the string, take the first character, see if any delimiters
start with that character, and if they match then check if the next
character matches, and so on until the full delimiter matches or not. If
matches, then you have your token. If not matches, then go back to the
second character in the string and start over, looking for delimiters that
start with that character, and so on. That should at least help limit how
much looping is actually done for strings/delimiters that don't even match
beginning characters, the bulk of the searching would be in potential
matches involving multiple characters.
Just out of curiousity, why do you need to search for so many delimiters at
a time anyway?
Quote
if interested I was using these as my delimiters
If it weren't for the first two multi-character delimiters that you have,
you could just use the STL's std:string class for everything. It has
find_first_of() and find_first_not_of() methods that are perfectly suited
for delimiter searching. But they only work for single-character
delimiters.
Gambit
 

Re:String Parsing Optimization

Quote
The more I think of it, I'm starting to think that maybe some kind of
lookup
table mechanism might work better instead. Then you can start at the
beginning of the string, take the first character, see if any delimiters
start with that character, and if they match then check if the next
character matches, and so on until the full delimiter matches or not. If
matches, then you have your token. If not matches, then go back to the
second character in the string and start over, looking for delimiters that
start with that character, and so on. That should at least help limit how
much looping is actually done for strings/delimiters that don't even match
beginning characters, the bulk of the searching would be in potential
matches involving multiple characters.
You might be on to something there. I came up with a semi-solution. I only
do StringReplace on delimiters with more than one character. In then I loop
through the baseString and look at all the single delimiters...so I'm only
looping through the baseString once for single delimiters. That sped things
up a lot. The idea you suggested would be better when I start adding more
delimiters with more than one character. Side affect of this is that it
would require the user of my class to Add the delimiters starting with the
greatest length first down to the single delimiters unless I wanted to
manage that for them I guess.
Quote
Just out of curiousity, why do you need to search for so many delimiters
at
a time anyway?
I get bored sometimes :) I recently downloaded a program called PopFile.
Its an email classification tool that uses Baye's theorem to determine the
probabiltiy of which bucket an email should go into (personal, work, spam,
etc). It actual has source code that comes with it but I haven't looked at
it. I prefer to work my way though problems...I tend to learn more that
way then simply copying code. At anyrate I became interested in this
partialy because I used a derivation of Baye's theorem to make a black jack
leaner for my AI project a few years ago. Baye's theorem is basically the
probability of A given B but can be expanded to use a more variables. I
like this program that I downloaded but unfortunatily its not that user
friendly. You have to go to web page interface to reclassify any mistakes
the program made. I want to make a program that is integrated into outlook
where I can simply drag and drop my emails to a different folder "bucket" to
correct the any mistakes the Bayesian alogorithim has made.
I know I can go out and buy one for $30-$100 but that would be too easy...or
too {*word*156} a student income :)
My first step is to create a tokenizer that simply gets all the usefull
words out of an email or HTML file that is sent as an email. So far what I
am doing works greate although it its not the quickest thing in the world.
Its really fine for normal emails and can do a 500k email in about min or so
on my 900 mhz 256 MB machine. So I can live with that since most emails are
1 to 2 k especially if they don't have attachments which take split seconds.
Maybe I'm going about this the wrong way. If you have a better idea of how
to get useful tokens of an email or HTML I would appreciate some enlighment.
Next step is simply to create a bucket class that gives the probability that
given word indicates that it belongs to that bucket. Then you take 10-20
words from each different bucket that have steer the most away from 0.5.
Then you simply use these probabilities to determine which bucket this email
has the highest probability of going into. Then integrate it into
outlook...hmm I am still aways off...ohh well I have time to spare :)
Anway got I little long winded there hope that satisfies your curiousity :)