Board index » cppbuilder » Re: HTML parsing...

Re: HTML parsing...


2007-02-10 07:12:58 AM
cppbuilder100
Eduardo Jauch wrote:
Quote
Hello!

I need to parse a HTML web page as fast as I can get...

What I want to parse is on this format:

<select name="mesDisponivel" class="combo-box" onChange="javascript:
popularListaDiasDisponiveis();">
<option value="">Selecione</option>
<option value="04/2007">Abril 2007
</option>
</select>

I know that the '<select ' that I wanna is the FIRST that appears on the
page. I need ONLY to get the 04/2007 value (and the possible others, if
they exist).
I would recommend you to try libxml2 library since it is quite fast an
supports parsing of html, and what is really nice you can even use xpath
to do queries.
Darko
 
 

Re:Re: HTML parsing...

Hello!
I need to parse a HTML web page as fast as I can get...
What I want to parse is on this format:
<select name="mesDisponivel" class="combo-box" onChange="javascript:
popularListaDiasDisponiveis();">
<option value="">Selecione</option>
<option value="04/2007">Abril 2007
</option>
</select>
I know that the '<select ' that I wanna is the FIRST that appears on the
page. I need ONLY to get the 04/2007 value (and the possible others, if
they exist).
I create this code:
struct TSAEVMesDisponivel
{
int mes, ano;
AnsiString mes_str;
};
DynamicArray<TSAEVMesDisponivel>mesesDisp;
AnsiString stream; //<<<===== It's PUBLIC. contains the HTML page
AnsiString tag;
int p;
int streamLength = stream.Length();
int nMeses = 0;
bool finalizar = false;
int opcao = 0;
int i;
for(p = 1; p <= streamLength; p++)
{
if(stream[p] == '<')
{
if(++p>streamLength || stream[p] != 's')
continue;
if(++p>streamLength || stream[p] != 'e')
continue;
if(++p>streamLength || stream[p] != 'l')
continue;
if(++p>streamLength || stream[p] != 'e')
continue;
if(++p>streamLength || stream[p] != 'c')
continue;
if(++p>streamLength || stream[p] != 't')
continue;
if(++p>streamLength || stream[p] != ' ')
continue;
//Se chegou aqui, ?o select dos meses (considerando que a página
seja a correta ;) hehehe)
for(; p < streamLength; ++p)
{
if(stream[p] == '<')
{
if(++p < streamLength && stream[p] == 'o')
{
if(++p>streamLength || stream[p] != 'p')
continue;
if(++p>streamLength || stream[p] != 't')
continue;
if(++p>streamLength || stream[p] != 'i')
continue;
if(++p>streamLength || stream[p] != 'o')
continue;
if(++p>streamLength || stream[p] != 'n')
continue;
if(++p>streamLength || stream[p] != ' ')
continue;
if(opcao == 0)
{
opcao++;
continue;
}
//==>NÃO<== ?a opção "SELECIONE" ;)
for(; p < streamLength; ++p)
{
if(stream[p] == '\"' && (p + 8) < streamLength)
{
lstMeses.Length++;
lstMeses[lstMeses.High].mes_str.SetLength(7);
for(i = 1; i < 8; i++)
{
lstMeses[lstMeses.High].mes_str[i] = stream[++p];
}
lstMeses[lstMeses.High].mes =
lstMeses[lstMeses.High].mes_str.SubString(0, 2).ToInt();
lstMeses[lstMeses.High].ano =
lstMeses[lstMeses.High].mes_str.SubString(3, 4).ToInt();
nMeses++;
}
}
}
else if(p < streamLength || stream[p] != '/')
{
if(++p>streamLength || stream[p] != 's')
continue;
if(++p>streamLength || stream[p] != 'e')
continue;
if(++p>streamLength || stream[p] != 'l')
continue;
if(++p>streamLength || stream[p] != 'e')
continue;
if(++p>streamLength || stream[p] != 'c')
continue;
if(++p>streamLength || stream[p] != 't')
continue;
finalizar = true;
break;
}
}
}
if(finalizar)
break;
}
}
Do you have any suggestion to IMPROVE the code?
Thanks :)
 

Re:Re: HTML parsing...

Eduardo Jauch < XXXX@XXXXX.COM >writes:
Quote
I need to parse a HTML web page as fast as I can get...
Well, if you get it online then the network will be the bottleneck
unless you are really doing "interesting" things in your program.
Quote
What I want to parse is on this format:

<select name="mesDisponivel" class="combo-box" onChange="javascript:
popularListaDiasDisponiveis();">
<option value="">Selecione</option>
<option value="04/2007">Abril 2007
</option>
</select>

I know that the '<select ' that I wanna is the FIRST that appears on
the page. I need ONLY to get the 04/2007 value (and the possible
others, if they exist).

I create this code:
[snip]
Quote
Do you have any suggestion to IMPROVE the code?
This is *much* too complicated for me to believe it works (BTW: have
yo considered that HTML tags are case-insensitive?).
What about using a regular expression library such as the the one from
Boost (www.boost.org/libs/regex/doc/index.html)?
 

{smallsort}

Re:Re: HTML parsing...

Darko Miletic escreveu:
Quote
Eduardo Jauch wrote:
>Hello!
>
>I need to parse a HTML web page as fast as I can get...
>
>What I want to parse is on this format:
>
><select name="mesDisponivel" class="combo-box" onChange="javascript:
>popularListaDiasDisponiveis();">
><option value="">Selecione</option>
><option value="04/2007">Abril 2007
></option>
></select>
>
>I know that the '<select ' that I wanna is the FIRST that appears on
>the page. I need ONLY to get the 04/2007 value (and the possible
>others, if they exist).

I would recommend you to try libxml2 library since it is quite fast an
supports parsing of html, and what is really nice you can even use xpath
to do queries.

Darko
Sounds interesting...
It's good to know :)
But this'll parse and hold the entire document no?
If yes, this will be more "time" and "resources" consuming than I want...
I'll take a look on it :)
Thanks anyway :)
 

Re:Re: HTML parsing...

Thomas Maeder [TeamB] escreveu:
Quote
Eduardo Jauch < XXXX@XXXXX.COM >writes:

>I need to parse a HTML web page as fast as I can get...

Well, if you get it online then the network will be the bottleneck
unless you are really doing "interesting" things in your program.

Hum... I don't get it... I get the page on the web, and really,
sometimes take a while to see the page... The page has always about
8-10k... Nut most of the time, I can reload the entire page many times
for second.
A simple 'for(i = 0; i < 1000; i++);' slow down the reload performance.
So, to parse the code must be the fastest as I can get...
What you mean with "interesting"? :)
Quote

>What I want to parse is on this format:
>
><select name="mesDisponivel" class="combo-box" onChange="javascript:
>popularListaDiasDisponiveis();">
><option value="">Selecione</option>
><option value="04/2007">Abril 2007
></option>
></select>
>
>I know that the '<select ' that I wanna is the FIRST that appears on
>the page. I need ONLY to get the 04/2007 value (and the possible
>others, if they exist).
>
>I create this code:

[snip]

>Do you have any suggestion to IMPROVE the code?

This is *much* too complicated for me to believe it works (BTW: have
yo considered that HTML tags are case-insensitive?).

What about using a regular expression library such as the the one from
Boost (www.boost.org/libs/regex/doc/index.html)?
Well, really is not so complicated.
I first get the first select in the page. Anything that starts with '<'
is verifyed, but if a single caracter is not part of the word 'select ',
the search go on.
When I find it, I search (using the same principle) for the 'option '
tag. The first I discard, because the value is always "".
I then load the value. If in the meddle of way I find a </select>, the
search ends.
The code works very well :)
To the case-insensitive, the page is generated automaticaly, never
changes. Only the values and text.
I'll take a look the lib that you told me :)
 

Re:Re: HTML parsing...

Eduardo Jauch wrote:
Quote
But this'll parse and hold the entire document no?
If yes, this will be more "time" and "resources" consuming than I want...
If that is your concern that you should use SAX from libxml as it is
designed for small memory footprint.
Take a look at examples for xmlReader here:
www.xmlsoft.org/examples/index.html
 

Re:Re: HTML parsing...

Darko Miletic escreveu:
Quote
Eduardo Jauch wrote:
>But this'll parse and hold the entire document no?
>If yes, this will be more "time" and "resources" consuming than I want...

If that is your concern that you should use SAX from libxml as it is
designed for small memory footprint.
Take a look at examples for xmlReader here:
www.xmlsoft.org/examples/index.html

I'll try it as soon as possible...
Anyway, shure I'l try it on another project... From what I read seens to
fit exactly what I need for the other project...
Thanks again! :)
 

Re:Re: HTML parsing...

Darko Miletic < XXXX@XXXXX.COM >writes:
Quote
I would recommend you to try libxml2 library since it is quite fast
an supports parsing of html, and what is really nice you can even
use xpath to do queries.
Does it parse HTML which isn't well-formed according to the stricter
rules of XML?
 

Re:Re: HTML parsing...

Eduardo,
This in my opinion is easier than checking each character...
// This gets you to the select you want.
stream = stream.SubString(stream.Pos("<select name=\"mesDisponivel")+27,
stream.Length());
// This makes sure you have only this Select
stream = stream.SubString(0, stream.Pos("</select>")-1);
// This will discard the first <option>that you don't need.
stream = stream.SubString(stream.Pos("</option>")+9, stream.Length());
// Now we loop through the rest of the code extracting the month and year
that
// you need.
int Pos = 0;
while((Pos = stream.Pos("<option value=\""))) // This generates a warning
{
stream = stream.SubString(Pos+15, stream.Length());
lstMeses.Length++;
lstMeses[lstMeses.High].mes_str = stream.SubString(0,
stream.Pos("\"")-1);
lstMeses[lstMeses.High].mes =
lstMeses[lstMeses.High].mes_str.SubString(0, 2).ToInt();
lstMeses[lstMeses.High].ano =
lstMeses[lstMeses.High].mes_str.SubString(3, 4).ToInt();
// Why is this needed? Cant you just check lstMeses.Length?
nMeses++;
}
There may be an even more elegant way to do this, but it's the way I use
when I'm parsing sites. =)
-Tom
 

Re:Re: HTML parsing...

Eduardo Jauch wrote:
Quote
<select name="mesDisponivel" class="combo-box" onChange="javascript:
popularListaDiasDisponiveis();">
<option value="">Selecione</option>
<option value="04/2007">Abril 2007
</option>
</select>

I know that the '<select ' that I wanna is the FIRST that appears on the
page. I need ONLY to get the 04/2007 value (and the possible others, if
they exist).
Is the page contained in an AnsiString? If so you can do this:
AnsiString HtmlText = ....
int pos1 = Htmltext.Pos("<select" );
int pos2 = Htmltext.Pos("</select" );
if ( ! pos1 || ! pos2 || pos1>pos2 )
return;
AnsiString SelectText = HtmlText.SubString ( pos1, pos2 - pos1);
pos1 = Htmltext.Pos("<option" );
pos2 = Htmltext.Pos("</option" );
if ( ! pos1 || ! pos2 || pos1>pos2 )
return;
AnsiString OptionText = SelectText.SubString ( pos1, pos2 - pos1);
ShowMessage ( OptionText ); // will show '<option value="">Selecione'
This as a start.
Are you shure it is always the secont <option>..</option>?
Hans.
 

Re:Re: HTML parsing...

Eduardo Jauch wrote:
Quote
<select name="mesDisponivel" class="combo-box" onChange="javascript:
popularListaDiasDisponiveis();">
<option value="">Selecione</option>
<option value="04/2007">Abril 2007
</option>
</select>

I know that the '<select ' that I wanna is the FIRST that appears on the
page. I need ONLY to get the 04/2007 value (and the possible others, if
they exist).
Is the page contained in an AnsiString? If so you can do this:
AnsiString HtmlText = ....
int pos1 = HtmlText.Pos("<select" );
int pos2 = HtmlText.Pos("</select" );
if ( ! pos1 || ! pos2 || pos1>pos2 )
return;
AnsiString SelectText = HtmlText.SubString ( pos1, pos2 - pos1);
pos1 = SelectText.Pos("<option" );
pos2 = SelectText.Pos("</option" );
if ( ! pos1 || ! pos2 || pos1>pos2 )
return;
AnsiString OptionText = SelectText.SubString ( pos1, pos2 - pos1);
ShowMessage ( OptionText ); // will show '<option value="">Selecione'
This as a start.
Are you shure it is always the secontd<option>..</option>?
Hans.
(PS. cancelled my first reply because it contained errors.)
 

Re:Re: HTML parsing...

Hans Galema escreveu:
Quote

Is the page contained in an AnsiString? If so you can do this:

Yes :)
Quote
AnsiString HtmlText = ....

int pos1 = HtmlText.Pos("<select" );
int pos2 = HtmlText.Pos("</select" );

if ( ! pos1 || ! pos2 || pos1>pos2 )
return;

AnsiString SelectText = HtmlText.SubString ( pos1, pos2 - pos1);

pos1 = SelectText.Pos("<option" );
pos2 = SelectText.Pos("</option" );

if ( ! pos1 || ! pos2 || pos1>pos2 )
return;

AnsiString OptionText = SelectText.SubString ( pos1, pos2 - pos1);

ShowMessage ( OptionText ); // will show '<option value="">Selecione'

This as a start.

Are you shure it is always the secontd<option>..</option>?
Yes ;)
Quote

Hans.

(PS. cancelled my first reply because it contained errors.)
No problem :)
I'll take a look to see the performance :)
Do you know how to see the "miliseconds"?
I try to use TDateTime but only seconds...
Thanks for the tips :)
 

Re:Re: HTML parsing...

Thomas Maeder [TeamB] wrote:
Quote
Does it parse HTML which isn't well-formed according to the stricter
rules of XML?
No, it parses html with rules for html.
If you try parsing html with default xml parser you will probably get
undesired result ;)
But you can do even that if you want. There is nothing that prevents you
to go in that direction.
take a look here:
www.xmlsoft.org/html/libxml-HTMLparser.html
I used this extensively in (shhh don't tell anybody) php 5 since it
implements all xml/html functions with libxml2 and it works like a charm.
Darko
 

Re:Re: HTML parsing...

Darko Miletic < XXXX@XXXXX.COM >writes:
Quote
Thomas Maeder [TeamB] wrote:
>Does it parse HTML which isn't well-formed according to the
>stricter rules of XML?

No, it parses html with rules for html.
Oh, good.
Quote
If you try parsing html with default xml parser you will probably
get undesired result ;)
Exactly. That's why I asked.
Quote
But you can do even that if you want. There is nothing that prevents
you to go in that direction.

take a look here:
www.xmlsoft.org/html/libxml-HTMLparser.html
Thanks!