Board index » delphi » Query Match Percentage

Query Match Percentage

How to get the match percentage for a multiple keywords query like
E{*word*277} search engine of web pages?  One way is to get recordset of those
records with at least one keyword matched.  And then iterate through all
the records and calculate the percentage and finally resort by match
percentage.  It seems not to be the best way if initial record set
contains too many records.  Any guru out there can point to a better on
how to do a multiple keyword query with result ordered by percent
match?  Thanks.

--
Wanli Ma
--
Internet Programmer
Legend Microsystems, Inc.
http://www.LegendNet.com
mailto:W...@LegendNet.com

 

Re:Query Match Percentage


How to get the match percentage for a multiple keywords query and order
the result by percent match like
E{*word*277} search engine of web pages?  One way is to get recordset of those
records with at least one keyword matched.  And then iterate through all
the records and calculate the percentage and finally resort by match
percentage.  It seems not to be the best way if initial record set
contains too many records.  Any guru out there can point to a better way
on
how to do a multiple keyword query with result ordered by percent
match?  Thanks.

--
Wanli Ma
--
Internet Programmer
Legend Microsystems, Inc.
http://www.LegendNet.com
mailto:W...@LegendNet.com

Re:Query Match Percentage


Hi Wanli !

Quote
> how to do a multiple keyword query with result ordered by percent
> match?  Thanks.

I can't give to You ideal solution. But in Oracle, for example,
in that case I will write a query, like:

select id, count(*)
from pages
where text like '%KW1% or ... or text like '%KWN%
group by id
order by 2

This query have one issue: LIKE do not a keyword search.
But You can change LIKE to UDF, that will do keyword search.

Other solution in Oracle server - to use Oracle Context
cartridge.

Regards,
Dmitry

--
------------------------------------------------------------------------
Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Thanks.  If I understand correctly the way you are suggesting will work only
if each word of a page has been indexed, how about if a page is just saved
in one text field?   In this way, the Count(*) will always return 1 for

select id, count(*)
from pages
where text like '%KW1% or ... or text like '%KWN%

no matter one keyword matched or ... three keywords matched.

Thanks again.

Quote
Dmitry Arefiev wrote:
> Hi Wanli !

> > how to do a multiple keyword query with result ordered by percent
> > match?  Thanks.

> I can't give to You ideal solution. But in Oracle, for example,
> in that case I will write a query, like:

> select id, count(*)
> from pages
> where text like '%KW1% or ... or text like '%KWN%
> group by id
> order by 2

> This query have one issue: LIKE do not a keyword search.
> But You can change LIKE to UDF, that will do keyword search.

> Other solution in Oracle server - to use Oracle Context
> cartridge.

> Regards,
> Dmitry

> --
> ------------------------------------------------------------------------
> Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
> Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
> EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Quote
Wanli Ma wrote:

> Thanks.  If I understand correctly the way you are suggesting will work only
> if each word of a page has been indexed, how about if a page is just saved
> in one text field?   In this way, the Count(*) will always return 1 for

> select id, count(*)
> from pages
> where text like '%KW1% or ... or text like '%KWN%

> no matter one keyword matched or ... three keywords matched.

> Thanks again.

I mean, that "TEXT" field completely contain a pages text. So each
record of
"PAGES" table contains all page info. Also, in Oracle for text fields
You can
have max length 4000. If You want more, You must to use other types:
CLOB, for
example, and use other methods for searching keywords.

Also, You can use indexed search. In that case, You must have two
tables:

PAGES   WORDS
-----   ------
ID      PAGE_ID
TEXT    WORD_TEXT

and rewrite a query:

select p.id, count(*)
from pages p, words w
where p.id = w.page_id and (w.word_text = 'KW1' or ... or w.word_text =
'KWN')
group by p.id
order by 2

Regards,
Dmitry

--
------------------------------------------------------------------------
Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Thanks.  But I did a short test on an Access table.

select idxGUID, COUNT(*) from table1 where strFld1 Like '%a%' or strFld1 Like
'%b%'
Group By idxGUID Order By 2

All COUNTs are 1.  I think the problem is at "text like '%KW1% or ... or text like
'%KWN%", as long as one keyword matches, the record will return true.  So it will
not influence COUNT.  Of course, I did not test it on Oracle though.  But I
suspect the result might be the same.  If each page is parsed into singles words
and each word is indexed, the COUNT might give what I need.  But it is really
something system resource intensive when index each word.

Quote
Dmitry Arefiev wrote:
> Wanli Ma wrote:

> > Thanks.  If I understand correctly the way you are suggesting will work only
> > if each word of a page has been indexed, how about if a page is just saved
> > in one text field?   In this way, the Count(*) will always return 1 for

> > select id, count(*)
> > from pages
> > where text like '%KW1% or ... or text like '%KWN%

> > no matter one keyword matched or ... three keywords matched.

> > Thanks again.

> I mean, that "TEXT" field completely contain a pages text. So each
> record of
> "PAGES" table contains all page info. Also, in Oracle for text fields
> You can
> have max length 4000. If You want more, You must to use other types:
> CLOB, for
> example, and use other methods for searching keywords.

> Also, You can use indexed search. In that case, You must have two
> tables:

> PAGES   WORDS
> -----   ------
> ID      PAGE_ID
> TEXT    WORD_TEXT

> and rewrite a query:

> select p.id, count(*)
> from pages p, words w
> where p.id = w.page_id and (w.word_text = 'KW1' or ... or w.word_text =
> 'KWN')
> group by p.id
> order by 2

> Regards,
> Dmitry

> --
> ------------------------------------------------------------------------
> Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
> Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
> EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Wanli, I am sorry ... I make a mistake. Valid querys must be:

Quote
> > select id, count(*)
> > from pages
> > where text like '%KW1% or ... or text like '%KWN%

select id, count(*)
from (
        select id from pages where text like '%KW1%
        union all
        .........
        union all
        select id from pages where text like '%KWN%
)
group by id
order by 2

Quote
> select p.id, count(*)
> from pages p, words w
> where p.id = w.page_id and (w.word_text = 'KW1' or ... or w.word_text =
> 'KWN')
> group by p.id
> order by 2

select id, count(*)
from (
        select page_id as id from words where word_text = 'KW1'
        union all
        .........
        union all
        select page_id as id from words where word_text = 'KWN'
)
group by id
order by 2

But this syntax may works only in Oracle ... In any case, there
must be variable number of subquerys. One for each keyword.

Regards,
Dmitry

--
------------------------------------------------------------------------
Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Even I dont have Oracle to test this query,  I must admit this is quite
creative syntax.  Thanks.
Quote
Dmitry Arefiev wrote:
> Wanli, I am sorry ... I make a mistake. Valid querys must be:

> > > select id, count(*)
> > > from pages
> > > where text like '%KW1% or ... or text like '%KWN%

> select id, count(*)
> from (
>         select id from pages where text like '%KW1%
>         union all
>         .........
>         union all
>         select id from pages where text like '%KWN%
> )
> group by id
> order by 2

> > select p.id, count(*)
> > from pages p, words w
> > where p.id = w.page_id and (w.word_text = 'KW1' or ... or w.word_text =
> > 'KWN')
> > group by p.id
> > order by 2

> select id, count(*)
> from (
>         select page_id as id from words where word_text = 'KW1'
>         union all
>         .........
>         union all
>         select page_id as id from words where word_text = 'KWN'
> )
> group by id
> order by 2

> But this syntax may works only in Oracle ... In any case, there
> must be variable number of subquerys. One for each keyword.

> Regards,
> Dmitry

> --
> ------------------------------------------------------------------------
> Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
> Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
> EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Even I dont have Oracle to test this query,  I must admit this is quite
creative syntax.  This should save a lot of system resource by not
indexing each word.  But of course indexing each word still has the
advantage of knowing the frequency of each word in a text.  I can not
see a way that a query can do that.  Thanks a lot.
Quote
Dmitry Arefiev wrote:
> Wanli, I am sorry ... I make a mistake. Valid querys must be:

> > > select id, count(*)
> > > from pages
> > > where text like '%KW1% or ... or text like '%KWN%

> select id, count(*)
> from (
>         select id from pages where text like '%KW1%
>         union all
>         .........
>         union all
>         select id from pages where text like '%KWN%
> )
> group by id
> order by 2

> > select p.id, count(*)
> > from pages p, words w
> > where p.id = w.page_id and (w.word_text = 'KW1' or ... or w.word_text =
> > 'KWN')
> > group by p.id
> > order by 2

> select id, count(*)
> from (
>         select page_id as id from words where word_text = 'KW1'
>         union all
>         .........
>         union all
>         select page_id as id from words where word_text = 'KWN'
> )
> group by id
> order by 2

> But this syntax may works only in Oracle ... In any case, there
> must be variable number of subquerys. One for each keyword.

> Regards,
> Dmitry

> --
> ------------------------------------------------------------------------
> Dmitry L. Arefiev, senior programmer, Nica-Com ltd.,  Russia
> Author of NCOCI8 - Freeware Delphi to Oracle8i direct access
> EMail: cut_this_prefix.di...@ncom.ru           ICQ: 50741007

Re:Query Match Percentage


Well, you should have a separate keyword table and update it whenever the
related record changes.
PAGES table: PAGE_ID integer, contents long varchar;
KWD table : PAGE_ID integer references PAGES, AKeyword char(32);

1. advantage: It is true that this way you waste some space, but if pages
table grows large, sequential search on varchar field will take time, while
KWD table is indexed.
2. you normally search on more than one keyword, so this way you can easier
weight results:
- simple: select PAGE_ID, count(*) where Akeyword = something and/or
AKeyword like somethingElse
- weighted:
Expr: Keyword like '%a' and (keyword like '%zz' or keyword like 'qwe') or
Keyword='123'
A*(B+C)+D=A*B+A*C+D; where A is first part (Keyword like '%a'), B
second,.....
now weight this:
select 3 as weight, page_id  from KWD where A*B
union all
select 3 as weight, page_id  from KWD where A*C
union all
select 1 as weight, page_id  from KWD where D

Depending on database server you can put this select into temp table or do a
direct summary select from this select:
Select sum(weight), page_id from (<the first select>)

Naturally, you'd have to weight the operators expression: "equal" should be
heavier than "like",....
I suggest you to read some books about this and design solution depending on
needed accuracy, speed, volume of data,.....

--
----------------------
Regards
Robert Cerny
Remove both qwe when replying
email: robert.qwe.ce...@neosys.xrs.qwe.si

No questions via email, unless explicitly invited.

Quote
Wanli Ma wrote in message <38B572BD.CCB32...@LegendNet.com>...
>How to get the match percentage for a multiple keywords query like
>E{*word*277} search engine of web pages?  One way is to get recordset of those
>records with at least one keyword matched.  And then iterate through all
>the records and calculate the percentage and finally resort by match
>percentage.  It seems not to be the best way if initial record set
>contains too many records.  Any guru out there can point to a better on
>how to do a multiple keyword query with result ordered by percent
>match?  Thanks.

>--
>Wanli Ma
>--
>Internet Programmer
>Legend Microsystems, Inc.
>http://www.LegendNet.com
>mailto:W...@LegendNet.com

Re:Query Match Percentage


Thanks for the suggestion.  This should make the search much faster when the
volume is large.  But there are some drawbacks here compared with what Dmitry
suggested.  It is resource consuming to index each non-repeated word in a text.
And also when the text changes, to update keywords also takes time.  There would
be no this problem if not indexing the keywords.  Indexing keywords does offer
another advantage if having the keyword table structured following way:

PAGE_ID, AKeyword, Frequency.

The appearance frequency could also be taken into account in percentage
calculation.

Guess it all depends on the volume of the data.  I will check some more
references.

By the way, if indexing keywords, do you still suggest keeping a separate Pages
table?

Thanks for all.

Quote
Robert Cerny wrote:
> Well, you should have a separate keyword table and update it whenever the
> related record changes.
> PAGES table: PAGE_ID integer, contents long varchar;
> KWD table : PAGE_ID integer references PAGES, AKeyword char(32);

> 1. advantage: It is true that this way you waste some space, but if pages
> table grows large, sequential search on varchar field will take time, while
> KWD table is indexed.
> 2. you normally search on more than one keyword, so this way you can easier
> weight results:
> - simple: select PAGE_ID, count(*) where Akeyword = something and/or
> AKeyword like somethingElse
> - weighted:
> Expr: Keyword like '%a' and (keyword like '%zz' or keyword like 'qwe') or
> Keyword='123'
> A*(B+C)+D=A*B+A*C+D; where A is first part (Keyword like '%a'), B
> second,.....
> now weight this:
> select 3 as weight, page_id  from KWD where A*B
> union all
> select 3 as weight, page_id  from KWD where A*C
> union all
> select 1 as weight, page_id  from KWD where D

> Depending on database server you can put this select into temp table or do a
> direct summary select from this select:
> Select sum(weight), page_id from (<the first select>)

> Naturally, you'd have to weight the operators expression: "equal" should be
> heavier than "like",....
> I suggest you to read some books about this and design solution depending on
> needed accuracy, speed, volume of data,.....

> --
> ----------------------
> Regards
> Robert Cerny
> Remove both qwe when replying
> email: robert.qwe.ce...@neosys.xrs.qwe.si

> No questions via email, unless explicitly invited.
> Wanli Ma wrote in message <38B572BD.CCB32...@LegendNet.com>...
> >How to get the match percentage for a multiple keywords query like
> >E{*word*277} search engine of web pages?  One way is to get recordset of those
> >records with at least one keyword matched.  And then iterate through all
> >the records and calculate the percentage and finally resort by match
> >percentage.  It seems not to be the best way if initial record set
> >contains too many records.  Any guru out there can point to a better on
> >how to do a multiple keyword query with result ordered by percent
> >match?  Thanks.

> >--
> >Wanli Ma
> >--
> >Internet Programmer
> >Legend Microsystems, Inc.
> >http://www.LegendNet.com
> >mailto:W...@LegendNet.com

Re:Query Match Percentage


Quote
Wanli Ma wrote in message <38B6D8A6.5DCB6...@LegendNet.com>...
>Thanks for the suggestion.  This should make the search much faster when
the
>volume is large.  But there are some drawbacks here compared with what
Dmitry
>suggested.  It is resource consuming to index each non-repeated word in a
text.
>And also when the text changes, to update keywords also takes time.  There
would
>be no this problem if not indexing the keywords.  Indexing keywords does
offer
>another advantage if having the keyword table structured following way:

>PAGE_ID, AKeyword, Frequency.

>The appearance frequency could also be taken into account in percentage
>calculation.

Depends. Generally you're interested in a keyword and combination on
keywords, less likely you're interested in how many times it appears on the
same page.

It is true that update keywords takes some time and space, but speeds up
searching 100, 1000 times or more.

Also, think if you need to index each word, or to define a set of keywords
and index on it.
If you need to index each word, you should consider bitmap indexes, because
they speed up like search too: for expression "like '%qwe%'", the dbengine
gains nothing by scanning standard index.

Quote
>Guess it all depends on the volume of the data.  I will check some more
>references.

>By the way, if indexing keywords, do you still suggest keeping a separate
Pages
>table?

Naturally. It's a master/detail relationship. 0, 1 or many keywords may
appear on the same page.

Quote

>Thanks for all.

--
----------------------
Regards
Robert Cerny
Remove both qwe when replying
email: robert.qwe.ce...@neosys.xrs.qwe.si

No questions via email, unless explicitly invited.

Re:Query Match Percentage


Thanks.  By the way, I like the way you modified your email address.

As for indexing, if the volume is not very hight, i.e. <100,000 records.
Searching Pages table does offer another advantage.  That is, you are able to
search Exact Phrase as search engine used in MS site.  I have no experience
with record counts > 100,000 and how the performance goes down.  When the
volume really goes up, I might have to build master-detail relationship by
further indexing keywords.  Any url you can recommend on "bitmap index"?

Well, here is a little further question.  Two keywords Kw1 and Kw2 have
different weight.  Say Kw1 for 3 and Kw2 for 1.  Is there any way I can
directly get the weighted count within query?  Or I have to mathematically deal
with weighted percentage after obtaining the resultset through "Like Kw1 or
Kw2"?  I thought I might be able to do it as

Select Kw1 From ...
Union All
Select Kw1 From ...
Union All
Select Kw1 From ...(Three Times)
Union All
Select Kw2 From ...(One Time)

But it is a little ugly.

All your help is greatly appreciated.

Wanli Ma

Quote
Robert Cerny wrote:
> Wanli Ma wrote in message <38B6D8A6.5DCB6...@LegendNet.com>...
> >Thanks for the suggestion.  This should make the search much faster when
> the
> >volume is large.  But there are some drawbacks here compared with what
> Dmitry
> >suggested.  It is resource consuming to index each non-repeated word in a
> text.
> >And also when the text changes, to update keywords also takes time.  There
> would
> >be no this problem if not indexing the keywords.  Indexing keywords does
> offer
> >another advantage if having the keyword table structured following way:

> >PAGE_ID, AKeyword, Frequency.

> >The appearance frequency could also be taken into account in percentage
> >calculation.

> Depends. Generally you're interested in a keyword and combination on
> keywords, less likely you're interested in how many times it appears on the
> same page.

> It is true that update keywords takes some time and space, but speeds up
> searching 100, 1000 times or more.

> Also, think if you need to index each word, or to define a set of keywords
> and index on it.
> If you need to index each word, you should consider bitmap indexes, because
> they speed up like search too: for expression "like '%qwe%'", the dbengine
> gains nothing by scanning standard index.

> >Guess it all depends on the volume of the data.  I will check some more
> >references.

> >By the way, if indexing keywords, do you still suggest keeping a separate
> Pages
> >table?

> Naturally. It's a master/detail relationship. 0, 1 or many keywords may
> appear on the same page.

> >Thanks for all.

> --
> ----------------------
> Regards
> Robert Cerny
> Remove both qwe when replying
> email: robert.qwe.ce...@neosys.xrs.qwe.si

> No questions via email, unless explicitly invited.

Other Threads