Board index » delphi » Removing duplicate rows in a SQL data base.

Removing duplicate rows in a SQL data base.

I have a SQL data base that only contains one table.
Each row contains 10 fields. 1 string field (50 chars), 1 datetime field,
and eight integers.
The table currently contains more than 7Million rows.
Several hundred thousand rows are added each month. (Current loads are
generally weekly)

Some of those rows are exact duplicates.
I want to keep only 1 copy of each unique row in the data base.
When additional data is loaded, I would like any additional entries that are
duplicates of existing rows to be ignored.

(Duplicates occur because:
    1. the user reloads the same source data file by mistake
    2. the load process terminates abnormally for some reason (some one
trips over a power cord, the main breaker trips, the UPS drowns [this is
Houston] etc), so the user restarts the load)

Q1. Is there a cheap way to remove the undesired copies?
Q2. Is there a server side "switch" that will cause an exact copy of an
existing record to be ignored on load?

I'm using D5 Enterprise and MS SQL Server 8.00.194 on Win2000.

Generic SQL or MS-SQL specific suggestions appreciated.

Thanks.

Keith.

 

Re:Removing duplicate rows in a SQL data base.


Does your table have a primary key ? That's how duplicates are normally
avoided. Or do you mean that your duplicates are different rows with the
same values for all columns except the primary key ? Creating an index with
the IGNORE_DUP_KEY option is a possibility, but it all depends what you mean
by "duplicate". The following message might help:
http://groups.google.com/groups?hl=en&selm=OnWX%23OGa%24GA.1372%40cpp...
6

Cheers,
Andy Mackie.

Quote
Keith Brown <kbr...@petropages.com> wrote in message

news:3baff829$1_2@dnews...
Quote
> Some of those rows are exact duplicates.
> I want to keep only 1 copy of each unique row in the data base.
> When additional data is loaded, I would like any additional entries that
are
> duplicates of existing rows to be ignored.

Re:Removing duplicate rows in a SQL data base.


In article <3baff829$1_2@dnews>, kbr...@petropages.com says...

Quote
> Generic SQL or MS-SQL specific suggestions appreciated.

Check the following
http://support.microsoft.com/directory/article.asp?ID=KB;EN-US;Q139444
http://support.microsoft.com/support/kb/articles/Q268/5/05.ASP

-------------------------------
Paul Lambadaris
Delta Singular S.A.
mailto : p...@singular.gr
www    : http://www.singular.gr

Re:Removing duplicate rows in a SQL data base.


The best way to prevent the duplicates would be to have the inserts occur
using a stored procedure, and the stored procedure checks for a duplicate
entry before it does the insert.

Alternatively, you could just load into a temporary table, a group the
entries into the real table:
insert into realtable
(select * from temptable
 group by strfield, datefield, int1, int2, int3, int4, int5, int6, int7,
int8);

HTH,
Christopher Latta

Quote
Keith Brown <kbr...@petropages.com> wrote in message

news:3baff829$1_2@dnews...
Quote
> I have a SQL data base that only contains one table.
> Each row contains 10 fields. 1 string field (50 chars), 1 datetime field,
> and eight integers.
> The table currently contains more than 7Million rows.
> Several hundred thousand rows are added each month. (Current loads are
> generally weekly)

> Some of those rows are exact duplicates.
> I want to keep only 1 copy of each unique row in the data base.
> When additional data is loaded, I would like any additional entries that
are
> duplicates of existing rows to be ignored.

> (Duplicates occur because:
>     1. the user reloads the same source data file by mistake
>     2. the load process terminates abnormally for some reason (some one
> trips over a power cord, the main breaker trips, the UPS drowns [this is
> Houston] etc), so the user restarts the load)

> Q1. Is there a cheap way to remove the undesired copies?
> Q2. Is there a server side "switch" that will cause an exact copy of an
> existing record to be ignored on load?

> I'm using D5 Enterprise and MS SQL Server 8.00.194 on Win2000.

> Generic SQL or MS-SQL specific suggestions appreciated.

Re:Removing duplicate rows in a SQL data base.


[scrubbing data]
1) Add another column called CRC to the table that is a 32 bit number.
2) Run an application to update the new column with the CRC of all the data
in the row (effectively a hash).
3) SELECT CRC FROM (SELECT CRC, COUNT(0) AS REC_COUNT FROM MYTABLE) WHERE
REC_COUNT > 1.
4) Use the above results to delete all but one row.
[enforcing uniqueness]
5) Create a unique index on the CRC and make it a required field.
6) Create a trigger that automatically calculates the CRC and inserts it
into the new row.
7) All new data put into the table will be calculated (CRC) and will fail to
insert if the unique CRC is violated.

Make sense?  Lather, rinse, and repeat.

Eric

Re:Removing duplicate rows in a SQL data base.


Quote
"Christopher Latta" <nob...@nowhere.not> wrote:
>insert into realtable
>(select * from temptable
> group by strfield, datefield, int1, int2, int3, int4, int5, int6, int7,
>int8);

   Combining this with a Unique Index that ignores duplicates
would be a simple solution to the current problem. Would
avoid need for trigger checks or SP Inserts.

create unique index IndexName on Tablename
(column1,column2, ...) with ignore_dup_key

   Instead of working on the whole large table at a time, a primary autogenerated key could be added and data in ranges could be inserted.

Sunil Furtado
New Mangalore Port Trust

Re:Removing duplicate rows in a SQL data base.


Hi Keith!

On Mon, 24 Sep 2001 22:25:42 -0500, "Keith Brown"

Quote
<kbr...@petropages.com> wrote:
>I have a SQL data base that only contains one table.
>Each row contains 10 fields. 1 string field (50 chars), 1 datetime field,
>and eight integers.
>The table currently contains more than 7Million rows.
>Several hundred thousand rows are added each month. (Current loads are
>generally weekly)

>Some of those rows are exact duplicates.
>I want to keep only 1 copy of each unique row in the data base.
>When additional data is loaded, I would like any additional entries that are
>duplicates of existing rows to be ignored.

>(Duplicates occur because:
>    1. the user reloads the same source data file by mistake
>    2. the load process terminates abnormally for some reason (some one
>trips over a power cord, the main breaker trips, the UPS drowns [this is
>Houston] etc), so the user restarts the load)

>Q1. Is there a cheap way to remove the undesired copies?
>Q2. Is there a server side "switch" that will cause an exact copy of an
>existing record to be ignored on load?

I would advice to use one another table to load records into and then
when all the data is loaded move it to the main table in one select
statement (and one transaction of course). In this way you would have
consistency on the level of one load.

As for the multiple loading of the same data, you have to make sure
that this doesn't happen. The implementation of this is dependent of
your data. If you are sure that there can not exist the same data row
more then once than you could make primary CLUSTERED key on all the
fields and in this way you will ensure that there is no duplication
entered. Clastered key is required so that index data doesn't take
extra space, table itself would be ordered.

Also, for loading purposes consider BCP (bulk copy) instead of just
using some SQL code for insertion either via INSERT or EXEC stored
procedure cause with BCP you can load 1000 records per second what is
absolutely the fastest way for data load. This approach would also
require separate loading table and once when data is in the database
then you process it and move to the main table.

tomi.

Other Threads