Board index » delphi » Finding Duplicates in a CSV file

Finding Duplicates in a CSV file

I receive daily a CSV file of about 5000 lines.  I need to check for
duplicate names in a particular column (column 7).  One way I'm thingking of
doing this could be to iterate through the file adding the column value to a
TStringList if it is not already in the list.  If there's already a value in
the TStringList, add the value to a TStringList that contains duplicates.
With the above method I would only iterate through the file once.

The difficulty with my concept is that I don't know how to find the seventh
column.  To make it even more trickier is the seventh column that contains
the names has a comma such as : "Nielsen, William".

Can anyone share some code or tips on how to find the seventh column or
perhaps some insights on how to find duplicates in a CSV file

Thanks,
Bill N

 

Re:Finding Duplicates in a CSV file


Quote
>The difficulty with my concept is that I don't know how to find the seventh
>column.  To make it even more trickier is the seventh column that contains
>the names has a comma such as : "Nielsen, William".

>Can anyone share some code or tips on how to find the seventh column or
>perhaps some insights on how to find duplicates in a CSV file

Are you using text driver to read CSV files?
If so, why don't you reference to the column by it's name (FieldByName
property)?
To list all duplicate values you may use a query:

SELECT name, count(name)
FROM YourTable
GROUP BY name
HAVING count(*)>1

and then use table's Locate method or Filter property to locate those
records.

HTH, Bojan

Re:Finding Duplicates in a CSV file


Try something like the following.

var
  DataFile:           TextFile;
  Rec:                  TStringList;
  S:                      String;
  Name:               String;
begin
AssignFile(DataFile, 'c:\foo\mydata.txt');
Reset(DataFile);
Rec := TStringList.Create;
while not DataFile.EOF do
begin
  Rec.Clear;
  Readln(DataFile, S);
  Rec.CommaText := S;
  Name := Rec[6];
end;

Assigning a record from a comma delimited file to the CommaText property of
a StringList causes the StringList to parse the record and put each field in
a separate string in the StringList. Since the StringList elements are
numbered starting with zero the 6th element contains the 7th field.

--
Bill
(TeamB cannot answer questions received via email)

Re:Finding Duplicates in a CSV file


"Bill and Karen Nielsen" <a_fam...@telusplanet.net> wrote in message
news:3bb68ddc$1_2@dnews...

Quote

> The difficulty with my concept is that I don't know how to find the
seventh
> column.  To make it even more trickier is the seventh column that contains
> the names has a comma such as : "Nielsen, William".

> Can anyone share some code or tips on how to find the seventh column or
> perhaps some insights on how to find duplicates in a CSV file

Here is a routine that I use. It takes into account the double quotes around
names and such so that the comma isn't counted. Pass it the string and the
number of the "field" you want and it will return a string in that position.

HTH
Woody

function GetToken(TokenStr: string; TokenNum: integer): string;
 // Separate is an internal procedure to strip the first token
  // from the string and return it
const
  TokenSeparator: char = ',';

 function Separate(var TokenStr: string): string;
  var
   InQuote: boolean;
    x: integer;

  begin
    InQuote := false;
    Result := '';
    if length(TokenStr) > 0 then
    begin
     for x := 1 to length(TokenStr) do
       if InQuote then
        begin
          if (TokenStr[x] = '"') then
          begin
           if (x = length(TokenStr)) then
             TokenStr := '';
           InQuote := false;
          end
          else
          Result := Result + TokenStr[x];
        end
        else
        begin
         if (TokenStr[x] = TokenSeparator) then
          begin
           TokenStr := copy(TokenStr, x+1, length(TokenStr));
            break;
          end
          else if (TokenStr[x] = '"') then
          begin
            InQuote := true;
          end
          else
          begin
           Result := Result + TokenStr[x];
            if (x = length(TokenStr)) then
             TokenStr := '';
          end
        end;
    end;
    if (result = TokenStr) then
     TokenStr := '';
  end;

begin
 if (TokenNum = 1) then
   result := Separate(TokenStr)
  else
  begin
   result := Separate(TokenStr);
    if (length(TokenStr) = 0) and (TokenNum > 2) then
     result := '{empty}'
    else
    result := GetToken(TokenStr,TokenNum - 1);
  end;
end;

Re:Finding Duplicates in a CSV file


Have you tried using the ASCII driver to open the file like a table?  This
would allow you to import the table and manipulate it as you wish.  You
cannot modify a file viewed ny the ACSII driver that is why you have to
import it into another table and the modify.

I can provide examples, codes and some programs if you would like.

Chris

Quote
"Woody" <woody....@ih2000.net> wrote in message news:3bb8bb03_1@dnews...
> "Bill and Karen Nielsen" <a_fam...@telusplanet.net> wrote in message
> news:3bb68ddc$1_2@dnews...

> > The difficulty with my concept is that I don't know how to find the
> seventh
> > column.  To make it even more trickier is the seventh column that
contains
> > the names has a comma such as : "Nielsen, William".

> > Can anyone share some code or tips on how to find the seventh column or
> > perhaps some insights on how to find duplicates in a CSV file

> Here is a routine that I use. It takes into account the double quotes
around
> names and such so that the comma isn't counted. Pass it the string and the
> number of the "field" you want and it will return a string in that
position.

> HTH
> Woody

> function GetToken(TokenStr: string; TokenNum: integer): string;
>  // Separate is an internal procedure to strip the first token
>   // from the string and return it
> const
>   TokenSeparator: char = ',';

>  function Separate(var TokenStr: string): string;
>   var
>    InQuote: boolean;
>     x: integer;

>   begin
>     InQuote := false;
>     Result := '';
>     if length(TokenStr) > 0 then
>     begin
>      for x := 1 to length(TokenStr) do
>        if InQuote then
>         begin
>           if (TokenStr[x] = '"') then
>           begin
>            if (x = length(TokenStr)) then
>              TokenStr := '';
>            InQuote := false;
>           end
>           else
>           Result := Result + TokenStr[x];
>         end
>         else
>         begin
>          if (TokenStr[x] = TokenSeparator) then
>           begin
>            TokenStr := copy(TokenStr, x+1, length(TokenStr));
>             break;
>           end
>           else if (TokenStr[x] = '"') then
>           begin
>             InQuote := true;
>           end
>           else
>           begin
>            Result := Result + TokenStr[x];
>             if (x = length(TokenStr)) then
>              TokenStr := '';
>           end
>         end;
>     end;
>     if (result = TokenStr) then
>      TokenStr := '';
>   end;

> begin
>  if (TokenNum = 1) then
>    result := Separate(TokenStr)
>   else
>   begin
>    result := Separate(TokenStr);
>     if (length(TokenStr) = 0) and (TokenNum > 2) then
>      result := '{empty}'
>     else
>     result := GetToken(TokenStr,TokenNum - 1);
>   end;
> end;

Other Threads