Board index » delphi » Looking for super fast CSV parser

Looking for super fast CSV parser


2007-09-07 11:28:20 PM
delphi200
I have a comma delimited file with 20 million rows of text data, around
10 fields per line. My slowest routine is the parsing out of these field
values and converting them to double.
Are there any super fast CSV parser routines out there that are
reliable? The file I am parsing doesn't have any quotes so that should
make things simpler. TIA
Sam
 
 

Re:Looking for super fast CSV parser

Sam Larson writes:
Quote
I have a comma delimited file with 20 million rows of text data,
around 10 fields per line. My slowest routine is the parsing out of
these field values and converting them to double.
Are you doing this to load the values into a database? If so, which
database? Many RDMBS have tools for bulk-loading of data such as this
very quickly.
--
Kevin Powick
 

Re:Looking for super fast CSV parser

Kevin Powick writes:
Quote
Sam Larson writes:


>I have a comma delimited file with 20 million rows of text data,
>around 10 fields per line. My slowest routine is the parsing out of
>these field values and converting them to double.


Are you doing this to load the values into a database? If so, which
database? Many RDMBS have tools for bulk-loading of data such as this
very quickly.

Kevin,
I have to do some preprocessing on this data before it gets added
to the database. Adding rows to a database is extremely slow because of
the disk I/O and index building. Even memory tables are slow compared to
using a TList or TStringList as temporary storage (I load 10k rows at a
time). I then compare the TList data to the db table and only update
rows that have changed. This is an order of magnitude faster than if I
reloaded the data to the database.
After doing a benchmark I discovered the slowest routine I have is
the parser, probably because it gets executed around a hundred million
times. Shaving a few ms off of it will make a big difference which is
why I am looking for a fast parser.
Sam
 

Re:Looking for super fast CSV parser

Hyperstring.
I love it, been using it for years.
Its very easy with about 3 functions to get delimited info.
"Sam Larson" <XXXX@XXXXX.COM>writes
Quote
Kevin Powick writes:

>Sam Larson writes:
>
>
>>I have a comma delimited file with 20 million rows of text data,
>>around 10 fields per line. My slowest routine is the parsing out of
>>these field values and converting them to double.
>
>
>Are you doing this to load the values into a database? If so, which
>database? Many RDMBS have tools for bulk-loading of data such as this
>very quickly.
>

Kevin,
I have to do some preprocessing on this data before it gets added
to the database. Adding rows to a database is extremely slow because of
the disk I/O and index building. Even memory tables are slow compared to
using a TList or TStringList as temporary storage (I load 10k rows at a
time). I then compare the TList data to the db table and only update
rows that have changed. This is an order of magnitude faster than if I
reloaded the data to the database.

After doing a benchmark I discovered the slowest routine I have is
the parser, probably because it gets executed around a hundred million
times. Shaving a few ms off of it will make a big difference which is
why I am looking for a fast parser.

Sam
 

Re:Looking for super fast CSV parser

Quote
>>I have a comma delimited file with 20 million rows of text data,
>>around 10 fields per line. My slowest routine is the parsing out of
>>these field values and converting them to double.
>
>
>Are you doing this to load the values into a database? If so, which
>database? Many RDMBS have tools for bulk-loading of data such as this
>very quickly.
I have to do some preprocessing on this data before it gets added
to the database. Adding rows to a database is extremely slow because of
the disk I/O and index building. Even memory tables are slow compared to
using a TList or TStringList as temporary storage (I load 10k rows at a
time). I then compare the TList data to the db table and only update
rows that have changed. This is an order of magnitude faster than if I
reloaded the data to the database.

After doing a benchmark I discovered the slowest routine I have is
the parser, probably because it gets executed around a hundred million
times.
If there are 20M rows, why would the parser need to execute
100M times? I am pretty good at writing parsers, what does your
current implementation look like? Is the I/O buffered?
 

Re:Looking for super fast CSV parser

John McTaggart writes:
Quote
>>>I have a comma delimited file with 20 million rows of text data,
>>>around 10 fields per line. My slowest routine is the parsing out of
>>>these field values and converting them to double.
>>
>>
>>Are you doing this to load the values into a database? If so, which
>>database? Many RDMBS have tools for bulk-loading of data such as this
>>very quickly.


>I have to do some preprocessing on this data before it gets added
>to the database. Adding rows to a database is extremely slow because of
>the disk I/O and index building. Even memory tables are slow compared to
>using a TList or TStringList as temporary storage (I load 10k rows at a
>time). I then compare the TList data to the db table and only update
>rows that have changed. This is an order of magnitude faster than if I
>reloaded the data to the database.
>
>After doing a benchmark I discovered the slowest routine I have is
>the parser, probably because it gets executed around a hundred million
>times.


If there are 20M rows, why would the parser need to execute
100M times?
It parses each field, and there are at least 5 fields per row (closer to
9 fields), so 20Mx5=100M fields need to be parsed.
I'm currently using HyperStr's Parse routine which is reasonably fast,
but it is not written in assembler.
Quote
I'm pretty good at writing parsers, what does your
current implementation look like? Is the I/O buffered?
I'm using Readln and it is quite fast. I have done some profiling and its
the field parsing that is taking about 7.5x longer than the Readln.
Sam
 

Re:Looking for super fast CSV parser

Craig writes:
Quote
Hyperstring.

I love it, been using it for years.
Its very easy with about 3 functions to get delimited info.


I have it to. I just use the Parse() function to extract the delimited
fields. Is there a faster Hyperstr function? TIA
Sam
 

Re:Looking for super fast CSV parser

Hello Sam
You can give a try to AnyDAC. How it can help you:
1) It has TADDataMove, which can load data from text
files (multiple formats are supported). The parser is very
well optimized.
2) It has TADClientDataSet. So, you can load data into
it, then mark required records as changed and then use
AnyDAC functionality to post changes to database.
TADClientDataSet is very well optimized. Also, you can
illuminate TDataSet overhead and work with internal
data storage using very effective and comfortable API.
AnyDAC is free and is accessible from www.da-soft.com.
Regards,
Dmitry
--
Dmitry Arefiev - www.da-soft.com
AnyDAC - Oracle, MySQL, MS SQL, MSAccess, IBM DB2, Sybase
ASA, DbExpress, ODBC freeware data access engine
ThinDAC - multitier data access engine
 

Re:Looking for super fast CSV parser

Après mure réflexion, Craig a écrit :
Quote
Hyperstring.

I love it, been using it for years.
Its very easy with about 3 functions to get delimited info.




"Sam Larson" <XXXX@XXXXX.COM>writes
news:46e1be69$XXXX@XXXXX.COM...
>Kevin Powick writes:
>
>>Sam Larson writes:
>>
>>
>>>I have a comma delimited file with 20 million rows of text data,
>>>around 10 fields per line. My slowest routine is the parsing out of
>>>these field values and converting them to double.
>>
>>
>>Are you doing this to load the values into a database? If so, which
>>database? Many RDMBS have tools for bulk-loading of data such as this
>>very quickly.
>>
>
>Kevin,
>I have to do some preprocessing on this data before it gets added
>to the database. Adding rows to a database is extremely slow because of
>the disk I/O and index building. Even memory tables are slow compared to
>using a TList or TStringList as temporary storage (I load 10k rows at a
>time). I then compare the TList data to the db table and only update
>rows that have changed. This is an order of magnitude faster than if I
>reloaded the data to the database.
>
>After doing a benchmark I discovered the slowest routine I have is
>the parser, probably because it gets executed around a hundred million
>times. Shaving a few ms off of it will make a big difference which is
>why I am looking for a fast parser.
>
>Sam
Hi,
I try to find the Hyperstring and don't find it on Internet, seams that
original location is down.
Could you please indicate me where I can found it (For Delphi 2007),
your sincerely.
A.
--
Alain
 

Re:Looking for super fast CSV parser

"Sam Larson" <XXXX@XXXXX.COM>writes
Quote

I'm using Readln and it is quite fast. I have done some profiling and its the
field parsing that is taking about 7.5x longer than the Readln.

I have some project with csv but Readln was the first to go :-)
I normally use 2 stringlists, one for reading the file using Loadfromfile
and the other to parse each line using commatext
 

Re:Looking for super fast CSV parser

Sam Larson <XXXX@XXXXX.COM>writes:
Quote
I have a comma delimited file with 20 million rows of text data, around
10 fields per line. My slowest routine is the parsing out of these field
values and converting them to double.

Are there any super fast CSV parser routines out there that are
reliable? The file I am parsing doesn't have any quotes so that should
make things simpler. TIA
The TDICsvParser component of DIUnicode (www.yunqa.de/delphi/) is super
fast because it
* reads the file in small chunks
* uses an optimized buffer mechanism
* never backtracks, never reads data twice
* reports a single cell at a time only
Parsing is RFC compliant. Please run the demo for testing - but be aware that
the TStringGrid used to display the results is ill suited for 20 million rows.
The underlying parsing engine, however, is.
Ralf
---
The Delphi Inspiration
www.yunqa.de/delphi/
 

Re:Looking for super fast CSV parser

Quote
Are there any super fast CSV parser routines out there that are
reliable? The file I am parsing doesn't have any quotes so that should
make things simpler. TIA
www.xilytix.com/FieldedTextComponent.html
Note that this is written in Delpi.Net.
Regards
Paul Klink
 

Re:Looking for super fast CSV parser

"Sam Larson" <XXXX@XXXXX.COM>wrote
Quote
>>I have a comma delimited file with 20 million rows
of text data, around 10 fields per line. My slowest
routine is the parsing out of these field values
and converting them to double.
A CSV file of 20 million rows of 10 fields per row and
an average of 7 characters per number would require
about 1.62 GB.
200,000,000 doubles would require about 1.60 GB.
Quote
>>I have to do some preprocessing on this data before
it gets added to the database. Adding rows to a
database is extremely slow because of the disk I/O
and index building. Even memory tables are slow
compared to using a TList or TStringList as temporary
storage (I load 10k rows at a time). I then compare
the TList data to the db table and only update rows
that have changed. This is an order of magnitude
faster than if I reloaded the data to the database.
Parsing a CSV string to double might take 500 to
1500 CPU cycles. How long is it taking your code?
How much data (as megabytes) are you keeping in memory
while doing your processing?
Rgds, JohnH
Ref:
Function to return clock cycle count.
function GetCpuClockCycleCount: Int64;
asm
dw $310F // opcode for RDTSC
end;
 

Re:Looking for super fast CSV parser

Quote

Hi,

I try to find the Hyperstring and don't find it on Internet, seams that
original location is down.

Could you please indicate me where I can found it (For Delphi 2007),
your sincerely.

A.

I'm afraid Hyperstr is no longer being sold.
Sam
 

Re:Looking for super fast CSV parser

Sam Larson writes:
Quote
>
>Hi,
>
>I try to find the Hyperstring and don't find it on Internet, seams
>that original location is down.
>
>Could you please indicate me where I can found it (For Delphi
>2007), your sincerely.
>
>A.
>

I'm afraid Hyperstr is no longer being sold.

Sam
Try this link -
https://www.regsoft.net/regsoft/vieworderpage.php3
HTH,
Glynn
--