Board index » delphi » Sorting - Big files

Sorting - Big files

In reply to a private email from a person who knows he can interrupt me. (but
should have added to the tread...)

Hi M,

Quote

> I saw your post in clpb on sorting:

> > Of course you've got a problem if your file contains more than N
> > records or if you've got limited memory. (With both the original
> > and the upcased name the record will have a size of 44 bytes,
> > limiting N to, depending on other global variables, somewhere
> > around the 1450)

> > Any clues? Give it a try and come back if you're really stuck.

> How do you sort arrays larger than available memory?

You cannot have an array that is larger than available memory :-)

Quote
> I know it can be done, since I downloaded several sort programs from
> Simtel that do so. But if you can explain, then I (and probably a
> bunch of people in the newsgroup) would very much like to know how!

If your file exceeds available direct access memory, you will have to
settle for a multipass merge sort, in _very_ simplistic terms:

- read in as many records as allowed by available RAM
- sort them and write them out to file 1
- read in a next bunch of records
- sort them and write them out to file 2
- repeat above until entire original file has been processed.
- open files created
- read one record from each of them
- write out the smallest of these and read the next record from the
  file that delivered it
- repeat previous step until all files have reached eof

Of course you may in this scenario encounter other problems - too many
open files, in which case you would only merge the first so many files
into one, before repeating the exercise with the next bunch. And yes,
you can already do the first merge-pass with the data from file 1 and
the sorted data from the next bunch...

Of course it's far easier to download RPSRT102.ZIP and SPWNO413.ZIP and
use the latters EXEC(...) executing RPSORT to do your sorting... <G>

And if RPSORT isn't fast enough, it contains the full source, so you
can tweak it for even higher performance.

BTW, I once tested it with a 250 Mb file filled with garbage, RPSORT (multiple
keys!) took just over 10 minutes in an NT Dos box,  NTs SORT (single key):
zapped it after more than 30 minutes... Both were running as the only real
process.

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

 

Re:Sorting - Big files


In article <7f4grb$7i...@nnrp1.dejanews.com>,
  Robert AH Prins <prin...@williscorroon.com> wrote:

Quote
> In reply to a private email from a person who knows he can interrupt me. (but
> should have added to the tread...)

> Hi M,

> > I saw your post in clpb on sorting:

> > > Of course you've got a problem if your file contains more than N
> > > records or if you've got limited memory. (With both the original
> > > and the upcased name the record will have a size of 44 bytes,
> > > limiting N to, depending on other global variables, somewhere
> > > around the 1450)

> > > Any clues? Give it a try and come back if you're really stuck.

> > How do you sort arrays larger than available memory?

> You cannot have an array that is larger than available memory :-)

> > I know it can be done, since I downloaded several sort programs from
> > Simtel that do so. But if you can explain, then I (and probably a
> > bunch of people in the newsgroup) would very much like to know how!

> If your file exceeds available direct access memory, you will have to
> settle for a multipass merge sort, in _very_ simplistic terms:

> - read in as many records as allowed by available RAM
> - sort them and write them out to file 1
> - read in a next bunch of records
> - sort them and write them out to file 2
> - repeat above until entire original file has been processed.
> - open files created
> - read one record from each of them
> - write out the smallest of these and read the next record from the
>   file that delivered it
> - repeat previous step until all files have reached eof

> Of course you may in this scenario encounter other problems - too many
> open files, in which case you would only merge the first so many files
> into one, before repeating the exercise with the next bunch. And yes,
> you can already do the first merge-pass with the data from file 1 and
> the sorted data from the next bunch...

> Of course it's far easier to download RPSRT102.ZIP and SPWNO413.ZIP and
> use the latters EXEC(...) executing RPSORT to do your sorting... <G>

Another option would be to write all the records to an indexed file and then
read them back out in sorted order according to the desired key(s).  If
you're looking for blazing speed this is not your best choice.   The
advantage of such an approach is that it's easy to write such a program (with
the proper tools) and you can define your own sorting rules.

Years ago I wrote such a utility using Borland's Database Toolbox.  I still
use it today.

Quote

> And if RPSORT isn't fast enough, it contains the full source, so you
> can tweak it for even higher performance.

> BTW, I once tested it with a 250 Mb file filled with garbage, RPSORT (multiple
> keys!) took just over 10 minutes in an NT Dos box,  NTs SORT (single key):
> zapped it after more than 30 minutes... Both were running as the only real
> process.

..

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

Re:Sorting - Big files


Re:Sorting - Big files


In article <7f4grb$7i...@nnrp1.dejanews.com>,
  Robert AH Prins <prin...@williscorroon.com> wrote:

Quote
> In reply to a private email from a person who knows he can interrupt me. (but
> should have added to the tread...)

And in response to this post I got another private email. I don't mind the
occasional private email, and getting two is not exactly a deluge, but I would
nevertheless appreciate it if you could ask your questions in the newsgroups.

To reply to your first question: My name is Robert. (And I really dislike the
ugly American habit of shortening names)

Quote
> Hi M,

> > I saw your post in clpb on sorting:

> > > Of course you've got a problem if your file contains more than N
> > > records or if you've got limited memory. (With both the original
> > > and the upcased name the record will have a size of 44 bytes,
> > > limiting N to, depending on other global variables, somewhere
> > > around the 1450)

> > > Any clues? Give it a try and come back if you're really stuck.

> > How do you sort arrays larger than available memory?

> You cannot have an array that is larger than available memory :-)

> > I know it can be done, since I downloaded several sort programs from
> > Simtel that do so. But if you can explain, then I (and probably a
> > bunch of people in the newsgroup) would very much like to know how!

> If your file exceeds available direct access memory, you will have to
> settle for a multipass merge sort, in _very_ simplistic terms:

> - read in as many records as allowed by available RAM
> - sort them and write them out to file 1
> - read in a next bunch of records
> - sort them and write them out to file 2
> - repeat above until entire original file has been processed.
> - open files created
> - read one record from each of them
> - write out the smallest of these and read the next record from the
>   file that delivered it
> - repeat previous step until all files have reached eof

> Of course you may in this scenario encounter other problems - too many
> open files, in which case you would only merge the first so many files
> into one, before repeating the exercise with the next bunch. And yes,
> you can already do the first merge-pass with the data from file 1 and
> the sorted data from the next bunch...

> Of course it's far easier to download RPSRT102.ZIP and SPWNO413.ZIP and
> use the latters EXEC(...) executing RPSORT to do your sorting... <G>

> And if RPSORT isn't fast enough, it contains the full source, so you
> can tweak it for even higher performance.

> BTW, I once tested it with a 250 Mb file filled with garbage, RPSORT (multiple
> keys!) took just over 10 minutes in an NT Dos box,  NTs SORT (single key):
> zapped it after more than 30 minutes... Both were running as the only real
> process.

URLs for RPSRT102.ZIP & SPWNO413.ZIP:

If I need files, I use http://ftpsearch.lycos.com/?form=medium

"Search type" - case insensitive substring match

"Max hits" - I usually set this to 50 or 100, to get a few more closer to
home sites and a decent amount of file dates, so that I can be sure I get the
latest version of something.

Don't enter an extension, as .ZIP is not universal for archives

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

Re:Sorting - Big files


Re:Sorting - Big files


Quote
netn...@altavista.net wrote:

> In article <7f4grb$7i...@nnrp1.dejanews.com>,
>   Robert AH Prins <prin...@williscorroon.com> wrote:
> > In reply to a private email from a person who knows he can interrupt me. (but
> > should have added to the tread...)

  'Twas me  - hit the wrong button in Netscape. Sorry. By  the  time I
  noticed, it was too late to do anything about it.

  [...]

  >> If your  file  exceeds available direct access  memory,  you will
  >> have to  settle  for a multipass merge sort,  in  very simplistic
  >> terms:

  >> - read in as many records as allowed by available RAM
  >> - sort them and write them out to file 1
  >> - read in a next bunch of records
  >> - sort them and write them out to file 2
  >> - repeat above until entire original file has been processed.
  >> - open files created
  >> - read one record from each of them
  >> - write out the smallest of these and read the next record from the
  >>   file that delivered it
  >> - repeat previous step until all files have reached eof

  [...]

  Very good explanation. Thanks!

  The key  is  to "write out the smallest of these and  read  the next
  record *from the file that delivered it*". It didn't make  any sense
  until that sunk in.

  The next problem is the number of file handles. This might be a good
  argument for not using the heap, which limits you to 64k chunks.

  Instead, allocate  DOS  memory with int21, function  48.  This would
  allow sorting in ~500k chunks. If you had 30 file handles available,
  this would allow sorting files up to 15 megabytes before running out
  of file handles.

  Thanks to  WIN9X, most machines come with 64 or 128 megs of  ram. It
  seems you  could  use Himem to move data from DOS to  XMS,  and sort
  larger files by using XMS instead of writing to the hard disk.

  Of course, the ideal would be to use flat real mode and do  the sort
  directly in ram! Default to using Himem only when in V86 mode.

Best Regards,

Michael R. Monett,
Automated Production Test
mailto:a...@csolve.net

Re:Sorting - Big files


In article <3718310F.1...@csolve.net>,
  Mike Monett <a...@csolve.net> wrote:

Quote
> netn...@altavista.net wrote:

> > In article <7f4grb$7i...@nnrp1.dejanews.com>,
> >   Robert AH Prins <prin...@williscorroon.com> wrote:
> > > In reply to a private email from a person who knows he can interrupt me.
(but
> > > should have added to the tread...)

>   'Twas me  - hit the wrong button in Netscape. Sorry. By  the  time I
>   noticed, it was too late to do anything about it.

>   [...]

Don't worry, you're on the list of people who are allowed to send me private
emails. It's just that I'm currently at war with my accountant and this is
putting me under some stress. [For those without access to UK newspapers,
accountants in UK are the worlds best paid, and mine is trying to charge my
company an amount equivalent to about GBP 3300 per hour...]

Quote
>   >> If your  file  exceeds available direct access  memory,  you will
>   >> have to  settle  for a multipass merge sort,  in  very simplistic
>   >> terms:

>   >> - read in as many records as allowed by available RAM
>   >> - sort them and write them out to file 1
>   >> - read in a next bunch of records
>   >> - sort them and write them out to file 2
>   >> - repeat above until entire original file has been processed.
>   >> - open files created
>   >> - read one record from each of them
>   >> - write out the smallest of these and read the next record from the
>   >>   file that delivered it
>   >> - repeat previous step until all files have reached eof

>   [...]

>   Very good explanation. Thanks!

>   The key  is  to "write out the smallest of these and  read  the next
>   record *from the file that delivered it*". It didn't make  any sense
>   until that sunk in.

>   The next problem is the number of file handles. This might be a good
>   argument for not using the heap, which limits you to 64k chunks.

>   Instead, allocate  DOS  memory with int21, function  48.  This would
>   allow sorting in ~500k chunks. If you had 30 file handles available,
>   this would allow sorting files up to 15 megabytes before running out
>   of file handles.

Not true, even with just one(!) file handle you can sort any size file:

- open file 1
- read input until RAM full
- close file 1
- sort RAM
- open file 2
- write out sorted data
- close file 2

- open file 1 again, read until RAM (-one record) full & close file 1
- sort RAM
- open file 2 and read first record
- close file 2
- open file 3
- write out smallest of RAM/file 2
- if record from RAM
  repeat above,
- else
  close file 3
  open file 2, read next & close
  open file 3, seek eof
  write out smalles of just read/RAM

Etc, I suppose you get the picture. The process will be
slooooooooooooooooow... and I expect it to fail somewhere with record files >
2Gb

Quote
>   Thanks to  WIN9X, most machines come with 64 or 128 megs of  ram. It
>   seems you  could  use Himem to move data from DOS to  XMS,  and sort
>   larger files by using XMS instead of writing to the hard disk.

Yes, for pure DOS, but under Winduffs (or OS/2), which allow the use of XMS in
their DOS boxes, it would result in paging, eliminating the advantage.

Quote
>   Of course, the ideal would be to use flat real mode and do  the sort
>   directly in ram! Default to using Himem only when in V86 mode.

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

Re:Sorting - Big files


In article <3718310F.1...@csolve.net>, Mike Monett <a...@csolve.net>
writes

Quote
>  The next problem is the number of file handles. This might be a good
>  argument for not using the heap, which limits you to 64k chunks.

Michael - you can do it with four file handles. There is a method called
the "Merging Sort" which uses only four files. This method used to be
used for sorting records held on tape (for which you needed four tape
units) but you can do it with four serial files on one disk. Neither
file size nor memory size is a problem - never more than three records
in memory at any one time.

The method is described in "Elements of Computer Science", 2nd Ed., Glyn
Emery, Pitman, 0-273-01332-7 (ch8 p136 in my copy).

If you can't find the reference, get back to me and I'll try to post you
the method.

--
Marcus Morris - South Croydon, LONDON, UK (Mar...@ntos.demon.co.uk)

Re:Sorting - Big files


Quote
Marcus Morris wrote:
> In article <3718310F.1...@csolve.net>, Mike Monett <a...@csolve.net>
> writes
> >  The next problem is the number of file handles. This might be a good
> >  argument for not using the heap, which limits you to 64k chunks.
> Michael - you can do it with four file handles. There is a method called
> the "Merging Sort" which uses only four files. This method used to be
> used for sorting records held on tape (for which you needed four tape
> units) but you can do it with four serial files on one disk. Neither
> file size nor memory size is a problem - never more than three records
> in memory at any one time.
> The method is described in "Elements of Computer Science", 2nd Ed., Glyn
> Emery, Pitman, 0-273-01332-7 (ch8 p136 in my copy).
> If you can't find the reference, get back to me and I'll try to post you
> the method.
> Marcus Morris - South Croydon, LONDON, UK (Mar...@ntos.demon.co.uk)

Marcus,

Yes, please post. It sounds interesting.

Best Regards,

Michael R. Monett
mailto:a...@csolve.net

Re:Sorting - Big files


Robert AH Prins <prin...@williscorroon.com> wrote:

Quote
> You cannot have an array that is larger than available memory :-)

Physical or virtual memory?  ;-)
k

Re:Sorting - Big files


In article <7gcvbu$sp...@schbbs.mot.com>,

Quote
  m...@here.com wrote:
> Robert AH Prins <prin...@williscorroon.com> wrote:

> > You cannot have an array that is larger than available memory :-)

> Physical or virtual memory?  ;-)

Memory is memory, and virtual memory has its limits, just as physical memory,
and no array can exceed these limits!

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

Other Threads