Board index » cppbuilder » Large dynamic multi-dimensional array taking up too much memory?

Large dynamic multi-dimensional array taking up too much memory?


2005-12-22 03:02:34 AM
cppbuilder72
First off this might be difficult to describe.
I've got 250,000 documents that I want to index for faster searching.
Now the indexing I have down, I'm working on the search part.
It works just fine, except for the fact it's taking 1.3GB worth of memory.
I've got a dictionary of words totaling 300,000 based upon the documents.
I've got 9 sources and 23 categories.
I then have 207 files, in these files are lists of word keys and document
keys.
Idea being that I find the word(s) searched for, then based upon source and
category I can find all document id's associated with those word(s).
So the struct looks like:
struct A_TABLE
{
A_TABLE *Next;
long id;
};
So for the array it would look something like
table[300000][9][23]. But that's only knowing the exact
number of words, sources and categories. All of which can
be dynamic, so I have to create them dynamically.
A_TABLE ****MyTable = new A_TABLE***[TotalWords];
for(long x=0; x < TotalWords; x++)
{
MyTable[x] = new A_TABLE **[NumSources];
for(int y=0; y < NumSources; y++)
{
MyTable[x][y] = new A_TABLE*[NumCats];
for(int z=0; z < NumCats; z++)
MyTable[x][y][z] = NULL;
}
}
So far so good, I think. I have a total of 3 tables of different
sizes that take up 430MBish of RAM. Which multiplying the numbers
out TableSize * NumSources * NumCats * sizeof(A_TABLE) seems
to be right.
Now for the part that I think is the problem. Loading the datafiles
into the array.
for(int x=0; x < NumSources; x++)
{
for(int y=0; y < NumCats; y++)
{
int iFile;
AnsiString Filename;
Filename.sprintf("%s\\Index\\0\\%d\\%d\\0.dat", InstallDir.c_str(), x,
y);
if((iFile = FileOpen(Filename, fmOpenRead)) != -1)
{
long id, word_key;
while(FileRead(iFile, &id, sizeof(id)))
{
FileRead(iFile, &word_key, sizeof(word_key));
A_TABLE *New = new A_TABLE;
New->id = id;
New->Next = MyTable[word_key][x][y];
MyTable[word_key][x][y] = New;
}
FileClose(iFile);
}
}
}
The datafiles are 582MB in size. But they are composed of
sets of 2 longs, so what I'm loading in to memory should be
only be 291MB. So, I think I should be using around 700MB
of RAM total and not 1300MB.
Any ideas?
TIA,
Tom
 
 

Re:Large dynamic multi-dimensional array taking up too much memory?

"Tom" < XXXX@XXXXX.COM >wrote in message
Quote
First off this might be difficult to describe.

<snip>
Quote
The datafiles are 582MB in size. But they are composed of
sets of 2 longs, so what I'm loading in to memory should be
only be 291MB. So, I think I should be using around 700MB
of RAM total and not 1300MB.

I don't follow this reasoning. But just a general comment: each memory
allocation has some overhead (the heap manager has to tag each
allocated block for its own bookkeeping purposes). This can add up to
a lot if you do many small allocations. You can alleviate this by
using a suballocation scheme (allocate a big chuck of memory and
parcel it out to your routines as needed).
--
Bruce
 

Re:Large dynamic multi-dimensional array taking up too much memory?

Quote
I don't follow this reasoning.
Because I'm only storing 1 long worth of data in memory, but I didn't take
into account the pointer at the time, so it's still actually 8 bytes worth
of
information, so the memory taken up should stil lbe 582MB worth.
Quote
But just a general comment: each memory allocation has some overhead (the
heap manager has to tag each allocated block for its own bookkeeping
purposes). This can add up to a lot if you do many small allocations
This is where the extra memory usage must be coming from. I'll have to test
it but with the title indexes it amounts to about 5MB of overhead. With the
body indexes it could be where the additionaly 2-300MB is at.
In either case, I dug out my calculator this morning and started breaking it
down, and if I esitmate things right, without the extra overhead I should be
using
right around 1GB of memory anyways which is way too much. I might have the
memory but most people don't..
Here's the broken down figures:
Title Dictionary: 31,220 words
Author Dictionary: 17,214 words
Body Dictionary: 376,370 words
I'm indexing all words from 1 char to 20 chars. It's important that they be
able to
search on smaller words.
Title Index Records: 1,588,232
Author Index Records: 574, 024
Body Index Records: 74, 107, 289
So now doing the math (for just the title):
words sources categories size of ptr
31220 * 9 * 23 * 4 = 24.65MB
Looking at the memory consumption in the Process Explorer it's actually
using 30MB worth of data. I've already taken into account the memory
used for the program by itself. So that's the 5MB worth of overhead.
That's just for the title indexes, the body index pointers amount to 297MB
worth of memory for just the pointers.
Altogether, before loading any data the memory usage of the indexes is 420MB
in the process explorer.
So, if I've done things right each record that gets loaded should take 8
bytes of memory. 4 bytes for the Next pointer and 4 bytes for the long. So
all
in all the memory usage for all loaded records should be the same as their
file
size on disk, which is about 1GB of RAM to be used. If there's also more
overhead for each record because it's being allocated singly then the
addition
300MB could come from there.
All in all, after crunching numbers, 1.3GB worth of memory looks to be
accurate. Very disappointing because I was searching 250,000 documents
in 3 seconds but the memory usage is just too much.
I'm using an alternate method that basically scans the index files as
needed, and when the searching the body of the document yields roughly
1m 45s per word of search time. This is still a huge improvement over the
original method. It didn't require the extra space for the index but was
quite
slow. When doing an exact search for something like "dog" it would take
20min to search.
-Tom
 

{smallsort}

Re:Large dynamic multi-dimensional array taking up too much memory?

"Tom" < XXXX@XXXXX.COM >writes:
Quote
Looking at the memory consumption in the Process Explorer it's actually
using 30MB worth of data. I've already taken into account the memory
used for the program by itself. So that's the 5MB worth of overhead.
Looking at the memory usage according to the OS doesn't give a good
picture of what your application is doing, because your application is
going through a memory manager that talks to the OS. If your
application asks for X bytes from the memory manager, it may in turn
ask for X+N bytes from the OS.
It is not uncommon to "over allocate" memory to help reduce the
overhead. But this may make your application appear to the OS to be
using more than it actually is (from your code's perspective.)
--
Chris (TeamB);