Recipe 4.4 Implementing a Sparse Array
4.4.1 Problem
An array with large, unoccupied expanses
between occupied elements wastes memory. How do you reduce that
overhead?
4.4.2 Solution
Use a hash instead of an array.
4.4.3 Discussion
If you assign to the millionth element of an array, Perl allocates a
million and one slots to store scalars. Only the last element
contains interesting data, leaving earlier ones each set to
undef at a cost of four (or more) bytes per
unoccupied slot.
In recent versions of Perl, if you grow an array by assigning either
past the end or directly to $#ARRAY, you can
distinguish these implicit undefs from those that
would result from assigning undef there by using
exists instead of defined, just
as you would with a hash.
$#foo = 5;
@bar = ( (undef) x 5 ) ;
printf "foo element 3 is%s defined\n",
defined $foo[3] ? "" : "n't";
printf "foo element 3 does%s exist\n",
exists $foo[3] ? "" : "n't";
printf "bar element 3 is%s defined\n",
defined $bar[3] ? "" : "n't";
printf "bar element 3 does%s exist\n",
exists $bar[3] ? "" : "n't";
foo element 3 isn't defined
foo element 3 doesn't exist
bar element 3 isn't defined
bar element 3 does exist
However, you still waste a lot of space. That's because Perl's array
implementation reserves a contiguous vector, one for each element up
to the highest occupied position.
$real_array[ 1_000_000 ] = 1; # costs 4+ megabytes
A hash works differently: you pay only for what you really use, not
for unoccupied positions. Although a hash element costs somewhat more
than an array element because you need to store both the value and
its key, with sparse arrays, the savings can be astonishing.
$fake_array{ 1_000_000 } = 1; # costs 28 bytes
What's the trade-off? Because a hash's keys aren't ordered, a little
more work is needed to sort the numeric keys so you can handle their
values in the same order as you would if they were stored as a real
array. With an array, you'd just do this to process elements in index
order:
foreach $element ( @real_array ) {
# do something with $element
}
or this to process indices in ascending order:
foreach $idx ( 0 .. $#real_array ) {
# do something with $real_array[$idx]
}
Using a hash representation, you should instead do either this to
process elements in index order:
foreach $element ( @fake_array{ sort {$a <=> $b} keys %fake_array } ) {
# do something with $element
}
or this to process indices in ascending order:
foreach $idx ( sort {$a <=> $b} keys %fake_array ) {
# do something with $fake_array{$idx}
}
If you don't care about handling elements in a particular order,
however, you don't need to go through all that. Just process the
values according to their internal order, either like this:
foreach $element ( values %fake_array ) {
# do something with $element
}
or like this:
# process indices in internal hash order
foreach $idx ( keys %fake_array ) {
# do something with $fake_array{$idx}
}
If you're determined to use an array, two fairly specialized cases
occasionally arise in which you can save substantial amounts of
memory by using an alternate storage scheme. Both cases also apply to
arrays that are densely populated, not just those that are mostly
empty.
The first case shows up when you grow an array by repeatedly
appending new elements until its subscripts become large. Because of
how Perl reallocates memory for growing arrays, this can use up to
four times the memory you really need. If you happen to know how big
the array will (or might) eventually become, you can avoid this
reallocation overhead either by storing the large subscripts first
instead of the small ones:
for ($i = 10_000; $i >= 0; $i--) { $real_array[$i] = 1 }
or by presizing the array by assigning to the special
$#ARRAY notation:
$#real_array = 10_000;
The second special case comes up when each array element holds
nothing but a single one-bit value—essentially either a true or
a false. For example, suppose you are keeping track of numbered
USENET news articles, and you only need to know whether a given
article number has been read. For situations like this, use a bit
vector instead of a real array:
my $have_read = '';
for ($i = 10_000; $i >= 0; $i--) { vec($have_read, $i, 1) = 1 }
Then you can check to see whether a given article has been read this
way:
if (vec($have_read, $artno, 1)) { .... }
4.4.4 See Also
The vec function in
perlfunc(1) and in Chapter 29 of
Programming Perl
|