index - create index using a fast access method
SYNOPSIS
index -m[sr] [-fc] [-Dn] table
-m[bi] [-fc] [-Dn] [-x] [-n] table keycolumn ...
-mh [-fc] [-Dn] [-i] [[+-]x] table keycolumn ...
DESCRIPTION
index
sets up a way for the
search
program to find a row in a table or list file
using one of the fast access methods.
Exactly what
index
does depends upon which fast access method is employed.
For some methods
index
builds a secondary
index table.
It is possible to
index
a single file for all five methods,
and some of the methods can index all of the columns.
These fast access methods are
static.
If the table or list is changed it may have to be re-indexed.
Application programs should make sure
that secondary indexes are updated to reflect changes made to
their associated table or list files.
Indexing all of the columns separately
in a table can be done
by linking the table to as many different names
as there are columns to index.
Then each column can be indexed on one of the links.
OPTIONS
- -fc
- c
(not TAB) is the input column separation character.
- -Dn
- turn on debugging at level n, or 1 if n
is not specified.
- -ms
- sequential method.
index
does nothing in this case because
search
simply scans through the whole file,
first row to last.
- -mr
- record method.
index
builds a secondary file named
tableorlist.r,
which contains a fixed length column of row offsets.
- -mb
- binary method.
index
sorts
tableorlist
by the specified
keycolumns.
By default,
keycolumns
indexed with the binary method
are sorted without regard to leading spaces
(-b
sort
option).
Other options applicable to both binary and inverted
methods of indexing and searching modify the way in
which the
tableorlist
is sorted, and are as follows:
- -x
- this option allows for searching
whole or partial keys without regard to case,
by applying the
-f
option to the sort command.
- -n
- this option sorts the
keycolumns
numerically instead of alphabetically,
for subsequent numeric searches: for example,
001
is the same as
1
is the same as
1.0.
When one of these options is used on the
index
command line, it must also be used on the
search
command line.
- -mi
- inverted method.
This method is identical to
binary
except that
index
builds a secondary file of keys and offsets
and sorts the keys in the secondary file rather than
the table itself.
The secondary file is named
tableorlist.i.
By default,
index
-mi
strips leading, trailing and multiple blanks from the keys
before adding them to the secondary table so that
`Mr. Goober' is sorted in the same way as ` Mr. Goober '.
The inverted method can get away with altering the
keycolumns
to provide a greater degree of conformity for search purposes
because it's not actually touching the
tableorlist
itself, as in the binary method,
which must rely solely on the final
sort
command to manipulate the
keycolumns.
See
-mb
above for an explanation of the other
index
options available with
-mi.
- -mh
- hash method.
index
builds a secondary file, named
tableorlist.h,
and applies a simple algorithm to the keys in each row of
tableorlist
to produce a number which is within the bounds of the number of
rows in the secondary file (called
hashing).
index
writes the offset of the
tableorlist
row at the row number in the secondary file.
If there's already an offset at that position,
index
probes each subsequent row in the index until
it finds an empty slot at which to dump its offset.
Because duplicate keys hash to the same row number,
using the hash index and search method for
tables which contain many identical or similar
keys (lists of numbers, for example)
can result in a slow sequential search rather than a
fast hash search.
The efficiency of hashing relies on the
degree of uniqueness of keys
and thus is very good for indexing data which varies widely.
One way to determine how well-suited the hash method is
for a particular table or key is to examine the resulting
tablorlist.h
file.
An excellent result is one in which very few non-zero
offsets appear in sequential rows; a not-so-good result is
one in which big clumps of non-zero offsets appear
in sequential rows.
Because the row number is based on the content of the key
column, the way a key will be searched
defines the way it should be indexed.
This also means that the
(-x, +x, -i)
switches used on the
index
command line
must also be used on the
search
command line, so that the key being searched can be
processed, hashed and compared in the same way as
the key it is seeking.
By default,
index
-mh
strips keys of leading, trailing and multiple internal blanks
prior to hashing.
This allows one to find keys which are word-for-word identical
even though they may not be byte-for-byte matches.
Also, blank key values are all assigned a null value in
the hash table so that they can be found regardless of their
(invisible) content. Options which alter the way hash tables
are created are:
- +x
- this option prohibits changing the key as it appears
in the table prior to hashing, forcing an exact match requirement
of subsequent searches, extraneous blanks and all.
- -x
- this option hashes an all lower case version of the key
in addition to trimming it as described above. When keys are
searched also using
-x,
the result is case insensitivity: in other words, ` Mr. gOOber '
matches `Mr. Goober.'
+x
and
-x
are mutually exclusive options: whichever one appears first
on the command line is the one which determines the indexing
behavior.
- -i
- this option prohibits
index
from adding offsets of rows which contain blank keys to the
index table.
This is a good idea if there are many blank keys in the table
because they will all hash to the same value, creating
a long list of sequential offsets in the hash table.
The side effect of
-i
is that search will not be able to find the blank keys.
If this is a problem, a different method should be employed
to index the table or to locate blank keys.
- -n
- this option hashes the
string representation of
keycolumns
after applying the
atof()
function to them.
Thus,
keycolumns
which contain any
non-numeric notation will evaluate to zero,
or be considered blank.
Care should be taken when using this option,
especially if the
-i
option is also being used.
In general, the binary and inverted methods
are better choices for numeric keys.
EXAMPLES
$ cat chart
Account Name
------- ----
100 Assets
101 Cash
111 Accounts Receivable
115 Notes Receivable
120 Deposits
Access to the data in
chart
can be sped up by indexing it and using the
search
command to find the row you want.
To index
chart
for the
inverted
search method:
$ index -mi chart Account
$ cat chart.i
Offset Account
--------- -------
26 100
37 101
46 111
70 115
91 120
SEE ALSO