Last time we described the basics of the file format, and wrote some sample code to generate the file. This time we are going to read the file. The process consists of skipping to the end to read the number of entries, then jumping back and reading the entire index section into memory. Once that is done, we are ready to perform lookups, by binary searching the index section for the desired key, then performing one seek and read to load the corresponding data record. The interesting parts follow:
void load_index(std::vector<index_record>& index, std::istream& stream)
{
unsigned int size;
stream.seekg(0 - sizeof(unsigned int), std::ios_base::end);
stream.read(reinterpret_cast<char*>(&size), sizeof(size));
index.resize(size);
stream.seekg(0 - sizeof(unsigned int) - sizeof(index_record) * size, std::ios_base::end);
stream.read(reinterpret_cast<char*>(&index[0]), sizeof(index_record) * size);
if (!stream.good())
throw format_error("stream is bad");
}
bool lookup(unsigned int key, const std::vector<index_record>& index, std::istream& stream, data_record* pRecord)
{
index_record rec;
rec.key = key;
std::vector<index_record>::const_iterator pos = std::lower_bound(index.begin(), index.end(), rec, index_record_order);
if (pos == index.end() || (*pos).key != key)
return false;
stream.seekg((*pos).offset);
stream.read(reinterpret_cast<char*>(pRecord), (*pos).length);
return true;
}
Usage is very straightforward:
std::ifstream file("file.dat", std::ios_base::in | std::ios_base::binary);
std::vector<index_record> index;
load_index(index, file);
data_record record;
if (lookup(key, index, file, &record))
{
// process the record with 'key'...
}
else
{
// no such record...
}
The STL function lower_bound implements the binary search algorithm, and is used by lookup() to search the index. Don't be fooled by the simplicity of binary search - it is extremely efficient. 4 billion records can be searched with 32 comparisons.
Next time we will look a some possible improvements, including refactoring the sample functions into a nice design, compression, caching, and a few other ideas as well.
-randy
Comments
You can follow this conversation by subscribing to the comment feed for this post.