Selecting a good file format can make a huge difference in a system. Certainly file formats focused on sharing and standards, such as XML and JSON, get lots of attention, but there is equally important work at the other end of the spectrum - private data files. The format outlined here is designed for internal use, and is particularly well suited to sequential build, with random access. It also scales well to large data sets, limited by the ability to keep the record index in memory.
The basic idea is as follows:
Header (version, etc)/TD> Data record 1 Data record 2 ... Data record n Index record 1 Index record 2 ... Index record n Index count
The data records can be any format you like, fixed or variable length, it really doesn't matter. The index records need to be of fixed size, and should be padded to match the native alignment of your chosen CPU architecture. Each index record needs to include the offset and size of the corresponding data record, and the key. Something like this:
struct IndexVariableRecord { UInt32 key; UInt32 dataLength; UInt64 dataOffset; }
The types you choose to use should obviously be selected to match the type of data, and the overall size of the collection. Here I am assuming variably sized records, a 32-bit unsigned integer key, a possibly huge file (the 64-bit offsets), and a maximum individual record size of 4GB. In most cases one probably needs more bits for key, and fewer for capacity. A sample index structure for fixed size data records could be:
struct IndexFixedRecord { UInt32 key; UInt32 dataIndex; }
Here we are using the same 32-bit key, but now are using an index into the record portion of the file, rather than an offset and length. In this case the offset can be computed, with "sizeof(FileHeader) + sizeof(DataRecord) * indexRecord.dataIndex". The length is then simply sizeof(DataRecord).
In order to allow efficient access, the index records should be sorted by key. To keep the build process simple, it is not necessary for the index and data records to be in the same order - this allows one to easily generate very large data files quickly.
Generation of such a file is a fairly simple process. It involves building the index section in memory, while writing the data section to disk. When finished, the entire index section can be sorted, and then written to disk. Something like this in C++:
// maximum valid size for the variable portion of a data_record
const unsigned int max_data_record_variable_size = 64 * 1024;
#pragma pack(push, 1)
struct data_record
{
unsigned int fixed_data;
// more fixed fields...
unsigned char variable_data[max_data_record_variable_size];
// this portion can be any length
};
struct index_record
{
unsigned int key;
unsigned int length;
unsigned long long offset;
};
#pragma pack(pop)
// write a data record in binary format
void write_data_record(std::ostream& out, const data_record& record)
{
out.write(reinterpret_cast<const char*>(&record.fixed_data), sizeof(record.fixed_data));
out.write(reinterpret_cast<const char*>(&record.variable_data[0]), record.fixed_data);
}
unsigned int get_key(const data_record& record)
{
return record.fixed_data;
}
bool index_record_order(const index_record& a, const index_record& b)
{
return a.key < b.key;
}
// write a simple binary file header, consisting of a signature and version
void write_header(std::ostream& stream)
{
static const unsigned int signature = 0x52544b46;
static const unsigned int version = 0x00000009;
stream.write(reinterpret_cast<const char*>(&signature), sizeof(signature));
stream.write(reinterpret_cast<const char*>(&version), sizeof(version));
}
template<class _itor>
void write_data_file(const char* filename, _itor begin, _itor end)
{
std::ofstream datafile(filename, std::ios_base::out | std::ios_base::binary);
std::vector<index_record> index;
write_header(datafile);
for (; begin != end; ++begin)
{
const data_record& record = *begin;
std::ofstream::pos_type beginPos = datafile.tellp();
write_data_record(datafile, record);
std::ofstream::pos_type endPos = datafile.tellp();
if ((unsigned long long)std::numeric_limits<unsigned int>::max() < endPos - beginPos)
throw std::overflow_error("single record was too long");
index_record indexRecord;
indexRecord.key = get_key(record);
indexRecord.offset = beginPos;
indexRecord.length = static_cast<unsigned int>(endPos - beginPos);
index.push_back(indexRecord);
}
std::sort(index.begin(), index.end(), index_record_order);
unsigned int size = index.size();
datafile.write(reinterpret_cast<const char*>(&index[0]), sizeof(index_record) * size);
datafile.write(reinterpret_cast<const char*>(&size), sizeof(size));
}
Calling this would look something like:
std::vector<DataRecord> dataRecords; // get records... write_data_file("file.dat", dataRecords.begin(), dataRecords.end());
Any input iterator will work, it doesn't have to be a vector - thus you can stream records from another file directly into the builder, only keeping the index in memory:
std::ifstream input("records.txt"); std::istream_iterator<data_record> itor(input); std::istream_iterator<data_record> end; write_data_file("file.dat", itor, end);
Although there is quite a bit of code here, if we break it down it is pretty easy to follow:
- Structure definitions for the data record and index record. Note the use of #pragma pack to ensure the structures in memory layout is exactly as desired.
- write_data_record: a very simple function to serialize a data_record to an std::ostream in a binary format.
- get_key: used to extract the key from a data_record when building the index.
- index_record_order: used to sort the index records, so we can use binary search later.
- write_header: by including a signature and a version number, we can more easily tell during read if a file is valid or not.
- write_data_file: the heart of the sample, this takes the filename for the data file, and a pair of iterators to data_record's:
- Create the output file using std::ofstream.
- Declare a std::vector to build the index in memory.
- Write the file header.
- Enumerate each data record:
- Store the current file position.
- Write the data record, using write_data_record.
- Compare the new file position with the stored position, and ensure that it is small enough to fit in the length member variable of index_record.
- Construct a new index_record, using the key extractor (get_key), the offset, and computed record length.
- Add the index_record to the in-memory collection.
- Once we process all of the data records, sort the index collection using the predicate index_record_order, which simply sorts by ascending key.
- Write the index section to disk.
- Write the count of index records to disk.
Next time we will implement a simple class to load and access this data file.
-randy
Comments