Using boost::iostreams::mapped_file_source with std::multimap
Multi maps aren't laid out sequentially in memory. (They're node-based containers, but I digress). In fact, even if they were, chances would be slim that the layout would match that of the text input.
There's basically two ways you can make this work:
- Keep using the
multimap
but use a custom allocator (so that all allocations are done in the mapped memory region). This is the "nicest" from a high-level C++ viewpoint, /but/ you will need to change to a binary format of your file.
If you can, this is what I'd suggest. Boost Container + Boost Interprocess have everything you need to make this relatively painless.
You write a custom container "abstraction" that works directly on the mapped data. You could either
- recognize a "xxxx yyyy" pair from anywhere (line ends?) or
- build an index of all line starts in the file.
Using these you can devise an interator (Boost Iterator iterator_facade
) that you can use to implement higher level operations (lower_bound
, upper_bound
and equal_range
).
Once you have these, you're basically all set to query this memory map as a readonly key-value database.
Sadly, this kind of memory representation would be extremely bad for performance if you also want to support mutating operations (insert
, remove
).
If you have an actual sample of the file, I could do a demonstration of either of the approaches described.
Update
Quick Samples:
With boost::interprocess you can (very) simply define the multimap you desire:
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}Notes:
I chose
flatmap
(flat_multimap
, actually) because it is likely more
storage efficient, and is much more comparable to the second approach
(given below);Note that this choice affects iterator/reference stability and will
favours read-only operations pretty heavily. If you need iterator
stability and/or many mutating operations, use a regularmap
(or for
very high volumes ahash_map
) instead of the flat variations.I chose a
managed_mapped_file
segment for this demonstration (so you get persistence). The demo shows how 10G is sparsely pre-allocated, but only the space actually allocated is used on disk. You could equally well use amanaged_shared_memory
.If you have binary persistence, you might discard the text datafile altogether.
I parse the text data into a
shared::multi_map<double, unsigned>
from amapped_file_source
using Boost Spirit. The implementation is fully generic.There is no need to write
iterator
classes,start_of_line()
,end_of_line()
,lower_bound()
,upper_bound()
,equal_range()
or any of those, since they're already standard in themulti_map
interface, so all we need to is writemain
:
Live On Coliru
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/container/flat_map.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
namespace bip = boost::interprocess;
namespace qi = boost::spirit::qi;
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}
#include <iostream>
bip::managed_mapped_file msm(bip::open_or_create, "lookup.bin", 10ul<<30);
template <typename K, typename V>
shared::multi_map<K,V>& get_or_load(const char* fname) {
using Map = shared::multi_map<K, V>;
Map* lookup = msm.find_or_construct<Map>("lookup")(msm.get_segment_manager());
if (lookup->empty()) {
// only read input file if not already loaded
boost::iostreams::mapped_file_source input(fname);
auto f(input.data()), l(f + input.size());
bool ok = qi::phrase_parse(f, l,
(qi::auto_ >> qi::auto_) % qi::eol >> *qi::eol,
qi::blank, *lookup);
if (!ok || (f!=l))
throw std::runtime_error("Error during parsing at position #" + std::to_string(f - input.data()));
}
return *lookup;
}
int main() {
// parse text file into shared memory binary representation
auto const& lookup = get_or_load<double, unsigned int>("input.txt");
auto const e = lookup.end();
for(auto&& line : lookup)
{
std::cout << line.first << "\t" << line.second << "\n";
auto er = lookup.equal_range(line.first);
if (er.first != e) std::cout << " lower: " << er.first->first << "\t" << er.first->second << "\n";
if (er.second != e) std::cout << " upper: " << er.second->first << "\t" << er.second->second << "\n";
}
}I implemented it exactly as I described:
simple container over the raw
const char*
region mapped;using
boost::iterator_facade
to make an iterator that parses the text on dereference;for printing the input lines I use
boost::string_ref
- which avoids dynamic allocations for copying strings.parsing is done with Spirit Qi:
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))Qi was chosen for speed and genericity: you can choose the
Key
andValue
types at instantiation time:text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
I've implemented
lower_bound
,upper_bound
andequal_range
member functions that take advantage of underlying contiguous storage. Even though the "line"iterator
is not random-access but bidirectional, we can still jump to themid_point
of such an iterator range because we can get thestart_of_line
from anyconst char*
into the underlying mapped region. This make binary searching efficient.
Note that this solution parses lines on dereference of the
iterator
. This might not be efficient if the same lines are dereferenced a lot of times.But, for infrequent lookups, or lookups that are not typical in the same region of the input data, this is about as efficient as it can possibly get (doing only minimum required parsing and
O(log n)
binary searching), all the while completely bypassing the initial load time by mapping the file instead (no access means nothing needs to be loaded).Live On Coliru (including test data)
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/optional.hpp>
#include <boost/spirit/include/qi.hpp>
#include <thread>
#include <iomanip>
namespace io = boost::iostreams;
namespace qi = boost::spirit::qi;
template <typename Key, typename Value>
struct text_multi_lookup {
text_multi_lookup(char const* begin, char const* end)
: _map_begin(begin),
_map_end(end)
{
}
private:
friend struct iterator;
enum : char { nl = '\n' };
using rawit = char const*;
rawit _map_begin, _map_end;
rawit start_of_line(rawit it) const {
while (it > _map_begin) if (*--it == nl) return it+1;
assert(it == _map_begin);
return it;
}
rawit end_of_line(rawit it) const {
while (it < _map_end) if (*it++ == nl) return it;
assert(it == _map_end);
return it;
}
public:
struct value_type final {
rawit beg, end;
Key key;
Value value;
boost::string_ref str() const { return { beg, size_t(end-beg) }; }
};
struct iterator : boost::iterator_facade<iterator, boost::string_ref, boost::bidirectional_traversal_tag, value_type> {
iterator(text_multi_lookup const& d, rawit it) : _region(&d), _data { it, nullptr, Key{}, Value{} } {
assert(_data.beg == _region->start_of_line(_data.beg));
}
private:
friend text_multi_lookup;
text_multi_lookup const* _region;
value_type mutable _data;
void ensure_parsed() const {
if (!_data.end)
{
assert(_data.beg == _region->start_of_line(_data.beg));
auto b = _data.beg;
_data.end = _region->end_of_line(_data.beg);
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))
{
std::cerr << "Problem in: " << std::string(_data.beg, _data.end)
<< "at: " << std::setw(_data.end-_data.beg) << std::right << std::string(_data.beg,_data.end);
assert(false);
}
}
}
static iterator mid_point(iterator const& a, iterator const& b) {
assert(a._region == b._region);
return { *a._region, a._region->start_of_line(a._data.beg + (b._data.beg -a._data.beg)/2) };
}
public:
value_type const& dereference() const {
ensure_parsed();
return _data;
}
bool equal(iterator const& o) const {
return (_region == o._region) && (_data.beg == o._data.beg);
}
void increment() {
_data = { _region->end_of_line(_data.beg), nullptr, Key{}, Value{} };
assert(_data.beg == _region->start_of_line(_data.beg));
}
};
using const_iterator = iterator;
const_iterator begin() const { return { *this, _map_begin }; }
const_iterator end() const { return { *this, _map_end }; }
const_iterator cbegin() const { return { *this, _map_begin }; }
const_iterator cend() const { return { *this, _map_end }; }
template <typename CompatibleKey>
const_iterator lower_bound(CompatibleKey const& key) const {
auto f(begin()), l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (m->key < key) {
f = m;
++f;
}
else {
l = m;
}
}
return f;
}
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key) const {
return upper_bound(key, begin());
}
private:
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key, const_iterator f) const {
auto l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (key < m->key) {
l = m;
}
else {
f = m;
++f;
}
}
return f;
}
public:
template <typename CompatibleKey>
std::pair<const_iterator, const_iterator> equal_range(CompatibleKey const& key) const {
auto lb = lower_bound(key);
return { lb, upper_bound(key, lb) };
}
};
#include <iostream>
int main() {
io::mapped_file_source map("input.txt");
text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
auto const e = tml.end();
for(auto&& line : tml)
{
std::cout << line.str();
auto er = tml.equal_range(line.key);
if (er.first != e) std::cout << " lower: " << er.first->str();
if (er.second != e) std::cout << " upper: " << er.second->str();
}
}
For the curious: here's the disassembly. Note how all the algorithmic stuff is inlined right into main
: http://paste.ubuntu.com/9946135/
Implicitly-declared boost::iostreams::mapped_file_source is deprecated
As you can se here: https://en.cppreference.com/w/cpp/language/copy_assignment
The generation of the implicitly-defined copy assignment operator is
deprecated(since C++11) if T has a user-declared destructor or
user-declared copy constructor.
Which is the case in boost 1.72, as you can see:
// Copy Constructor
mapped_file_source(const mapped_file_source& other);
There is this copy constructor on line 187 of boost\iostreams\device\mapped_file.hpp
Efficiently read a portion of text in C++
You don't even need to map a subsection of the file, because mmap just virtually maps memory blocks. Actual pages are only loaded on demand, so you could map the full 12GiB of a file even if you have only, say, 4GiB of physical RAM (not even requiring swap).
If your file is text-bases, you will want to find the start-of-line from a random location in the file.
An example of something similar is in the second approach here: Using boost::iostreams::mapped_file_source with std::multimap
Most efficient way to parse every fourth line from a very large file
This is the most efficient solution that I could come up with that is platform independent. I thought about all of the users, and for now am running under the assumption that everyone has a 64bit machine if they are using 4+ GiB file sizes. If this changes, I will have to modify the class to handle "blocks" of data into separate mmap
regions.
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>
//////////////////////////////////////////////////////////////////////////////////////////
/// @class LineParser
///
/// @brief Class to efficiently parse a file and take the second line out of every 4 lines
///
/// This class uses memory-mapped io to efficiently extract and return sequences from a
/// file
//////////////////////////////////////////////////////////////////////////////////////////
class LineParser
{
private:
boost::iostreams::mapped_file mmap; ///< Object for memory mapped file
const char* curr; ///< Current position of the file
const char* end; ///< End position of the file
public:
//////////////////////////////////////////////////////////////////////////////////////
/// @fn valid
///
/// Indicates whether the parser is in a valid state or not
///
/// @return Boolean indicating if the parser is open and in a valid state
///
/// @note Declared inline as it is acceptable in my situation because of being called
/// many times in only a few spots
//////////////////////////////////////////////////////////////////////////////////////
inline bool valid(void)
{
return (curr && end && (curr < end) && (mmap.is_open()));
}
//////////////////////////////////////////////////////////////////////////////////////
/// @fn peek
///
/// Obtains the next sequence string - if it exists - but maintains parsers state
///
/// @return Next sequence available in the file. Emptry string returned if none
/// exist
///
/// @note Declared inline as it is acceptable in my situation because of being called
/// many times in only a few spots
//////////////////////////////////////////////////////////////////////////////////////
inline std::string peek(void)
{
const char* save = curr;
std::string ret;
getRead(ret);
curr = save;
return ret;
}
//////////////////////////////////////////////////////////////////////////////////////
/// @fn getRead
///
/// Sets container to the current read being processed
///
/// @param container String container to place current sequence into
///
/// @return Boolean indicating if getting a new read was successful
///
/// @note Declared inline as it is acceptable in my situation because of being called
/// many times in only a few spots
//////////////////////////////////////////////////////////////////////////////////////
inline bool getRead(std::string& container)
{
if (valid() == false)
{
return false;
}
curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;
const char* index = static_cast<const char*>(memchr(curr, '\n', end-curr));
container = std::string(curr, index - curr);
curr = index + 1;
curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;
curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;
return true;
}
//////////////////////////////////////////////////////////////////////////////////////
/// @fn LineParser
///
/// Constructor to initialize memory mapped file and set index values
//////////////////////////////////////////////////////////////////////////////////////
LineParser(const std::string& filepath)
: mmap(filepath, boost::iostreams::mapped_file::readonly)
{
if (mmap.is_open())
{
curr = mmap.const_data();
end = curr + mmap.size();
}
}
//////////////////////////////////////////////////////////////////////////////////////
/// @fn ~LineParser
///
/// Default destructor
//////////////////////////////////////////////////////////////////////////////////////
~LineParser(void) = default;
};
Please note that this class is not perfect and might not handle deviations in file format very well, but under the assumption of perfect file format it works nicely.
Related Topics
How to Select a Random Element in Std::Set
Std::Vector of Std::Vectors Contiguity
Setjmp/Longjmp: Why Is This Throwing a Segfault
Maximum Stack Size for C/C+ Program
C++ - <Unresolved Overloaded Function Type>
Assembly Adc (Add with Carry) to C++
Is There C/C++ Equivalent of Eval("Function(Arg1, Arg2)")
Is It Bad Practice to Allocate Memory in a Dll and Give a Pointer to It to a Client App
How to Declare Variables of Different Types in the Initialization of a for Loop
Write and Read String to Binary File C++
Differencebetween C-Like Casting and Functional Casting
Boost.Python Not Supporting Parallelism
Capturing a Time in Milliseconds
How to Use Boost Preprocessor to Generate Accessors
Structured Binding with [[Maybe_Unused]]
Should I Inherit from Std::Exception
What Is the Most Efficient Way to Display Decoded Video Frames in Qt