Boost::Multi_Array Performance Question

Boost::multi_array performance question

Are you building release or debug?

If running in debug mode, the boost array might be really slow because their template magic isn't inlined properly giving lots of overhead in function calls. I'm not sure how multi array is implemented though so this might be totally off :)

Perhaps there is some difference in storage order as well so you might be having your image stored column by column and writing it row by row. This would give poor cache behavior and may slow down things.

Try switching the order of the X and Y loop and see if you gain anything.
There is some info on the storage ordering here:
http://www.boost.org/doc/libs/1_37_0/libs/multi_array/doc/user.html

EDIT:
Since you seem to be using the two dimensional array for image processing you might be interested in checking out boosts image processing library gil.

It might have arrays with less overhead that works perfectly for your situation.

Which is the fastest? A boost::multi_array or a std::vector?

Best advice is to benchmark it by yourself.

In any case, since you seem to have constant size there are other solutions:

  • plain C arrays (eg. int data[X][Y][Z])
  • plain one dimensional C array in which you compute indices by yourself, eg X*W*H + Y*W + Z, can be handy in some situations
  • std::array, which is basically a C++ array with some synctactic sugar taken from STL collections
  • std::vector, which I guess is the first solution that can be tried
  • boost::multi_array, which is meant to support N dimensional arrays so it can be overkill for your purpose but probably has a better locality of data compared to a vector.

Boost::multi_array -- referencing too slow

Where you zero-initialize your multi_arays, you can try using std::memset instead. For example

std::memset(b.data(), 0, size_of_b_in_bytes);

There are a few places in your code where you index the same multi_array element more than once. For example, instead of

h[i][j] = h[i][j] + pp[k]*v[k][i]

try

h[i][j] += pp[k]*v[k][i]

Normally, the optimizer would automatically make such substitutions for you, but maybe it can't with multi_array.

I also spotted two for loops that can be merged into one to avoid indexing the same multi_array element multiple times:

/*
for(i=0; i<=j; i++)
{
for(k=0; k<n; k++)
{
h[i][j] = h[i][j] + pp[k]*v[k][i];
}
}

for(i=0; i<=j; i++)
{
for(k=0; k<n; k++)
{
ppp[k] = ppp[k] + h[i][j]*v[k][i];
}
}
*/

for(i=0; i<=j; i++)
{
for(k=0; k<n; k++)
{
long double& h_elem = h[i][j];
long double v_elem = v[k][i];
h_elem += pp[k]*v_elem;
ppp[k] += h_elem*v_elem;
}
}

There might be more like these. Note the use of references and variables to "remember" an element and to avoid having to recompute its position in the multi_array.

In the last for loop from your code, you can avoid lots of recomputing of multi_array indices by using temporary variables and references:

/*
for(i=0;i<j+1;i++){
c = hess[i][i]/sqrt(pow(hess[i][i],2)+pow(hess[i+1][i],2));
s = hess[i+1][i]/sqrt(pow(hess[i][i],2)+pow(hess[i+1][i],2));
for (k=0;k<=j;k++){
inner1=c*hess[i][k]+s*hess[i+1][k];
inner2=(-s)*hess[i][k]+c*hess[i+1][k];
hess[i][k] = inner1;
hess[i+1][k] = inner2;
}
b[i+1] = -s*b[i];
b[i] = c*b[i];
}
*/

for(i=0;i<j+1;i++){
long double hess_i_i = hess[i][i];
long double hess_ip1_i = hess[i+1][i];
long double temp = sqrt(pow(hess_i_i,2)+pow(hess_ip1_i,2));
c = hess_i_i/temp;
s = hess_ip1_i/temp;
for (k=0;k<=j;k++){
long double& hess_i_k = hess[i][k];
long double& hess_ip1_k = hess[i+1][k];
inner1=c*hess_i_k+s*hess_ip1_k;
inner2=(-s)*hess_i_k+c*hess_ip1_k;
hess_i_k = inner1;
hess_ip1_k = inner2;
}
long double b_i& = b[i];
b[i+1] = -s*b_i;
b_i = c*b_i;
}

Double check my work -- it's certain I've made a mistake somewhere. Note that I've stored the sqrt(pow(hess_i_i,2)+pow(hess_ip1_i,2)) in a variable so that it's not needlessly computed twice.

I doubt these minor tweaks will bring the runtime down to 5 seconds. The problem with multi_array is that the array dimensions are only known at runtime. Support for row-major/column-major ordering probably also induces some overhead.

With C-style multi-dimensional arrays, dimensions are known at compile time, so the compiler can produce "tighter" code.

By using Boost multi_arrays you're basically trading off speed for flexibilty and convenience.

Fastest method of accessing elements in Boost MultiArray

The fastest way to access every element of a boost::multi_array is via data() and num_elements().

With data() you gain access to the underlying raw storage (a contiguous block that contains the array‘s data) so there isn't need for multiple index calculations (consider also that multi_array can index arrays from bases different from 0 and this is a further complication).

A simple test gives:

g++ -O3 -fomit-frame-pointer -march=native   (GCC v4.8.2)
Writing (index): 9.70651
Writing (data): 2.22353
Reading (index): 4.5973 (found 1)
Reading (data): 3.53811 (found 1)

clang++ -O3 -fomit-frame-pointer -march=native (CLANG v3.3)
Writing (index): 5.49858
Writing (data): 2.13678
Reading (index): 5.07324 (found 1)
Reading (data): 2.55109 (found 1)

By default boost access methods perform range checking. If a supplied index is out of the range defined for an array, an assertion will abort the program. To disable range checking, you can define the BOOST_DISABLE_ASSERTS preprocessor macro prior to including multi_array.hpp in your application.

This will reduce a lot the performance difference:

g++ -O3 -fomit-frame-pointer -march=native   (GCC v4.8.2)
Writing (index): 3.15244
Writing (data): 2.23002
Reading (index): 1.89553 (found 1)
Reading (data): 1.54427 (found 1)

clang++ -O3 -fomit-frame-pointer -march=native (CLANG v3.3)
Writing (index): 2.24831
Writing (data): 2.12853
Reading (index): 2.59164 (found 1)
Reading (data): 2.52141 (found 1)

Performance difference increases (i.e. data() is faster):

  • with a higher number of dimensions;
  • with less elements (for a large number of elements the access to elements is not going to be as significant as the cache pressure to load those elements into the CPU cache. The prefetcher is going to be sitting there trying to load those elements, which is going to take a large proportion of the time).

Anyway this kind of optimization is unlikely to make a measurable difference in a real program. You shouldn't worry about this unless you've conclusively determined, through extensive testing, that it is the source of some sort of bottleneck.

Source:

#include <chrono>
#include <iostream>

// #define BOOST_DISABLE_ASSERTS
#include <boost/multi_array.hpp>

int main()
{
using array3 = boost::multi_array<unsigned, 3>;
using index = array3::index;

using clock = std::chrono::high_resolution_clock;
using duration = std::chrono::duration<double>;

constexpr unsigned d1(300), d2(400), d3(200), sup(100);

array3 A(boost::extents[d1][d2][d3]);

// Writing via index
const auto t_begin1(clock::now());
unsigned values1(0);
for (unsigned n(0); n < sup; ++n)
for (index i(0); i != d1; ++i)
for (index j(0); j != d2; ++j)
for (index k(0); k != d3; ++k)
A[i][j][k] = ++values1;
const auto t_end1(clock::now());

// Writing directly
const auto t_begin2(clock::now());
unsigned values2(0);
for (unsigned n(0); n < sup; ++n)
{
const auto sup(A.data() + A.num_elements());

for (auto i(A.data()); i != sup; ++i)
*i = ++values2;
}
const auto t_end2(clock::now());

// Reading via index
const auto t_begin3(clock::now());
bool found1(false);
for (unsigned n(0); n < sup; ++n)
for (index i(0); i != d1; ++i)
for (index j(0); j != d2; ++j)
for (index k(0); k != d3; ++k)
if (A[i][j][k] == values1)
found1 = true;
const auto t_end3(clock::now());

// Reading directly
const auto t_begin4(clock::now());
bool found2(false);
for (unsigned n(0); n < sup; ++n)
{
const auto sup(A.data() + A.num_elements());

for (auto i(A.data()); i != sup; ++i)
if (*i == values2)
found2 = true;
}
const auto t_end4(clock::now());

std::cout << "Writing (index): "
<< std::chrono::duration_cast<duration>(t_end1 - t_begin1).count()
<< std::endl
<< "Writing (data): "
<< std::chrono::duration_cast<duration>(t_end2 - t_begin2).count()
<< std::endl
<< "Reading (index): "
<< std::chrono::duration_cast<duration>(t_end3 - t_begin3).count()
<< " (found " << found1 << ")" << std::endl
<< "Reading (data): "
<< std::chrono::duration_cast<duration>(t_end4 - t_begin4).count()
<< " (found " << found2 << ")" << std::endl;

return 0;
}

Iterate over all but d-th dimension of any boost::multi_array

Ok, after spending a significant amount of time studying the implementation of boost::multi_array, I am now ready to answer my own question: No, there are no provisions anywhere in boost::multi_array that would allow one to iterate along any but the first dimension. The best I could come up with is to construct an iterator that manually manages the N-1 indices that are being iterated over. Here is slice_iterator.hpp:

#include "boost/multi_array.hpp"

template <template <typename, std::size_t, typename...> class Array,
typename T, std::size_t NDims>
struct SliceIterator {
typedef Array<T, NDims> array_type;
typedef typename array_type::size_type size_type;
typedef boost::multi_array_types::index_range range;
typedef boost::detail::multi_array::multi_array_view<T, 1> slice_type;
typedef boost::detail::multi_array::index_gen<NDims, 1> index_gen;

array_type& A;
const size_type* shape;
const long d;
index_gen indices;
bool is_end = false;

SliceIterator(array_type& A, long d) : A(A), shape(A.shape()), d(d) {
int i = 0;
for (; i != d; ++i) indices.ranges_[i] = range(0);
indices.ranges_[i++] = range();
for (; i != NDims; ++i) indices.ranges_[i] = range(0);
}

SliceIterator& operator++() {
// addition with carry, excluding dimension d
int i = NDims - 1;
while (1) {
if (i == d) --i;
if (i < 0) {
is_end = true;
return *this;
}
++indices.ranges_[i].start_;
++indices.ranges_[i].finish_;
if (indices.ranges_[i].start_ < shape[i]) {
break;
} else {
indices.ranges_[i].start_ = 0;
indices.ranges_[i].finish_ = 1;
--i;
}
}
return *this;
}

slice_type operator*() {
return A[indices];
}

// fakes for iterator protocol (actual implementations would be expensive)
bool operator!=(const SliceIterator& r) {
return !is_end;
}

SliceIterator begin() {return *this;}
SliceIterator end() {return *this;}
};

template <template <typename, std::size_t, typename...> class Array,
typename T, std::size_t NDims>
SliceIterator<Array, T, NDims> make_slice_iterator(Array<T, NDims>& A, long d) {
return SliceIterator<Array, T, NDims>(A, d);
}

// overload for rvalue references
template <template <typename, std::size_t, typename...> class Array,
typename T, std::size_t NDims>
SliceIterator<Array, T, NDims> make_slice_iterator(Array<T, NDims>&& A, long d) {
return SliceIterator<Array, T, NDims>(A, d);
}

It can be used as

for (auto S : make_slice_iterator(A, d)) {
f(S);
}

and works for all examples in my question.

I must say that boost::multi_array's implementation was quite disappointing to me: Over 3700 lines of code for what should be little more than a bit of index housekeeping. In particular the iterators, which are only provided for the first dimension, aren't anywhere near a performance implementation: There are actually up to 3*N + 5 comparisons carried out at each step to decide whether the iterator has arrived at the end yet (note that my implementation above avoids this problem by faking operator!=()), which makes this implementation unsuitable for anything but arrays with a dominant last dimension, which is handled more efficiently. Moreover, the implementation doesn't take advantage of dimensions that are contiguous in memory. Instead, it always proceeds dimension-by-dimension for operations such as array assignment, wasting significant optimization opportunities.

In summary, I find numpy's implementation of an N-dimensional array much more compelling than this one. There are 3 more observations that tell me "Hands Off" of boost::multi_array:

  • I couldn't find any serious use cases for boost::multi_array anywhere on the web
  • Development appears to have essentially stopped in 2002
  • This (and similar) questions on StackOverflow have hardly generated any interest ;-)

How to specify degenerate dimension of boost multi_array at runtime?

Please, try this. Сode has one disadvantage - it refers to ranges_ array variable declared at boost::detail:: multi_array namespace.

#include <boost/multi_array.hpp>                                                                                                                              

typedef boost::multi_array<double, 3> array_type;
typedef boost::multi_array_types::index_gen::gen_type<2,3>::type index_gen_type;
typedef boost::multi_array_types::index_range range;

index_gen_type
func(int degenerate_dimension, int slice_index)
{
index_gen_type slicer;
int i;
for(int i = 0; i < 3; ++i) {
if (degenerate_dimension == i)
slicer.ranges_[i] = range(slice_index);
else
slicer.ranges_[i] = range();
}
return slicer;
}

int main(int argc, char **argv)
{
array_type myarray(boost::extents[3][3][3]);
array_type::array_view<2>::type myview = myarray[ func(2, 1) ];
return 0;
}


Related Topics



Leave a reply



Submit