Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

loumalouomega
Copy link
Member

Reopen #12151

We need to study the segmentation fault.

Sort values and column indices in the rows of a sparse matrix in-place. For each row of the input matrix, this saves

  • one allocation
  • four row-sized copies
  • one deallocation

FYI @loumalouomega

@loumalouomega loumalouomega requested a review from a team as a code owner March 6, 2024 11:22
@loumalouomega
Copy link
Member Author

Yes, a 10% time reduction.

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/improve-f91598bfeb-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.3339204788208  seconds
Matrix multiplication execution time:  4.3560473918914795  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. 

Proposal:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/matrix-product-performance-40bb818051-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.36043405532837  seconds
Matrix multiplication execution time:  3.974817991256714  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. 

@loumalouomega
Copy link
Member Author

I couldn't solve the memory error.

@loumalouomega loumalouomega marked this pull request as draft March 6, 2024 13:24
@loumalouomega
Copy link
Member Author

I couldn't solve the memory error.

Making draft to avoid merge

@RiccardoRossi
Copy link
Member

@loumalouomega
Copy link
Member Author

@matekelemen would you mind to merge #12143 first?

@matekelemen
Copy link
Contributor

Give me until Monday to debug the problem.

@loumalouomega
Copy link
Member Author

Give me until Monday to debug the problem.

On the other side I will try proposal of @RiccardoRossi

@loumalouomega
Copy link
Member Author

Give me until Monday to debug the problem.

On the other side I will try proposal of @RiccardoRossi

Implementing this as:

template <typename TSize, typename Col, typename TIndexType, typename ValueType>
    static inline void SortRows(
        const TIndexType* CPtr,
        const TSize NRows,
        const TSize NCols,
        Col* Columns,
        ValueType* Values
        )
    {
        IndexPartition<TSize>(NRows).for_each([&](TSize i_row) {
            const TIndexType row_begin = CPtr[i_row];
            const TIndexType row_end = CPtr[i_row + 1];
            TSize length = row_end - row_begin;

            // Create and sort a vector of indices based on column indices
            std::vector<IndexType> indices(length);
            std::iota(indices.begin(), indices.end(), row_begin); // Starting from row_begin
            std::sort(indices.begin(), indices.end(), [&](IndexType i, IndexType j) {
                return Columns[i] < Columns[j];
            });

            // Apply the sorted indices to permute Columns and Values in place
            // Note: This requires a careful approach to avoid overwriting
            std::vector<bool> visited(length, false);
            for (TSize i = 0; i < length; ++i) {
                if (visited[i] || indices[i] == row_begin + i)
                    continue; // Already in place or visited

                // Cycle detection in permutation
                TSize cycle_start = i;
                IndexType prev_index = i;
                Col temp_col = Columns[row_begin + i];
                ValueType temp_val = Values[row_begin + i];

                do {
                    IndexType next_index = indices[prev_index] - row_begin;
                    std::swap(Columns[row_begin + prev_index], temp_col);
                    std::swap(Values[row_begin + prev_index], temp_val);

                    visited[prev_index] = true;
                    prev_index = next_index;
                } while (prev_index != cycle_start);
            }
        });
    }

It is slower than this proposal (but faster than #12143):

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/improve-9a1ba2bc3b-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.73923110961914  seconds
Matrix multiplication execution time:  4.149796724319458  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. 

It addition it also crashes, so I think the problem is on the std::swap, maybe some values are not defined and induces a memory error.

@matekelemen
Copy link
Contributor

matekelemen commented Mar 8, 2024

Ok I found the problem: std::sort doesn't always use std::swap. Specifically on small ranges, it will use the move (or copy) constructor and move (or copy) assignment operator, which is a problem in our case because the value type of our iterator is just a pair of pointers pointing inside the range, and thus they cannot moved (or copied) outside it.

I added some logic to move these values in and out the array, at some extra copying expense. We'll see how it affects performance (I'll run some tests and provide the results here).

P.S.: I didn't see a measurable improvement

@loumalouomega loumalouomega marked this pull request as ready for review March 11, 2024 09:31
@loumalouomega
Copy link
Member Author

With teh suggested change:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/avoid-copy-before-sort-matrix-37ffadf0e5-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  67.21308064460754  seconds
Matrix multiplication execution time:  4.146542310714722  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. 

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

@loumalouomega
Copy link
Member Author

With teh suggested change:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/avoid-copy-before-sort-matrix-37ffadf0e5-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  67.21308064460754  seconds
Matrix multiplication execution time:  4.146542310714722  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. 

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

Myabe we should test more matrices to actually perform a benchmark

@matekelemen
Copy link
Contributor

matekelemen commented Mar 11, 2024

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

I think the original implementation in #12143 (copy both arrays to a temporary array of pairs and sort that, then copy them back) is the most clean, and we should go with it for now. I ran my test case 5 times for both implementations and couldn't measure a difference in performance.

My specs for running the benchmarks:

  • 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz
  • 384K L1, 4M L2, 16M L3
  • OMP_NUM_THREADS=8 (only physical cores)

Myabe we should test more matrices to actually perform a benchmark

That's a good idea, if you're willing to generate some test operands that lead to a variety of row densities in the output matrix.

@loumalouomega
Copy link
Member Author

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

I think the original implementation in #12143 (copy both arrays to a temporary array of pairs and sort that, then copy them back) is the most clean, and we should go with it for now. I ran my test case 5 times for both implementations and couldn't measure a difference in performance.

My specs for running the benchmarks:

* 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz

* 384K L1, 4M L2, 16M L3

* `OMP_NUM_THREADS=8` (only physical cores)

Myabe we should test more matrices to actually perform a benchmark

That's a good idea, if you're willing to generate some test operands that lead to a variety of row densities in the output matrix.

In that case we can merge #12143 and then point this branch to master for merge, and expereiment changes and benchmark there.

@loumalouomega
Copy link
Member Author

FYI @RiccardoRossi

@loumalouomega
Copy link
Member Author

Copying message from @RiccardoRossi:

this code, taken from:
https://codereview.stackexchange.com/questions/231352/c17-zip-iterator-compatible-with-stdsort

would most probably do what you need without copies. Having said this, admittedly i am not sure wheter it is faster or not, but i would expect that allocation of a lot of pairs is not for free. 
(the main is at the end of the post ... everything at the beginning are the helper functions)

~~~cpp
#include <cstdint>
#include <algorithm>
#include <functional>
#include <tuple>
#include <utility>

template <typename ...T>
class ZipRef {
    std::tuple<T*...> ptr;
public:
    ZipRef() = delete;
    ZipRef(const ZipRef& z) = default;
    ZipRef(ZipRef&& z) = default;
    ZipRef(T* const... p): ptr(p...) {}

    ZipRef& operator=(const ZipRef& z)             { return copy_assign(z); }
    ZipRef& operator=(const std::tuple<T...>& val) { return val_assign(val); }

    template <std::size_t I = 0>
    ZipRef& copy_assign(const ZipRef& z) {
        *(std::get<I>(ptr)) = *(std::get<I>(z.ptr));
        if constexpr(I+1 < sizeof...(T)) return copy_assign<I+1>(z);
        return *this;
    }
    template <std::size_t I = 0>
    ZipRef& val_assign(const std::tuple<T...>& t) {
        *(std::get<I>(ptr)) = std::get<I>(t);
        if constexpr(I+1 < sizeof...(T)) return val_assign<I+1>(t);
        return *this;
    }

    std::tuple<T...> val() const {return std::apply([](auto&&...args) { return std::tuple((*args)...); }, ptr);}
    operator std::tuple<T...>() const { return val(); }

    void swap(const ZipRef& o) const {
        swap_impl<sizeof...(T)-1>(o);
    }

private:
    template <std::size_t I>
    void swap_impl(const ZipRef& o) const {
        using std::swap;
        swap(*std::get<I>(ptr), *std::get<I>(o.ptr));
        if constexpr(I) swap_impl<I-1>(o);
    }

public:
#define OPERATOR(OP)                                                    \
    bool operator OP(const ZipRef & o) const { return val() OP o.val(); } \
    inline friend bool operator OP(const ZipRef& r, const std::tuple<T...>& t) { return r.val() OP t; } \
    inline friend bool operator OP(const std::tuple<T...>& t, const ZipRef& r) { return t OP r.val(); }

    OPERATOR(==) OPERATOR(<=) OPERATOR(>=)
    OPERATOR(!=) OPERATOR(<)  OPERATOR(>)
#undef OPERATOR

};

template<typename ...IT>
class ZipIter {
    std::tuple<IT...> it;

    template<int N, typename... T> using NthTypeOf =
        typename std::tuple_element<N, std::tuple<T...>>::type;
    template<typename... T> using FirstTypeOf = NthTypeOf<0, T...>;

public:
    using iterator_category = typename std::iterator_traits<FirstTypeOf<IT...>>::iterator_category;
    using difference_type   = typename std::iterator_traits<FirstTypeOf<IT...>>::difference_type;
    using value_type        = std::tuple<typename std::iterator_traits<IT>::value_type ...>;
    using pointer           = std::tuple<typename std::iterator_traits<IT>::pointer ...>;
    using reference         = ZipRef<typename std::iterator_traits<IT>::value_type ...>;

    ZipIter() = default;
    ZipIter(const ZipIter &rhs) = default;
    ZipIter(ZipIter&& rhs) = default;
    ZipIter(IT... rhs): it(std::move(rhs)...) {}

    ZipIter& operator=(const ZipIter& rhs) = default;
    ZipIter& operator=(ZipIter&& rhs) = default;

    ZipIter& operator+=(const difference_type d) {
        std::apply([&d](auto&&...args) {((std::advance(args,d)),...);}, it); return *this;
    }
    ZipIter& operator-=(const difference_type d) { return operator+=(-d); }

    reference operator* () const {return std::apply([](auto&&...args) {return reference(&(*(args))...);}, it);}
    pointer   operator->() const {return std::apply([](auto&&...args) {return pointer(&(*(args))...);}, it);}
    reference operator[](difference_type rhs) const {return *(operator+(rhs));}

    ZipIter& operator++() { return operator+=(1); }
    ZipIter& operator--() { return operator+=(-1); }
    ZipIter operator++(int) {ZipIter tmp(*this); operator++(); return tmp;}
    ZipIter operator--(int) {ZipIter tmp(*this); operator--(); return tmp;}

    difference_type operator-(const ZipIter& rhs) const {return std::get<0>(it)-std::get<0>(rhs.it);}
    ZipIter operator+(const difference_type d) const {ZipIter tmp(*this); tmp += d; return tmp;}
    ZipIter operator-(const difference_type d) const {ZipIter tmp(*this); tmp -= d; return tmp;}
    inline friend ZipIter operator+(const difference_type d, const ZipIter& z) {return z+d;}
    inline friend ZipIter operator-(const difference_type d, const ZipIter& z) {return z-d;}

#define OPERATOR(OP)                                                    \
    bool operator OP(const ZipIter& rhs) const {return it OP rhs.it;}
    OPERATOR(==) OPERATOR(<=) OPERATOR(>=)
    OPERATOR(!=) OPERATOR(<)  OPERATOR(>)
#undef OPERATOR
};

template<typename ...Container>
class Zip {
    std::tuple<Container&...> zip;

public:
    Zip() = delete;
    Zip(const Zip& z) = default;
    Zip(Zip&& z) = default;
    Zip(Container&... z)
        : zip {z...}
    {
        auto const len = {(z.size())...,};
        // if (std::adjacent_find(len.begin(), len.end(), std::not_equal_to()) != len.end())
        //     throw std::error("array lengths differ");
    }

#define HELPER(OP)                                                      \
    auto OP() {return std::apply([](auto&&... args) { return ZipIter((args.OP())...);}, zip);}
    HELPER(begin) HELPER(end)
    HELPER(rbegin) HELPER(rend)
#undef HELPER

#define HELPER(OP)                                                      \
    auto OP() const {return std::apply([](auto&&... args) { return ZipIter((args.OP())...);}, zip);}
    HELPER(begin) HELPER(end)
    HELPER(rbegin) HELPER(rend)
    HELPER(cbegin) HELPER(cend)
    HELPER(crbegin) HELPER(crend)
#undef HELPER
};

#include <utility>
using std::swap;
template<typename ...T> void swap(const ZipRef<T...>& a, const ZipRef<T...>& b) { a.swap(b); }

#include <sstream>
template< class Ch, class Tr, class...IT, typename std::enable_if<(sizeof...(IT)>0), int>::type = 0>
auto& operator<<(std::basic_ostream<Ch, Tr>& os, const ZipRef<IT...>& t) {
    std::basic_stringstream<Ch, Tr> ss;
    ss << "[ ";
    std::apply([&ss](auto&&... args) {((ss << args << ", "), ...);}, t.val());
    ss.seekp(-2, ss.cur);
    ss << " ]";
    return os << ss.str();
}


#include <vector>
#include <string>
#include <algorithm>
#include <iostream>

int main() {
    std::vector<int> a {3,1,4,2};
    //std::vector<std::string> b {"Alice","Bob","Charles","David"};
    std::vector<double> b {3000,1000,4000,2000};

    auto const zip = Zip(a,b);

    for (const auto & z: zip) std::cout << z << std::endl;

    std::cout << std::endl;
    std::sort(zip.begin(), zip.end());
    for (const auto & z: zip) std::cout << z << std::endl;
}

alternatively also boost has a zip_iterator which probably would allow us doing the same

Base automatically changed from core/improve to master March 26, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants