Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring rulinalg in terms of a lower level BLAS-like interface #155

Open
Andlon opened this issue Feb 19, 2017 · 18 comments
Open

Refactoring rulinalg in terms of a lower level BLAS-like interface #155

Andlon opened this issue Feb 19, 2017 · 18 comments

Comments

@Andlon
Copy link
Collaborator

Andlon commented Feb 19, 2017

As I've been working on rewriting some of our decompositions, it has become increasingly clear that we could massively benefit from a rulinalg-specific BLAS-like API. More specifically, I propose the creation of a new blas module which has BLAS-like functions covering different functionality from the three levels of BLAS. For example:

Example of BLAS-1 operation:
   y <- β y + α x            (y and x vectors, α is a constant)

Example of BLAS-2 operation:
   y <- β y + α A x           (y and x vectors, A matrix, α constant)

Example of BLAS-3 operation:
   C <- β C + α A B        (A, B, C matrices, α, β constants)

I propose rewriting all matrix/vector arithmetic (i.e. Add, Sub, Mul implementations) in terms of such functions. This has some significant benefits:

  • Implementing opt-in BLAS-support is much easier: just replace the implementations of the BLAS-like functions.
  • Many high-performance algorithms for linear algebra are described in terms of BLAS operations which constitute the vast majority of work involved. For example, LU decomposition can be implemented extremely efficiently by reformulating it in terms of certain BLAS-3 operations (which generally have the best cache access patterns).
  • It also allows users of the library to have a finer degree of control, rather than relying on implementations of Mul/Add/etc. to do the right thing (which they won't when computing something like C = 2 * C + 3 * A * B. In this case you'd have unnecessary memory allocations).
  • Presumably most algorithms exposed by rulinalg would be written in terms of these functions, which means that the majority of the optimization work would be contained within this module.

I'm not decided on the exact naming of the functions. I think we should be somewhat BLAS-compatible in order to easily facilitate the aforementioned opt-in BLAS support, but I'm more a fan of the BLIS set of routines. See also this paper. I became aware of this framework through @bluss's blog post on matrixmultiply.

We need these APIs to be as generic as possible. I think they need to follow these rules:

  • Functions should never allocate any heap memory. Any necessary storage should be passed in as mutable references.
  • Functions should operate on MatrixSlice(Mut) for matrices and &[T] or &mut [T] for vector-like types. The reason for the regular slice instead of e.g. Vector is that some algorithms require you to treat e.g. partial rows of matrices as if they were vectors. This was the case in my implementation of the "GAXPY-rich Cholesky" in Cholesky decomposition rewrite: initial work #150.

An example:

/// y <- βy + α A x 
/// General matrix-vector multiply. 
// Note: I haven't thought about the order of arguments here
fn gemv<M, T> gemv(beta: T, alpha: T,  y: &mut [T], A: &mut M, x: &) 
    where M: BaseMatrixMut<T>
{
    // Highly optimized implementation comes here
}

We could use this to e.g. implement matrix-vector products, and this particular pattern comes up a lot in linear algebra algorithms.

If people agree that this could be a good idea, I suggest we start by making this blas (or whatever name it should have) module entirely a private module for now. Then we can refactor in multiple stages and make decisions about the API. Hopefully some time in the future we'll be confident enough in our choices to make it a public module.

Thoughts/questions/ideas/criticism to this idea are very welcome!

@AtheMathmo
Copy link
Owner

AtheMathmo commented Feb 19, 2017

I think this is a solid idea and a good way to stretch into supporting opt-in blas (and others) in the future. A few comments:

Functions should never allocate any heap memory. Any necessary storage should be passed in as mutable references.

I know that this is necessary but I think we should take some care in doing this in the safest way. For example consider: C <- β C + α A B from above. Let's say that really the user just wants to compute the matrix product AB. Then ideally they would like to reserve space for C without initializing it, something like:

let mut data = Vec::with_capacity(n*n);
unsafe { data.set_len(n*n); }

// C hasn't been initialized but can be efficiently filled
let c = Matrix::new(n, n, data);
let d = dgemm(1.0, 0.0, a, b, c);

This is just to illustrate that to be optimal here we need to do some unsafe things. Note that matrixmultiply supports this uninitialized memory by not accessing C when β = 0. I think a good way to address this is to do as you say - keep the blas module private for now and provide safe wrappers for these optimal routines.

I think naming is a tricky issue as well. The BLAS names are (perhaps a little too) compact and easy to remember/understand once you are familiar with them. But I think that for new users it can be pretty difficult to figure out what function you need. I'd like to try and work around this where possible.

There is also the f32 vs f64 issue. Each type requires separate handling with BLAS to ensure optimality. This is tricky to do in Rust without lots of code bloat - for matrixmultiply we do a runtime type check before calling the gemm routines. If our blas module has versions of each function for both sizes then we will probably have similar runtime type checks to make the API nicer for users (or specialization hopefully).

One other concern is that this could be a really huge sub-project. Because of this we need to have a good battle-plan going in. Maybe we can break down the work into small but useful chunks so that if things get too much we can take a break without stalling development elsewhere.


On one slightly unrelated note. I have been thinking a little more about the arithmetic traits lately. In particular, the fact that we have ops::Mul represent matrix multiplication and provide dedicated functions for element-wise multiplication/division. I'm starting to convince myself that we might want to have these the other way round. Of course this is a big breaking change but I thought it might be sensible to include it as part of the API changes this work would involve. What do you think?

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 20, 2017

You bring up some very good points!

The problem of uninitialized input is difficult to resolve within the confines of a safe API. As you say, we can write an internal unsafe one and expose a safe abstraction on top, but this still makes us have to deal with unsafe pointer-based APIs spread across the code base, which is also not entirely ideal.

I'm having some vague ideas about a MatrixBuffer reference wrapper that encapsulates the possiblity of the buffer not being initialized, as well as providing Into<MatrixBuffer> (ish) implementations for &mut MatrixSliceMut. Not sure if this will lead to anything.

I think naming is a tricky issue as well. The BLAS names are (perhaps a little too) compact and easy to remember/understand once you are familiar with them. But I think that for new users it can be pretty difficult to figure out what function you need. I'd like to try and work around this where possible.

I think this can be solved by good documentation. Basically a table of the mathematical meaning, and link to corresponding function. I'd like to read a bit more about BLIS - it seems very well structured.

There is also the f32 vs f64 issue. Each type requires separate handling with BLAS to ensure optimality. This is tricky to do in Rust without lots of code bloat - for matrixmultiply we do a runtime type check before calling the gemm routines. If our blas module has versions of each function for both sizes then we will probably have similar runtime type checks to make the API nicer for users (or specialization hopefully).

I think that until we get specialization, we will probably have to resort to runtime checks. However, I think it will take us a while before we get to the point of specializing individual types for most of the operations in question. I think we can get pretty far with some generic manual loop unrolling and auto-vectorization.

One other concern is that this could be a really huge sub-project. Because of this we need to have a good battle-plan going in. Maybe we can break down the work into small but useful chunks so that if things get too much we can take a break without stalling development elsewhere.

Agree completely. I think as long as we keep the blas module private, we can do it in many small steps, and I think it only really makes sense to do so on-demand. For example, if we need a certain BLAS-like feature to write a more efficient decomposition, it would make sense to do the work of implementing this particular BLAS-like function in the blas module. Over time we should cover a larger and larger number, until we can maybe make an effort to port the remaining arithmetic code to the blas module too.

On one slightly unrelated note. I have been thinking a little more about the arithmetic traits lately. In particular, the fact that we have ops::Mul represent matrix multiplication and provide dedicated functions for element-wise multiplication/division. I'm starting to convince myself that we might want to have these the other way round. Of course this is a big breaking change but I thought it might be sensible to include it as part of the API changes this work would involve. What do you think?

Can I ask you why you think it should be the other way around?

This is actually an area on which I have a very strong opinion. Basically, it boils down to the following: arrays are the wrong abstraction for linear algebra. Conversely, matrices are the right abstraction for linear algebra. For arrays, it may make sense for multiplication to mean elementwise multiplication. However, for matrices, matrix multiplication is a very well-defined operation, and you'd be hard pressed to find any mathematician or engineer to expect A * B for matrices A and B to mean anything but matrix multiplication of A and B.

The fact that one often works with matrices as if they were arrays in e.g. numpy is a huge disadvantage of its API. I cannot tell you how often I drop into a Python terminal and try to multiply two matrices by typing a * b. The scary thing here is that it does not fail, because for square matrices the elementwise product is well-defined, and so it can often take me some time of pondering before I recall that I must write a.dot(b), which is stupendously unintuitive. To be fair, numpy does have a matrix type, but it seems to be generally discouraged by the community.

As you can probably tell by now, I definitely don't think you should change it ;-) But I'm curious to hear why you are considering it, however!

@AtheMathmo
Copy link
Owner

AtheMathmo commented Feb 20, 2017

but this still makes us have to deal with unsafe pointer-based APIs spread across the code base, which is also not entirely ideal.

I agree that this wouldn't be the best. I think however that we might be able to avoid falling to pointers

extern crate blas_sys;

#[cfg(not(blas))]
unsafe fn gemm<T: Float>(alpha: T, beta: T, a: &MatrixSlice<T>, b: &MatrixSlice<T>, c: &mut MatrixSliceMut<T> ) {
    // Implement C = β C + α A B
}

#[cfg(blas)]
unsafe fn gemm<T: Float>(alpha: T, beta: T, a: &MatrixSlice<T>, b: &MatrixSlice<T>, c: &mut MatrixSliceMut<T> ) {
    // Check type of T and call either `dgemm` or `sgemm`
    blas_sys::cblas_dgemm(.. pull all these arguments from the input objects ...)
}

The above function is unsafe because of the initialization issue in the not(blas) case and the ffi call (with subsequent requirements) in the blas case. Here I've used a generic type too but this isn't really safe as I've written it. We'd probably have to hide f32 and f64 behind our own trait to ensure that users can't call the blas_sys functions (implicitly) with Float implementors of different sizes. We will have to ensure that these functions are used safely within rulinalg.

arrays are the wrong abstraction for linear algebra. Conversely, matrices are the right abstraction for linear algebra.

I agree with you here, and actually it is because of this that I designed things as they are now. I also dislike how this is not the case in every other library I use. But I am not too sure whether fighting this convention is such a good idea and this is basically why I had been considering changing things. For now though, let's keep putting up the good fight and keep things how they are! :)

Edit: My above functions should probably also include some arguments to determine whether the matrices should be transposed. (So far as I'm aware computing AB^T is much more cache-efficient.)

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 20, 2017

If I understand you correctly, you want to construct a matrix from uninitialized data and then pass it into the function. I assume that you intend it to be called like:

let mut data = Vec::with_capacity(n * n);
unsafe { data.set_len(n * n); }
let mut mat = Matrix::new(n, n, data);
gemm(1.0, 0.0, &a, &b, &mut c);

(I also want to note that this is only well defined in the first place if T: Copy)

I'm a little worried that this is a dangerous pattern to start using. It requires gemm to be very careful not to touch the uninitialized data before it has been initialized, and I can imagine it's easy to accidentally pass an uninitialized matrix to a function that isn't explicitly written to support uninitialized data. Or, the other way around, if one is currently passing uninitialized data to a function that supports it, and then rewrites the function without taking into account the fact that it needs to support uninitialized data, you are effectively invoking undefined behavior very easily by accident.

Here I've used a generic type too but this isn't really safe as I've written it. We'd probably have to hide f32 and f64 behind our own trait to ensure that users can't call the blas_sys functions (implicitly) with Float implementors of different sizes. We will have to ensure that these functions are used safely within rulinalg.

I'm confused. T always refers to a single type, so it would either be f32 or f64 in that case?

But I am not too sure whether fighting this convention is such a good idea and this is basically why I had been considering changing things.

MATLAB and most libraries I've used use * for matrix multiplication. Numpy is the only library I've used that doesn't (well, it does, but not really since people mostly use arrays rather than matrices). From my perspective at least, the 'convention' is to use '*'!

@AtheMathmo
Copy link
Owner

AtheMathmo commented Feb 20, 2017

I'm a little worried that this is a dangerous pattern to start using.

I understand and agree with the concern. However - I'm not sure that there is a good way around this. At some layer in our abstraction we will want to have the choice to use an uninitialized Vec - as close to the FFI boundary as possible would be best. Otherwise we waste resources filling the Vec before passing it to BLAS/matrixmultiply/etc. Do you have any thoughts on how we can prevent this? (And yes we would need T: Copy which is a requirement of Float).

I'm confused. T always refers to a single type, so it would either be f32 or f64 in that case?

I meant specifically the Float trait bound was dangerous. Really we only want these functions to work with f32 and f64 (or maybe complex types in the future). But the Float trait could be implemented for a custom type in a users library, for example:

pub struct Float128(f64, f64);

impl Float for Float128 {};

But this type has the wrong size and we invoke undefined behaviour when we pass its raw pointer. We would have to either panic at runtime when we see a bad TypeId or we can create an InternalFloat trait which we implement for types that are accepted to gemm. This way the user can only call this function when they are using f32 or f64. Does that make sense? Maybe there is a better way around this that I'm not seeing.

the 'convention' is to use '*'!

Even more reason to keep things as they are :) . I had got MATLAB mixed up in my head, I thought it also followed numpy's pattern. I've been using arrayfire recently which also has a dedicated matmul function. I have been convinced though and am very happy to stick with the current approach!

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 21, 2017

(This is an answer only to the question of uninitialized data)

I think it should be possible to encapsulate the possibility of uninitialized data in a slightly more safe manner. Ultimately unsafe will be needed to actually do something with the data, but I'd like to encapsulate this as well as possible. I've been toying with an idea which I will sketch out below. I'm not sure if it works out, but it might be a start.

trait MatrixBuffer<T> {
    fn is_initialized(&self) -> bool;
    fn rows(&self) -> usize;
    fn cols(&self) -> usize;
    
    /// Returns a matrix slice whose data is guaranteed to be initialized.
    fn as_mut_slice(&'a mut self) -> MatrixSliceMut<'a, Scalar>;
    
    /// Return a pointer to the mutable data. The data may be uninitialized.
    unsafe fn as_mut_ptr(&mut self) -> *mut Scalar;
}

impl<T> MatrixBuffer<T> for MatrixSliceMut<T> {
    fn is_initialized(&self) -> { true }
    unsafe fn as_mut_ptr(&mut self) -> { self.as_mut_ptr() }
    
    // Rest of the impl...
}

impl<T> MatrixBuffer for Matrix<T> {
    // Same as MatrixSliceMut
}

struct UninitializedBuffer<T: Copy> {
    data: Vec<T>
}

impl<T: Copy> UninitializedBuffer<T: Copy> {
    fn new(rows: usize, cols: usize) -> Self { ... }    
    fn into_matrix(self) -> Matrix<T> { ... }
}

impl<T: Copy> MatrixBuffer for UninitializedBuffer<T> {
    // ...
}

fn gemm<T: Float, M: Into<MatrixBuffer>>(
    alpha: T,
    beta: T, 
    a: &MatrixSlice<T>,
    b: &MatrixSlice<T>, 
    c: &mut M) {
    // Implement C = β C + α A B
}

fn main() {
    let (a, b) = ...
    
    {
        // Call gemm with uninitialized buffer
        let mut c = UninitializedBuffer::new(100, 100);
        gemm(1.0, 0.0, &a, &b, &mut c);
        let c = c.into_matrix();
    }
    
    {
        // Call gemm with existing/initialized matrix/buffer
        let mut c = Matrix::zeros(100, 100);
        gemm(1.0, 1.0, &a, &b, &mut c);
    }
}

I stress that the above is a very rough sketch, and I don't even know if the above would work out in terms of lifetimes etc. The idea here is to prevent any consumers of these APIs to have to deal with unsafe. Moreover, dealing with uninitialized data in each BLAS-like function is entirely optional: one can just call as_mut_slice() to retrieve a slice that's guaranteed to be initialized.

Do you think something like this could maybe work?

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 21, 2017

I meant specifically the Float trait bound was dangerous. Really we only want these functions to work with f32 and f64 (or maybe complex types in the future). But the Float trait could be implemented for a custom type in a users library, for example:

I think the runtime type check works quite well for this issue. We check the type and specialize if possible, otherwise we fall back to some fairly efficient implementation of our own (that presumably performs loop unrolling in the kernel at a minimum). This way we can also cover integers for the cases in which this makes sense, which BLAS does not handle at all. Hopefully we can move to compile-time specialization in the future.

That said, this approach risks leading us into a situation where we have to provide unnecessary "fallback" implementations.

I think actually it's completely OK to restrict the types supported by rulinalg to a specific set. For example, as you say we could provide a trait Scalar which we implement for f32 and f64. However, we also want to be able to work with integers in some cases. How do we accommodate them? Perhaps Scalar could be a very general trait, whereas a trait named Real: Scalar would represent real numbers, implemented by f32 and f64. An interesting idea would also be to allow e.g. Rational types, so that one can perform exact rational arithmetic. However, this may be somewhat out of scope and not really so important.

@AtheMathmo
Copy link
Owner

Do you think something like this could maybe work?

It looks quite appealing in the way you've sketched it out. However, I'm not convinced we can allow the user to safely construct an UninitializedBuffer. After all - this constructor is really just an abstraction for unsafe { vec.set_len(n*m) }. The way I see it we have to be very careful to ensure that MatrixBuffers are only used with this fact in mind. And moreover - we hide how unsafe the action actually is. If a user ever interacts with an UnitializedBuffer they could very easily do something unsafe without realizing.

All that said - I do see the benefit of an UnitializedBuffer like this. It would help us handle dimensionality a little more safely and we could even do a runtime check to make sure that the MatrixBuffer is initialized whenever β != 0. I just think that the UnitializedBuffer should probably be unsafe still.

However, we also want to be able to work with integers in some cases. How do we accommodate them?

I think what you laid out here is a good approach. And fortunately feature flags work in our favour here. A user will be able to use the BLAS-like functions on integers etc. if they haven't opt'd in to BLAS. But when they activate the flag the compiler will only accept T: Real rather than T: Num.

However, we would probably want to stick to poorly optimized functions for anything but floats initially. And tackling Rational types seems like something that we'd prefer to tackle in a separate crate, where the BLAS-like functions can handle Rational types as they implement Num or something similar.

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 25, 2017

However, I'm not convinced we can allow the user to safely construct an UninitializedBuffer. After all - this constructor is really just an abstraction for unsafe { vec.set_len(n*m) }. The way I see it we have to be very careful to ensure that MatrixBuffers are only used with this fact in mind. And moreover - we hide how unsafe the action actually is. If a user ever interacts with an UnitializedBuffer they could very easily do something unsafe without realizing.

Actually, the way I had it in mind, it should be completely safe. At least, to my understanding, unsafe { vec.set_len(n*m); } by itself is fine as long as one does not attempt to read from the data. My idea for UninitializedBuffer is to not expose its data in any way whatsoever besides the accessors in the MatrixBuffer trait. If one calls as_mut_slice(), the data will be zero-initialized first, so that accessing the MatrixSliceMut is completely safe. Otherwise, the unsafe fn as_mut_ptr() function gives access to uninitialized data, but this requires explicit unsafe code.

The conclusion is that accessing any safe methods on UninitializedBuffer is completely safe, and the unsafe behavior is correctly hidden behind an unsafe function, in which case the fact that the data is uninitialized should be stressed in the documentation.

I think what you laid out here is a good approach. And fortunately feature flags work in our favour here. A user will be able to use the BLAS-like functions on integers etc. if they haven't opt'd in to BLAS. But when they activate the flag the compiler will only accept T: Real rather than T: Num.

Actually, I think we should do BLAS integration in a different way. Specifically, opting in to BLAS should not change the API in any way. Just imagine if you're using both integers and floats in your application, and you enable BLAS and suddenly the code involving integers simply breaks because BLAS does not support it. Instead I think we should make the BLAS integration transparent, in that the BLAS implementations are called for floating point types "under the hood". For now we'd have to rely on runtime checks, but eventually we can hopefully replace it with generic specialization.

Also, a significant barrier to BLAS support is that, as far as I know, BLAS only supports column-major matrices...?

EDIT: It seems that CBLAS and similar also support row-major matrices.

@AtheMathmo
Copy link
Owner

My idea for UninitializedBuffer is to not expose its data in any way whatsoever besides the accessors in the MatrixBuffer trait.

I think that this could work actually. At the very least we should try this idea out and see if it gives us what we want with the additional safety.

Actually, I think we should do BLAS integration in a different way. Specifically, opting in to BLAS should not change the API in any way. Just imagine if you're using both integers and floats in your application, and you enable BLAS and suddenly the code involving integers simply breaks because BLAS does not support it.

I can see your argument here and I think the implementation isn't much different. Instead of replacing the whole function with a typed blas abstraction we would keep the same gemm function and inside it call sgemm and dgemm which are overwritten when we use the blas option. The integer branch would remain unchanged regardless of the feature setting.

Also, a significant barrier to BLAS support is that, as far as I know, BLAS only supports column-major matrices...?

I know that blas-sys allows row major inputs but I'm not sure if there is any performance hit for using this. I think that row and column strides let us encode both column major and row major indexing and as a result it is possible to optimize a blas library to be efficient for both.

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 25, 2017

I think that this could work actually. At the very least we should try this idea out and see if it gives us what we want with the additional safety.

Yeah, I agree it's worth trying out. We could attempt to rewrite one of the BLAS-1 routines as an experiment.

I'm still partial to a BLIS-like interface instead of BLAS, however, because it seems to incur a lot less work to implement it from scratch as composed to BLAS. I will try to read more about BLIS when I have the time. As far as I understand, I think a BLAS API can be written on top of BLIS, so it might make sense in either case.

I can see your argument here and I think the implementation isn't much different. Instead of replacing the whole function with a typed blas abstraction we would keep the same gemm function and inside it call sgemm and dgemm which are overwritten when we use the blas option. The integer branch would remain unchanged regardless of the feature setting.

Yes, exactly!

I just had an idea. Not sure if it works out. How about we implement each BLAS-like function as a trait, and implement it for each datatype? You'd have roughly:

impl Gemm<...> for f64 { 
    fn gemm(...) {}
}

// Call gemm for f64
f64::gemm(...)

However, could this make writing generic code harder? Or could we define the Real along the lines of

pub trait Real: Gemm + <other BLAS functions>

This way we could still write generic functions in terms of the Real type (I think?).

In the end I think this gets unnecessarily complicated compared to runtime type checking, but it's perhaps still an interesting idea.

@bluss
Copy link

bluss commented Feb 25, 2017

BLIS is exciting. When I evaluated it (probably a whole year ago) I concluded it's not a drop in solution with multiple great implementations like BLAS. At that time, BLIS had very high overhead for small problems. I can't really quantify that, but think of it that OpenBLAS also makes sure that even small matrix operations are quick.

Python's numpy also would benefit from BLIS but doesn't use it what I understand.

@AtheMathmo
Copy link
Owner

@bluss @Andlon -I have to admit that I don't know too much about BLIS. I also looked into it around a year ago when you first published your matrixmultiply crate but my understanding of even BLAS was a little lackluster then. I would probably appreciate it more if I took another look now. I'm happy to look into it - hopefully it is suited to our needs (the issues with small matrices seem like they would lead to code bloat if we have to special case each operation ourselves but we can cross this bridge if we get there).

How about we implement each BLAS-like function as a trait, and implement it for each datatype?

I worry that we lose some functionality this way. For example what if a user has their own custom type which implements Float. We would like our implementations to handle this (though of course we cannot use blas bindings here). I think this is an issue if we use traits because we can't do, e.g.

impl<T: Float> Gemm<T> for T { }

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 25, 2017

Thanks for your input, @bluss! Did you have the impression that overhead for small matrices is a fundamental restriction of the architecture of BLIS, or that it is/was an implementation detail?

The reason I'm curious to see if we can use something inspired by BLIS for rulinalg is that it seems to provide a shorter path to some reasonably performant system. Whereas implementing something BLAS-like seems to cause you to have to optimize code in many, many places, the promise of BLIS is that you can implement all operations in terms of a comparatively small number of kernels, and that optimizing these kernels automatically leads to optimized code for the rest of the routines.

My point is that BLIS seems to offer a greater performance/effort ratio, which in the case of rulinalg is highly preferable since we're all doing this in our spare time. Moreover, we don't need a complete BLIS implementation: we can assemble something similar over time depending on our needs. Moreover, it seems to be possible to implement a BLAS-compatible API on top of BLIS (someone please correct me if I'm wrong), so perhaps we could do something like this:

  • Implement (internal?) custom BLIS-like API. We might also provide opt-in bindings for BLIS here later.
  • Implement BLAS-like API on top of the BLIS-like API, optionally accelerated by external BLAS libraries.
  • Implement binary operators and similar in terms of the BLAS API.

It might seem unnecessary to implement a BLIS-like API just to build a BLAS-like API on top, but if implementing a relatively high-performance BLIS-like API from scratch is significantly simpler than implementing a relatively high-performance BLAS-like API scratch, it would probably still be worthwhile.

Also note that I still haven't had the time to actually read any of the BLIS papers yet. Hope to get around to it soon, but I'm pretty busy these days!

@Andlon
Copy link
Collaborator Author

Andlon commented Feb 25, 2017

I worry that we lose some functionality this way. For example what if a user has their own custom type which implements Float. We would like our implementations to handle this (though of course we cannot use blas bindings here). I think this is an issue if we use traits because we can't do, e.g.

Yeah, this is a trade off. Though, keep in mind that in almost all cases, users will use either f32 or f64, occasionally complex types (which we don't really support at the moment?).

It is also possible to provide default implementations of the traits (assuming T is appropriately bounded), so that one could just do impl Gemm for MyType. But to be clear, I think this is all a bit round-about and I think I'm still in favor of the runtime type specialization for now, as it simplifies the user-facing API.

@AtheMathmo
Copy link
Owner

I think the BLIS/BLAS composition you described sounds like a reasonable approach. I'm not sure exactly how it will work (due to a lack of familiarity with BLIS) but it seems like a good way to kick things off. I think one of the more important aspects is that rulinalg works with the higher level BLAS api so that we can easily swap out for blas bindings if we need (having the option to do this at the BLIS level would be great too!).

occasionally complex types (which we don't really support at the moment?).

All of the arithmetic operations would work for complex types with the appropriate trait bounds. But most of our decompositions require floats explicitly as complex types change some of these algorithms (in minor ways mostly I think).

I think this is all a bit round-about and I think I'm still in favor of the runtime type specialization for now

Yes me too. Maybe we can explore doing this through a trait interface later, once things are working.

@bluss
Copy link

bluss commented Apr 17, 2017

Here's a talk about something BLIS-like. Maybe too multidimensional for yall, but there's always ideas worth taking. https://www.youtube.com/watch?v=FbsJ6urR4x0

@Andlon
Copy link
Collaborator Author

Andlon commented Apr 18, 2017

That was a very interesting talk, @bluss! I've never really had the need to work with tensors so it's not my field of expertise, but I quite enjoyed to see how the matrix packing that BLIS employs can be used for efficiently computing tensor contractions. Very cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants