Sparse Binary Matrix

Klaus Iglberger

Hi Christian!

At this point in time there is unfortunately no sparse binary matrix available. As you correctly mentioned, the only option that you have today is CompressedMatrix. The memory overhead that you would have to pay if you would use a CompressedMatrix is approx. a factor of two (since you have to store the values in addition to the indices and since there are some alignment requirements). Performance-wise the overhead is directly related to the memory overhead since you have to explicitly load all values, too. However, our estimation is that with the generic algorithms as we use them today within Blaze the performance overhead should be in the single digit range.

Within Blaze 3.3, adding a new data structure (vector or matrix) is a significant endeavor. Whereas writing a BinaryMatrix class yourself should not be too complicated (especially if you use CompressedMatrix as a blueprint), integrating this into the library is more involved. Unfortunately there is no guide yet what is required beyond writing the matrix class itself. However, we would be willing to help you with the implementation. Part of this help are the current modifications that we are implementing for Blaze 3.4, which will make the task of adding new data structures into Blaze significantly easier. Additionally, we feel that the idea might be useful for other users of Blaze and therefore will consider adding such a data structure in a future release of Blaze.

Thanks for creating this issue,

Best regards,

Klaus!

2018-06-13T05:59:43+00:00

Christian Henning reporter

Hi Klaus,

thanks for your reply. A few questions:

Based on your reply I'm not sure I should wait for 3.4 and then attempt an implementation?
Could I just approximate a binary sparse matrix via CompressedMatrix<bool, rowMajor>? Or is that inefficient?

If I were to try an implementation with 3.3 do the following ideas make sense. Please note that I'm mostly shooting from the hip and so some questions might be premature.

Declare a class like this:

template< bool SO = defaultStorageOrder >  // Storage order
class CompressedBinaryMatrix
   : public SparseMatrix< CompressedBinaryMatrix<SO>, SO >

Replace CompressedMatrix::Element class with bool. Or maybe just get rid of it, at once? After all we just need to store indices.
Every time we insert a new element rework the interface and remove the Type parameter. Similar to retrieving an element. Just return a bool indicating a bit to be on or off.
The underlying memory is just a vector<pair<uint32,uint32>> indicating indices of on bits.
There are many optimization imaginable. For instance, an index for a 256 bits vector is just an uint8. If sparsity is 2% that's 4 uint8s. And so the whole vector could be stored in one uint32.

If you're interested where a sparse binary vector/matrix is useful have a look here:

http://www.cortical.io/science-sparse-distributed-representations.html

https://numenta.com/assets/pdf/biological-and-machine-intelligence/BaMI-SDR.pdf

Regards, Christian

2018-06-13T14:13:41+00:00

Klaus Iglberger

Hi Christian!

Based on your reply I'm not sure I should wait for 3.4 and then attempt an implementation?

The hard part is not the implementation of the CompressedBinaryMatrix class, but the integration into the library such that you can perform arithmetic operations and combine your class with the existing data structures. For instance, in Blaze 3.3 you have to manually define the resulting data types for each possible arithmetic operation. This is necessary since Blaze might have to create temporaries within the evaluation of complex expressions. Without this you will unfortunately experience very complex compilation errors. This is one of the details that we are currently improving. It will take us a few weeks to finish this refactoring, but then (hopefully) your task should be significantly simplified (you might start even before 3.4 is officially released). However, of course you can of course start to implement the matrix class and see how far you can get. I will try to help as much as possible.

Could I just approximate a binary sparse matrix via CompressedMatrix<bool, rowMajor>? Or is that inefficient?

CompressedMatrix does not use the classical CRS or CCS storage format, but uses pairs of elements. Although this wastes space, it proves to be pretty efficient and is very helpful for writing in parallel to a CompressedMatrix. However, due to this setup a CompressedMatrix<bool> would require 8 bytes to store a single bool due to alignment restrictions. Therefore from a storage point of view it would be very inefficient and nowhere close to an implementation that only stores indices.

If I were to try an implementation with 3.3 do the following ideas make sense. Please note that I'm mostly shooting from the hip and so some questions might be premature.

Your design ideas for a CompressedBinaryMatrix sound reasonable. However, as stated before this is the simple part. The data structure needs to be able to work in combination with the existing data structures (at least that is my expectation). I really hope that in 3.4 you will be able to just write your CompressedBinaryMatrix class and be able to immediately use it productively, even in combination with the existing algorithms and data structures.

I should point out that the existing algorithms (foremost the matrix multiplication) will not be able to exploit the fact that the matrix is guaranteed to only store ones and zeros. Therefore any potential performance gain will be entirely due to the storage savings.

Best regards,

Klaus!

2018-06-13T16:20:44+00:00

Christian Henning reporter

Great Klaus!

I'll be waiting for 3.4. Let me know when there is something to toy with. In the meantime I'll just stick with CompressedMatrix<bool> since it works nicely with all the other matrix types.

Good luck refactoring!

Christian

2018-06-13T16:36:55+00:00

Klaus Iglberger

Hi Christian!

We have released Blaze 3.4 and have integrated the promised changes. You should now be able to add a binary matrix based on the implementation of CompressedMatrix and should be able to use it in all available operations. If you encounter any problem, please tell us so we can resolve the obstacle.

Best regards,

Klaus!

2018-08-24T05:02:18+00:00

Christian Henning reporter

Thanks Klaus,

this is great news! I'll give it a shot over the weekend.

Christian

2018-08-24T13:06:40+00:00

Christian Henning reporter

Hi Klaus,

I have been looking at the code for CompressedMatrix. The CompressedBinaryMatrix doesn't need to store any elements only the indices of non-zero elements. So when I define the class I don't think I need the private Element class and so the Iterator type would be just be size_t*.

Here is what I'm thinking in code:

class CompressedBinaryMatrix
{

public:
    //**Type definitions****************************************************************************
    using This = CompressedBinaryMatrix;   //!< Type of this CompressedMatrix instance.
    using ResultType = This;                        //!< Result type for expression template evaluations.
    using OppositeType = CompressedBinaryMatrix;  //!< Result type with opposite storage order for expression template evaluations.
    using TransposeType = CompressedBinaryMatrix;  //!< Transpose type for expression template evaluations.
    using ElementType = size_t;                        //!< Type of the compressed matrix elements.
    using ReturnType = const size_t&;                 //!< Return type for expression template evaluations.
    using CompositeType = const This&;                 //!< Data type for composite expression templates.
    using Reference = MatrixAccessProxy<This>;     //!< Reference to a compressed matrix value.
    using ConstReference = const size_t&;                 //!< Reference to a constant compressed matrix value.
    using Iterator = size_t* ;                    //!< Iterator over non-constant elements.
    using ConstIterator = const size_t*;              //!< Iterator over constant elements.
    //**********************************************************************************************


    //**Member variables****************************************************************************
    /*!\name Member variables */
    //@{
    size_t m_;         //!< The current number of rows of the compressed matrix.
    size_t n_;         //!< The current number of columns of the compressed matrix.
    size_t capacity_;  //!< The current capacity of the pointer array.
    Iterator* begin_;  //!< Pointers to the first non-zero element of each row.
    Iterator* end_;    //!< Pointers one past the last non-zero element of each row.
    //@}
    //**********************************************************************************************
};

Does that match your way of thinking? And also, do you think I still need the Storage Order template argument?

Thanks, Christian

2018-08-25T17:46:22+00:00

Klaus Iglberger

Hi Christian!

Unfortunately it is not enough to have size_t* as iterator type. The interface of the iterators must provide access to the indices and values via the index() and value() functions (see the wiki). In case of the CompressedMatrix class, this interface is provided by the private Element class.

If you are only interested in a specific form of compressed binary matrix (i.e. either row-major or column-major matrices) then you can omit the template parameter for the template argument.

In which operations do you plan to use the CompressedBinaryMatrix? Do you need it for any arithmetic operations (e.g. matrix-vector multiplications, matrix addition, matrix multiplication, ...)? Do you want to use it in combination with other matrix types?

Best regards,

Klaus!

2018-08-26T09:31:04+00:00

Christian Henning reporter

Hi Klaus,

if I understand you correctly I need provide a Element type which supports same interface. The only difference would be that my Element wont have a Type value_ member. Instead it would just return 1, indicating an ON bit.

I like to use this matrix type for the following operations:

convert to/from dense matrix types
convert to/from compressed matrix types
encoding information into a unified representation (see SDR - Sparse Distribution Representation).
bit operations, AND (Overlap), OR (Union), XOR, NOT
Subsampling

In general, random access must be fast.

If you are interested here are some videos displaying the need:

https://www.youtube.com/playlist?list=PL3yXMgtrZmDqhsFQzwUC9V8MeeVOQ7eZ9

Regards, Christian

2018-08-29T01:53:36+00:00

Klaus Iglberger

Hi Christian!

if I understand you correctly I need provide a Element type which supports same interface. The only difference would be that my Element wont have a Type value_ member. Instead it would just return 1, indicating an ON bit.

Correct. It's just about the interface. This is required by all sparse matrix algorithms. Even though you don't plan on using them and implement most of the operation you need yourself (see below) it would be better to adhere to this common interface.

I like to use this matrix type for the following operations:

convert to/from dense matrix types

convert to/from compressed matrix types

Conversions operations will have to be defined in the binary matrix itself.

encoding information into a unified representation (see SDR - Sparse Distribution Representation).

bit operations, AND (Overlap), OR (Union), XOR, NOT

All of these are operations that we currently don't provide. You would have to implement them yourself.

Subsampling

I'm not sure if this merely means to create submatrices (that would already be supported) or something else (which might not yet be supported).

Best regards,

Klaus!

2018-08-29T06:56:40+00:00

Terry Tucker

This helps to ensure that the annotations are relevant and useful for the model. Another important aspect of annotating data for machine learning is to ensure that the annotations are done efficiently. This is where the use of tools and automation comes in handy. The article provides some useful tips on how to use tools like Labelbox and Amazon SageMaker Ground Truth to speed up the annotation process. However, there are also some common pitfalls to avoid when annotating data. One of the most common pitfalls is to rely too heavily on automated tools and not verifying the annotations manually. This can lead to errors and inconsistencies that can negatively impact the model's performance. Another pitfall is to not have a clear understanding of the data being annotated, which can lead to irrelevant or inaccurate annotations. Overall, the article https://www.projectpractical.com/annotating-data-for-machine-learning-best-practices-and-common-pitfalls/ on annotating data for machine learning best practices and common pitfalls is a must-read for anyone working with machine learning models. It provides valuable insights and practical tips that can help to ensure that the annotations are accurate, consistent, and relevant for the model. I highly recommend this article to anyone looking to improve their machine learning models.

2023-06-20T22:19:26+00:00

Comments (11)