cpu的矩阵访问和乘法优化

我在java中制作了一些内在优化的矩阵包装器(在JNI的帮助下)。 需要对此进行肯定, 您能否给出关于矩阵优化的一些提示? 我要实现的是:

Matrix可以表示为四组缓冲区/数组,一组用于水平访问,一组用于垂直访问,一组用于对角访问,命令缓冲区仅在需要时计算矩阵元素。 这是一个例子。

Matrix signature: 0 1 2 3 4 5 6 7 8 9 1 3 3 5 2 9 First(hroizontal) set: horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9} Second(vertical) set: verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9} Third(optional) a diagonal set: diagS={0,5,1,9} //just in case some calculation needs this Fourth(calcuation list, in a "one calculation one data" fashion) set: calc={0,2,1,3,2,5} --->0 means multiply by the next element 1 means add the next element 2 means divide by the next element so this list means ( (a[i]*2)+3 ) / 5 when only a[i] is needed. Example for fourth set: A.mult(2), A.sum(3), A.div(5), A.mult(B) (to list) (to list) (to list) (calculate *+/ just in time when A is needed ) so only one memory access for four operations. loop start a[i] = b[i] * ( ( a[i]*2) +3 ) / 5 only for A.mult(B) loop end 

如上所示,当需要访问列元素时,第二组提供连续访问。 没有飞跃。 第一组水平访问也达到了同样的效果。

这应该让一些事情变得更容易,有些事情更难:

  Easier: **Matrix transpozing operation. Just swapping the pointers horSet[x] and verSet[x] is enough. **Matrix * Matrix multiplication. One matrix gives one of its horizontal set and other matrix gives vertical buffer. Dot product of these must be highly parallelizable for intrinsics/multithreading. If the multiplication order is inverse, then horizontal and verticals are switched. **Matrix * vector multiplication. Same as above, just a vector can be taken as horizontal or vertical freely. Harder: ** Doubling memory requirement is bad for many cases. ** Initializing a matrix takes longer. ** When a matrix is multiplied from left, needs an update vertical-->horizontal sets if its going to be multiplied from right after.(same for opposite) (if a tranposition is taken between, this does not count) Neutral: ** Same matrix can be multiplied with two other matrices to get two different results such as A=A*B(saved in horizontal sets) A=C*A(saved in vertical sets) then A=A*A gives A*B*C*A(in horizontal) and C*A*A*B (in vertical) without copying A. ** If a matrix always multiplied from left or always from right, every access and multiplication will not need update and be contiguous on ram. ** Only using horizontals before transpozing, only using verticals after, should not break any rules. 

主要目的是具有(8的倍数,8的倍数)大小的矩阵并且应用具有多个线程的avx内在函数(每个胎面同时在一组上工作)。

我只实现了矢量*矢量dotproduct。 如果你掌握了编程方法,我会进入这个方向。

我写的dotproduct(带内在函数)比循环展开的版本快6倍(这是乘法的两倍快),在包装器中启用multithreading时也会停留在内存带宽上限(8x – >使用近20GB / s接近我的ddr3的限制)已经尝试过opencl,它对于cpu来说有点慢,但对于gpu来说非常好。

谢谢。

编辑: “块矩阵”缓冲区如何执行? 当乘以大矩阵时,小补丁以特殊方式相乘,缓存可能用于减少主存储器访问。 但是这需要在垂直 – 水平 – 对角线和该块之间的矩阵乘法之间进行更多更新。

一些库使用表达式模板来为一系列矩阵运算启用非常特定的优化函数。

C ++编程语言还有一个关于“融合操作”(29.5.4,第4版)的简短章节。

这使得语句的连接成为可能:

 M = A*B.transp(); // where M, A, B are matrices 

在这种情况下,您将需要3个类:

 class Matrix; class Transposed { public: Transposed(Matrix &matrix) : m_matrix(matrix) {} Matrix & obj (void) { return m_matrix; } private: Matrix & m_matrix; }; class MatrixMatrixMulTransPosed { public: MatrixMatrixMulTransPosed(Matrix &matrix, Transposed &trans) : m_matrix(matrix), m_transposed(trans.obj()) {} Matrix & matrix (void) { return m_matrix; } Matrix & transposed (void) { return m_transposed; } private: Matrix & m_matrix; Matrix & m_transposed; }; class Matrix { public: MatrixMatrixMulTransPosed operator* (Transposed &rhs) { return MatrixMatrixMulTransPosed(*this, rhs); } Matrix& operator= (MatrixMatrixMulTransPosed &mmtrans) { // Actual computation goes here and is stored in this. // using mmtrans.matrix() and mmtrans.transposed() } }; 

你可以推进这个概念,以便能够为每个任何平均值的计算都有一个特殊的函数。

这实际上相当于缓存转置。 听起来你打算急切地这样做; 我只是在需要时才计算换位,并在需要时再次记住它。 这样,如果你从不需要它,那么它永远不会被计算出来。