Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077
Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077densamoilov wants to merge 12 commits intomainfrom
Conversation
|
make test |
| const bool is_tail_acc | ||
| = gemv_is_tail_acc(bd, bd_block, is_bdb_tail); | ||
| // TODO: adjust A_offset(bd, 0) | ||
| const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)]; |
There was a problem hiding this comment.
Suggestions:
- remove the TODO unless there is something you still need to do here.
2.replace
const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)];
with
const auto a_addr = ptr[reg_aux_A + bd * vreg_traits_t<Vmm>::vlen];
This removes the hardcoded 8 and make is clear that the bd-accumulator is a full register width.
I did a little checking to make sure the 8 is always valid. It looks like this always takes a Ymm register so the 8 is valid. However, changing the code to remove the hardcoded 8 would improve maintenance.
There was a problem hiding this comment.
Oh, I left this TODO as part of the development process and simply overlooked it. It has now been completed. Thanks for pointing it out.
| return bd_block2 * brg.bd_block * brg.LDD; | ||
| } | ||
|
|
||
| // TODO: check that these offsets are correct or 4 different kinds of offsets are needed |
There was a problem hiding this comment.
As best I can tell the offsets are correct. Does this TODO still need to be here? What remains to be checked?
There was a problem hiding this comment.
This one had been completed. I just forgot to remove the comment. Removed it now, thanks.
|
|
||
| // transA GEMV requires contiguous output (INCY == 1) for tensors with | ||
| // batch dimensions. Some batched layouts, e.g. `bac`, place the patch | ||
| // dimension between consecutive GEMV output elements and make INCY > 1. |
There was a problem hiding this comment.
possible typo in comment: patch dimensions -> batch dimensions?
| // Blocking parameters for the transposed case. | ||
| brg->ld_block = 1; | ||
| brg->ldb = brg->load_dim / brg->ld_block; | ||
| brg->ldb_tail = brg->load_dim % brg->ld_block; |
There was a problem hiding this comment.
Low: Should this have an asset(brg->ldb_tail ==0) like line 790 of the not transA path above?
both calculations are the same why does one have an assert and not the other?
There was a problem hiding this comment.
Yeah, let's add it.
7abb8dd to
8d09d25
Compare
8d09d25 to
4b4f4ff
Compare
|
make test |
This kernel will enable matmul for the following cases: - A is a matrix, B is a vector, and A is transposed - A is a vector, B is a matrix, and B is not transposed
Redirect GEMV cases to GEMV code path when fpmath is not default because it's expected to be faster than the GEMM path.
4b4f4ff to
c3970f3
Compare
|
make test |
brgemm_matmul now has broad support for GEMV cases. The only exception is cases with unusual input/output layouts. However, the GEMV code path in auto-generated GEMM is also not expected to support them. Therefore, the decision is to always use brgemm_matmul, whether for the GEMV path or the regular GEMM path for those exceptions and avoid falling back to auto-generated GEMM.
c3970f3 to
4793a28
Compare
This PR adds a new GEMV kernel to BRGEMM to support the remaining cases and complete GEMV coverage.
The existing and the new GEMV kernels enable all four GEMV cases required for full support across layout and parameter combinations.
GEMV coverage in MatMul
transAparametertreat_y_as_rowparametery = A * xyᵀ = xᵀ * Aᵀyᵀy = Aᵀ * xyᵀ = xᵀ * AyᵀNote
transA: selects whether the BRGEMV usesAorAᵀtreat_y_as_row: for M=1, interpretsyas a row vectorAt the matmul level, these GEMV configurations are represented via
gemv_strategy_t.At the BRGEMM level, they are implemented using
transAandtreat_y_as_row.Performance


Performance was evaluated on ADL and SRF showing parity with auto-generated GEMM kernels.
As a result, GEMM implementations are no longer used in performance validation and have been fully replaced by BRGEMM matmul.