Skip to content

Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077

Open
densamoilov wants to merge 12 commits intomainfrom
dsamoylo/main/gemv
Open

Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077
densamoilov wants to merge 12 commits intomainfrom
dsamoylo/main/gemv

Conversation

@densamoilov
Copy link
Copy Markdown
Contributor

@densamoilov densamoilov commented Apr 24, 2026

This PR adds a new GEMV kernel to BRGEMM to support the remaining cases and complete GEMV coverage.

The existing and the new GEMV kernels enable all four GEMV cases required for full support across layout and parameter combinations.

GEMV coverage in MatMul

Vector dimension A layout B layout Corresponding BRGEMV Operation BRGEMV transA parameter BRGEMV treat_y_as_row parameter
N = 1 ab ab,ba y = A * x false n/a
M = 1 ab,ba ba yᵀ = xᵀ * Aᵀ false if true output is yᵀ
N = 1 ba ab,ba y = Aᵀ * x true n/a
M = 1 ab,ba ab yᵀ = xᵀ * A true if true output is yᵀ

Note

  • transA: selects whether the BRGEMV uses A or Aᵀ
  • treat_y_as_row: for M=1, interprets y as a row vector
  • Batch dimensions are supported
  • Bias, post-ops and scales are supported

At the matmul level, these GEMV configurations are represented via gemv_strategy_t.
At the BRGEMM level, they are implemented using transA and treat_y_as_row.

Performance
Performance was evaluated on ADL and SRF showing parity with auto-generated GEMM kernels.
As a result, GEMM implementations are no longer used in performance validation and have been fully replaced by BRGEMM matmul.
image
image

@densamoilov densamoilov requested a review from a team as a code owner April 24, 2026 20:30
@github-actions github-actions Bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Apr 24, 2026
@densamoilov
Copy link
Copy Markdown
Contributor Author

make test

const bool is_tail_acc
= gemv_is_tail_acc(bd, bd_block, is_bdb_tail);
// TODO: adjust A_offset(bd, 0)
const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  1. remove the TODO unless there is something you still need to do here.
    2.replace
const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)];

with

const auto a_addr = ptr[reg_aux_A + bd * vreg_traits_t<Vmm>::vlen];

This removes the hardcoded 8 and make is clear that the bd-accumulator is a full register width.

I did a little checking to make sure the 8 is always valid. It looks like this always takes a Ymm register so the 8 is valid. However, changing the code to remove the hardcoded 8 would improve maintenance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I left this TODO as part of the development process and simply overlooked it. It has now been completed. Thanks for pointing it out.

Comment thread src/cpu/x64/brgemm/brgemm_types.hpp
Comment thread src/cpu/x64/brgemm/brgemm_types.hpp
return bd_block2 * brg.bd_block * brg.LDD;
}

// TODO: check that these offsets are correct or 4 different kinds of offsets are needed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As best I can tell the offsets are correct. Does this TODO still need to be here? What remains to be checked?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one had been completed. I just forgot to remove the comment. Removed it now, thanks.


// transA GEMV requires contiguous output (INCY == 1) for tensors with
// batch dimensions. Some batched layouts, e.g. `bac`, place the patch
// dimension between consecutive GEMV output elements and make INCY > 1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible typo in comment: patch dimensions -> batch dimensions?

// Blocking parameters for the transposed case.
brg->ld_block = 1;
brg->ldb = brg->load_dim / brg->ld_block;
brg->ldb_tail = brg->load_dim % brg->ld_block;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Should this have an asset(brg->ldb_tail ==0) like line 790 of the not transA path above?

both calculations are the same why does one have an assert and not the other?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's add it.

@densamoilov
Copy link
Copy Markdown
Contributor Author

make test

This kernel will enable matmul for the following cases:
- A is a matrix, B is a vector, and A is transposed
- A is a vector, B is a matrix, and B is not transposed
Redirect GEMV cases to GEMV code path when fpmath is not default
because it's expected to be faster than the GEMM path.
@densamoilov
Copy link
Copy Markdown
Contributor Author

make test

brgemm_matmul now has broad support for GEMV cases. The only exception is
cases with unusual input/output layouts. However, the GEMV code path in
auto-generated GEMM is also not expected to support them. Therefore, the
decision is to always use brgemm_matmul, whether for the GEMV path or the
regular GEMM path for those exceptions and avoid falling back to
auto-generated GEMM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants