Add a new GEMV kernel to BRGEMM and enable it in MatMul by densamoilov · Pull Request #5077 · uxlfoundation/oneDNN

densamoilov · 2026-04-24T20:30:06Z

This PR adds a new GEMV kernel to BRGEMM to support the remaining cases and complete GEMV coverage.

The existing and the new GEMV kernels enable all four GEMV cases required for full support across layout and parameter combinations.

GEMV coverage in MatMul

Vector dimension	A layout	B layout	Corresponding BRGEMV Operation	BRGEMV `transA` parameter	BRGEMV `treat_y_as_row` parameter
N = 1	ab	ab,ba	`y = A * x`	false	n/a
M = 1	ab,ba	ba	`yᵀ = xᵀ * Aᵀ`	false	if true output is `yᵀ`
N = 1	ba	ab,ba	`y = Aᵀ * x`	true	n/a
M = 1	ab,ba	ab	`yᵀ = xᵀ * A`	true	if true output is `yᵀ`

Note

transA: selects whether the BRGEMV uses A or Aᵀ
treat_y_as_row: for M=1, interprets y as a row vector
Batch dimensions are supported
Bias, post-ops and scales are supported

At the matmul level, these GEMV configurations are represented via gemv_strategy_t.
At the BRGEMM level, they are implemented using transA and treat_y_as_row.

Performance
Performance was evaluated on ADL and SRF showing parity with auto-generated GEMM kernels.
As a result, GEMM implementations are no longer used in performance validation and have been fully replaced by BRGEMM matmul.

densamoilov · 2026-04-24T21:06:13Z

make test

georgen117 · 2026-04-24T23:52:06Z

+            const bool is_tail_acc
+                    = gemv_is_tail_acc(bd, bd_block, is_bdb_tail);
+            // TODO: adjust A_offset(bd, 0)
+            const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)];


Suggestions:

remove the TODO unless there is something you still need to do here.
2.replace

const auto a_addr = ptr[reg_aux_A + bd * 8 * sizeof(float)];

with

const auto a_addr = ptr[reg_aux_A + bd * vreg_traits_t<Vmm>::vlen];

This removes the hardcoded 8 and make is clear that the bd-accumulator is a full register width.

I did a little checking to make sure the 8 is always valid. It looks like this always takes a Ymm register so the 8 is valid. However, changing the code to remove the hardcoded 8 would improve maintenance.

Oh, I left this TODO as part of the development process and simply overlooked it. It has now been completed. Thanks for pointing it out.

georgen117 · 2026-04-25T00:09:03Z

    return bd_block2 * brg.bd_block * brg.LDD;
 }

+// TODO: check that these offsets are correct or 4 different kinds of offsets are needed


As best I can tell the offsets are correct. Does this TODO still need to be here? What remains to be checked?

This one had been completed. I just forgot to remove the comment. Removed it now, thanks.

georgen117 · 2026-04-25T00:11:29Z

+
+    // transA GEMV requires contiguous output (INCY == 1) for tensors with
+    // batch dimensions. Some batched layouts, e.g. `bac`, place the patch
+    // dimension between consecutive GEMV output elements and make INCY > 1.


possible typo in comment: patch dimensions -> batch dimensions?

georgen117 · 2026-04-25T00:20:14Z

+    // Blocking parameters for the transposed case.
    brg->ld_block = 1;
    brg->ldb = brg->load_dim / brg->ld_block;
    brg->ldb_tail = brg->load_dim % brg->ld_block;


Low: Should this have an asset(brg->ldb_tail ==0) like line 790 of the not transA path above?

both calculations are the same why does one have an assert and not the other?

Yeah, let's add it.

densamoilov · 2026-04-27T20:12:05Z

make test

This kernel will enable matmul for the following cases: - A is a matrix, B is a vector, and A is transposed - A is a vector, B is a matrix, and B is not transposed

Redirect GEMV cases to GEMV code path when fpmath is not default because it's expected to be faster than the GEMM path.

densamoilov · 2026-04-27T21:40:13Z

make test

brgemm_matmul now has broad support for GEMV cases. The only exception is cases with unusual input/output layouts. However, the GEMV code path in auto-generated GEMM is also not expected to support them. Therefore, the decision is to always use brgemm_matmul, whether for the GEMV path or the regular GEMM path for those exceptions and avoid falling back to auto-generated GEMM.

densamoilov added 5 commits April 24, 2026 11:10

cpu: x64: brgemm: simplify reduce_gemv_accumulators function

0e1d451

cpu: x64: brgemm: add comments for gemv fields in brgemm desc

d956318

cpu: x64: brgemm: unify avx mask for gemv and non-gemv

9cb7523

cpu: x64: brgemm: simplify gemv_microkernel for non-transA case

4d27689

cpu: x64: brgemm: introduce gemv-specific a/b vmm getters

a2e2032

densamoilov requested a review from a team as a code owner April 24, 2026 20:30

github-actions Bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Apr 24, 2026

ankalinin approved these changes Apr 24, 2026

View reviewed changes

georgen117 reviewed Apr 25, 2026

View reviewed changes

densamoilov force-pushed the dsamoylo/main/gemv branch from 7abb8dd to 8d09d25 Compare April 27, 2026 18:01

densamoilov added 2 commits April 27, 2026 12:12

cpu: x64: brgemm: mark const member funcitons

0a298c8

cpu: x64: brgemm: move wei application to a separate function

80442d7

densamoilov force-pushed the dsamoylo/main/gemv branch from 8d09d25 to 4b4f4ff Compare April 27, 2026 20:11

densamoilov added 4 commits April 27, 2026 14:38

cpu: x64: brgemm: introduce transA gemv kernel

ea9fff5

cpu: x64: matmul: disable copy A for bad lda for gemv cases

b63a639

cpu: x64: matmul: enable transA gemv brgemm kernel

cd632f2

This kernel will enable matmul for the following cases: - A is a matrix, B is a vector, and A is transposed - A is a vector, B is a matrix, and B is not transposed

cpu: x64: matmul: allow fpmath for gemv code path

d3e7668

Redirect GEMV cases to GEMV code path when fpmath is not default because it's expected to be faster than the GEMM path.

densamoilov force-pushed the dsamoylo/main/gemv branch from 4b4f4ff to c3970f3 Compare April 27, 2026 21:38

densamoilov force-pushed the dsamoylo/main/gemv branch from c3970f3 to 4793a28 Compare April 27, 2026 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077

Add a new GEMV kernel to BRGEMM and enable it in MatMul#5077
densamoilov wants to merge 12 commits intomainfrom
dsamoylo/main/gemv

densamoilov commented Apr 24, 2026 •

edited

Loading

Uh oh!

densamoilov commented Apr 24, 2026

Uh oh!

georgen117 Apr 24, 2026

Uh oh!

densamoilov Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

georgen117 Apr 25, 2026

Uh oh!

densamoilov Apr 27, 2026

Uh oh!

georgen117 Apr 25, 2026

Uh oh!

georgen117 Apr 25, 2026

Uh oh!

densamoilov Apr 27, 2026

Uh oh!

densamoilov commented Apr 27, 2026

Uh oh!

densamoilov commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

densamoilov commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

densamoilov commented Apr 24, 2026

Uh oh!

georgen117 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

densamoilov Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

georgen117 Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

densamoilov Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

georgen117 Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

georgen117 Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

densamoilov Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

densamoilov commented Apr 27, 2026

Uh oh!

densamoilov commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

densamoilov commented Apr 24, 2026 •

edited

Loading