Skip to content

cpu: x64: matmul: enable treat_as_plain for weights format#5038

Open
xuxinzen wants to merge 1 commit intomainfrom
xzeng/fixup_cache_key_collision
Open

cpu: x64: matmul: enable treat_as_plain for weights format#5038
xuxinzen wants to merge 1 commit intomainfrom
xzeng/fixup_cache_key_collision

Conversation

@xuxinzen
Copy link
Copy Markdown
Contributor

Fixes MFDNN-14900

Using plain format when N == 1 for transpose format, which works the same before guilty commit. And the TF confirms that the patch works fine on their end.

CI tests

@xuxinzen xuxinzen requested a review from a team as a code owner April 16, 2026 18:55
@xuxinzen xuxinzen added bug A confirmed library bug platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 labels Apr 16, 2026
&& memory_desc_matches_tag(
B_md, transposed_tensor_layout_tag);
const bool treat_as_plain = plain_transposed_matched && bgmmc.N == 1
&& !bgmmc.is_int4_weights;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipate to see a comment explain the necessity of this variable and its conditions. It took several people for several days to acknowledge the issue and narrow down it to matmul specific setting behavior.

Additionally, I wonder why no recently introduced canonical call is not used.

Thirdly, it's unclear why parallel creation lead to the crash, which shouldn't happen by primitive cache design, insights are desired.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please convert the reproducer from the tracker into a test_concurrency gtest expansion as a regression test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering whether the issue has been there for long time or did we recently introduce it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some comments and this patch is a temporary fixup. I need more time to investigate the cause and get a real solution

As for why not using is_canonical(), I tried to use it before and the issue was resolve with c++ reproducer. But the results from TF team said the segfault issue was still there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce this issue with rls-v3.7 when using correct weights format for transposed. brgemm matmul uses plain for transpose format when N == 1 before the guilty commit.

Copy link
Copy Markdown
Contributor

@vpirogov vpirogov Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering whether the issue has been there for long time or did we recently introduce it?

It was introduced a year ago by PR #3027. Specifically this stride check creates undefined behavior when corresponding dimension is trivial.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stride_check is case, then the reason of not using is_canonial() that mentioned above can be explained as I still keep the code to reach the stride check. I'll continue investigating that. Thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t fully understand this fix since we don’t know the root cause yet. You said you needed more time to find it and come up with a real solution so it’s not clear whether this is actually fixing the issue or just happens to make it disappear. I don’t think we should promote it as a fix until we understand that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug A confirmed library bug platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants