cpu: x64: matmul: enable treat_as_plain for weights format#5038
cpu: x64: matmul: enable treat_as_plain for weights format#5038
Conversation
| && memory_desc_matches_tag( | ||
| B_md, transposed_tensor_layout_tag); | ||
| const bool treat_as_plain = plain_transposed_matched && bgmmc.N == 1 | ||
| && !bgmmc.is_int4_weights; |
There was a problem hiding this comment.
I anticipate to see a comment explain the necessity of this variable and its conditions. It took several people for several days to acknowledge the issue and narrow down it to matmul specific setting behavior.
Additionally, I wonder why no recently introduced canonical call is not used.
Thirdly, it's unclear why parallel creation lead to the crash, which shouldn't happen by primitive cache design, insights are desired.
There was a problem hiding this comment.
And please convert the reproducer from the tracker into a test_concurrency gtest expansion as a regression test.
There was a problem hiding this comment.
I'm also wondering whether the issue has been there for long time or did we recently introduce it?
There was a problem hiding this comment.
I'll add some comments and this patch is a temporary fixup. I need more time to investigate the cause and get a real solution
As for why not using is_canonical(), I tried to use it before and the issue was resolve with c++ reproducer. But the results from TF team said the segfault issue was still there.
There was a problem hiding this comment.
I can reproduce this issue with rls-v3.7 when using correct weights format for transposed. brgemm matmul uses plain for transpose format when N == 1 before the guilty commit.
There was a problem hiding this comment.
I'm also wondering whether the issue has been there for long time or did we recently introduce it?
It was introduced a year ago by PR #3027. Specifically this stride check creates undefined behavior when corresponding dimension is trivial.
There was a problem hiding this comment.
If the stride_check is case, then the reason of not using is_canonial() that mentioned above can be explained as I still keep the code to reach the stride check. I'll continue investigating that. Thanks
There was a problem hiding this comment.
I don’t fully understand this fix since we don’t know the root cause yet. You said you needed more time to find it and come up with a real solution so it’s not clear whether this is actually fixing the issue or just happens to make it disappear. I don’t think we should promote it as a fix until we understand that
Fixes MFDNN-14900
Using plain format when N == 1 for transpose format, which works the same before guilty commit. And the TF confirms that the patch works fine on their end.
CI tests