[WIP]gml-hexagon: Q4_0 mm opt #17907

chraac · 2025-12-10T13:31:05Z

Changes

Q4_0 Dot Product Optimization:
- Implemented hvx_vec_load_and_mul_d_rx2 and hvx_vec_load_and_mul_d_r2x2 helper functions to streamline vector loading and multiplication.
- Refactored vec_dot_q4x4x2_q8x4x2_rx2 and vec_dot_q8x4x2_q8x4x2_rx2 to improve instruction pipelining and reduce overhead in the main loops.

Performance

The following performance comparison shows significant improvements for MUL_MAT(type_a=q4_0, type_b=f32) across various batch sizes (n), with ~30% speedup observed for n >= 2.

Device: 8Gen3
Baseline: 4d3726278
Current: 00d5fb31b

Operation (`q4_0`)	Baseline (GFLOPS)	Current (GFLOPS)	Speedup
`n=2`	238.32	316.59	+32%
`n=3`	242.05	323.53	+33%
`n=4`	244.17	327.72	+34%
`n=5`	245.33	329.64	+34%
`n=8`	247.10	333.06	+34%

…tions

…ul operations" This reverts commit 7c8f101.

…rations" This reverts commit b567413.

…ction

…handling and processing

…and multiplication

…oading and multiplication

…ptimization

chraac · 2025-12-10T13:35:56Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

+        }
+
+        HVX_Vector_x4 r_dd =
+            hvx_vec_load_and_mul_d_r2x2(r0_x_d + i * x_dblk_size, r1_x_d + i * x_dblk_size, y_d + i * y_dblk_size);


Optimized the scale multiplication step. The previous implementation only processed 32xf16 elements (half the vector width). This change enables 64xf16 multiplication to fully utilize the HVX vector capacity.

I'm getting garbled output for all models.
Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32). This means that we still need to do the same number of multiplies and use the same number of HVX registers either way.

I'm getting garbled output for all models.

Reverted scale loading to handle unaligned scales, as alignment cannot be ensured for all tensor shapes.

Thought this resolves the garbled output issues. Tested on:

llama3-1b: log

qwen3-1.7b: log

Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32).

Regarding the scales utilization: The original source uses 2 Q6_Wqf32_vmpy_VhfVhf instructions for 2 rows but ignores the upper half. This PR aims to fully utilize the results of both multiplications.

As for the accumulator width: For Q4_0, an INT32 accumulator is likely excessive. Since src0 (4-bit) * src1 (8-bit) fits in 12 bits, accumulating 32 elements only requires 17 bits total. A 32-bit accumulator is far larger than what is strictly required.

Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32).

Regarding the scales utilization: The original source uses 2 Q6_Wqf32_vmpy_VhfVhf instructions for 2 rows but ignores the upper half. This PR aims to fully utilize the results of both multiplications.

Ah. Cool. I missed that. That part should help. Reading again.... :)

As for the accumulator width: For Q4_0, an INT32 accumulator is likely excessive. Since src0 (4-bit) * src1 (8-bit) fits in 12 bits, accumulating 32 elements only requires 17 bits total. A 32-bit accumulator is far larger than what is strictly required.

That'd be relevant only if we had native INT17 data type and instructions to use it efficiently.

That'd be relevant only if we had native INT17 data type and instructions to use it efficiently.

That's true regarding INT17.
But I'm now thinking we could potentially reduce the precision of src1 to INT7. If that works, we could keep the rest of the calculation within the f16 space. I'm not certain if it will be viable yet, and it would be a significant refactor, so I'll need some time to investigate the details.

Quick update from my testing. I think most of the gain you're seeing from this PR comes from slightly wider processing of the blocks and not from the scales.

There is a slightly better (simpler) way to multiple scales from both rows.
I dug up my older code that does this

HVX_Vector vyy_d = Q6_Vh_vshuff_Vh(Q6_V_valign_VVR(vy_d, Q6_V_vror_VR(vy_d, 64), 64)); HVX_Vector r01_d = Q6_Vh_vshuff_Vh(Q6_V_valign_VVR(r1_d, Q6_V_vror_VR(r0_d, 64), 64)); HVX_VectorPair r01_dd = Q6_Wqf32_vmpy_VhfVhf(r01_d, vyy_d); HVX_Vector r0_dd = Q6_Vsf_equals_Vqf32(Q6_V_lo_W(r01_dd)); HVX_Vector r1_dd = Q6_Vsf_equals_Vqf32(Q6_V_hi_W(r01_dd));

Both this and the vmux based version you have in the PR do not by themselves improve things.
Looks like we either run out of registers or the extra instructions for mixing the scales are eating away the gains.

Now, the overall gains from this PR are not very consistent with some regressions:

## Before: Gen3 common_perf_print: prompt eval time = 1923.41 ms / 205 tokens ( 9.38 ms per token, 106.58 tokens per second) common_perf_print: eval time = 1429.96 ms / 63 runs (22.70 ms per token, 44.06 tokens per second) Gen4 common_perf_print: prompt eval time = 1235.03 ms / 205 tokens ( 6.02 ms per token, 165.99 tokens per second) common_perf_print: eval time = 1073.28 ms / 63 runs (17.04 ms per token, 58.70 tokens per second) Gen5 common_perf_print: prompt eval time = 864.09 ms / 205 tokens ( 4.22 ms per token, 237.24 tokens per second) common_perf_print: eval time = 1089.50 ms / 63 runs (17.29 ms per token, 57.82 tokens per second) ## After: Gen3 common_perf_print: prompt eval time = 1773.68 ms / 205 tokens ( 8.65 ms per token, 115.58 tokens per second) common_perf_print: eval time = 1373.27 ms / 63 runs (21.80 ms per token, 45.88 tokens per second) Gen4 common_perf_print: prompt eval time = 1273.77 ms / 205 tokens ( 6.21 ms per token, 160.94 tokens per second) common_perf_print: eval time = 1097.05 ms / 63 runs (17.41 ms per token, 57.43 tokens per second) Gen5 common_perf_print: prompt eval time = 845.55 ms / 205 tokens ( 4.12 ms per token, 242.44 tokens per second) common_perf_print: eval time = 1133.26 ms / 63 runs (17.99 ms per token, 55.59 tokens per second)

So a mix of gains and regressions.
I'm thinking we'll be better off doing a new version of the multiply reduce that avoids the reductions (ie shuffles and adds).
This can be done by interleaving groups of 4ints across 8-blocks in the repack and dyn quant. With that we can also use Vw_vrmpyacc_VwVbVb version which is a fused multiple accumulate.
I mentioned that I've been working on that version so maybe give me a few more days to clean that up and we can play with that version. It should provide bigger & consistent gains.

btw Might be good to do a clean rebase of this PR and squash/remove commits we no longer need (ie ROPE fixes, etc).

ggml/src/ggml-hexagon/htp/matmul-ops.c

chraac · 2025-12-10T14:29:19Z

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

max-krasnyansky · 2025-12-10T23:29:40Z

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

Actually the DMA is fully asynchronous and it already overlaps with vec_dot.
If you look at the overall outer loop it works like this

Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15
Wait for first DMA request to complete (rows 0,1)
VecDot for rows 0,1 (DMAs for rows 2.... are in flight)
Issue DMA request for rows 16,17 (will override row 0,1)
Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now
VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)
Issue DMA request for rows 18,19 (will override row 2,3)
...

You get the idea. It's fully pipelined. Typically all the waits are no-ops except for the first one.
Also, if you just comment out vec_dot calls from the outter loop you'll see that we're fully DMA/DDR bound (ie you'll get about the same token rate).

The Prompt on the other hand is compute bound and I'm working on redoing the matvec to optimize out the number of reductions that are needed (ie those rmpy_x8 functions can be improved but need a data layout/repack changes).
And of course we'll need to enable HMX to fully utilize the TOPs but that is a bit tricky and might take some time :)

chraac · 2025-12-11T14:20:04Z

Actually the DMA is fully asynchronous and it already overlaps with vec_dot. If you look at the overall outer loop it works like this

Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15

Wait for first DMA request to complete (rows 0,1)

VecDot for rows 0,1 (DMAs for rows 2.... are in flight)

Issue DMA request for rows 16,17 (will override row 0,1)

Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now

VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)

Issue DMA request for rows 18,19 (will override row 2,3)

Thanks. I was referring to swapping the order so we issue the DMA request (step 4) before vec_dot (step 3).
That would require implementing a second buffer (double buffering), haven't tested it yet, but I'm going to comment out vec_dot first to verify if that improves the performance.

max-krasnyansky · 2025-12-11T21:26:55Z

Actually the DMA is fully asynchronous and it already overlaps with vec_dot. If you look at the overall outer loop it works like this
...
Thanks. I was referring to swapping the order so we issue the DMA request (step 4) before vec_dot (step 3). That would require implementing a second buffer (double buffering), haven't tested it yet, but I'm going to comment out vec_dot first to verify if that improves the performance.

Yep, I understood the suggestion, and I would recommend to re-read my description again :)
We effectively have octuple-buffering (ie 8 buffers of 2x rows). While we're processing one buffer (two rows) there are 7 more buffers in flight. We already issue the DMA requests before the vec_dot. That's the prefill stage (step 1). As I mentioned we issue 8 requests for prefill, and only wait for the first one to complete from there on we always have 7 outstanding requests, issuing N more requests in the inner loop before the vec_dot (ie swapping the order) just means we'll have N more requests in flight, which also just means that our scratch pad needs N more rows.

btw I experimented with 32 and 64 rows scratchpad and it doesn't really help with the current HVX implementation.
It will probably help after some more optimizations that I'm working on, and with HMX though, so we can bump that up later.

chraac · 2025-12-14T07:52:32Z

Yep, I understood the suggestion, and I would recommend to re-read my description again :)
We effectively have octuple-buffering (ie 8 buffers of 2x rows). While we're processing one buffer (two rows) there are 7 more buffers in flight.

Yeah, your're right.
After carefully reviewing the current mulmat implementation, realized it's not necessary to reorder step 4 before step 3. We already have a circular row buffer implemented: the buffer is initially filled with HTP_SPAD_SRC0_NROWS rows, and then, after each vec_dot, it'll asynchronously fetch the (ir0 + HTP_SPAD_SRC0_NROWS)-th row.

chraac added 25 commits November 27, 2025 12:54

fix test failure

407b408

fix: correct scaling calculations in rope_cache_init

4ddb8a4

wip

cfca78b

wip

e9a02fd

fix: optimize element copying in rope_hex_f32 using memcpy

e324bb0

fix: optimize loop boundaries in rope_hex_f32 for better performance

0121291

rename

010039a

wip

a6ef41f

Merge branch 'master' into dev-fix-rope

0376146

Merge tag 'b7207' into dev-fix-rope

8abecfa

feat: add profiling macros for performance measurement in operations

b567413

refactor: replace manual timing with profiling macros in matmul opera…

7c8f101

…tions

Merge branch 'master' into dev-fix-rope

3a70465

Revert "refactor: replace manual timing with profiling macros in matm…

3b0cef4

…ul operations" This reverts commit 7c8f101.

Revert "feat: add profiling macros for performance measurement in ope…

121e656

…rations" This reverts commit b567413.

refactor: optimize vector operations in vec_dot_q4x4x2_q8x4x2_rx2 fun…

401fd3e

…ction

wip

cf491f2

feat: enhance vec_dot_q4x4x2_q8x4x2_rx2 function with optimized data …

3a01d82

…handling and processing

Merge branch 'master' into dev-mulmat-opt

87ad8b2

feat: add hvx_vec_load_d_and_mpy function for optimized data loading …

421d031

…and multiplication

wip

bd43860

feat: add hvx_vec_load_d_and_mpy_r2x2 function for optimized vector l…

b197464

…oading and multiplication

feat: optimize vec_dot functions with improved data handling and loading

309d782

wip

dbe9309

feat: add build information and update vector loading functions for o…

00d5fb3

…ptimization

chraac requested review from lhez and max-krasnyansky as code owners December 10, 2025 13:31

chraac commented Dec 10, 2025

View reviewed changes

chraac marked this pull request as draft December 10, 2025 13:36

chraac commented Dec 10, 2025

View reviewed changes

ggml/src/ggml-hexagon/htp/matmul-ops.c Outdated Show resolved Hide resolved

loci-dev mentioned this pull request Dec 10, 2025

UPSTREAM PR #17907: [WIP]gml-hexagon: Q4_0 mm opt auroralabs-loci/llama.cpp#512

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 10, 2025

revert rope changes

b54ff18

Merge tag 'b7345' into dev-mulmat-opt

f757245

fix: revert HVX_Vector back to HVX_UVector

09c4899

Merge branch 'master' into dev-mulmat-opt

8e51ea9

[WIP]gml-hexagon: Q4_0 mm opt #17907

Are you sure you want to change the base?

[WIP]gml-hexagon: Q4_0 mm opt #17907

Conversation

chraac commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Uh oh!

chraac Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chraac commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 10, 2025

Uh oh!

chraac commented Dec 11, 2025

Uh oh!

max-krasnyansky commented Dec 11, 2025

Uh oh!

chraac commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chraac commented Dec 10, 2025 •

edited

Loading

chraac Dec 10, 2025 •

edited

Loading

chraac commented Dec 10, 2025 •

edited

Loading