Skip to content

Commit

Permalink
powerpc32: optimise csum_partial() loop
Browse files Browse the repository at this point in the history
On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallel execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallel execution)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
  • Loading branch information
Christophe Leroy authored and Scott Wood committed Mar 5, 2016
1 parent 48821a3 commit f867d55
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion arch/powerpc/lib/checksum_32.S
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
srwi. r6,r4,2 /* # words to do */
adde r5,r5,r0
beq 3f
1: mtctr r6
1: andi. r6,r6,3 /* Prepare to handle words 4 by 4 */
beq 21f
mtctr r6
2: lwzu r0,4(r3)
adde r5,r5,r0
bdnz 2b
21: srwi. r6,r4,4 /* # blocks of 4 words to do */
beq 3f
mtctr r6
22: lwz r0,4(r3)
lwz r6,8(r3)
lwz r7,12(r3)
lwzu r8,16(r3)
adde r5,r5,r0
adde r5,r5,r6
adde r5,r5,r7
adde r5,r5,r8
bdnz 22b
3: andi. r0,r4,2
beq+ 4f
lhz r0,4(r3)
Expand Down

0 comments on commit f867d55

Please sign in to comment.