caching - VLD3.u16 takes a lot longer than VLD3.u8 -
i benchmarking vld3.u8 , vld3.u16 , noticed interesting result. input array of 1200*1048*3 vld3.u8 , 1200*1048*3*2 vld3.u16 , iterated 1000 times.
vld3.u8 : 2.70 sec
vld3.u16 : 16.68 sec
sample code
pld [r1, #64] pld [r1, #64*2] pld [r1, #64*3] .loop: pld [r1, #64*3] vld3.u8 {d0, d1, d2}, [r1]! vst3.u8 {d0, d1, d2}, [r0]! subs r2, r2, #1 bne .loop pop { r4-r5, pc }
looking @ cycle counter (http://pulsar.webshaker.net/ccc/result.php?lng=us) takes 4 cycles each operation (vld3.u8 , vld3.u16).
my expectation vld3.u16 take twice long compute since input vld3.u16 double compared of vld3.u8.
is cache miss issue happening vld3.u16, causing vld3.u16 take long time load values arm registers neon registers ?
can please me :) ?
Comments
Post a Comment