caching - VLD3.u16 takes a lot longer than VLD3.u8 -


i benchmarking vld3.u8 , vld3.u16 , noticed interesting result. input array of 1200*1048*3 vld3.u8 , 1200*1048*3*2 vld3.u16 , iterated 1000 times.

vld3.u8 : 2.70 sec

vld3.u16 : 16.68 sec

sample code

pld          [r1, #64] pld          [r1, #64*2] pld          [r1, #64*3] .loop:    pld          [r1, #64*3]    vld3.u8      {d0, d1, d2}, [r1]!    vst3.u8      {d0, d1, d2}, [r0]! subs        r2, r2, #1 bne         .loop  pop         { r4-r5, pc } 

looking @ cycle counter (http://pulsar.webshaker.net/ccc/result.php?lng=us) takes 4 cycles each operation (vld3.u8 , vld3.u16).

my expectation vld3.u16 take twice long compute since input vld3.u16 double compared of vld3.u8.

is cache miss issue happening vld3.u16, causing vld3.u16 take long time load values arm registers neon registers ?

can please me :) ?


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -