If It takes half the time it probably does half the work
Hello,
I recently revisited my sqlite-implementation, and wanted to know how the LLVM-IR looked.
So I read through it, and discovered to my horror that it seemed to truncate my values after doing a ByteSwap on it.
Just when I was about to investigate a bug in the llvm optimizer it occurred to me to check my implementation.
The cannonical implementation to swap an 8Byte value looks like this.
return (((val & 0x00000000000000ffUL) << 56UL) | ((val & 0x000000000000ff00UL) << 40UL) | ((val & 0x0000000000ff0000UL) << 24UL) | ((val & 0x00000000ff000000UL) << 8UL) | ((val & 0x000000ff00000000UL) >> 8UL) | ((val & 0x0000ff0000000000UL) >> 24UL) | ((val & 0x00ff000000000000UL) >> 40UL) | ((val & 0xff00000000000000UL) >> 56UL) );I discovered by reading the IR, that the old version ldc, I happend to be using, did not use intrinsics to swap the values.
I decided to implement it in a form that the optimizer would recognize.
return (swap4Byte(cast(uint)(val & 0xffffffff00000000UL >> 32))| swap4Byte(cast(uint)(val & 0x00000000ffffffffUL)) );
This implementation passed my tests.
And it was twice as fast the the old one.
Now where is the bug ?
The statement above does not shift the value to the left, instead it shifts the bitmask constant.
Which leads to the truncation of the last 32 bit.
It is oblivious to me now.
But this bug lurked in my codebase for over 3 months.
It could have been found if I had tested swapping 64bit values
.
There are two take-aways:
First, get full test-coverage! Especially in hand-optimized code.
Second, if your optimized code much much faster then the baseline you are probably not doing the work you should be doing.
Keine Kommentare:
Kommentar veröffentlichen