A Blog about CTFE, optimization and blunders : Mai 2016

Freitag, 20. Mai 2016

A look into a CTFE transcompiler. (part 1)

Hello,

Today I am going to write about a toy compiler of mine.
It compiles an unremarkable language called PL/0 into D code.
This is a relatively simple process since all of PL0's constructs have a D equivalent.

However despite it's simplicity even PL0 needs to be properly parsed and checked for semantics.
Doing that in a CTFE compatible way may be non-trivial for a novice.
However I discovered a few tricks which make these kinds of things easier.

The most important things to understand is the notion of functional purity.
Because that is basically the only restriction you have to obey at CTFE.
This notion goes as follows :

"A function is pure if : it does not accesses mutable state aside from the parameters passed to it."
In D a pure function is allowed to accesses to mutable state that is _considered local_ to it.

This relaxation makes it easy to write a pure function,
Since turns out that a member function considers all non-static data-members of the enclosing aggregate local.

I feel this is worth pointing out:

As long as the state is wrapped in a struct, class or a union you may modify it at your leisure,
And you will still be considered pure.

I will give more concrete examples in my next post

Dienstag, 17. Mai 2016

Another blunder.

If It takes half the time it probably does half the work

Hello,

I recently revisited my sqlite-implementation, and wanted to know how the LLVM-IR looked.

So I read through it, and discovered to my horror that it seemed to truncate my values after doing a ByteSwap on it.

Just when I was about to investigate a bug in the llvm optimizer it occurred to me to check my implementation.

The cannonical implementation to swap an 8Byte value looks like this.

return (((val & 0x00000000000000ffUL) << 56UL) | 
 ((val  & 0x000000000000ff00UL) << 40UL) | 
 ((val  & 0x0000000000ff0000UL) << 24UL) | 
 ((val  & 0x00000000ff000000UL) <<  8UL) | 
 ((val  & 0x000000ff00000000UL) >>  8UL) | 
 ((val  & 0x0000ff0000000000UL) >> 24UL) | 
 ((val  & 0x00ff000000000000UL) >> 40UL) | 
 ((val  & 0xff00000000000000UL) >> 56UL)
);

I discovered by reading the IR, that the old version ldc, I happend to be using, did not use intrinsics to swap the values.

I decided to implement it in a form that the optimizer would recognize.

return (swap4Byte(cast(uint)(val & 0xffffffff00000000UL >> 32))|
 swap4Byte(cast(uint)(val & 0x00000000ffffffffUL))
);

This implementation passed my tests.

And it was twice as fast the the old one.

Now where is the bug ?

The statement above does not shift the value to the left, instead it shifts the bitmask constant.

Which leads to the truncation of the last 32 bit.

It is oblivious to me now.

But this bug lurked in my codebase for over 3 months.

It could have been found if I had tested swapping 64bit values
.
There are two take-aways:

First, get full test-coverage! Especially in hand-optimized code.
Second, if your optimized code much much faster then the baseline you are probably not doing the work you should be doing.