A Blog about CTFE, optimization and blunders

Sonntag, 17. Juli 2016

CTFE - My first design sketches

Hi,

I just found my very first design sketches for the CTFE IR

I will post them here verbatim,

for no particular reason other then amusement :)

---

//int bug6498(int x) { int n = 0; while (n < x) ++n; return n; }
// static assert(bug6498(100_000_000)==10_000_000);

i32 bug5498 (i32 $a1) {
//LOCALS
i32 $l1;
START :
str 0, $l1;

lbl while_cond :
cmp $l1 $a1;
jlt after_while; // conditional jump a< b if less then in prevoius cmp;
inc $l1;
jmp while_cond;
lbl after_while :
ret $l1;
}

>> resolving labels

1 str 0, $l1;
2 cmp $l1 $a1;
4 clt 2; // conditional block less then next 2 instruction are executed if cmp true otherwise execute 2+1 instruction;
3 inc $l1;
5 jmp 2; // jumps use absoulte addresses;
lbl after_while :
6 ret { $l1 } ;

struct Adder {
uint[4] a;
opBinary (string op : "+")(uint c) {
foreach(i;0 .. a.length) {
a[i] += c;
}
}
}
uint[4] fn(uint s) {
Adder adr = Addr([2,3,4,5]);
return addr + s;
}

///
typedef Adder = [i32, i32, i32, i32] ;
//stack is a hardcoded property of the function;
all arguments are written into the reserved bytes before the first instruction
caller passes it's call location to the callee
callee saves results under the call;
callee jumps behinds it's results
[i32, i32, i32, i32] fn1 (i32 $a1) {
// push constant and push varible a diffrent
sts fn2;// set stack
psc 2;
psc 3;
psc 4;
psc 5;
psv $a1;
jmp fn2;
result[1]
result[2]
result[3]
result[4]
ret ;
}
// again caller tells the callee where to write the result;
// one regsiter is needed to hold the current return-address
// and one ip of course;

[i32, i32, i32, i32] fn2 ([i32, i32, i32, i32]$a[1-4], i32 $a5) {
add $a5, $a1;
add $a5, $a2;
add $a5, $a3;
add $a5, $a4;
ret ; //maybe ret is going to be optional
}

// values before the ret are the return value...

--

Donnerstag, 7. Juli 2016

Design new CTFE engine - A preview

Hello,

I have been working on a rewrite of the CTFE interpreter in D.

As D is a complicated language with many constructs and complicated corner-cases that have to be handled correctly, I did not want to rip out the old Interpreter entirely.

Rather I am hooking the engine in at a point where the lowered function is evaluated and see if my engine can evaluate it.
If it can it will do so and return the result.
If it cannot it will give back a null value which signals to the old Interpreter to take over instead.

This way I can incrementally improve CTFE and still provide a fully functioning compiler.

The AST is already processed into a canonical from when it arrives at the point where I hooked the evaluator in.

This means it is in backend friendlier form.
And much of the syntactic sugar is already removed.

I am translating this canonicalized AST into linear byte-code and hand it off to a byte-code interpreter.

The byte-code itself is almost completely decoupled from the part that processes the AST, thus making it possible to provide different interpretation backends, and to change the byte-code-format at any point without having to rewrite the AST processing.

Thus I keep the byte-code API fairly clean and minimal.

currently there are 15 calls the generator/evaluator backend has to support

Lt3
Gt3
Eq3

Add3
Sub3
Mul3
Div3

And3
Or3
Lsh3
Rsh3

JmpZ
JmpNZ
JmpTrue
JmpFalse

These are enough to represent a large body of code.
And easy implementations should be possible for most of the targets out there.

However I suspect that I will need add at least

Call
and
Return

In order to support function calls.

I will also need to find and document all assumptions I make about the capability of a backend.

For example it has to support 32bit integer operations. It also has to be able to work with 32bit aligned data.

Freitag, 20. Mai 2016

A look into a CTFE transcompiler. (part 1)

Hello,

Today I am going to write about a toy compiler of mine.
It compiles an unremarkable language called PL/0 into D code.
This is a relatively simple process since all of PL0's constructs have a D equivalent.

However despite it's simplicity even PL0 needs to be properly parsed and checked for semantics.
Doing that in a CTFE compatible way may be non-trivial for a novice.
However I discovered a few tricks which make these kinds of things easier.

The most important things to understand is the notion of functional purity.
Because that is basically the only restriction you have to obey at CTFE.
This notion goes as follows :

"A function is pure if : it does not accesses mutable state aside from the parameters passed to it."
In D a pure function is allowed to accesses to mutable state that is _considered local_ to it.

This relaxation makes it easy to write a pure function,
Since turns out that a member function considers all non-static data-members of the enclosing aggregate local.

I feel this is worth pointing out:

As long as the state is wrapped in a struct, class or a union you may modify it at your leisure,
And you will still be considered pure.

I will give more concrete examples in my next post

Dienstag, 17. Mai 2016

Another blunder.

If It takes half the time it probably does half the work

Hello,

I recently revisited my sqlite-implementation, and wanted to know how the LLVM-IR looked.

So I read through it, and discovered to my horror that it seemed to truncate my values after doing a ByteSwap on it.

Just when I was about to investigate a bug in the llvm optimizer it occurred to me to check my implementation.

The cannonical implementation to swap an 8Byte value looks like this.

return (((val & 0x00000000000000ffUL) << 56UL) | 
 ((val  & 0x000000000000ff00UL) << 40UL) | 
 ((val  & 0x0000000000ff0000UL) << 24UL) | 
 ((val  & 0x00000000ff000000UL) <<  8UL) | 
 ((val  & 0x000000ff00000000UL) >>  8UL) | 
 ((val  & 0x0000ff0000000000UL) >> 24UL) | 
 ((val  & 0x00ff000000000000UL) >> 40UL) | 
 ((val  & 0xff00000000000000UL) >> 56UL)
);

I discovered by reading the IR, that the old version ldc, I happend to be using, did not use intrinsics to swap the values.

I decided to implement it in a form that the optimizer would recognize.

return (swap4Byte(cast(uint)(val & 0xffffffff00000000UL >> 32))|
 swap4Byte(cast(uint)(val & 0x00000000ffffffffUL))
);

This implementation passed my tests.

And it was twice as fast the the old one.

Now where is the bug ?

The statement above does not shift the value to the left, instead it shifts the bitmask constant.

Which leads to the truncation of the last 32 bit.

It is oblivious to me now.

But this bug lurked in my codebase for over 3 months.

It could have been found if I had tested swapping 64bit values
.
There are two take-aways:

First, get full test-coverage! Especially in hand-optimized code.
Second, if your optimized code much much faster then the baseline you are probably not doing the work you should be doing.