1 ArrayFire JIT Code Generation {#jit}
4 The ArrayFire library offers JIT (Just In Time) compiling for elementwise
5 arithmetic operations. This includes trigonometric functions, comparisons, and
6 element-wise operations.
8 At runtime, ArrayFire aggregates these function calls using an Abstract Syntax
9 Tree (AST) data structure such that whenever a JIT-supported function is
10 called, it is added into the AST for a given variable instance. The AST of the
11 variable is computed if one of the following conditions is met:
13 * an explication evaluation is required by the programmer using the
14 [eval](\ref af::eval) function, or
15 * the variable is required to compute a different variable that is not
18 When the above occurs, and the variable needs to be evaluated, the functions
19 and variables in the AST data structure are used to create a single
20 kernel. This is done by creating a customized kernel on-the-fly that is made
21 up of all the functions in the AST. The customized function is then executed.
23 This JIT compilation technique has multiple benefits:
25 * A reduced number of kernel calls – a kernel call can be a significant
26 overhead for small data sets.
27 * Better cache performance – there are many instances in which the memory
28 required by a single element in the array can be reused multiple times, or
29 the temporary value of a computation can be stored in the cache and reused
30 by future computations.
31 * Temporary memory allocation and write-back can be reduced – when multiple
32 expressions are evaluated and stored into temporary arrays, these arrays
33 need to be allocated and the results written back to main memory.
34 * Avoid computing elements that are not used – there are cases in which the
35 AST is created for a variable; however, the expression is not used later in
36 the computation. Thus, its evaluation can be avoided.
37 * Better performance – all the above can help reduce the total execution time.
39 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
40 // As JIT is automatically enabled in ArrayFire, this version of the function
41 // forces each expression to be evaluated. If the eval() function calls are
42 // removed, then the execution of this code would be equivalent to the
43 // following function.
45 static double pi_no_jit(array x, array y, array temp, int samples) {
54 return 4.0 sum(temp)/samples;
57 static double pi_jit(array x, array y, array temp,int samples){
58 temp = sqrt(x*x + y*y) < 1;
60 return 4.0 * sum(temp) / samples;
62 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
64 The above code computes the value of π using a Monte-Carlo simulation where
65 points are randomly generated within the unit square. Each point is tested to
66 see if it is within the unit circle. The ratio of points within the circle and
67 square approximate the value π. The accuracy of π improves as the number of
68 samples is increased, which motivates using additional samples.
70 There are two implementations above:
71 1. an implementation that does not benefit from the JIT (pi\_no\_jit), and
72 2. an implementation that takes advantage of the JIT feature (pi\_jit).
74 Specifically, as JIT is an integral feature of the ArrayFire library, it
75 cannot simply be turned on and off. The only way for a programmer to sidestep
76 the JIT operations is to manually force the evaluation of expressions. This is
77 done in the non-JIT-supported implementation.
79 Timing these two implementations results in the following performance
82 <img src="jit_cuda1.webp" alt="Performance of JIT and Non-JIT implementations"
86 The above figure depicts the execution time (abscissa) as a function of the
87 number of samples (ordinate) for the two implementations discussed above.
89 When the number of samples is small, the execution time of pi\_no\_jit is
90 dominated by the launch of multiple kernels and the execution time pi\_jit is
91 dominated by on-the-fly compilation of the JIT code required to launch a
92 single kernel. Even with this JIT compilation time, pi\_jit outperforms
93 pi_no_jit by 1.4-2.0X for smaller sample sizes.
95 When the number of samples is large, both the kernel launch overhead and the
96 JIT code creation are no longer the limiting factors – the kernel’s
97 computational load dominates the execution time. Here, the pi\_jit outperforms
98 pi\_no\_jit by 2.0-2.7X.
100 The number of applications that benefit from the JIT code generation is
101 significant. The actual performance benefits are also application-dependent.