1 GFOR: Parallel For-Loops {#page_gfor}
2 ========================
6 Run many independent loops simultaneously on the GPU or device.
8 Introduction {#gfor_intro}
11 The gfor-loop construct may be used to simultaneously launch all of the
12 iterations of a for-loop on the GPU or device, as long as the iterations are
13 independent. While the standard for-loop performs each iteration sequentially,
14 ArrayFire's gfor-loop performs each iteration at the same time (in
15 parallel). ArrayFire does this by tiling out the values of all loop iterations
16 and then performing computation on those tiles in one pass.
18 You can think of `gfor` as performing auto-vectorization of your code,
19 e.g. you write a gfor-loop that increments every element of a vector but
20 behind the scenes ArrayFire rewrites it to operate on the entire vector in
23 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
24 for (int i = 0; i < n; ++i)
29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31 Behind the scenes, ArrayFire rewrites your code into this equivalent and
34 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
36 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38 It is best to vectorize computation as much as possible to avoid the overhead
39 in both for-loops and gfor-loops.
41 To see another example, you could run an FFT on every 2D slice of a volume in
42 a for-loop, or you could "vectorize" and simply do it all in one gfor-loop
45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
46 for (int i = 0; i < N; ++i)
47 A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence
50 A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53 There are three formats for instantiating gfor-loops.
54 -# gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
55 -# gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
56 -# gfor(var,first,last,incr) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
58 So all of the following represent the equivalent sequence: _0,1,2,3,4_
60 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
68 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
69 array A = constant(1, n, n);
70 array B = constant(1, 1, n);
71 gfor (seq k, 0, n-1) {
72 B(span, k) = sum(A(span, k) * A(span,k)); // inner product
74 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
78 array B = constant(0,n,m);
79 gfor (seq k, 0, m-1) {
80 B(span,k) = fft(A(span,k));
82 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
87 User Functions called within GFOR {#gfor_user_functions}
88 ---------------------------------
90 If you have defined a function that you want to call within a GFOR loop, then
91 that function has to meet all the conditions described in this page in order
92 to be able to work as expected.
94 Consider the (trivial) example below. The function compute() has to satisfy
95 all requirements for GFOR Usage, so you cannot use if-else conditions inside
98 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
99 array compute(array A, array B, float ep)
102 if (ep > 0) H = (A * B) / ep; // BAD
108 array A = randu(m, n);
109 array B = randu(m, n);
111 array H = constant(0,m,n);
113 H(span,ii) = compute(A(span,ii), B(span,ii), ep);
114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
116 The Iterator {#gfor_iterator}
119 The iterator can be involved in expressions.
121 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
122 A = constant(1,n,n,m);
125 A(span,span,k) = (k+1)*B + sin(k+1); // expressions
126 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
128 Iterator definitions can include arithmetic in expressions.
130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
131 A = constant(1,n,n,m);
133 gfor (seq k, m/4, m-m/4)
134 A(span,span,k) = k*B + sin(k+1); // expressions
135 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137 Subscripting {#gfor_subscripting}
140 More complicated subscripting is supported.
142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
143 A = constant(1,n,n,m);
144 B = constant(1,n,10);
146 A(span,seq(10),k) = k*B; // subscripting, seq(10) generates index [0,9]
147 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149 Iterators can be combined with arithmetic in subscripts.
151 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
152 array A = randu(n,m);
153 array B = constant(1,n,m);
155 B(span,k) = A(span,k-1);
160 B(span,k) = A(span,2*(k+1)-1);
165 B(span,k) = A(span,floor(k+.2));
166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
168 In-Loop Reuse {#gfor_in_loop}
171 Within the loop, you can use a result you just computed.
173 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
175 A(span,k) = 4 * B(span,k);
176 C(span,k) = 4 * A(span,k); // use it again
178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
180 Although it is more efficient to store the value in a temporary variable:
182 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
188 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
190 In-Place Computation {#gfor_in_place_computation}
193 In some cases, GFOR behaves differently than the typical sequential
194 FOR-loop. For example, you can read and modify a result in place as long as
195 the accesses are independent.
197 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
200 A(span,k) = sin(k) + A(span,k);
201 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
203 Subscripting behaviors `arrays` also work with GFOR.
205 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
206 A = constant(1,n,n,m,k);
207 m = m * k; // precompute since cannot have expressions in iterator
209 A(span,span,k) = 4 * A(span,span,k); // collapse last dimension
210 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
212 Random Data Generation {#gfor_random}
213 ----------------------
215 Random data should always be generated outside the GFOR loop. This is due to
216 the fact that GFOR only passes over the body of the loop once. Therefore,
217 any calls to randu() inside the body of the loop will result in the same
218 random matrix being assigned to every iteration of the loop.
220 For example, in the following trivial code, all columns of `B` are identical
221 because `A` is only evaluated once:
223 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
225 array A = randu(3,1);
229 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
236 This can be rectified by bringing the random number generation outside the
239 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
240 array A = randu(3,n);
242 B(span,ii) = A(span,ii);
244 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
251 This is a trivial example, but demonstrates the principle that random numbers
252 should be pre-allocated outside the loop in most cases.
254 Restrictions {#gfor_restrictions}
257 This preliminary implementation of GFOR has the following restrictions.
259 Iteration independence {#gfor_iteration_independence}
260 ----------------------
262 The most important property of the loop body is that each iteration must be
263 independent of the other iterations. Note that accessing the result of a
264 separate iteration produces undefined behavior.
266 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
270 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
272 No conditional statements {#gfor_no_cond}
273 -------------------------
275 No conditional statements in the body of the loop, (i.e. no
276 branching). However, you can often find ways to overcome this
277 restriction. Consider the following two examples:
281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
284 if (k > 10) A(span,k) = k + 1; // bad
286 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
288 However, you can do a few tricks to overcome this restriction by expressing
289 the conditional statement as a multiplication by logical values. For instance,
290 the block of code above can be converted to run as follows, without error:
294 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
296 array condition = (k > 1); // good
297 A(span,k) = (!condition).as(f32) * A(span,k) + condition.as(f32) * (k + 1);
299 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
301 Another example of overcoming the conditional statement restriction in GFOR is
306 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
307 array A = constant(1,n,n,m);
308 array B = randu(n,n);
311 A(span,span,k) = B + k;
313 A(span,span,k) = B * k;
315 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
317 Instead, you can make two passes over the same data, each pass performing one
322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
323 A = constant(1,n,n,m);
325 gfor (seq k, 0, 2, 3)
326 A(span,span,k) = B + k;
327 gfor (seq k, 1, 2, 3)
328 A(span,span,k) = B * k;
329 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
331 Nested loop restrictions {#gfor_nested_loop}
332 ------------------------
334 Nesting GFOR-loops within GFOR-loops is unsupported. You may interleave
335 FOR-loops as long as they are completely independent of the GFOR-loop
338 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
340 gfor (seq j, m) { // bad
344 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
346 Nesting FOR-loops within GFOR-loops is supported, as long as the GFOR iterator
347 is not used in the FOR loop iterator, as follows:
349 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
351 for (int j = 0; j < (m+k); j++) { // bad
357 for (int j = 0; j < m; j++) { // good
361 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
363 Nesting GFOR-loops inside of FOR-loops is fully supported.
365 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
367 gfor (int j = 0; j < m; j++) { // good
371 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
373 No logical indexing {#gfor_no_logical}
376 Logical indexing like the following is not supported:
378 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
381 array tmp = B(B > .5); // bad
384 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
386 The problem is that every GFOR tile has a different number of elements,
387 something which GFOR cannot yet handle.
389 Similar to the workaround for conditional statements, it might work to use
392 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
396 D(i) = sum(mask .* B);
398 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
400 Sub-assignment with scalars and logical masks is supported:
402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
410 Memory considerations {#gfor_memory}
411 =====================
413 Since each computation is done in parallel for all iterator values, you need
414 to have enough card memory available to do all iterations simultaneously. If
415 the problem exceeds memory, it will trigger "out of memory" errors.
417 You can work around the memory limitations of your GPU or device by breaking
418 the GFOR loop up into segments; however, you might want to consider using a
419 larger memory GPU or device.
421 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
425 C(span,span,k) = matmulNT(B * B); // outer product expansion runs out of memory
429 for (int kk = 0; kk < 400; kk += 100) {
430 gfor (seq k, kk, kk+99) { // four batches of 100
432 C(span,span,k) = matmulNT(B, B); // now several smaller problems fit in card memory
435 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~