$darkmode
A high-performance general-purpose compute library
/home/abuild/rpmbuild/BUILD/arrayfire-full-v3.9.0/docs/pages/gfor.md
Go to the documentation of this file.
1 GFOR: Parallel For-Loops {#page_gfor}
2 ========================
3 
4 [TOC]
5 
6 Run many independent loops simultaneously on the GPU or device.
7 
8 Introduction {#gfor_intro}
9 ============
10 
11 The gfor-loop construct may be used to simultaneously launch all of the
12 iterations of a for-loop on the GPU or device, as long as the iterations are
13 independent. While the standard for-loop performs each iteration sequentially,
14 ArrayFire's gfor-loop performs each iteration at the same time (in
15 parallel). ArrayFire does this by tiling out the values of all loop iterations
16 and then performing computation on those tiles in one pass.
17 
18 You can think of `gfor` as performing auto-vectorization of your code,
19 e.g. you write a gfor-loop that increments every element of a vector but
20 behind the scenes ArrayFire rewrites it to operate on the entire vector in
21 parallel.
22 
23 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
24 for (int i = 0; i < n; ++i)
25  A(i) = A(i) + 1;
26 
27 gfor (seq i, n)
28  A(i) = A(i) + 1;
29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
30 
31 Behind the scenes, ArrayFire rewrites your code into this equivalent and
32 faster version:
33 
34 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
35 A = A + 1;
36 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37 
38 It is best to vectorize computation as much as possible to avoid the overhead
39 in both for-loops and gfor-loops.
40 
41 To see another example, you could run an FFT on every 2D slice of a volume in
42 a for-loop, or you could "vectorize" and simply do it all in one gfor-loop
43 operation:
44 
45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
46 for (int i = 0; i < N; ++i)
47  A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence
48 
49 gfor (seq i, N)
50  A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52 
53 There are three formats for instantiating gfor-loops.
54 -# gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
55 -# gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
56 -# gfor(var,first,last,incr) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
57 
58 So all of the following represent the equivalent sequence: _0,1,2,3,4_
59 
60 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
61 gfor (seq i, 5)
62 gfor (seq i, 0, 4)
63 gfor (seq i, 0, 1, 4)
64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
65 
66 More examples:
67 
68 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
69 array A = constant(1, n, n);
70 array B = constant(1, 1, n);
71 gfor (seq k, 0, n-1) {
72  B(span, k) = sum(A(span, k) * A(span,k)); // inner product
73 }
74 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75 
76 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
77 array A = randu(n,m);
78 array B = constant(0,n,m);
79 gfor (seq k, 0, m-1) {
80  B(span,k) = fft(A(span,k));
81 }
82 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
83 
84 Usage {#gfor}
85 =====
86 
87 User Functions called within GFOR {#gfor_user_functions}
88 ---------------------------------
89 
90 If you have defined a function that you want to call within a GFOR loop, then
91 that function has to meet all the conditions described in this page in order
92 to be able to work as expected.
93 
94 Consider the (trivial) example below. The function compute() has to satisfy
95 all requirements for GFOR Usage, so you cannot use if-else conditions inside
96 it.
97 
98 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
99 array compute(array A, array B, float ep)
100 {
101  array H;
102  if (ep > 0) H = (A * B) / ep; // BAD
103  else H = A * 0;
104  return H;
105 }
106 
107 int m = 2, n = 3;
108 array A = randu(m, n);
109 array B = randu(m, n);
110 float ep = 2.35;
111 array H = constant(0,m,n);
112 gfor (seq ii, n)
113  H(span,ii) = compute(A(span,ii), B(span,ii), ep);
114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
115 
116 The Iterator {#gfor_iterator}
117 ------------
118 
119 The iterator can be involved in expressions.
120 
121 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
122 A = constant(1,n,n,m);
123 B = constant(1,n,n);
124 gfor (seq k, m)
125  A(span,span,k) = (k+1)*B + sin(k+1); // expressions
126 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
127 
128 Iterator definitions can include arithmetic in expressions.
129 
130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
131 A = constant(1,n,n,m);
132 B = constant(1,n,n);
133 gfor (seq k, m/4, m-m/4)
134  A(span,span,k) = k*B + sin(k+1); // expressions
135 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
136 
137 Subscripting {#gfor_subscripting}
138 ------------
139 
140 More complicated subscripting is supported.
141 
142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
143 A = constant(1,n,n,m);
144 B = constant(1,n,10);
145 gfor (seq k, m)
146  A(span,seq(10),k) = k*B; // subscripting, seq(10) generates index [0,9]
147 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
148 
149 Iterators can be combined with arithmetic in subscripts.
150 
151 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
152 array A = randu(n,m);
153 array B = constant(1,n,m);
154 gfor (seq k, 1, m-1)
155  B(span,k) = A(span,k-1);
156 
157 A = randu(n,2*m);
158 B = constant(1,n,m);
159 gfor (seq k, m)
160  B(span,k) = A(span,2*(k+1)-1);
161 
162 A = randu(n,2*m);
163 B = constant(1,n,m);
164 gfor (seq k, m)
165  B(span,k) = A(span,floor(k+.2));
166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167 
168 In-Loop Reuse {#gfor_in_loop}
169 -------------
170 
171 Within the loop, you can use a result you just computed.
172 
173 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
174 gfor (seq k, n) {
175  A(span,k) = 4 * B(span,k);
176  C(span,k) = 4 * A(span,k); // use it again
177 }
178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
179 
180 Although it is more efficient to store the value in a temporary variable:
181 
182 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
183 gfor (seq k, n) {
184  a = 4 * B(span,k);
185  A(span,k) = a;
186  C(span,k) = 4 * a;
187 }
188 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189 
190 In-Place Computation {#gfor_in_place_computation}
191 --------------------
192 
193 In some cases, GFOR behaves differently than the typical sequential
194 FOR-loop. For example, you can read and modify a result in place as long as
195 the accesses are independent.
196 
197 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
198 A = constant(1,n,n);
199 gfor (seq k, n)
200  A(span,k) = sin(k) + A(span,k);
201 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
202 
203 Subscripting behaviors `arrays` also work with GFOR.
204 
205 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
206 A = constant(1,n,n,m,k);
207 m = m * k; // precompute since cannot have expressions in iterator
208 gfor (seq k, m)
209  A(span,span,k) = 4 * A(span,span,k); // collapse last dimension
210 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
211 
212 Random Data Generation {#gfor_random}
213 ----------------------
214 
215 Random data should always be generated outside the GFOR loop. This is due to
216 the fact that GFOR only passes over the body of the loop once. Therefore,
217 any calls to randu() inside the body of the loop will result in the same
218 random matrix being assigned to every iteration of the loop.
219 
220 For example, in the following trivial code, all columns of `B` are identical
221 because `A` is only evaluated once:
222 
223 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
224 gfor (seq ii, n) {
225  array A = randu(3,1);
226  B(span,ii) = A;
227 }
228 print(B);
229 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
230 
231  B =
232  0.1209 0.1209 0.1209
233  0.6432 0.6432 0.6432
234  0.8746 0.8746 0.8746
235 
236 This can be rectified by bringing the random number generation outside the
237 loop, as follows:
238 
239 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
240 array A = randu(3,n);
241 gfor (seq ii, n)
242  B(span,ii) = A(span,ii);
243 print(B);
244 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
245 
246  B =
247  0.0892 0.1655 0.7807
248  0.5626 0.5173 0.2932
249  0.5664 0.5898 0.1391
250 
251 This is a trivial example, but demonstrates the principle that random numbers
252 should be pre-allocated outside the loop in most cases.
253 
254 Restrictions {#gfor_restrictions}
255 ============
256 
257 This preliminary implementation of GFOR has the following restrictions.
258 
259 Iteration independence {#gfor_iteration_independence}
260 ----------------------
261 
262 The most important property of the loop body is that each iteration must be
263 independent of the other iterations. Note that accessing the result of a
264 separate iteration produces undefined behavior.
265 
266 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
267 array B = randu(3);
268 gfor (seq k, n)
269  B = B + k; // bad
270 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
271 
272 No conditional statements {#gfor_no_cond}
273 -------------------------
274 
275 No conditional statements in the body of the loop, (i.e. no
276 branching). However, you can often find ways to overcome this
277 restriction. Consider the following two examples:
278 
279 Example 1: Problem
280 
281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
282 A = constant(1,n,m);
283 gfor (seq k, n) {
284  if (k > 10) A(span,k) = k + 1; // bad
285 }
286 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
287 
288 However, you can do a few tricks to overcome this restriction by expressing
289 the conditional statement as a multiplication by logical values. For instance,
290 the block of code above can be converted to run as follows, without error:
291 
292 Example 1: Solution
293 
294 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
295 gfor (seq k, m) {
296  array condition = (k > 1); // good
297  A(span,k) = (!condition).as(f32) * A(span,k) + condition.as(f32) * (k + 1);
298 }
299 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
300 
301 Another example of overcoming the conditional statement restriction in GFOR is
302 as follows:
303 
304 Example 2: Problem
305 
306 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
307 array A = constant(1,n,n,m);
308 array B = randu(n,n);
309 gfor (seq k, 4) {
310  if ((k % 2) != 0)
311  A(span,span,k) = B + k;
312  else
313  A(span,span,k) = B * k;
314 }
315 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
316 
317 Instead, you can make two passes over the same data, each pass performing one
318 branch.
319 
320 Example 2: Solution
321 
322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
323 A = constant(1,n,n,m);
324 B = randu(n);
325 gfor (seq k, 0, 2, 3)
326  A(span,span,k) = B + k;
327 gfor (seq k, 1, 2, 3)
328  A(span,span,k) = B * k;
329 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
330 
331 Nested loop restrictions {#gfor_nested_loop}
332 ------------------------
333 
334 Nesting GFOR-loops within GFOR-loops is unsupported. You may interleave
335 FOR-loops as long as they are completely independent of the GFOR-loop
336 iterator.
337 
338 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
339 gfor (seq k, n) {
340  gfor (seq j, m) { // bad
341  // ...
342  }
343 }
344 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
345 
346 Nesting FOR-loops within GFOR-loops is supported, as long as the GFOR iterator
347 is not used in the FOR loop iterator, as follows:
348 
349 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
350 gfor (seq k, n) {
351  for (int j = 0; j < (m+k); j++) { // bad
352  // ...
353  }
354 }
355 
356 gfor (seq k, n) {
357  for (int j = 0; j < m; j++) { // good
358  //...
359  }
360 }
361 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
362 
363 Nesting GFOR-loops inside of FOR-loops is fully supported.
364 
365 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
366 for (seq k, n) {
367  gfor (int j = 0; j < m; j++) { // good
368  // ...
369  }
370 }
371 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
372 
373 No logical indexing {#gfor_no_logical}
374 -------------------
375 
376 Logical indexing like the following is not supported:
377 
378 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
379 gfor (seq i, n) {
380  array B = A(span,i);
381  array tmp = B(B > .5); // bad
382  D(i) = sum(tmp);
383 }
384 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
385 
386 The problem is that every GFOR tile has a different number of elements,
387 something which GFOR cannot yet handle.
388 
389 Similar to the workaround for conditional statements, it might work to use
390 masked arithmetic:
391 
392 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
393 gfor (seq i, n) {
394  array B = A(span,i);
395  array mask = B > .5;
396  D(i) = sum(mask .* B);
397 }
398 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
399 
400 Sub-assignment with scalars and logical masks is supported:
401 
402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
403 gfor (seq i, n) {
404  a = A(span,i);
405  a(isnan(a)) = 0;
406  A(span,i) = a;
407 }
408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
409 
410 Memory considerations {#gfor_memory}
411 =====================
412 
413 Since each computation is done in parallel for all iterator values, you need
414 to have enough card memory available to do all iterations simultaneously. If
415 the problem exceeds memory, it will trigger "out of memory" errors.
416 
417 You can work around the memory limitations of your GPU or device by breaking
418 the GFOR loop up into segments; however, you might want to consider using a
419 larger memory GPU or device.
420 
421 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
422 // BEFORE
423 gfor (seq k, 400) {
424  array B = A(span,k);
425  C(span,span,k) = matmulNT(B * B); // outer product expansion runs out of memory
426 }
427 
428 // AFTER
429 for (int kk = 0; kk < 400; kk += 100) {
430  gfor (seq k, kk, kk+99) { // four batches of 100
431  array B = A(span,k);
432  C(span,span,k) = matmulNT(B, B); // now several smaller problems fit in card memory
433  }
434 }
435 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~