ICS 632, Fall 2008: Homework #1
Send a .tar.gz file to henric@hawaii.edu that contains all
your source code and your report. The top-level directory should be
called ics632_<lastname>_hw1. In that directory there
should be sub-directories called exercise_1, exercise_2, etc. Each such directory should itself have sub-directories called question_1, question_2, etc. These directories contain your code with Makefiles and reports whenever applicable.
Question #1: Naive
Implementation
Write the "naive" sequential implementation,
instrumented so that the wall-clock time is measured. This implementation
consists in generating two random NxN double precision matrices, and
computing a third one. The program takes only one command-line argument:
the dimension of the matrices, N. Experiment with switching loops
for better cache utilization and plot the execution time versus N
for all possible loop permutations. Also plot the MFlop/sec rate versus N
for all possible loop permutation rates. Do the results make
sense in light of what we said in class? Be careful to do all your
experiments/comparisons using the -O0 compiler optimization flags (i.e., no
optimization).
Question #2: Blocked Implementation
Write a "blocked" implementation (as described in class), assuming only
square blocks. Your algorithm now takes an extra command-line argument, the
block size. You can assume that the matrix dimension is divisible by the
block size (so as not to deal with annoying partial block computations and
index arithmetic). Plot the execution time versus the block size for a "large"
N. Determine the optimal block
size. Of
course, the loop ordering is still important and you should pick the best
possible one (either by reasoning based on the results from Question #1, or
via experimentation). Still do all your compilation with -O0.
Question #3: "Optimized" Implementation
Modify your implementation from Question #2 to make it faster using (a reasonable number of)
the techniques we have seen in class (e.g., removing array references, loop
unrolling). For each additional modification, compute the speedup achieved
with respect to the implementation in Question #2 (e.g., "by removing array references
the speedup was 1.1; by unrolling loops the speedup went up to 1.15; etc.). Extra credit will be assigned based on performance achieved and compared to other students in the class.
Question #4: "Compiler Optimized" Implementation
In a table report on the following, for a "large" N:
- Execution time of the naive i-j-k implementation compiled with -O0
- Execution time of the naive i-j-k implementation compiled with -O3
- Execution time of the naive implementation with the best loop ordering compiled with -O0
- Execution time of the naive implementation with the best loop ordering compiled with -O3
- Execution time of the implementation in Question #3 compiled with -O0
- Execution time of the implementation in Question #3 compiled with -O3
Discuss the results. (The results here could be depressing because
Matrix-multiplication is such a simple program.)
Question #5: OpenMP Implementation
Use OpenMP to parallelize the best sequential implementation you have so
far, so that it can run with an arbitrary number of threads (i.e., not a
hard-coded number... obviously using 10000 threads to compute a 3x3 matrix
multiplication is not too useful). Pick a "large" N for your experiments.
Plot the execution time and the speedup versus the number of threads (for,
say, up to 16 threads). Discuss the results.
A common task in science and engineering is the minimization of functions
of many variables that have many local minima. To complicate the matter
further, function values are often not calculated from a formula, but
instead come from a complex process that is mostly opaque to the optimizer.
In other terms, one is given a function (in the sense of a C function) that
takes input and that generates output. The goal is to find the set of
input values that minimize the output. One approach could be to try one random
input and hope to be lucky. Another approach, called steepest descent,
consists in approximating the gradient of the function at that random
input (which can be compute intensive) to figure out which way is "down"
towards a minimum and thus improve on the input, possibly multiple times.
For both of these one can repeat the process for many random input and in
the end select the best result. The topic of function minimization is
fascinating and there is a wealth of search techniques, going from random
search to genetic algorithms, via simulated annealing. In this assignment
you'll implement a very simple random global/local search.
Question #1: Global/Local Random
Search
Write a sequential program called search
that takes two command line arguments: a number of trials (integer
>0) and a step size (floating point >0). The program
optimizes a real function of 800 variables.
The random search proceeds via a number of trials. For each trial pick
800 random variable values, all in [0.0,50.0]. Compute the corresponding
function value. Then try to increase the first variable
by the step-size (we only go towards larger values, for simplicity). If
the function value improves then increase the variable again, otherwise
stop and go back to the previous value. Repeat while the function value
keeps improving. Do the same thing for the remaining 799 variables. This
is a (admittedly very simple) local search in which starting from some
random input one tries to find a minimum in a neighborhood of that random
input. The search is also global because it uses several trials. The
smaller the step size the more intensive the local search component, the
bigger the number of trials the more intensive the global search component.
The function to minimize is of 800 variables and is compiled in a .o file stored on the cluster at /home/casanova/public/mystery_function_800.o. The function prototype is:
double mystery_function_800(double *x);
and takes as input an array of 800 double elements. When you link your code,
remember to link to the math library (with a -lm flag) because the mystery
function uses that library.
Instrument your code so that the time taken by each trial and the total
wall-clock time is printed (e.g., on stdout). We will run this code
in two modes:
- Mostly Local: 10 trials, step size = 0.1. This mode may be more appropriate
if one feels that there are many "good" local minima all over.
- Mostly Global: 100 trials, step size = 1. This mode may be more appropriate
if one feels that there are not many "good" local minima, or perhaps only
one global minimum.
IMPORTANT: To make two runs comparable, remember to seed your random
number generator with the same seed.
Question #2: First Parallel Version
Using OpenMP parallelize your sequential program using two threads, so that
each thread performs exactly half of the trials. Instrument your code to
report on load imbalance as well (e.g., if thread #1 finishes in 10
seconds, and thread #2 finishes in 12 seconds, the load imbalance is
(12-10)/10). Give a table that shows speedup and load imbalance for the
Mostly Local and the Mostly Global Execution. Explain the numbers in the
table.
Question #3: Second Parallel Version
Using OpenMP parallelize your sequential program using two threads, so that
threads perform trials "on-demand", that is, a thread performs a trial only
if it's idle. Give a table that shows speedup and load imbalance for the
Mostly Local and the Mostly Global Execution. Explain the numbers in the
table.
A fundamental type of scientific computations can be expressed as so-called
"iterative stencil applications": at each iteration, values of a matrix are
updated using values of neighboring elements in the previous or the current
iterations. One well-known such application is "heat transfer", which
consists in solving the Laplace equation

over a 2-D domain with boundary conditions. Two common (and not very
efficient) iterative methods to solve the equation are Jacobi and
Gauss-Seidel. A better method is called Successive Over Relaxation (SOR), but
from a parallel computing perswpective it is equivalent to Gauss-Seidel.
See this report, which describes the
three methods fairly well.
Question #1: Sequential Implementations
Write a sequential jacobi program and a sequentical
gseidel program. These programs both take two command-line
arguments: N, the dimension of the (square) 2-D domain (i.e., a double
precision 2-D array); and the number of iterations to be performed. The
programs should create 2-D arrays of dimension N+2 so that the borders of
the domain need not be updated and are used only for initial border
conditions. Like a simple array initialization, the computation can be
performed column-by-column, or row-by-row. Pick the one that uses the cache
better. For the purpose of this exercise, assume that the left border of
the domain has constant temperature 1.0, while all other borders have
temperature 0.0. The center of the domain is initialized with zeros.
Question #2: Parallel
Implementations
Using OpenMP directives write a parallel
version of the jacobi program and of the gseidel. Discuss
the differences between the two implementations and explain clearly how you
extracted parallelism in the Gauss-Seidel implementation in spite of
data dependencies.
henric@hawaii.edu