ICS 632, Fall 2008: Homework #1

 What to turn in

Send a .tar.gz file to henric@hawaii.edu that contains all your source code and your report. The top-level directory should be called ics632_<lastname>_hw1. In that directory there should be sub-directories called exercise_1, exercise_2, etc. Each such directory should itself have sub-directories called question_1, question_2, etc. These directories contain your code with Makefiles and reports whenever applicable.

 Exercise #1: Matrix Multiplication

Question #1: Naive Implementation
Write the "naive" sequential implementation, instrumented so that the wall-clock time is measured. This implementation consists in generating two random NxN double precision matrices, and computing a third one. The program takes only one command-line argument: the dimension of the matrices, N. Experiment with switching loops for better cache utilization and plot the execution time versus N for all possible loop permutations. Also plot the MFlop/sec rate versus N for all possible loop permutation rates. Do the results make sense in light of what we said in class? Be careful to do all your experiments/comparisons using the -O0 compiler optimization flags (i.e., no optimization).

Question #2: Blocked Implementation
Write a "blocked" implementation (as described in class), assuming only square blocks. Your algorithm now takes an extra command-line argument, the block size. You can assume that the matrix dimension is divisible by the block size (so as not to deal with annoying partial block computations and index arithmetic). Plot the execution time versus the block size for a "large" N. Determine the optimal block size. Of course, the loop ordering is still important and you should pick the best possible one (either by reasoning based on the results from Question #1, or via experimentation). Still do all your compilation with -O0.

Question #3: "Optimized" Implementation
Modify your implementation from Question #2 to make it faster using (a reasonable number of) the techniques we have seen in class (e.g., removing array references, loop unrolling). For each additional modification, compute the speedup achieved with respect to the implementation in Question #2 (e.g., "by removing array references the speedup was 1.1; by unrolling loops the speedup went up to 1.15; etc.). Extra credit will be assigned based on performance achieved and compared to other students in the class.

Question #4: "Compiler Optimized" Implementation
In a table report on the following, for a "large" N: Discuss the results. (The results here could be depressing because Matrix-multiplication is such a simple program.)

Question #5: OpenMP Implementation
Use OpenMP to parallelize the best sequential implementation you have so far, so that it can run with an arbitrary number of threads (i.e., not a hard-coded number... obviously using 10000 threads to compute a 3x3 matrix multiplication is not too useful). Pick a "large" N for your experiments. Plot the execution time and the speedup versus the number of threads (for, say, up to 16 threads). Discuss the results.

 Exercise #2: Parallel Function Minimization

A common task in science and engineering is the minimization of functions of many variables that have many local minima. To complicate the matter further, function values are often not calculated from a formula, but instead come from a complex process that is mostly opaque to the optimizer. In other terms, one is given a function (in the sense of a C function) that takes input and that generates output. The goal is to find the set of input values that minimize the output. One approach could be to try one random input and hope to be lucky. Another approach, called steepest descent, consists in approximating the gradient of the function at that random input (which can be compute intensive) to figure out which way is "down" towards a minimum and thus improve on the input, possibly multiple times. For both of these one can repeat the process for many random input and in the end select the best result. The topic of function minimization is fascinating and there is a wealth of search techniques, going from random search to genetic algorithms, via simulated annealing. In this assignment you'll implement a very simple random global/local search.

Question #1: Global/Local Random Search
Write a sequential program called search that takes two command line arguments: a number of trials (integer >0) and a step size (floating point >0). The program optimizes a real function of 800 variables.

The random search proceeds via a number of trials. For each trial pick 800 random variable values, all in [0.0,50.0]. Compute the corresponding function value. Then try to increase the first variable by the step-size (we only go towards larger values, for simplicity). If the function value improves then increase the variable again, otherwise stop and go back to the previous value. Repeat while the function value keeps improving. Do the same thing for the remaining 799 variables. This is a (admittedly very simple) local search in which starting from some random input one tries to find a minimum in a neighborhood of that random input. The search is also global because it uses several trials. The smaller the step size the more intensive the local search component, the bigger the number of trials the more intensive the global search component.

The function to minimize is of 800 variables and is compiled in a .o file stored on the cluster at /home/casanova/public/mystery_function_800.o. The function prototype is:

double mystery_function_800(double *x);

and takes as input an array of 800 double elements. When you link your code, remember to link to the math library (with a -lm flag) because the mystery function uses that library. Instrument your code so that the time taken by each trial and the total wall-clock time is printed (e.g., on stdout). We will run this code in two modes: IMPORTANT: To make two runs comparable, remember to seed your random number generator with the same seed.

Question #2: First Parallel Version
Using OpenMP parallelize your sequential program using two threads, so that each thread performs exactly half of the trials. Instrument your code to report on load imbalance as well (e.g., if thread #1 finishes in 10 seconds, and thread #2 finishes in 12 seconds, the load imbalance is (12-10)/10). Give a table that shows speedup and load imbalance for the Mostly Local and the Mostly Global Execution. Explain the numbers in the table.

Question #3: Second Parallel Version
Using OpenMP parallelize your sequential program using two threads, so that threads perform trials "on-demand", that is, a thread performs a trial only if it's idle. Give a table that shows speedup and load imbalance for the Mostly Local and the Mostly Global Execution. Explain the numbers in the table.

 Exercise #3: Iterative Methods

A fundamental type of scientific computations can be expressed as so-called "iterative stencil applications": at each iteration, values of a matrix are updated using values of neighboring elements in the previous or the current iterations. One well-known such application is "heat transfer", which consists in solving the Laplace equation

over a 2-D domain with boundary conditions. Two common (and not very efficient) iterative methods to solve the equation are Jacobi and Gauss-Seidel. A better method is called Successive Over Relaxation (SOR), but from a parallel computing perswpective it is equivalent to Gauss-Seidel. See this report, which describes the three methods fairly well.

Question #1: Sequential Implementations
Write a sequential jacobi program and a sequentical gseidel program. These programs both take two command-line arguments: N, the dimension of the (square) 2-D domain (i.e., a double precision 2-D array); and the number of iterations to be performed. The programs should create 2-D arrays of dimension N+2 so that the borders of the domain need not be updated and are used only for initial border conditions. Like a simple array initialization, the computation can be performed column-by-column, or row-by-row. Pick the one that uses the cache better. For the purpose of this exercise, assume that the left border of the domain has constant temperature 1.0, while all other borders have temperature 0.0. The center of the domain is initialized with zeros.

Question #2: Parallel Implementations
Using OpenMP directives write a parallel version of the jacobi program and of the gseidel. Discuss the differences between the two implementations and explain clearly how you extracted parallelism in the Gauss-Seidel implementation in spite of data dependencies.


henric@hawaii.edu