Overview
In this exercise you'll implement the 1-D Matrix Multiplication algorithm
we saw in class and develop an empirical performance model for it.
You should write your code using blocking MPI communication only. Your code should take a single command-line argument: the dimension of the matrices, N. You can assume that the number of nodes divides N.
Note that you may have to set the P4_GLOBMEMSIZE environment variable to a higher value than the default to send large messages (as instructed by an error message you may see when you run your code).
Note also that to test that the result is correct you cannot compare floating point numbers with an "==". You should compare the absolute value of their difference to some epsilon. in our case the epsilon can be large (e.g., 0.5) because we know that all matrix elements are integers.
Question #2: The Performance Model
In class we've seen a simple performance model for the above algorithm.
That model involved several constants that depend on the platform
(L, b, and w). Design, describe, and implement a sound
procedure to measure these three constants on our cluster. Use these
constants to instantiate the performance model.
Run your program and measure its execution time for a matrix size of 1440 matrix size on 2, 4, and 6 nodes. On a graph plots the execution time of your code and the performance predicted by the performance model. Note that the performance model should account for the fact that your code uses blocking communication as opposed to non-blocking communication.
Discuss the discrepancies between the model and the reality, and venture explanations. Are the trends at least the same? Repeat the same experiment with a "small" matrix size and see whether your model is more or less accurate. Explain why.
Question #3: Non-Blocking Communication
Modify your code so that it uses non-blocking communication and overlaps
communication and computation. Modify your performance model accordingly
(not the constants, just the equation). On a graph show the execution time
of your code and that predicted by the performance model for a fixed
(reasonably large) matrix size on 2, 3, 4, 5, and 6 nodes. Was performance
improved by the use on non-blocking communication?
Question #4: OpenMP
Multi-thread your code with OpenMP so that both processors on each
node are used by each MPI process. What is the speedup compared
to the code in Question #3 for 6 nodes?