ECE 757 Fall 2009 Homework 1 // Due in Lecture Wed Feb 18, 2009

The purpose of this assigment is to:

This assignment is to be completed individually.
Programming environment: You will do this assignment on niagara.ece.wisc.edu, which is a Sun Fire T2000 Server containing a 32-thread Sun UltraSparc-T1 processor. To login to this machine you can use your ECE login. If you do not have an ECE login, email the instructors and we will arrange access for you.

Note: Since we are all sharing access to a single machine, please try to start the assignment early. If multiple users are doing intensive runs at the same time, they can greatly impact the performance of each other’s programs.

Program Description:

In this homework, you will be using OpenMP and Pthreads to parallelize a Finnish dice game known as Kymppitonni ("Ten-thousand"). In this game, the objective is to reach a total of 10,000 points in as few turns as possible. Each turn a player can roll dice until they have rolled all dice or reached their point cutoff. The point cutoff is strategically selected by the player. A high cutoff could net more points, but also risks losing all accumulated points if the player rolls a "dud". The provided serial application plays multiple games of kymppitonni (per cutoff) varying the cutoff value from a minimum cutoff to a maximum cutoff in increments of 50. Since all games are independent from each other, this problem should be very easy to parallelize.
The serial C code code is here.
Example call:
./kymppitonni <num games per cutoff> <minimum cutoff> <maximum cutoff>

The provided program tries to collect some statistics about the played games. Specifically 1) The best realized cutoff point 2) The worst realized cutoff point 3) The average point benefit obtained by playing with x dice still in play. 4) The probability of rolling a "dud" with x dice still in play. All of these statistics represent shared state across games which your parallel code will have to respect.

Homework Requirements

This homework involves the completion of three primary tasks.
  1. Profiling and analysis of existing sequential code
  2. Parallelization of program with Pthreads
  3. Parallelization of program with OpenMP

Part 1 - Profiling and analysis

Before attempting to parallelize the program we will first attempt to estimate the ideal scalability of the application using Amdhal’s law. The existing application measures the total execution time in nanoseconds.

Using your best estimate, modify the program to measure the time of the expected parallel segment and serial segment. Note: You should be using the gethrtime() function which returns a value in terms of nanoseconds. Report the ratio of parallel to total program execution time.

Using the expected parallel/serial ratio from your measurement, create a graph showing the speedup expected by scaling from 1-32 processors. You only need points for 1, 2, 4, 8, 16, and 32.

Part 2 - Pthreads implementation

In this section you will modify the existing program to properly support threaded execution with pthreads. An excellent reference for pthreads can be found here.

You should take into consideration how to partition work into threads. Specifically, you should be careful in assigning which threads execute certain games as not all games require equal time to execute (those with very high cutoffs often take much more time). You are completely free to restructure the code, define new data structures, or interchange looping order - as long as individual games are still simulated properly and global statistics are properly maintained. Also, make sure the make the number of threads to execute a command line parameter to your executable.

Note: The existing code makes a call to the library call random. This function is not thread safe. You should replace this with a call to rand_r. Look up the man page documentation for usage.

Once you have completed parallelizing your code, record the speedup using total execution time ranging from 2-32 processors for the parallel code over the serial version.

Part 3 - OpenMP implementation

Now parallelize your code using OpenMP rather than Pthreads. An excellent reference for OpenMP is available here.

Note: You will also have to use rand_r instead for this code section.

Once you have completed parallelizing your code, record the speedup total execution time ranging from 2-32 processors for the parallel code over the serial version.

Deliverables:

Note: For each graph, you should have 2 data segments. Run the program once with (games=10, minimum cutoff=50, maximum cutoff=7000) and again with (games=1000, minimum cutoff=50, maximum cutoff=7000).
  1. Estimated parallel and sequential fractions from Part 1
  2. A graph showing expected speedup derived from the parallel and sequential fractions. Graph should range from 1-32 processors.
  3. A graph showing your achieved speedup from parallelization with Pthreads.
  4. A graph showing your achieved speedup from parallelization with OpenMP.
  5. A 1-2 paragraph discussion comparing your Pthread results with the ideal speedups expected from part 1. Make sure to discuss any observable differences.
  6. A 1-2 paragraph discussion comparing your OpenMP results with the ideal speedups expected from part 1. Make sure to discuss any observable differences.
  7. Finally, directly compare the observed speedups of Pthreads and OpenMP. You should write 1-2 paragraphs discussion which performed better and a possible explantion for any differences
  8. Submit your source code using the learn@uw dropbox.

For further details on the Niagara platform (which may help you explain your results), refer to these 752 lecture notes (starting at slide 50), or this 752 reading.