ECE 757 Spring 2016 Homework 1 // Due Monday Feb 20, 2017

The purpose of this assigment is to:

This assignment is to be completed individually.

NOTE: this assignment writeup is a work in progress. If you discover any problems or issues, please let me know as soon as possible.


Programming environment: You may do this assignment using best-tux.cae.wisc.edu (one of the CAE linux machines) or any other linux machine you have access to. The tux machines typically contain a quad-core CPU, so the amount of parallelism you can effectively explore will be limited.

Program Description:

In this homework, you will be using OpenMP and Pthreads to parallelize a Finnish dice game known as Kymppitonni ("Ten-thousand"). In this game, the objective is to reach a total of 10,000 points in as few turns as possible. Each turn a player can roll dice until they have rolled all dice or reached their point cutoff. The point cutoff is strategically selected by the player. A high cutoff could net more points, but also risks losing all accumulated points if the player rolls a "dud". The rules of the game are not material for you to complete this assignment.

The provided serial application plays multiple games of kymppitonni (per cutoff) varying the cutoff value from a minimum cutoff to a maximum cutoff in increments of 50. Since all games are independent from each other, this problem should be very easy to parallelize. The point of the program is to develop intuition/strategy as to what is a good cutoff value to utilize when actually playing the game.
The serial C code code is here.
Example call:
./kymppitonni <num games per cutoff> <minimum cutoff> <maximum cutoff>

The provided program tries to collect some statistics about the played games. Specifically 1) The best realized cutoff point 2) The worst realized cutoff point 3) The average point benefit obtained by playing with x dice still in play. 4) The probability of rolling a "dud" with x dice still in play. All of these statistics represent shared state across games which your parallel code will have to respect.

Homework Requirements

This homework involves the completion of four primary tasks.
  1. Profiling and analysis of existing sequential code
  2. Parallelization of program with Pthreads
  3. Parallelization of program with OpenMP
  4. Running the pthreads parallel version with gem5

Part 1 - Profiling and analysis

Before attempting to parallelize the program we will first attempt to estimate the ideal scalability of the application using Amdahl's law. The existing application measures the total execution time in processor clock cycles.

Using your best estimate, modify the program to measure the time of the expected parallel segment and serial segment. Note: You should be using the getCC() function provided by rdtsc.h which returns a value in terms of clock cycles. Report the ratio of parallel to total program execution time.

Using the expected parallel/serial ratio from your measurement, create a graph showing the speedup expected by scaling from 1-16 processors. You only need points for 1, 2, 4, 8, and 16.

Part 2 - Pthreads implementation

In this section you will modify the existing program to properly support threaded execution with pthreads. An excellent reference for pthreads can be found here.

You should take into consideration how to partition work into threads. Specifically, you should be careful in assigning which threads execute certain games as not all games require equal time to execute (those with very high cutoffs often take much more time). You are completely free to restructure the code, define new data structures, or interchange looping order - as long as individual games are still simulated properly and global statistics are properly maintained. Also, make sure the make the number of threads to execute a command line parameter to your executable.

Note: The existing code makes a call to the library call rand. This function is not thread safe. You should replace this with a call to rand_r. Look up the man page documentation for usage.

Once you have completed parallelizing your code, record the speedup using total execution time ranging from 2, 3, and 4 processors for the parallel code over the serial version.

Part 3 - OpenMP implementation

Now parallelize your code using OpenMP rather than Pthreads. An excellent reference for OpenMP is available here.

Note: You will also have to use rand_r instead for this code section.

Once you have completed parallelizing your code, record the speedup total execution time ranging from 2, 3, and 4 processors for the parallel code over the serial version.

Part 4 - Running your pthreads implementation on gem5 in SE mode

You will need to recompile your pthreads implementation to use m5threads instead of pthreads, since a real pthreads library is not available in SE mode.

First, you need to follow the steps in the gem5 tutorial to clone a copy of gem5 to a linux machine, configure it, and compile it.

If you have a personal linux machine, you will have to make sure the "scons" and "swig" packages are installed.

If you have a reasonably fast PC with a decent amount of memory, but it runs windows, I highly recommend running linux in a virtual machine using the Virtual Box platform. It is quite straightforward to create a VM and install the latest stable version of Ubuntu on it.

If you are using the best-tux.cae.wisc.edu machines, "scons" is globally installed but you will need to issue the following command to pick up a local installed version of swig (you can also add this to the .bashrc file in your home directory, if you wish; that way it is automatically performed whenever you log in):
$ source /home/vhosts/ece757.ece.wisc.edu/.bashrc

The above script will also set GEM5=/home/vhosts/ece.757.wisc.edu/src/gem5 as a shortcut to find the gem5 install on the best-tux machines. You should create your own installation and set this variable to point to that location.

Next, you need to set up an instance of the m5threads library. On the tux machines, you can use the one in /home/vhosts/ece757.ece.wisc.edu/src/m5threads, or you can download your own:
cd towhereyouwantm5threadinstalled
hg clone http://repo.gem5.org/m5threads
cd m5threads
make

Now, to compile your pthreads version of kymppitonni for execution in gem5 SE mode, you will need to compile it using m5threads. There are also some handy hooks that you can use to instruct gem5 to reset and collect stats around your work units. You will need to add an '#include "m5ops.h" to your C file to compile with these hooks. Refer to eg_pthread.c for an example.

Once you have modified your pthreads file appropriately, compile it statically, Here is an example command line (substitute paths to your own installation of gem and and m5threads if you are not running on the best-tux machines):
gcc kymppitonni_pthread.c -I$GEM5/util/m5/ $GEM5/util/m5/m5op_x86.S -static $GEM5/../m5threads/libpthread.a -o kymppitonni_pthread

Now you can run your program. You will need to specify the number of cores to gem5, and that it is one more than the number of threads. -c specifies the command to run, and -o specifies any arguments to send to the program. For example (this assumes you added a parameter to specify the number of threads to the kyppitonni program as argv[1]):


$GEM5/build/X86/gem5.opt $GEM5/configs/example/se.py --ruby -n 3 -c kymppitonni_pthreads -o "2 10 2500 3500"

Validate your Amdahl's Law projections from Part I by running with 1, 2, 4, 8, and 16 threads and measuring the execution time by examining the gem5 output statistics.

Deliverables:

Note: For graphs #2, #3, and #4 below, you should have 2 data points. Run the program once with (games=10, minimum cutoff=50, maximum cutoff=7000) and again with (games=1000, minimum cutoff=50, maximum cutoff=7000). For the simulated results (#5 below) run with games=10 only.
  1. Estimated parallel and sequential fractions from Part 1
  2. A graph showing expected speedup derived from the parallel and sequential fractions. Graph should range from 1-16 processors.
  3. A graph showing your achieved speedup from parallelization with Pthreads for 1-4 threads.
  4. A graph showing your achieved speedup from parallelization with OpenMP for 1-4 threads.
  5. A graph showing your achieved speedup from parallelization with pthreads running on gem5 SE mode for 1-16 threads.
  6. A 1-2 paragraph discussion comparing your Pthread results (real and simulated) with the ideal speedups expected from part 1. Make sure to discuss any observable differences.
  7. A 1-2 paragraph discussion comparing your OpenMP results with the ideal speedups expected from part 1. Make sure to discuss any observable differences.
  8. Finally, directly compare the observed speedups of Pthreads and OpenMP. You should write 1-2 paragraphs discussion which performed better and a possible explantion for any differences
  9. Submit your source code, graphs, and writeups using the learn@uw dropbox.