ECE 757 Spring 2016 Homework 1 // Due Monday Feb 20, 2017
The purpose of this assigment is to:
- give you experience writing simple shared memory programs using OpenMP and
Pthreads
- provide a gentle introduction to parallel programming
- provide a foundation for writing and/or running much more complex
programs on various parallel programming environments
- familiarize you with running parallel programs in SE mode on gem5.
This assignment is to be completed individually.
NOTE: this assignment writeup is a work in progress. If you discover any problems or issues, please let me know as soon as possible.
Programming environment:
You may do this assignment using best-tux.cae.wisc.edu (one of the CAE linux machines) or any other linux machine you have access to.
The tux machines typically contain a quad-core CPU, so the amount of parallelism you can effectively explore will be limited.
Program Description:
In this homework, you will be using OpenMP and Pthreads to parallelize a Finnish dice game
known as Kymppitonni ("Ten-thousand"). In this game, the objective is to reach
a total of 10,000 points in as few turns as possible. Each turn a player can
roll dice until they have rolled all dice or reached their point cutoff. The
point cutoff is strategically selected by the player. A high cutoff could net
more points, but also risks losing all accumulated points if the player rolls a
"dud". The rules of the game are not material for you to complete this assignment.
The provided serial application plays multiple games of kymppitonni (per cutoff) varying the cutoff
value from a minimum cutoff to a maximum cutoff in increments of 50. Since all
games are independent from each other, this problem should be very
easy to parallelize. The point of the program is to develop intuition/strategy as to what is a good cutoff value to utilize when actually playing the game.
The serial C code code is here.
Example call:
./kymppitonni <num games per cutoff> <minimum cutoff> <maximum cutoff>
The provided program tries to collect some statistics about the played games.
Specifically 1) The best realized cutoff point 2) The worst realized cutoff
point 3) The average point benefit obtained by playing with x dice still in
play. 4) The probability of rolling a "dud" with x dice still in play. All of
these statistics represent shared state across games which your parallel code
will have to respect.
Homework Requirements
This homework involves the completion of four primary tasks.
- Profiling and analysis of existing sequential code
- Parallelization of program with Pthreads
- Parallelization of program with OpenMP
- Running the pthreads parallel version with gem5
Part 1 - Profiling and analysis
Before attempting to parallelize the program we will first attempt to estimate
the ideal scalability of the application using Amdahl's law. The
existing application measures the total execution time in processor clock cycles.
Using your best estimate, modify the program to measure the time of the expected
parallel segment and serial segment. Note: You should be using the getCC()
function provided by rdtsc.h which returns a value in terms of clock cycles.
Report the ratio of parallel to total program execution time.
Using the expected parallel/serial ratio from your measurement, create a graph
showing the speedup expected by scaling from 1-16 processors. You only need
points for 1, 2, 4, 8, and 16.
Part 2 - Pthreads implementation
In this section you will modify the existing program to properly support
threaded execution with pthreads. An excellent reference for pthreads can be
found here.
You should take
into consideration how to partition work into threads. Specifically, you
should be careful in assigning which threads execute certain games as not all
games require equal time to execute (those with very high cutoffs often take
much more time). You are completely free to restructure the code, define new
data structures, or interchange looping order - as long as individual games are
still simulated properly and global statistics are properly maintained. Also,
make sure the make the number of threads to execute a command line parameter to
your executable.
Note: The existing code makes a call to the library call rand.
This function is not thread safe. You should replace this with a call to
rand_r. Look up the man page documentation for usage.
Once you have completed parallelizing your code, record the speedup using total execution
time ranging from 2, 3, and 4 processors for the parallel code over the serial
version.
Part 3 - OpenMP implementation
Now parallelize your code using OpenMP rather than Pthreads. An excellent
reference for OpenMP is available here.
Note: You will also have to use rand_r instead for this code section.
Once you have completed parallelizing your code, record the speedup total execution
time ranging from 2, 3, and 4 processors for the parallel code over the serial
version.
Part 4 - Running your pthreads implementation on gem5 in SE mode
You will need to recompile your pthreads implementation to use m5threads instead of pthreads, since a real pthreads library is not available in SE mode.
First, you need to follow the steps in the gem5 tutorial to clone a copy of gem5 to a linux machine, configure it, and compile it.
If you have a personal linux machine, you will have to make sure the "scons" and "swig" packages are installed.
If you have a reasonably fast PC with a decent amount of memory, but it runs windows, I highly recommend running linux in a virtual machine using the Virtual Box platform. It is quite straightforward to create a VM and install the latest stable version of Ubuntu on it.
If you are using the best-tux.cae.wisc.edu machines, "scons" is globally installed but you will need to issue the following command to pick up a local installed version of swig (you can also add this to the .bashrc file in your home directory, if you wish; that way it is automatically performed whenever you log in):
$ source /home/vhosts/ece757.ece.wisc.edu/.bashrc
The above script will also set GEM5=/home/vhosts/ece.757.wisc.edu/src/gem5 as a shortcut to find the gem5 install on the best-tux machines. You should create your own installation and set this variable to point to that location.
Next, you need to set up an instance of the m5threads library. On the tux machines, you can use the one in /home/vhosts/ece757.ece.wisc.edu/src/m5threads, or you can download your own:
cd towhereyouwantm5threadinstalled
hg clone http://repo.gem5.org/m5threads
cd m5threads
make
Now, to compile your pthreads version of kymppitonni for execution in gem5 SE mode, you will need to compile it using m5threads. There are also some handy hooks that you can use to instruct gem5 to reset and collect stats around your work units. You will need to add an '#include "m5ops.h" to your C file to compile with these hooks.
Refer to eg_pthread.c for an example.
Once you have modified your pthreads file appropriately, compile it statically, Here is an example command line (substitute paths to your own installation of gem and and m5threads if you are not running on the best-tux machines):
gcc kymppitonni_pthread.c -I$GEM5/util/m5/ $GEM5/util/m5/m5op_x86.S -static $GEM5/../m5threads/libpthread.a -o kymppitonni_pthread
Now you can run your program. You will need to specify the number of cores to gem5, and that it is one more than the number of threads. -c specifies the command to run, and -o specifies any arguments to send to the program. For example (this assumes you added a parameter to specify the number of threads to the kyppitonni program as argv[1]):
$GEM5/build/X86/gem5.opt $GEM5/configs/example/se.py --ruby -n 3 -c kymppitonni_pthreads -o "2 10 2500 3500"
Validate your Amdahl's Law projections from Part I by running with 1, 2, 4, 8, and 16 threads and measuring the execution time by examining the gem5 output statistics.
Deliverables:
Note: For graphs #2, #3, and #4 below, you should have 2 data points. Run the program
once with (games=10, minimum cutoff=50, maximum cutoff=7000) and again with
(games=1000, minimum cutoff=50, maximum cutoff=7000).
For the simulated results (#5 below) run with games=10 only.
- Estimated parallel and sequential fractions from Part 1
- A graph showing expected speedup derived from the parallel and sequential
fractions. Graph should range from 1-16 processors.
- A graph showing your achieved speedup from parallelization with
Pthreads for 1-4 threads.
- A graph showing your achieved speedup from parallelization with
OpenMP for 1-4 threads.
- A graph showing your achieved speedup from parallelization with
pthreads running on gem5 SE mode for 1-16 threads.
- A 1-2 paragraph discussion comparing your Pthread results (real and simulated) with the ideal
speedups expected from part 1. Make sure to discuss any observable
differences.
- A 1-2 paragraph discussion comparing your OpenMP results with the ideal
speedups expected from part 1. Make sure to discuss any observable
differences.
- Finally, directly compare the observed speedups of Pthreads and OpenMP.
You should write 1-2 paragraphs discussion which performed better and a
possible explantion for any differences
- Submit your source code, graphs, and writeups using the learn@uw dropbox.