Homework 5

Sorting algorithm evaluation

hw5 due date is now 27 march tuesday

In this exercise we are going to do some algorithms engineering. You will have to evaluate the performance of the sorting algorithms we have discussed on various inputs.Though you will have to do some programming, the main focus here is not on the programming part, but on the analysis of your experimental results.

We are trying to figure out which algorithms run faster than others by performing experiments. To be fairly confident of the results we must collect a large amount of data, for a variety of inputs as large as possible. Your eventual goal is to produce a spreadsheet that looks like this:

Input size	Insertion sort	Selection sort	Quicksort
1000	23	52	28
2000	43	72	38
3000	63	82	48
4000	83	92	58
...	..	..	..

What this means is that for random arrays of size 1000 you ran insertion sort a number of times and on average it took 23 seconds for it to sort the input on your computer. The numbers here are completely made up of course, while your numbers should come from actual experiments. Some important details:

Your analysis should deal with all the sorting algorithms we have seen so far: bubblesort, insertion sort, selection sort, mergesort and quicksort. After you automate things this should not be a big problem.
For each input size you should run the experiment a sufficient number of times so that the result is reliable. In particular, statistical noise may be caused both by the fact that we will be using randomly generated inputs and external factors (e.g. other programs running on your computer at the same time).
You must gather a large sample of results, with input sizes ranging at least in the tens of thousands and possibly much higher (as much as your hardware can cope with).
It should be clear by now that I want you to run hundreds or thousands of experiments. This will be hopeless unless you automate the process with appropriate programming! Time invested here will pay off handsomely in the end...

There are two important sub-topics that we need to expand on: what data will you use and how you will measure performance

Inputs: I would like you to consider one kind of input: completely random arrays. You should generate an array of n elements simply by producing n random numbers in the range 1..n. Use the rand() function for this (see here for an example).

Measuring time: The most straightforward way to measure time in C++ is using the time() function. It is contained in the header file <time.h>. time() returns the number of seconds elapsed since 1/1/1970. To measure the time it takes for a function to execute you call time() before the function, then call the function and then call time() again. Subtracting the two values gives us the desired time. See an example here. An important point to keep in mind is that since we are measuring time in seconds data points of just a few seconds are likely to be affected by rounding errors. In other words, if something takes 2.5 seconds to execute the above technique will estimate the answer as 2 or 3 seconds, both of which are 20% off. If it takes 100.5 seconds though, the rounding error is at most 1%.

Goal: After you have run all the experiments you should gather the data in one spreadsheet. Then try to summarize the results graphically using charts (Excel skills will come in handy here, but if you don't like Excel I can suggest other freely available tools for the job). Finally, provide a brief evaluation of the results. Do your experiments agree with the theory we discussed in class? Why? Your answer must be supported by your data!

What to submit: Hand in a technical report including a brief description of your methodology, your charts and your conclusions. As a supplemental technical appendix, also submit the one spreadsheet and any programs you wrote. The report and other materials should be handed both electronically (in pdf format only through blackboard) and in person in class.