HW 5 : Evolutionary Algorithms (Introduction to Classes and Dynamic Memory Allocation)
DUE: SUNDAY, May 6th 2007, 11:59PM
This homework will give you a chance to work with classes and dynamically allocating memory etc. The program you will write will be a algorithm that simulates an evolutionary process. Although this is an advanced programming topic; the entire subject matter/program will be explained and laid out for you.
Here is the base template file you will be working with: hw5_template.cpp
Submission: Please submit your completed hw5.cpp code as one (1) file using your UNIX account and the hw5-submit command:
Points: the point value for this assignment is 6.
As always, if hw5-submit fails, you can e-mail me your hw5.cpp code : chipp@sci.brooklyn.cuny.edu
Background
Evolutionary Algorithms takes mechanisms that are inspired by biology and evolution: reproduction, mutation, recombination, and natural selection to evolve and simulate an artificial population. Evolutionary Algorithms are used in Genetic Programming and Algorithms to optimize solutions to problems that are otherwise too complex for explicit solutions. Traditional examples include finding the optimal set of values for circuit components in large non-linear systems. More recent examples of genetic programming is used in evolutionary filter design, video games designs, and in games of life and large multi-agent systems.
Implementation of an Evolutionary Algorithm
Population of Genes
An evolutionary algorithm will operate on a population of objects which we will call Genes. When our simulation starts, a large pool of Genes (i.e. 5000) will be created. Each Gene will have a genotype, and in our case the genotype will be a string of characters (which characters will be defined at run time according to a Policy).
Example:
Here is a diagram of our Gene object. Within the Gene we will have a string of characters that define the "type" of Gene it is according to the genotype.
Gene's genotype | GCGGGCGTTT |
Fitness of a Generation
We will group our population of Genes into another object called a Generation. This way a popultion of Genes can be associated with each other as belonging to the same Generation. Within a Generation, we will assign a fitness (i.e. a number describing how fit an individual Gene is, which describes how likely the Gene will survive to the next Generation). The fitness will be defined by a Policy that will be defined at run time.
Example of a Policy:
Trait (i.e. a character in the genotype) | Fitness Score |
A | -0.5 |
T | 1.0 |
G | 0.5 |
C | -1.0 |
So the fitness of our previous Gene (GCGGGCGTTT) is :
Total Fitness Score (GCGGGCGTTT) = 0.5 - 1.0 + 0.5 + 0.5 + 0.5 - 1.0 + 0.5 + 1.0 + 1.0 + 1.0 = 3.5
Policy File
In order to change the Policy at runtime we will encode the Policy in a simple text file with number of traits followed by the trait-fitness score pairs in a format similar to the attribute pairs in XML. Here is the above policy encoded for the Policy file:
4 A="-0.5" T="1.0" G="0.5" C="-1.0"
Here is an example Policy file.
Our Evolutionary program will at runtime open the Policy and read the traits and score pairs into a Policy Object. Here the class we will use to store the Policy Object:
class Policy { private: double * scores; char * traits; int size; public: Policy(char * policy_file); ~Policy(); double getScore(char trait) const; char * getTraits() const { return traits; } int getSize() const { return size; } };
At runtime both the traits and the scores need to by dynamically created to the size (i.e. the member attribute size will store the size of the Policy, the number of traits and values) specified in the Policy File (i.e. the first number, 4 in the case of our example). The constructor (i.e. Policy(char * policy_file) ) will open the file specified by it's string arguement and read the trait and score pairs into their respective arrays.
This table demonstrates how the traits and the scores are stored. They are associated by the same index into both arrays:
0 | 1 | 2 | 3 | |
traits | A | T | G | C |
scores | -0.5 | 1.0 | 0.5 | -1.0 |
The score that is associated to a given trait is accessed through the member function getScore(char trait).
Anatomy of a Gene
We will store the information about one Gene in the Gene class:
class Gene { private: static const int SIZE = 10; string genotype; double fitness; int generation; bool mutant; public: Gene(Policy & policy, int gen_id); Gene(Gene * parent1, Gene * parent2, int gen_id); double calcFitness(Policy & policy); double getFitness() const { return fitness; } void mutate(Policy & policy); };
Members Attributes:
SIZE - In our simulation we will assume that all of our Gene's have a genotype of a constant SIZE (given by this static const int)
genotype - This string object contains the genotype comprised of various traits in the policy.
fitness - This is the calculated fitness given above in the example. It is calculated in the calcFitness() member function.
generation - This identifies which generation this Gene is from. Once a Gene is created and the generation is set, it is not changed.
mutant - Our simulation will support mutation (described below in the evolutionary process).
Constructors:
There are two constructors given for the Gene class.
Gene(Policy & policy, int gen_id);
Constructs a Gene randomly according to the Policy object (passed in by reference to avoid any nasty copying) and sets the generation according to the gen_id. (The first generation will be Generation 0, so all of the Gene's generation attributes will be 0). The Gene's genotype is generated out of a random selection of the legal traits (i.e. characters) provided in the Policy object.
Gene(Gene * parent1, Gene * parent2, int gen_id);
This is a reproductive constructor. This Gene object is created from two parent Gene's (passed by pointer, since Gene's will be dynamically created objects). The reproduction process is best described in the follow diagram:
Parent1 | A | T | T | G | C | A | T | G | A | A |
Parent2 | T | A | T | T | A | G | G | T | A | G |
Child Gene | A | T | T | G | C | A | T | T | A | G |
A split-point is randomly chosen with in the Gene. In this case, the split was between the 7th and the 8th trait in the Gene (starting to count with 1). From Parent1 the first half of the split is chosen (i.e. the blue traits in the Parent1 Gene) and is appended to the second half of the split from the Parent2 Gene (i.e. the blue traits in the Parent2 Gene) to result in the Child Gene's genotype (i.e. the two blue traits are joined).
Building a Generation
Populations of Gene's are stored in a Generation class. Which is given below:
class Generation { private: int id; static const int GEN_SIZE = 100; Gene * population[GEN_SIZE]; string gen_prefix; public: Generation(Policy & policy, char * prefix); Generation(Policy & policy, Generation * prev); ~Generation(); void sort(); void calcFitness(Policy & policy); void print() const; };
Member Attributes:
id - This stores the current Generation "version" since the initial base Generation. The first Generation (i.e. the randomly generated Generation) is Generation 0. Subsequent Generations are numbered in order, Generation 1 are evolved from Generation 0.
GEN_SIZE - This constant is the number of Genes in this population. It is set at compile time.
population - This array contains pointers to the dynamically allocated Genes. When the Generation is either initially generated or is evolved from a previous Generation, the Genes are referenced from this population array.
gen_prefix - This string will precede the names of the generation output files generated by the print() function.
Member Functions
sort() - This function sorts the Genes in the Generation's population in order from maximum fitness to minimum fitness.
calcFitness() - This function calculates the Fitness of all of the Gene's in the population (i.e. calls the Gene's calcFitness function in the population).
print() - After a Generation is sorted, it's Gene's are printed out to a Generation file. The name of the file will be:
< gen_prefix > < generation number > ".txt"
So if gen_prefix = "gen", and we are printing Generation 1, the generation file name will be "gen1.txt".
The Generation file contains the details about every Gene, printed out in order maximum fitness to minimum fitness. Every line is a Gene, and has the following format:
<rank> <gen_id> <genotype> <fitness> <* for mutant>
Here is the start of an example Generation file:
1 1 GTTTTTTGGT 8.5 2 1 TGTGGTTTTG 8 3 1 GTTGTGGGTT 7.5 4 1 TGTGGTGTTG 7.5 5 1 TTATTTTGGT 7.5 6 1 GTTGTGGGTT 7.5 7 1 GTTAGTTTTG 7 8 1 TTTCGTTTTG 7 9 1 TTGCGTTTTT 7 10 1 GTATTTTGGT 7
And an extra that has been mutated (a mutant) will be shown as:
64 1 TTGCGTTTCG 4.5 *
These lines are from this example output file.
Constructors
Like the Gene class, there are two constructors for the Generation:
Generation(Policy & policy, char * prefix);
This is the constructor for the base Generation (Generation 0), it dynamically generates GEN_SIZE number of Genes. The policy is used in generating all of the Genes (i.e. randomly according to the traits in the policy), and the prefix is stored for use in generating the output Generation files.
Generation(Policy & policy, Generation * prev);
This constructor generates a Generation from a previous Generation (accessed through the Generation pointer prev) according to the evolutionary process. Here are the steps to generate a new Generation from the previous one:
1. Selection
Assuming the popultion of the Generation is sorted according to fitness (from maximum to minimum). A top percentage (shown in the code as a double constant FITNESS_PERCENT) of the Gene population is copied over to the next Generation's population, this is the fittest group of the Generation.
2. Reproduction
The remainder of the next Generation's population is generated by combining two members of the fittest group as parents to generate a new Gene (i.e. using the Gene(Gene * parent1, Gene * parent2, int gen_id); reproductive constructor).
3. Mutation
Finally, a small percentage of the population (shown in the code as the double constant MUTATION_PERCENT) is mutated. To mutated a Gene, one trait (i.e. letter) is randomly selected in the genotype. This trait is then replaced with a randomly generated trait from the policy. Finally, a mutated gene is marked by setting the mutant flag to true.
Format of Program Command Line Arguments
Your program (shown below as the binary hw5) will have the following command line format:
hw5 <policy file> <gen file prefix> <num generations>
<policy file> is the name of the policy file that is used to generate the Policy object.
<gen file prefix> is a string that is used in the file name of every Generation file that is generated.
<num generations> is how many Generations beyond the first Generation (Generation 0) that are evolved.
This format is given in the main() function of the code.
Expected Output
After a new Generation is evolved from a previous Generation, the fitness of all of the Genes in the new Generation is calculated. Then the population is sorted according to the fitness of its Generation from highest to lowest. Once sorted, the Generation is printed to the Generation file numbered appropriately for it's Generation.
Here is a series of 10 Generation's output files so you can get an idea of the format and the progression of evolution.
gen0.txt
gen1.txt
gen2.txt
gen3.txt
gen4.txt
gen5.txt
gen6.txt
gen7.txt
gen8.txt
gen9.txt
gen10.txt
Extras
1. Symbolize the Generation (1 point)
Devise a Graphical Representation of a Gene based on the example policy file given above. Some traits may be represented straightlines etc... Use the Postscript format that you know to generate a valid Postscript file showing a Generation of Genes (i.e. instead of a generation file). Include the fitness value underneath or next to every Gene (i.e. in Postscript it is not hard to print text, you will have to do a little research to figure out how).
2. Dynamically Sized Population (1 point)
Create a version of the program that supports a dynamically sized population for every Generation. Enhance the Policy file format to support parameters that will drive the changing size of population.
Last Updated 4/29/2007