Traveling Salesman Attack! (Hadoop and Genetic Algorithms)

For one of my “coffee projects” (things I work on for about 20-30 minutes each morning to warm my brain up and stay amused), I wrote a genetic algorithm attack on the Traveling Salesman problem. Because I’m a Big Data geek, I parallelized it with Hadoop.

(Note: I actually gave a talk on this topic some time back, having implemented this same concept in Ruby using Hadoop Streaming. That implementation contained a number of flaws, some of which I’m attempting to rectify. Also, it’s not as though I’m the first guy to work this angle. I haven’t perused the latest Hadoop/GA work in the field yet, because I wanted to maximize my first-hand learning at this stage – but I’ll be looking to other sources for inspiration as I develop and generalize this code.)

This was just done for my own edification; I haven’t put enough effort into tuning this software or studying its properties for this to be considered a serious work of research (yet). I’m not a specialist or any kind of expert with genetic algorithms, beyond my recreational reading. That said, I had some success at building a gene representation and evolutionary mechanisms that drove a population of solutions for an NP-hard problem toward (mostly) monotonic improvement.

What follows is a brief description of my approach to the problem, with code attached. If you are an expert with GAs, I’d love to hear your feedback on what I could do better.

I haven’t spent any effort in this post describing the Traveling Salesman problem, genetic algorithms, or Hadoop/MapReduce. (It was plenty long already.)  If you’re interested in any of those topics but need to brush up on the basics, check out the preceding links.

More geekery below the fold…

The Map

A “city” in my representation is an (x, y) coordinate pair, with each coordinate having a floating-point value in [0.0, 1.0). The software can take any arrangement of cities, but for testing I used an arrangement with a set of trivial solutions:

(0.0, 0.0) (0.0, 0.2) (0.0, 0.4) (0.0, 0.6) (0.0, 0.8)
(0.0, 1.0) (0.2, 1.0) (0.4, 1.0) (0.6, 1.0) (0.8, 1.0)
(1.0, 1.0) (1.0, 0.8) (1.0, 0.6) (1.0, 0.4) (1.0, 0.2)
(1.0, 0.0) (0.8, 0.0) (0.6, 0.0) (0.4, 0.0) (0.2, 0.0)

Gene Representation

When choosing a non-repeating path among N distinct cities, there are N possible choices for the first city; N-1 choices for the second city, and so on; and only one choice for the final city.

In my representation, a chromosome has N – 1 integer genes. For a gene Gn, valid values are [0, N - n) – thus, the first gene has N valid values, the second gene N – 1, and so on. Each gene corresponds to the 0-based index of a city in an ordered collection of the cities not yet accounted for in the path. For example, if we start with an ordered collection of cities {C0, C1, C2}, and the chromosome {0, 0}:

  • The first city in this candidate solution path is at index 0 in the collection:  C0
  • The second city in the path is at index 0 in the collection of remaining cities, {C1, C2}: C1
  • The final city is trivially C2.

Likewise, the chromosome {1, 1} encodes the path C1 -> C2 -> C0.

Fitness Scoring

Here I’ve implemented the symmetric Traveling Salesman problem (the distance/cost from C1 to C2 is the same as from C2 to C1), and the distance between cities is just the Cartesian distance between their coordinates. The maximum possible distance between any two cities in the landscape I’ve defined is 2, so the maximum/worst-case distance possible for an N-city map is (N – 1) * √2.

A chromosome’s score is this worst-case distance minus the actual distance of the path it encodes; thus, the greater the improvement over the theoretically pessimal case, the higher the chromomsome’s score.

public class ChromosomeScorer {
    protected ArrayList cities;
    protected ArrayList citiesUsed;
    final static double SQRT2 = Math.sqrt(2.0);
    protected double maxDistance;
 
    public ChromosomeScorer(String cityString) {
        cities = getCitiesFromString(cityString);
        citiesUsed = new ArrayList(cities.size());
        maxDistance = SQRT2 * (cities.size() - 1);
        for (int i = 0; i < cities.size(); ++i) {
            citiesUsed.add(Boolean.FALSE);
        }
    }
 
    protected double score(String chromosomeString) {
        ArrayList route = getRouteFromChromosome(chromosomeString);
        double distance = 0.0;
        for (int i = 1; i < cities.size(); ++i) {
            int city1Index = route.get(i);
            int city2Index = route.get(i - 1);
            double[] city1 = cities.get(city1Index);
            double[] city2 = cities.get(city2Index);
            double city1x = city1[0]; double city1y = city1[1];
            double city2x = city2[0]; double city2y = city2[1];
 
            distance += Math.sqrt(((city1x - city2x) * (city1x - city2x)) + ((city1y - city2y) * (city1y - city2y)));
        }
 
        return maxDistance - distance;
    }
 
    // city string parsing code elided
}

The ArrayList is a list of flags indicating whether a city has been added to the route yet; it’s for implementing the chromosome parsing I described above. In my example, I described the process in terms of pulling elements from an ordered collection, and performing further operations on the reduced collection. Regenerating the ordered list of cities and manipulating it in this way every time I wanted to score a chromosome sounded computationally expensive; flipping booleans sounded cheaper. (I could probably get cheaper still using actual bit flags, but I was trying to keep the code both readable and writable at this stage. Optimizations will come later.)

Initial Population

The following code creates an initial population of chromosomes (encoded as Strings) and dumps them to a requested file:

protected static void createInitialPopulation(FSDataOutputStream populationOutfile, final int populationSize, final int numCities) throws IOException {
    for (int i = 0; i < populationSize; ++i) {
        for (int j = 0; j < (numCities - 1); ++j) {
             if (j > 0) {
                populationOutfile.writeBytes(" ");
            }
            populationOutfile.writeBytes(String.format("%d", random.nextInt(numCities - j)));
        }
        populationOutfile.writeBytes("\n");
    }
    populationOutfile.close();
}

At a later time, I’ll experiment with using Hadoop’s ArrayWritable or other serialization schemes for more efficient representation of the chromosomes; at this stage, it was important to me to keep the data human-readable.

There’s a mapper dedicated to scoring generation 0:

public class ScoringMapper extends Mapper {
    private Text outKey = new Text();
    private DoubleWritable outValue = new DoubleWritable();
 
    protected ChromosomeScorer scorer = null;
 
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String incomingValue[] = value.toString().split("\t");
        String chromosome = incomingValue[0];
        double score = 0.0;
        if (incomingValue.length > 1) {
            String scoreString = incomingValue[1];
            try {
                score = Double.parseDouble(scoreString);
            } catch(NumberFormatException nfe) {
                // weird - but let's try re-generating the score
                score = scorer.score(value.toString());
            }
        } else { // we only go through the scoring process if we don't have a map
            score = scorer.score(value.toString());
        }
 
        outValue.set(score);
	outKey.set(chromosome);
	context.write(outKey, outValue);
    }
 
    // setup code elided
}

Subsequent Generations

I managed to cram the rest of the operations for each generation into one mapper and one reducer – really, into one reducer, since the map function just randomly assigns chromosomes to groups:

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        outKey.set(random.nextInt(numBins));
        context.write(outKey, value); // shuffle
    }

So why am I shuffling the chromosomes thus? Most methods of population selection involve having some view over the data; the naïve method of truncation selection is an extreme example, where you need to sort the entire population, and keep some fraction of the highest-scoring individuals. This is hard to parallelize, and not especially scalable for large populations.

Other selection methods such as fitness proportionate selection, however, look like they’d retain their properties well if the total population was divided into large, random subpopulations. (I have yet to do a detailed mathematical treatment of this, but some combination of intuition and cocktail napkin calculation has convinced me of this well enough to forge ahead.) To this end, I’ve set up the system to randomly distribute the full population into a user-settable number of subpopulations; the size of these subpopulations can be tuned to fit within the memory allocated to a reducer instance. (Naturally, the ideal would be to have a subpopulation just large enough to be handled by one reducer instance.) Assuming that the subpopulations are sufficiently large, the random distribution should retain the aggregate characteristics (distribution of fitness scores, &c) of the total population, and give similar results under fitness proportionate selection.

After this subdivision of the population, a reducer instance receives a subpopulation, and is then responsible for selecting survivors from that population, breeding replacement members of the population from the survivors, and scoring the new population members.

    @Override
    protected void reduce(VIntWritable key, Iterable values, Context context) throws IOException, InterruptedException {
        TreeSet sortedChromosomes = getSortedChromosomeSet(values);
        normalizeScores(sortedChromosomes);
 
        int survivorsWanted = (int) ((double) sortedChromosomes.size() * survivorProportion);
        Set survivors = new HashSet(survivorsWanted);
 
        while (survivors.size() < survivorsWanted) {
            survivors.add(selectSurvivor(sortedChromosomes));
        }
 
        ArrayList parentPool = new ArrayList(survivors);
        while (survivors.size() < desiredPopulationSize) {
            survivors.add(makeOffspring(parentPool));
        }
 
        Iterator iter = survivors.iterator();
        while (iter.hasNext()) {
            ScoredChromosome sc = iter.next();
            outKey.set(sc.chromosome);
            outValue.set(sc.score);
            context.write(outKey, outValue);
        }
    }

Selection: The first step in fitness proportionate selection is to sort the new members according to their score, ascending. After that, survivors are selected from the population randomly, weighted by their scores (the higher a chromosome’s score, the more likely it is to survive).

Repopulation: Surviving chromosomes are used to breed back up to our original population numbers. I used single crossover and random mutations (maximum of one mutation per chromosome per generation):

    protected ScoredChromosome makeOffspring(ArrayList parentPool) throws InterruptedException {
        int parent1Index = random.nextInt(parentPool.size());
        int parent2Index = parent1Index;
        while (parent2Index == parent1Index) {
            parent2Index = random.nextInt(parentPool.size());
        }
 
        ScoredChromosome parent1 = parentPool.get(parent1Index);
        ScoredChromosome parent2 = parentPool.get(parent2Index);
 
        ScoredChromosome offspring = crossover(parent1, parent2);
        if (random.nextDouble() < mutationChance) {
            mutate(offspring);
        }
 
        offspring.score = scorer.score(offspring.chromosome);
 
        return offspring;
    }

Scoring: Finally, all unscored population members are scored in preparation for the next round.

This process is repeated for as many generations as is specified at the beginning of the run.  (Later, I plan to allow the number of generations to be unspecified, and have the algorithm terminate when the maximum score starts to settle/converge.)

Parameters

  • Number of cities
  • Population size
  • Proportion of survivors in each generation
  • Subpopulation size (see above for rationale for dividing up the population, and thoughts on tuning)
  • Chance of random mutation
  • Number of generations

Results

I’ve thrown the data for min/max/mean scores for each generation of each run into a Google spreadsheet, and made some simple graphs.

The first run was done with 10,000-member population, 30% survival chance, 500 generations. There’s an unmistakable fitness increase over time, and it gets pretty close to the theoretical maximum for my test city (~23.066).

The second run was done with a 100,000-member population and 50% survival chance, 500 generations. It gets off to a slower start, but there’s a burst of improvement late in the 500-generation run and it gets pretty close to the optimum.

The difference in early-stage performance of the second run had me kicking myself for changing two inputs. I switched back to a 30% survival chance but kept the larger population. This run landed in the same neighborhood as the others, and looked like it still had room to settle.

Interesting to note on all 3 runs is that all start with a few generations of quick improvement, then flatten out, then have an inflection point after which there’s a burst of quick improvement. (Punctuated equilibrium?) All runs seem to still be on the climb; I suspect that a moving average over a few generations and/or some statistical tests for convergence would give a clearer picture of what’s happening with the max trend; the average was still definitely on the rise at the end of all three runs.

Notes

Test coverage could be improved, especially with the use of mocks. I may move to CDH3 and try MRUnit to ease test design. As it stands I’m taking most of the logic out of the mappers and reducers and just using the overridden map() and reduce() functions to coordinate the calls that do the actual work, largely to make it easier to test the logic in isolation.

For the sake of speed, I could probably take out some of the square roots and do my scoring using the squares of the distances.

On my 13″ MacBook Pro with Core Duo processor running 2 mappers and 2 reducers, a population of 10,000 took 45-60 seconds to cull, breed, and score. The time per generation didn’t change appreciably when I went from 10,000 chromomsomes per generation to a population of 100,000 – which means most of my time is still spent in Hadoop overhead, which means that even on my little laptop, I could put up much bigger populations (or much bigger maps) and still run 500 generations in about 6 hours.

I haven’t bothered taking the simulation parameters in via the command line yet; this would be an obvious improvement, and probably the next thing I’ll do. I was a lot more focused on getting the GA code right and checking that it behaved as expected than I was with making a usable tool. I’ve gotten the itch to generalize and grow this thing, though, so I’ll need to start making it usable & configurable & scriptable.

Clearly, this is not a general toolkit for GA, or even GA applied to NP-hard problems. (I don’t like to generalize without at least two cases.) I do see plenty of places where it is generalizable, though, and will starting doing so as soon as I pick a second problem to attack. A long-term goal is to factor out a library/API for doing GA on Hadoop.

Genetic algorithms are not one single thing – there are multiple strategies for culling populations, doing crossover/breeding, etc. A longer-term design goal is to make each of those variable pieces of the algorithm into pluggable strategies selectable from the command line.

There’s optimization to do all over the place, e.g. with the multiple copies of data kept in the reducer. I chose the data structures I used for ease of implementing the algorithm – e.g., picking a Set to hold survivors so as to avoid picking up duplicate survivors – with the intent to get working software quickly, and optimize last.

The Code

The code, such as it is, is in a public repository on GitHub. More to come.

Share
This entry was posted in hadoop, Java, Just for Kicks and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">