Getting Bigger: How to Scale A Small Application

A few months ago, I was required to list the anagrams in a file. The file consists of words in each line. The output should be the anagram words. A line will present the anagrams, without a particular order. Hence, there will be as many lines as anagram sets.

Simple Design

The very step I took was to think small and make a simple design. The main idea of my code is to transform every word to a canonical form, such that, it is unique for every anagram of that word. So, I can use this canonical form as my key of the key-value map. This map stores the canonical form as key and the list of the words that have the same canonical form as value. Actually, the value is no more than the list of anagrams. After finding all the anagrams, I iterate over the keys of the map and output the values which have more than one word.

The canonical form is achieved by sorting the characters of the word. In my implementation, the upper case is a different character than its lowercase correspondent. If it is not the case, then, the canonical form extracting function can be modified by first converting the input into lower cases. If the locale of the words is different than the default of the Java Virtual Machine (VM), or we would like to be VM independent for that matter, then toLowerCase(Locale locale) should be used.


	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

To sort the characters, I preferred the Java implementation of Arrays class. This uses a quicksort variant for the char basic type. It is an in-place and very fast (most probably with an O(nlogn) time complexity) sorting mechanism. But because of it’s recursive nature, it (again most probably) requires O(logn) memory space, since for each recursive call, it needs to allocate another call stack. My sorting choice can be replaced under certain restrictions on the input. For instance, if we are assured that the characters of the words will be strictly from standard ASCII set, or extended ASCII set, or only letters, or only alphanumeric letters, etc. we can use Radix (Bucket) sort. That way, it is possible to sort the characters in O(n) time, at the expense of O(Character Set Size) memory space.

The advantage of this approach is, it is easy to understand and divide into smaller functions, which lead to a lower maintenance cost. Performance wise, we consider each word only once. Assuming that the number of words dwarfs the number of characters in a word, the running time of this iteration will be dominant factor. The time required to sort the characters will only be important when the input words are only a few. Henceforth, I can state that the time complexity of my algorithm is O(n). That is the best possible time complexity that can be achieved for this problem, since we must always investigate each word. The whole code is here:


import java.io.*;
import java.util.*;

public class AnagramPrinter {

	public static void main(String[] args) {
		if (args.length != 1) {
			System.out.println("Usage : java AnagramPrinter <file_name>");
			System.exit(1);
		}
		String fileName = args[0].trim();

		Map<String, List<String>> wordMap = new HashMap<String, List<String>>();
		try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
			String line;
			while ((line = br.readLine()) != null) {
				String canonicalForm = getCanonicalForm(line);
				List<String> anagramList = wordMap.get(canonicalForm);
				if (anagramList == null) {
					anagramList = new LinkedList<String>();
					wordMap.put(canonicalForm, anagramList);
				}
				anagramList.add(line);
			}
		} catch (FileNotFoundException e) {
			StringBuilder sb = new StringBuilder("File ");
			sb.append(fileName);
			sb.append(" not found!");
			System.out.println(sb.toString());
			System.exit(2);
		} catch (IOException e) {
			throw new RuntimeException(e);
		}

		printAnagrams(wordMap);
	}

	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

	private static void printAnagrams(Map<String, List<String>> wordMap) {
		for (String canonicalForm : wordMap.keySet()) {
			List<String> words = wordMap.get(canonicalForm);
			if (words.size() > 1) {
				StringBuilder sb = new StringBuilder();
				for (String w : words) {
					sb.append(w);
					sb.append(" ");
				}
				System.out.println(sb.toString().trim());
			}
		}
	}

}

The memory requirements of my approach is, in fact, the weakest link. Using a map considerably speeds up the process, but that requires a possible memory location for each word. Moreover, my algorithm cannot start to output the anagrams up until all words are exhausted. These two factors should be reconsidered for scalability purposes. Without that, applying this algorithm directly is, at the very least, unfeasible to millions of words and nearly impossible for the case of many billions. Here is why.

Suppose that the average number of characters in a word is 5. Let’s also assume that each character occupies 2 bytes. In the worst case, 10 million words will need approximately 100MB of memory for the keys of the map and another 100MB for the values. So how can I modify my approach?

Scalable Design

I thought this in the context of Map-Reduce and implemented a similar algorithm using Hadoop. The idea in my HadoopAnagramPrinter.java code is more or less the same. Again, we generate the canonical form with the same rules. Since Hadoop first splits the whole input file into independent set of blocks and feeds the data nodes with those, each node running the mapper code outputs the key-value pair, which is simply the canonical form of the word and the word itself.

  public static class CanonicalMapper
      extends Mapper<Object, Text, Text, Text>{

      public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
        String canonicalForm = getCanonicalForm(value.toString());
        context.write(new Text(canonicalForm), value);
      }

      private static String getCanonicalForm(String word) {
        char[] content = word.toCharArray();
        Arrays.sort(content);
        return new String(content);
      }
  }

There can be many mapper nodes running in parallel, because the output of each of them is independent from the others. That is the essence of Map-Reduce which makes the whole process scalable. After all of them finish their jobs, before running the reduce job, Hadoop generates a list of values for each key created in the mapping phase and sorts these lists by their keys. The key is our canonical form and the list contains the words, which are anagrams! Therefore, the duty of Reducer is very simple: if the size of the list is greater than 1, append each word in the list and print them in one line.

  public static class AnagramIterableReducer
      extends Reducer<Text,Text,NullWritable,Text> {

      public void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
        Iterator<Text> iterator = values.iterator();
        StringBuilder sb = new StringBuilder();
        if (iterator.hasNext()) {
          String word = iterator.next().toString();
          sb.append(word);
          if (iterator.hasNext()) {
            do {
              word = iterator.next().toString();
              sb.append(" ");
              sb.append(word);
            } while (iterator.hasNext());
            context.write(NullWritable.get(), new Text(sb.toString().trim()));
          }
        }
      }
  }

If we put the two codes side by side, we see that the map generation and word investigation part of the basic algorithm is directly turned into the Mapper class activity and the key-value list creation of Hadoop. The anagram printing code in the basic approach is nearly the same as the Reducer class. Upon a closer inspection, we see that the Reducer code can run on different nodes and produce the same result. That is because each list is independent from other lists. A Partitioner, which generates a number of segments of the data, can be implemented. For example, assuming that we have 3 CPUs dedicated to run as reducers, we must have three partitioner tasks. These tasks should divide the “canonical form (key)” – “list of anagrams (value)” input pairs into three non-overlapping segments. The criteria can be either their length of the key, or the alphanumeric order of keys or the number of words in the anagram lists, etc. By this design, not only can we run the Mapper nodes in parallel, but also we will have the same opportunity for the Reducer nodes. Therefore, all our system can be scalable.

This entry was posted on 03/05/2016 at 04:38 and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Merter Sualp's Weblog