May | 2016 | Merter Sualp's Weblog

Archive for May, 2016

bit.ly URL Shortener Workflow

30/05/2016

It is here but I would like to share its story.

Ten years ago, I used to play Magic the Gathering… A lot and with excitement… I created decks writing the card information on plain papers. If the deck is fun to play, I bought them from eBay auctions (another excitement) and play with my friends day and night. What I still like about Magic is not the sheer power of individual cards. Some cards are powerful or insanely powerful by themselves. But what I really like is the interaction between the cards and their effect to the overall game. The combos are much more bigger then their card components. I feel that this is exactly what Workflow does between iOS apps.

I am trying to do my real work as much as possible on iPad. I follow especially Federico Viticci in that regard. Blogging a few times a month is one of my self-fulfilling tasks. I used WordPress and Blogo Mac apps, iOS app and web editor. None of them suits me. So I decided to prepare my simple blogging steps on iPad. Again, Mr. Viticci showed me the way.

I started with using Markdown. I am using 1Writer and Drafts. However, in this post, I will talk about a simple workflow I use to shorten URLs and how it came to life.

Every post I write contains a few links to other internet sites. This is the norm for most of the bloggers. I use bit.ly for this purpose. In fact, I have found a way to do that with Workflow.

wrong way

As we can see, we need our own OAuth access token to make bit.ly shorten the URL for us. To have that, we login (or create) our bit.ly account and get our access token from here. We will use it in the workflow.

I got mine but no matter what, it failed with the proposed solution above. The problem was that, “Get Contents of Web Page” action creates a web archive output which is a rich text but does not have the required information in an accessible form. So I tried to modify it while learning the Workflow actions. It is advised to use “Get Contents of URL” action. I replaced it with that and bingo!

right way

I would like to learn what is going on behind the scenes. So let’s go over it together. As the first step, we make the save_link API call and get the contents. It is a text in JSON format. In our example, I wanted to convert the long URL http://www.grandprix.com/race/r941racereport.html. The response we have contains the data that we will use, status code and text, as explained in the API document.

{
    "status_code": 200,
    "data": {
        "link_save": {
            "link": "http://bit.ly/1qW7Coh",
            "aggregate_link": "http://bit.ly/1RDu8rY",
            "long_url": "http://www.grandprix.com/race/r941racereport.html",
            "new_link": 1
        }
    },
    "status_txt": "OK";
}

To access the “link” information, this text output should be converted to a dictionary first. Since it’s a JSON, that is ok to use “Get Dictionary from Input” action. When it is done, we can get the data part by using “Get Value for Key” action by setting the key value to “data”. This logic is important, since we will use it extensively whenever we try to extract the information from JSON or XML files. Now we have this:

{
    "link_save": {
        "link": "http://bit.ly/1qW7Coh",
        "aggregate_link": "http://bit.ly/1RDu8rY",
        "long_url": "http://www.grandprix.com/race/r941racereport.html",
        "new_link": 1
    }
}

We are getting closer to the “link”. The same mechanism can be applied but this time, our key will be “link_save”.

{
    "link": "http://bit.ly/1qW7Coh",
    "aggregate_link": "http://bit.ly/1RDu8rY",
    "long_url": "http://www.grandprix.com/race/r941racereport.html",
    "new_link": 1
}

Applying the “Get Value for Key” one last time, we get the bit.ly URL created for us. It can be send to the other applications by “Copy to Clipboard” action. This workflow can be found in here.

While digging into the API Documentation, I found that there are two main methods to make a URL short. The one that I explained above is the complex one. The other is so simple. In that method, we use the “shorten” API call. I read the details and saw that there is no need to generate a JSON and try to get the short URL searching key by key by key… All we need to do is tell in the shorten API call that the output should be in txt format! The content of the output contains only the short URL. I also uploaded that workflow to the gallery.

simple way

Hope it makes your lives simpler than before. I am planning to post about another workflow which combines 1Writer javascripts to make a successful post to WordPress.

Posted in Uncategorized | Leave a Comment »

Getting Smaller: Splitting Evernote Notes

13/05/2016

There is an interesting feature (I hope!) that Evernote has: It searches a keyword blazingly fast within many notes but it it is painfully slow to search the same keyword in a single huge note. This makes me write many but small notes.

However, I have a one big log file. A few days ago, I decided to split it into monthly journals, as in Drafts 4. My original log file has the following format:

dd.MM.yyyy - hhmm
* Bla bla bla activity
* More activity...

dd.MM.yyyy - hhmm
* Another activity on the same day but in a different time frame

dd.MM.yyyy - hhmm
* The activity started the day after

The preferred format I used in Drafts 4 was something like this:

## yyyy-MM-dd
@hhmm
Bla bla bla activity
More activity...

@hhmm
Another activity on the same day but in a different time frame

## yyyy-MM-dd
@hhmm
The activity started the day after

Now, the important thing is that, the full log file should be separated into non-overlapping parts. Each part contains the logs of a single month. And what should I do with the monthly logs? I have three options:

Outputting them to different text files
Sending each of them as mails to my Evernote account
Using Evernote API to inject them as notes

For the time being, I have chosen the second one. I am planning to play with Evernote API in the near future.

No matter what our approach is, we need to use jar packages designed for specific purposes. Sending mails requires javax.mail.jar or Evernote API is presented within evernote-api.jar. Dependency management is a big deal so I wanted to keep it as simple as possible with a proven tool: Maven. Actually, Maven is more than a dependency manager but for my purposes, it will fill that spot.

In this application, I will use Eclipse as my IDE. Creating a new project which will use Maven is simple. Just create a Maven Project. It will generate the directory structure and the basic pom.xml file, which is crucial in Maven operations. It is a three step process. You can find mine below:

Maven-01

Maven-02

Maven-03

At the end of it, we will see the project in our Eclipse workspace. There are a few things to mention. First, the directory structure is designed as in Maven’s preferred style. You can change it but do not forget to make necessary adjustments to Build Path. The source codes are in src/main/java folder. src/test/java contains the unit tests. src/main/resources and src/test/resources hold the setting and other files necessary for running the application or tests. Another point of interest is the pom.xml file. Here, we will add our dependencies. Moreover, this file tells Maven about the java requirements. As you can see, the default JRE version is 1.5.

Maven-04

So, our project is ready but empty. We can either create new packages and class in it, or copy a pre-written source code. I had mine before the project creation. Therefore, I will put it inside the source code folders.

Maven-05

Maven-07

You can link files if you want. Then, the workspace will refer to the original location of the file. I generally copy, so the workspace is coherent within itself.

Ups… We have a problem.

Maven-08

With Java 7, a convenient way to write try-with-resources is introduced. This way, a resource implementing java.lang.AutoClosable can be used just after try, and this will ensure that the resource will be properly closed no matter the try failed or not. There is no need to explicitly close the resource. For more information, I advise you to visit this Oracle Java Tutorial.

Eclipse says it can solve it. Great! But DO NOT DO THAT! If you only use that Eclipse workspace, or you somehow share your project properties with other Eclipse workspaces, that may be fine. But generally, that is not the case. Maven has its own way to solve it and it is a better one. Remember the pom.xml file? We can also define properties there. In our case, the source and target should be at least 1.7. My preference is 1.8. Problem solved.

Maven-09

To add a dependency, we need to modify the pom.xml file and Update the project. For example, to integrate JUnit as our unit test framework, we should add it’s packages. Also, to run the code on a simple input, I made a small test file and put it under src/main/resources.

Maven-11

To update the project, we can use the following path:

Maven-16

After the update, the content of our project will be like this. Note the jar files under Maven Dependencies.

Maven-12

Since my code expects an input file, to it with the test file, we will add a run/debug configuration in Eclipse.

Maven-10

The output of the code is directly printed to console. That may be good for early testing phase, but it is undesirable for a production environment, even in my simple project. What I want is to send this output as a set of e-mails to my Evernote account. For that purpose, we will visit the central Maven repository to learn the necessary packages and put them inside our project.

Maven-13

The most widely used mail API is javax.mail.

Maven-14

We get the Maven dependency information and directly paste it inside the pom.xml file.

Maven-15

We have access to the methods for sending mail. I will send from my gmail account to my Evernote account. I know both mail addresses. That is also fine. So, how can I send them? The mkyong tutorial in here provides a great example. I have used the SSL version.

If you are also a gmail user, you will most probably encounter a problem as I did. As a security measure, gmail may prohibit sending emails without logging through a client application. I solved it by changing my default settings, as explained in here.

That’s all. You can get my code from github any time you like. Hope it will also save your time as mine.

Posted in Uncategorized | Leave a Comment »

Getting Bigger: How to Scale A Small Application

03/05/2016

A few months ago, I was required to list the anagrams in a file. The file consists of words in each line. The output should be the anagram words. A line will present the anagrams, without a particular order. Hence, there will be as many lines as anagram sets.

Simple Design

The very step I took was to think small and make a simple design. The main idea of my code is to transform every word to a canonical form, such that, it is unique for every anagram of that word. So, I can use this canonical form as my key of the key-value map. This map stores the canonical form as key and the list of the words that have the same canonical form as value. Actually, the value is no more than the list of anagrams. After finding all the anagrams, I iterate over the keys of the map and output the values which have more than one word.

The canonical form is achieved by sorting the characters of the word. In my implementation, the upper case is a different character than its lowercase correspondent. If it is not the case, then, the canonical form extracting function can be modified by first converting the input into lower cases. If the locale of the words is different than the default of the Java Virtual Machine (VM), or we would like to be VM independent for that matter, then toLowerCase(Locale locale) should be used.


	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

To sort the characters, I preferred the Java implementation of Arrays class. This uses a quicksort variant for the char basic type. It is an in-place and very fast (most probably with an O(nlogn) time complexity) sorting mechanism. But because of it’s recursive nature, it (again most probably) requires O(logn) memory space, since for each recursive call, it needs to allocate another call stack. My sorting choice can be replaced under certain restrictions on the input. For instance, if we are assured that the characters of the words will be strictly from standard ASCII set, or extended ASCII set, or only letters, or only alphanumeric letters, etc. we can use Radix (Bucket) sort. That way, it is possible to sort the characters in O(n) time, at the expense of O(Character Set Size) memory space.

The advantage of this approach is, it is easy to understand and divide into smaller functions, which lead to a lower maintenance cost. Performance wise, we consider each word only once. Assuming that the number of words dwarfs the number of characters in a word, the running time of this iteration will be dominant factor. The time required to sort the characters will only be important when the input words are only a few. Henceforth, I can state that the time complexity of my algorithm is O(n). That is the best possible time complexity that can be achieved for this problem, since we must always investigate each word. The whole code is here:


import java.io.*;
import java.util.*;

public class AnagramPrinter {

	public static void main(String[] args) {
		if (args.length != 1) {
			System.out.println("Usage : java AnagramPrinter <file_name>");
			System.exit(1);
		}
		String fileName = args[0].trim();

		Map<String, List<String>> wordMap = new HashMap<String, List<String>>();
		try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
			String line;
			while ((line = br.readLine()) != null) {
				String canonicalForm = getCanonicalForm(line);
				List<String> anagramList = wordMap.get(canonicalForm);
				if (anagramList == null) {
					anagramList = new LinkedList<String>();
					wordMap.put(canonicalForm, anagramList);
				}
				anagramList.add(line);
			}
		} catch (FileNotFoundException e) {
			StringBuilder sb = new StringBuilder("File ");
			sb.append(fileName);
			sb.append(" not found!");
			System.out.println(sb.toString());
			System.exit(2);
		} catch (IOException e) {
			throw new RuntimeException(e);
		}

		printAnagrams(wordMap);
	}

	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

	private static void printAnagrams(Map<String, List<String>> wordMap) {
		for (String canonicalForm : wordMap.keySet()) {
			List<String> words = wordMap.get(canonicalForm);
			if (words.size() > 1) {
				StringBuilder sb = new StringBuilder();
				for (String w : words) {
					sb.append(w);
					sb.append(" ");
				}
				System.out.println(sb.toString().trim());
			}
		}
	}

}

The memory requirements of my approach is, in fact, the weakest link. Using a map considerably speeds up the process, but that requires a possible memory location for each word. Moreover, my algorithm cannot start to output the anagrams up until all words are exhausted. These two factors should be reconsidered for scalability purposes. Without that, applying this algorithm directly is, at the very least, unfeasible to millions of words and nearly impossible for the case of many billions. Here is why.

Suppose that the average number of characters in a word is 5. Let’s also assume that each character occupies 2 bytes. In the worst case, 10 million words will need approximately 100MB of memory for the keys of the map and another 100MB for the values. So how can I modify my approach?

Scalable Design

I thought this in the context of Map-Reduce and implemented a similar algorithm using Hadoop. The idea in my HadoopAnagramPrinter.java code is more or less the same. Again, we generate the canonical form with the same rules. Since Hadoop first splits the whole input file into independent set of blocks and feeds the data nodes with those, each node running the mapper code outputs the key-value pair, which is simply the canonical form of the word and the word itself.

  public static class CanonicalMapper
      extends Mapper<Object, Text, Text, Text>{

      public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
        String canonicalForm = getCanonicalForm(value.toString());
        context.write(new Text(canonicalForm), value);
      }

      private static String getCanonicalForm(String word) {
        char[] content = word.toCharArray();
        Arrays.sort(content);
        return new String(content);
      }
  }

There can be many mapper nodes running in parallel, because the output of each of them is independent from the others. That is the essence of Map-Reduce which makes the whole process scalable. After all of them finish their jobs, before running the reduce job, Hadoop generates a list of values for each key created in the mapping phase and sorts these lists by their keys. The key is our canonical form and the list contains the words, which are anagrams! Therefore, the duty of Reducer is very simple: if the size of the list is greater than 1, append each word in the list and print them in one line.

  public static class AnagramIterableReducer
      extends Reducer<Text,Text,NullWritable,Text> {

      public void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
        Iterator<Text> iterator = values.iterator();
        StringBuilder sb = new StringBuilder();
        if (iterator.hasNext()) {
          String word = iterator.next().toString();
          sb.append(word);
          if (iterator.hasNext()) {
            do {
              word = iterator.next().toString();
              sb.append(" ");
              sb.append(word);
            } while (iterator.hasNext());
            context.write(NullWritable.get(), new Text(sb.toString().trim()));
          }
        }
      }
  }

If we put the two codes side by side, we see that the map generation and word investigation part of the basic algorithm is directly turned into the Mapper class activity and the key-value list creation of Hadoop. The anagram printing code in the basic approach is nearly the same as the Reducer class. Upon a closer inspection, we see that the Reducer code can run on different nodes and produce the same result. That is because each list is independent from other lists. A Partitioner, which generates a number of segments of the data, can be implemented. For example, assuming that we have 3 CPUs dedicated to run as reducers, we must have three partitioner tasks. These tasks should divide the “canonical form (key)” – “list of anagrams (value)” input pairs into three non-overlapping segments. The criteria can be either their length of the key, or the alphanumeric order of keys or the number of words in the anagram lists, etc. By this design, not only can we run the Mapper nodes in parallel, but also we will have the same opportunity for the Reducer nodes. Therefore, all our system can be scalable.

Posted in Uncategorized | Leave a Comment »

Merter Sualp's Weblog

Archive for May, 2016

bit.ly URL Shortener Workflow

Getting Smaller: Splitting Evernote Notes

Getting Bigger: How to Scale A Small Application

Simple Design

Scalable Design

Pages

Archives

Categories