Merter Sualp's Weblog

Engineer’s Dilemma

04/07/2016

One of my assignments recently involved uploading some data to the web. It consists of repo auctions carried by Central Bank of the Republic of Turkey. The data reside inside a big html table. That table is inside an xml file, which is served by IBM Web Content Manager.

A Brief History

The current insertion procedure is not user-friendly for the time being. They get the new repo auction results from the database by mens of an Excel VBA application. Each value in the record is placed under successive column cells. Later, they copy those rows and paste them inside a draft. As the last step, they approve it. This makes the data to be published.

Our users wanted this process to be simplified. The web content team offered the possibility of replacing the xml file with the Excel itself. By this, the users can upload the same file where the VBA application runs and carries the auction data. What I need to do was simply create a copy of the Excel file while striping off the VBA modules. Simple, right? Yet, our users objected to that idea. Their reasoning was that, since Excel is completely open to their editing, there may be some human errors in the data. The Excel file can be edited either intentionally or unintentionally. No matter what, this is unacceptable.

The web content team also has some concerns. They argued that this method of maintaining the big html table will be unsustainable when it hit a certain number of characters in it. Also, it will most probably be impossible to upload the whole html table each time.

With all those in my mind, I thought that there must be a way to have the best of both worlds. To achieve that, I need to somehow update the html file by appending the new auctions at the very end of the html table. This should satisfy the user, since I will never use the Excel file again. Also, I won’t change the values of the previous auctions, which is a strict requirement of the users. Moreover, I will have the option to trim the very old auctions by deleting the first rows of the html table. That is welcome by the web content team. So how shall I start?

The Common Ground

Let’s think about the steps of the new process:
1. Connect to the server
2. Request1: Create a draft with the current web page
3. Update the html table in the web page by appending the new auctions
4. Request2: Put the modified web page back to server
5. Request3: Send the draft to approval

After setting these steps, web content team send me an example code showing how to connect to our IBM WCM server, make requests and get the responses. I also found the necessary documentation and examples detailing these steps.
Since I will be making XML and HTML parsing, I chose jsoup, which can parse both of these, as my primary library. Now I was ready to start to code a small Java client application. At least I thought so.

The first thing about the IBM WCM is that, every document has a UUID. The web content team provided me the UUID of the document I was dealing with. I though that, I would use this for all my three requests. However, the mechanism is not like this. In create draft (Request1), we need to use the UUID of the document, that’s ok. But, the draft creation request generates a response which contains another id representing that draft. This is because there may be more than one draft for the same document at the same time frame. We extract that id by parsing the xml response and use it in the following put (Request2) and send approval (Request3) requests. It’s critical.

package tr.gov.tcmb.pgm.api.network;

import java.io.IOException;

import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpEntityEnclosingRequestBase;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpPut;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.LaxRedirectStrategy;
import org.apache.log4j.Logger;

import tr.gov.tcmb.pgm.api.model.DraftWrapper;

/**
 *
 * @author Merter Sualp
 * This class is responsible for making the network operations
 */
public class NetworkOperations {

  private static final String HOST = &quot;idmvwaut1.tcmb.gov.tr&quot;;
  private static final int PORT = 80;             // can be replaced by 443
  private static final String PROTOCOL = "http"  // can be replaced by "https"
  private static final String USERNAME = "user_name";
  private static final String PASSWORD = "password";
  private static final String TEXT_CONTENT = "text/plain";
  private static final String XML_CONTENT = "application/atom+xml";

  public static final String BASE_URI = PROTOCOL + "://" + HOST;
  public static final String BASE_IYS_URL = BASE_URI + "/wps/mycontenthandler/wcmrest";
  public static final String CREATE_DRAFT_TEMPLATE = "%sitem/%s/create-draft";
  public static final String CONTENT_TEMPLATE = "%sContent/%s";
  public static final String APPROVE_TEMPLATE = "%sitem/%s/next-stage";

  private static final Logger logger = Logger.getLogger(NetworkOperations.class);

  private enum WcmConnection {
    SINGLETON_CONNECTION();

    private final CloseableHttpClient httpClient;
    private final HttpClientContext context;
    private final HttpHost targetHost;

    private WcmConnection() {
      httpClient = HttpClientBuilder.create().setRedirectStrategy(new LaxRedirectStrategy()).build();
      targetHost = new HttpHost(HOST, PORT, PROTOCOL);

      CredentialsProvider credsProvider = new BasicCredentialsProvider();
      credsProvider.setCredentials(new AuthScope(targetHost.getHostName(), targetHost.getPort()),
          new UsernamePasswordCredentials(USERNAME, PASSWORD));

      AuthCache authCache = new BasicAuthCache();
      BasicScheme basicAuth = new BasicScheme();
      authCache.put(targetHost, basicAuth);

      context = HttpClientContext.create();
      context.setCredentialsProvider(credsProvider);
      context.setAuthCache(authCache);
    }

    private CloseableHttpResponse execute(HttpEntityEnclosingRequestBase request)
        throws ClientProtocolException, IOException {
      CloseableHttpResponse response = httpClient.execute(targetHost, request, context);
      logger.trace(response.getStatusLine());
      return response;
    }
  }

  /**
   * This method prepares the request for creating a draft document.
   * The given UUID is used as the template document for this draft.
   *
   * @param docUuid
   * @return CloseableHttpResponse
   * @throws IOException
   */
  public static CloseableHttpResponse createDraft(String docUuid) throws IOException {
    HttpPost draftCreateRequest = new HttpPost(String.format(CREATE_DRAFT_TEMPLATE, BASE_IYS_URL, docUuid));
    setHeaderInRequest(draftCreateRequest, TEXT_CONTENT);
    return getResponseForRequest(draftCreateRequest);
  }

  private static void setHeaderInRequest(HttpEntityEnclosingRequestBase request, String content) {
    request.setHeader("Content-Type", content);
  }

  private static CloseableHttpResponse getResponseForRequest(HttpEntityEnclosingRequestBase request)
      throws ClientProtocolException, IOException {
	  CloseableHttpResponse response = WcmConnection.SINGLETON_CONNECTION
		  .execute(request);
	  checkResponse(response);
	  return response;
  }

  private static void checkResponse(CloseableHttpResponse response)
	    throws FailedRequestException {
	  if (response.getStatusLine().getStatusCode() == 200)
	    return;
	  if (response.getStatusLine().getStatusCode() == 201)
	    return;
	  throw new FailedRequestException();
  }
}

To create a draft for a given UUID, calling the createDraft(String docUuid) method above is enough. It forms the url string which will be posted and sets the header information accordingly. After that, it send this request and gets the response. While doing it, I preferred to implement an enumeration for making the singleton connection simpler. The very first call for the execute method of the WcmConnection enumeration creates the single object. The following requests will not do that again. They will use the exact connection object which was initialized before draft creation.

	private static Document getXmlDocument(CloseableHttpResponse response)
			throws IOException, SAXException, ParserConfigurationException {
		HttpEntity entity = response.getEntity();
		DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
		DocumentBuilder builder = factory.newDocumentBuilder();
		Document xmlDoc = builder.parse(entity.getContent());
		xmlDoc.getDocumentElement().normalize();
		return xmlDoc;
	}

My initial plan was to employ jsoup for both parsing the wrapper xml and its html child, which contains the table and is a simple character data. Below is an example of it.

<data type="text/HTML><![CDATA[
... HTML tags...
]]></data>

The problem we encountered was that, the XML parser of jsoup turns all tags into lowercase counterparts. That is a serious issue. We constantly got 400 as response status code. I immediately changed my approach and employed a standard DOM parser for XML. jsoup was only used for HTML character data.

private static String extractDraftIdFrom(Document xmlDoc)
	    throws TagNonExistentException {
	Node idElement = extractTagStartingFrom(DRAFT_ID_TAG, 
    xmlDoc.getDocumentElement());
	if (idElement == null)
	    throw new TagNonExistentException();
	  return idElement.getTextContent().substring(PRE_ID_KEY.length());
}

private static Node extractTagStartingFrom(String tag, Node parent)
	    throws TagNonExistentException {
	Node requiredElement = null;
	NodeList children = parent.getChildNodes();
	for (int i=0; i<children.getLength(); i++) {
	    requiredElement = children.item(i);
	    if (tag.equals(requiredElement.getNodeName())) {
		    break;
	    }
	    requiredElement = null;
	}
	if (requiredElement == null)
	    throw new TagNonExistentException();
	  return requiredElement;
}

The next step after parsing the XML file is extracting the id assigned for the draft. The child contains this information prepended with “wcmrest:” string. We get the remaining of the text value as the id. The extractTagStartingFrom(String tag, Node parent) may seem overkill here. As you will see, we will use this method to find other nodes inside the XML document.

private static Node extractTableXmlElement(Document xmlDoc)
	    throws TagNonExistentException {
	Node contentNode = extractTagStartingFrom("content",
		xmlDoc.getDocumentElement());
	Node wcmContentNode = extractTagStartingFrom("wcm:content",
		contentNode);
	Node elementsNode = extractTagStartingFrom("elements", wcmContentNode);
	Node bodyElementNode = extractTagHavingAttributeValueStartingFrom(
		"element", "name", "Body", elementsNode);
	Node dataNode = extractTagStartingFrom("data", bodyElementNode);
	return dataNode.getFirstChild();
}

private static Node extractTagHavingAttributeValueStartingFrom(String tag,
	    String attr, String value, Node parent)
	    throws TagNonExistentException {
	Node requiredElement;
	NodeList children = parent.getChildNodes();
	for (int i=0; i<children.getLength(); i++) {
	    requiredElement = children.item(i);
	    if (tag.equals(requiredElement.getNodeName())
		    && doesNodeContainAttributeHavingValue(attr, value,
			    requiredElement))
		return requiredElement;
	}
	throw new TagNonExistentException();
}

private static boolean doesNodeContainAttributeHavingValue(String attr,
	    String value, Node requiredElement) {
	NamedNodeMap attributes = requiredElement.getAttributes();
	for (int a=0; a<attributes.getLength(); a++) {
	    Node theAttribute = attributes.item(a);
	    if (theAttribute.getNodeName().equals(attr)
		    && theAttribute.getNodeValue().equals(value))
		return true;
	}
	return false;
}

After getting the draft id, what we will do is to find the XML part which holds the HTML table. The path to it starts with the “content” child of the root. That node has a “wcm:content” child. There are “elements” as children of “wcm:content”. One of the children has an attribute, “name” and that attribute has a value “Body”. At that point, we get a child node, “data”, which is the parent of the HTML table as a character data. Here you can see that we are extensively calling the extractTagStartingFrom(String tag, Node parent) method to reach the destination. Moreover, at one place, we are calling extractTagHavingAttributeValueStartingFrom(String tag, String attr, String value, Node parent) to find the node that we are interested in.

private static String addAuction(Node oldTableXmlElement) {
	org.jsoup.nodes.Document htmlDoc = extractHtmlTableFrom(
		oldTableXmlElement);
	org.jsoup.nodes.Element allAuctions = htmlDoc.getElementsByTag("tbody")
		.first();
	org.jsoup.nodes.Element newAuction = new org.jsoup.nodes.Element(
		Tag.valueOf("tr"), NetworkOperations.BASE_URI);
	List data = new LinkedList();
	data.add("28.06.2016");
	data.add("MİKTAR");
	data.add("28.06.2016");
	data.add("12.07.2016");
	data.add("14");
	data.add("21,765,467.46");
	data.add("10,999,999.97");
	data.add("7.50");
	data.add("7.50");
	data.add("7.50");
	data.add("7.78");
	data.add("7.78");
	data.add("7.78");
	for (String datum : data) {
	    org.jsoup.nodes.Element tableElement = new org.jsoup.nodes.Element(
		    Tag.valueOf("td"), NetworkOperations.BASE_URI);
	    tableElement.appendText(datum);
	    newAuction.appendChild(tableElement);
	}
	allAuctions.appendChild(newAuction);
	return htmlDoc.body().html();
}

private static org.jsoup.nodes.Document extractHtmlTableFrom(
	    Node oldTableXmlElement) {
	String htmlContent = oldTableXmlElement.getTextContent();
	return Jsoup.parse(htmlContent);
}

Up until now, our only aim was to reach where the HTML table is. So, we did not deal with jsoup in anywhere. That is intentional. If, for any reason, we would like to change the XML format, or XML parsing libraries, or HTML format or HTML parsing libraries, these abstraction layers should obstruct help our cause. The other parts of the code will be left untouched. For example, the codes above are the only ones where jsoup calls are made and no other prior stuff is involved. We pare the HTML, create and append a new table row to the existing table, and return it as a String. Simple.

private static String modifyXmlDocAsString(String newHtmlDocAsString,
	    Node oldTableXmlElement, Document xmlDoc)
	    throws XmlToStringConversionException {
	CDATASection cdata = xmlDoc.createCDATASection(newHtmlDocAsString);
	oldTableXmlElement.getParentNode().replaceChild(cdata,
		oldTableXmlElement);
	return xmlToString(xmlDoc);
}

private static String xmlToString(Document doc)
	    throws XmlToStringConversionException {
	try {
	    StringWriter sw = new StringWriter();
	    TransformerFactory tf = TransformerFactory.newInstance();
	    Transformer transformer = tf.newTransformer();
	    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
		    "no");
	    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
	    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
	    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

	    transformer.transform(new DOMSource(doc), new StreamResult(sw));
	    return sw.toString();
	} catch (Exception ex) {
	    logger.error(ex.getMessage());
	    throw new XmlToStringConversionException(
		    "Error converting to String: " + ex.getMessage());
	}
}

So far, we were able to the new HTML table. Now, we will get that, create a character data node and put it back to the main XML file. I used CDATASection interface for it. By means of the XML document, it can be created. I also replaced the old table with this one. After that, I wanted the full text form of the new XML. Interestingly, as far as I know, there is no single method to accomplish that. Hence, I got this elegant java code to do that.

private static DraftWrapper addMissingAuctions() throws IOException,
	    SAXException, ParserConfigurationException, FailedRequestException,
	    TagNonExistentException, XmlToStringConversionException {
	CloseableHttpResponse response = null;
	try {
	    response = NetworkOperations.createDraft(DraftWrapper.CONTENT_UUID);
	    Document xmlDoc = getXmlDocument(response);
	    String draftId = extractDraftIdFrom(xmlDoc);
	    Node oldTableXmlElement = extractTableXmlElement(xmlDoc);
	    String newHtmlDocAsString = addAuction(oldTableXmlElement);
	    String xmlDocumentAsText = modifyXmlDocAsString(newHtmlDocAsString,
		    oldTableXmlElement, xmlDoc);
	    logger.trace(xmlDocumentAsText);
	    return DraftWrapper.valueOf(
		    new ByteArrayEntity(xmlDocumentAsText.getBytes("UTF-8")),
		    draftId);
	} catch (IOException ex) {
	    logger.error(ex.getMessage());
	    throw ex;
	} catch (UnsupportedOperationException e) {
	    logger.error(e.getMessage());
	    throw e;
	} catch (FailedRequestException e) {
	    logger.error(e.getMessage());
	    throw e;
	} catch (TagNonExistentException e) {
	    logger.error(e.getMessage());
	    throw e;
	} catch (XmlToStringConversionException e) {
	    logger.error(e.getMessage());
	    throw e;
	} finally {
	    if (response != null)
		response.close();
	}
}

The last step is to convert the string/text form of XML file into an http entity. A ByteArrayEntity() is just fine. I show above the whole code for appending new auctions to the existing ones. You see that I return a DraftWrapper object.

package tr.gov.tcmb.pgm.api.model;

import org.apache.http.HttpEntity;

/**
 * This class contains the id and http entity data which are used in IBM wcm
 * requests
 *
 * @author asgmsta
 *
 */
public class DraftWrapper {
  public static final String CONTENT_UUID = "D0CUM3N7-C0N73NT-1D;

  private final HttpEntity body;
  private final String draftId;

  private DraftWrapper(HttpEntity body, String draftId) {
	  this.body = body;
	  this.draftId = draftId;
  }

  public HttpEntity getBody() {
	  return body;
  }

  public String getDraftId() {
	  return draftId;
  }

  /**
    * The static factory method creating a DraftWrapper
    *
    * @param body
    * @param draftId
    * @return DraftWrapper
    */
  public static DraftWrapper valueOf(HttpEntity body, String draftId) {
	  return new DraftWrapper(body, draftId);
  }
}

After modifying the draft, our next steps require the draft id and the entity. I did not want to make a stateful object, therefore I did not add these as private fields of any class. The result is creating a DraftWrapper object. The only stateful class is this one. The wcm and network operation objects are all stateless.

private static void updateServerDocumentWith(DraftWrapper draftWrapper)
	  throws IOException, FailedRequestException {
	CloseableHttpResponse response = null;
	try {
	  response = NetworkOperations.putModifiedDraft(draftWrapper);
	} catch (IOException ex) {
	  logger.error(ex.getMessage());
	  throw ex;
	} finally {
	  if (response != null)
		  response.close();
	}
}

/**
  * This method prepares the request for putting modified draft back to the
  * server The given DraftWrapper contains the id of the draft and the bytes
  * of the draft itself to be put.
  *
  * @param draftWrapper
  * @return CloseableHttpResponse
  * @throws IOException
  */
public static CloseableHttpResponse putModifiedDraft(
	    DraftWrapper draftWrapper)
	    throws IOException, FailedRequestException {
	HttpPut putRequest = new HttpPut(String.format(CONTENT_TEMPLATE,
		BASE_IYS_URL, draftWrapper.getDraftId()));
	putRequest.setEntity(draftWrapper.getBody());
	setHeaderInRequest(putRequest, XML_CONTENT);
	return getResponseForRequest(putRequest);
}

We are getting really close finish. Our draft is ready and now it is time to put it back to the wcm server. We initialize a put request and set its body as the entity body we previously put inside the DraftWrapper. The important distinction here is this entity addition and setting the content as “application/atom+xml”. All our other requests are post and their content type is “tetx/plain”.

private static void sendToApproval(String draftId)
	  throws IOException, FailedRequestException {
	CloseableHttpResponse response = null;
	try {
	  response = NetworkOperations.sendToApproval(draftId);
	} catch (IOException ex) {
	  logger.error(ex.getMessage());
	  throw ex;
	} finally {
	  if (response != null)
		  response.close();
	}
}

/**
  * This method changes the state of the created and modified draft into
  * ready for approval.
  *
  * @param docUuid
  * @return CloseableHttpResponse
  * @throws IOException
  */
public static CloseableHttpResponse sendToApproval(String docUuid)
	  throws IOException, FailedRequestException {
	HttpPost approvalRequest = new HttpPost(
		String.format(APPROVE_TEMPLATE, BASE_IYS_URL, docUuid));
	setHeaderInRequest(approvalRequest, TEXT_CONTENT);
	return getResponseForRequest(approvalRequest);
}

The last step is sending the newly modified draft, which is on the server now, to approval. It only needs the draft id. Nothing more. The whole WcmOperation.java code is below for the sake of completeness.

package tr.gov.tcmb.pgm.api.wcm;

import java.io.IOException;
import java.io.StringWriter;
import java.util.LinkedList;
import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.entity.ByteArrayEntity;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.parser.Tag;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import tr.gov.tcmb.pgm.api.exceptions.FailedRequestException;
import tr.gov.tcmb.pgm.api.exceptions.TagNonExistentException;
import tr.gov.tcmb.pgm.api.exceptions.XmlToStringConversionException;
import tr.gov.tcmb.pgm.api.model.DraftWrapper;
import tr.gov.tcmb.pgm.api.network.NetworkOperations;

public class WcmOperation {

  private static final Logger logger = Logger.getLogger(WcmOperation.class);
  private static final String PRE_ID_KEY = "wcmrest:";
  private static final String DRAFT_ID_TAG = "id";

  private WcmOperation() {
  }

  public static void main(String[] args) {
    try {
      DraftWrapper draftWrapper = addMissingAuctions();
      updateServerDocumentWith(draftWrapper);
      sendToApproval(draftWrapper.getDraftId());
    } catch (IOException ex) {
      StringBuilder stb = new StringBuilder();
      stb.append(&quot;There is an error because of this: &quot;);
      stb.append(ex.getCause());
      stb.append(ex.getMessage());
      logger.error(stb.toString());
    } catch (SAXException e) {
      logger.error(e.getMessage());
    } catch (ParserConfigurationException e) {
      logger.error(e.getMessage());
    } catch (TagNonExistentException e) {
      logger.error(e.getMessage());
    } catch (FailedRequestException e) {
      logger.error(e.getMessage());
    } catch (XmlToStringConversionException e) {
      logger.error(e.getMessage());
    }
  }

  private static DraftWrapper addMissingAuctions() throws IOException,
      SAXException, ParserConfigurationException, FailedRequestException,
      TagNonExistentException, XmlToStringConversionException {
    CloseableHttpResponse response = null;
    try {
	    response = NetworkOperations.createDraft(DraftWrapper.CONTENT_UUID);
	    Document xmlDoc = getXmlDocument(response);
	    String draftId = extractDraftIdFrom(xmlDoc);
	    Node oldTableXmlElement = extractTableXmlElement(xmlDoc);
	    String newHtmlDocAsString = addAuction(oldTableXmlElement);
	    String xmlDocumentAsText = modifyXmlDocAsString(newHtmlDocAsString,
		    oldTableXmlElement, xmlDoc);
	    logger.trace(xmlDocumentAsText);
	    return DraftWrapper.valueOf(
		    new ByteArrayEntity(xmlDocumentAsText.getBytes("UTF-8")),
		    draftId);
  	} catch (IOException ex) {
	    logger.error(ex.getMessage());
	    throw ex;
	  } catch (UnsupportedOperationException e) {
	    logger.error(e.getMessage());
	    throw e;
	  } catch (FailedRequestException e) {
	    logger.error(e.getMessage());
	    throw e;
	  } catch (TagNonExistentException e) {
	    logger.error(e.getMessage());
	    throw e;
	  } catch (XmlToStringConversionException e) {
	    logger.error(e.getMessage());
	    throw e;
	  } finally {
	    if (response != null)
		    response.close();
	  }
  }

  private static Document getXmlDocument(CloseableHttpResponse response)
	    throws IOException, SAXException, ParserConfigurationException {
    HttpEntity entity = response.getEntity();
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document xmlDoc = builder.parse(entity.getContent());
    xmlDoc.getDocumentElement().normalize();
    return xmlDoc;
  }

  private static String extractDraftIdFrom(Document xmlDoc)
	    throws TagNonExistentException {
	  Node idElement = extractTagStartingFrom(DRAFT_ID_TAG,
		  xmlDoc.getDocumentElement());
	  if (idElement == null)
	    throw new TagNonExistentException();
	  return idElement.getTextContent().substring(PRE_ID_KEY.length());
  }

  private static Node extractTagStartingFrom(String tag, Node parent)
	    throws TagNonExistentException {
    Node requiredElement = null;
    NodeList children = parent.getChildNodes();
    for (int i=0; i<children.getLength(); i++) {
	    requiredElement = children.item(i);
	    if (tag.equals(requiredElement.getNodeName())) {
		    break;
	    }
	    requiredElement = null;
    }
    if (requiredElement == null)
	    throw new TagNonExistentException();
	  return requiredElement;
  }

  private static Node extractTableXmlElement(Document xmlDoc)
	    throws TagNonExistentException {
    Node contentNode = extractTagStartingFrom("content",
      xmlDoc.getDocumentElement());
    Node wcmContentNode = extractTagStartingFrom("wcm:content",
      contentNode);
    Node elementsNode = extractTagStartingFrom("elements", wcmContentNode);
    Node bodyElementNode = extractTagHavingAttributeValueStartingFrom(
      "element", "name", "Body", elementsNode);
    Node dataNode = extractTagStartingFrom("data", bodyElementNode);
    return dataNode.getFirstChild();
  }

  private static Node extractTagHavingAttributeValueStartingFrom(String tag,
	    String attr, String value, Node parent)
	    throws TagNonExistentException {
    Node requiredElement;
    NodeList children = parent.getChildNodes();
    for (int i=0; i<children.getLength(); i++) {
	    requiredElement = children.item(i);
	    if (tag.equals(requiredElement.getNodeName())
		      && doesNodeContainAttributeHavingValue(attr, value,
			    requiredElement))
    		return requiredElement;
	  }
	  throw new TagNonExistentException();
  }

  private static boolean doesNodeContainAttributeHavingValue(String attr,
	    String value, Node requiredElement) {
	  NamedNodeMap attributes = requiredElement.getAttributes();
	  for (int a=0; a<attributes.getLength(); a++) {
	    Node theAttribute = attributes.item(a);
	    if (theAttribute.getNodeName().equals(attr)
		      && theAttribute.getNodeValue().equals(value))
		    return true;
	  }
	  return false;
  }

  private static String addAuction(Node oldTableXmlElement) {
	  org.jsoup.nodes.Document htmlDoc = extractHtmlTableFrom(
		  oldTableXmlElement);
	  org.jsoup.nodes.Element allAuctions = htmlDoc.getElementsByTag("tbody")
		  .first();
	  org.jsoup.nodes.Element newAuction = new org.jsoup.nodes.Element(
		  Tag.valueOf("tr"), NetworkOperations.BASE_URI);
    List data = new LinkedList();
    data.add("28.06.2016");
    data.add("MİKTAR");
    data.add("28.06.2016");
    data.add("12.07.2016");
    data.add("14");
    data.add("21,765,467.46");
    data.add("10,999,999.97");
    data.add("7.50");
    data.add("7.50");
    data.add("7.78");
    data.add("7.78");
    data.add("7.78");
    for (String datum : data) {
	    org.jsoup.nodes.Element tableElement = new org.jsoup.nodes.Element(
		    Tag.valueOf("td"), NetworkOperations.BASE_URI);
	    tableElement.appendText(datum);
	    newAuction.appendChild(tableElement);
	  }
	  allAuctions.appendChild(newAuction);
	  return htmlDoc.body().html();
  }

  private static org.jsoup.nodes.Document extractHtmlTableFrom(
	    Node oldTableXmlElement) {
	  String htmlContent = oldTableXmlElement.getTextContent();
	  return Jsoup.parse(htmlContent);
  }

  private static String modifyXmlDocAsString(String newHtmlDocAsString,
	    Node oldTableXmlElement, Document xmlDoc)
	    throws XmlToStringConversionException {
	  CDATASection cdata = xmlDoc.createCDATASection(newHtmlDocAsString);
	  oldTableXmlElement.getParentNode().replaceChild(cdata,
		  oldTableXmlElement);
	  return xmlToString(xmlDoc);
  }

  private static String xmlToString(Document doc)
	    throws XmlToStringConversionException {
	  try {
	    StringWriter sw = new StringWriter();
	    TransformerFactory tf = TransformerFactory.newInstance();
	    Transformer transformer = tf.newTransformer();
	    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
		    "no");
	    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
	    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
	    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

	    transformer.transform(new DOMSource(doc), new StreamResult(sw));
	    return sw.toString();
	  } catch (Exception ex) {
	    logger.error(ex.getMessage());
	    throw new XmlToStringConversionException(
		    "Error converting to String: " + ex.getMessage());
	  }
  }

  private static void updateServerDocumentWith(DraftWrapper draftWrapper)
	    throws IOException, FailedRequestException {
	  CloseableHttpResponse response = null;
	  try {
	    response = NetworkOperations.putModifiedDraft(draftWrapper);
	  } catch (IOException ex) {
	    logger.error(ex.getMessage());
	    throw ex;
	  } finally {
	    if (response != null)
		    response.close();
	  }
  }

  private static void sendToApproval(String draftId)
	    throws IOException, FailedRequestException {
	  CloseableHttpResponse response = null;
	  try {
	    response = NetworkOperations.sendToApproval(draftId);
	  } catch (IOException ex) {
	    logger.error(ex.getMessage());
	    throw ex;
	  } finally {
	    if (response != null)
		    response.close();
	  }
  }
}

Posted in Uncategorized | Leave a Comment »

Beauty of Maven

19/06/2016

I definitely remember the chaos and problems that I encountered while preparing this very small application. At that time, I thought that something might have helped me through my struggle. Actually “something” is not that far away. The techniques that I employed and explained in Splitting Evernote Notes post helped me tremendously. In this post, I will talk about how I converted the manual process inside Eclipse and how Maven made it too easy to be true.

First, I created a Maven project as I explained in my previous post Since I need to run it on machines having Java 6, I set the source and target parameters inside the pom.xml file accordingly. I had 3 jars of DB2, so I also defined those dependencies in the same file. Since I have the code ready, just created the package and put the source file inside it. The whole process took 5 to 10 minutes. Here is the situation:

Dependencies

As we can see, the Java version of JRE System Library is 1.6. The Maven Dependencies contain all jars that I need. The package is created and the necessary code exists where it should. By the way, there is a problem with the [M2Eclipse(http://www.eclipse.org/m2e/) Eclipse Mars and Neon. When you try to update the projects to het the necessary jars, some of them are not downloaded, even if you adjust the proxy information in settings.xml. To bypass, I downloaded apache maven and run the following command from command line, inside the folder where pom.xml file resides.

mvn clean install

That downloads all my packages correctly.

To run the code on different machines, an executable jar file must be generated. In the previous version, I did all these manually. Moreover, the jars that my code depends on must be present in specific folders on those machines. This time, I decided to put all those dependencies into a single jar file. That way, a simple copy from one machine to another is all we need do to in order to successfully execute this application.

There are many plugins that can be used with Maven. The one that fits perfectly to my case is Maven Assembly Plugin. What it does is to generate the eventual jar file to be used by either other dependent applications or an end user, simply executing it. It also has the ability to add the jars which are necessary for the application to accomplish its task. So all jars are in the single, final, executable jar.

RunnableJar

The “archive” tag inside “configuration” makes the jar executable. The class that contains the main() function should be written the manifest file. This addition to the pom.xml file defines which class should be run. When we run the following command, the JVM finds the main() function in the class which is defined in the “manifest” tag.

java -jar teklif.jar

So, after all these adjustments, we can run the project and generate the jar file under the /target folder. How will we do that? Actually, the Eclipse versions that I have access to (Mars & Neon) do not have a predefined Maven run configuration for packaging. But we can make one, thanks to the help of a stackoverflow user. The aim is to create a Maven run configuration which reads the pom.xml file of the project and runs the following command:

mvn clean package

First, we select the project, Select Run As and select Run Configurations…

Maven Run Configurations

Here, we will add the above command as a Maven Run Configuration. I gave the name “maven-package” and was sure that the Base directory is ${selected_resource_loc}. That is where Maven looks for the pom.xml file. So each time we generate the jar, we must select the main project from explorer view. The goals are clean and package, in that order. That’s it. Save it and run.

The complexity of two methods are not even comparable in my view. This is relatively small application. If gets bigger, managing it without Maven or some other kind of dependency manager, we will definitely be lost. But not with Maven.

Posted in Uncategorized | Leave a Comment »

Posting to WordPress With Workflow

15/06/2016

I am trying to move my blogging activity, which takes place entirely on MacBook to my iPad Pro. To be honest, image capturing and loading are a real pain for me right now. The first problem I have is that, the links of Dropbox images are not directly usable in HTML documents. This may be an important design choice for Dropbox but it creates an unnecessary problem for bloggers. To make it available inside an HTML document, Canton Becker pointed a solution. This is a classic case for a Workflow.

Parameter Replacement

The solution he proposed and validated by the Dropbox document, is that, we need to replace the “dl=0” parameter with “raw=1”. My approach involves using regular expressions. In fact, this can be accomplished without using them but there is a catch. If there exists more than one place inside the URL where “dl=0” occurs, then they are also replaced with “raw=1”, which is not our purpose.

The regular expression (.*)\?dl=0$ represents the parameter dl that is at the very end of the URL. The other characters in the URL are grouped within the parenthesis. We can access that part by $1. The replacement string, $1\?raw=1 puts those characters and after them inserts the raw parameter set to 1.

Matryoshka Workflows

Now we have the image link, what will we do with it? Generally, I shorten URLs whenever I place them in my post. So, we can apply my previous workflow When I first did that, the thing that came to my mind was: Can’t I call another workflow within a workflow?

The simple answer is yes, but not directly. The long answer is below:

Run Workflow Run

As far as I know, we cannot directly call another workflow. However, we can open URLs. A workflow can be accessed via workflow:// URL scheme, right? Moreover, that URL scheme enables us to run any workflow. All we need to provide is the name of the workflow and the input to it. So why not try it? In my case, the name of the workflow to be run is “bit.ly Shorten”. I encode it as an URL and persist it inside the variable bitUrl. The modified Dropbox link is my input. I can also store it as a variable and feed as a text to workflow run scheme but preferred to use the clipboard. The Open URLs step completes my action set. You can get the workflow from here

Editor Automation

1Writer and Drafts4 really improved my blogging efficiency. That triggered my intention to write and share more. With that in mind, I was thinking if there exists a method to publish to WordPress directly from 1Writer and Drafts4. Within both apps, there is none. But, with Workflow, this becomes a reality. What to do with Workflow other than loving it?

A quick googling took me to one my favorite bloggers, without a surprise. In his article, Federico Viticci talks about the very same subject. The main point is that, there can be actions defined in 1Writer and Drafts4 which get the necessary information from the document and use them to call a Workflow script. Since the Workflow call with parameters inside the URL are independent of the action side, we can use the same Workflow either from 1Writer or Drafts4.

I use atx-style headers, so my text page the editor starts with the header, preceded by a # character. This line is succeeded
by two new line characters and the rest is the body of the post. Below is the very start of this post.

# Posting to WordPress With Workflow

I am trying to move my blogging activity, which takes place entirely on MacBook to my iPad Pro. To be honest, image capturing and loading are a real pain for me right now. The first problem I have is that, the links of Dropbox images are not directly usable in HTML documents. This may be an important [design choice](http://bit.ly/1UtRR2h) for Dropbox but it creates an unnecessary problem for bloggers. To make it available inside an HTML document, Canton Becker [pointed a solution](http://bit.ly/1UtRzZv). This is a classic case for a Workflow.

1Writer Way

1Writer Action Creation

Let’s talk about the actions. Both 1Writer and Drafts4 permit us to create action scripts using JavaScript programming language. So, the JavaScript action of 1Writer version is:

var parameters = prepareBlogPostParameters();
var workflowName = 'Publish';
app.openURL('workflow://run-workflow?name=' + encodeURIComponent(workflowName) + '&input=' + encodeURIComponent(parameters));

// you can add as many parameters as you like
function prepareBlogPostParameters() {
var text = editor.getText();
var endOfTitle = text.indexOf("\n");
var title = text.substring(1, endOfTitle).trim();
text = text.substring(endOfTitle+1).trim();
var parametersAsJson = {
"title" : title,
// "newParameter" : value,
"text" : text
};
return JSON.stringify(parametersAsJson);
}

I definitely advise you to go through the 1Writer JavaScript Documentation because there are lots of important information about the capabilities of 1Writer’s automation. For our example, I will emphasis two basic objects. We see that the properties of the document can be accessed via the “editor” object. The method we used is the getText(), which supplies the whole text in the editor. We extract the title out of it, as the first sentence ending with a new line character.

We have the title and the text. We must send it to our publish workflow. All workflows accept one input parameter. I decided to pass these two variables by encapsulating them as a single JSON object. By the virtue of its key-pair values, our workflow can get the necessary information by using the “Get Value for Key” actions.

After all these preparations, we call the workflow to do its job. In 1Writer, there is no explicit Workflow integration. Rather, we use the generic openURL() method of the “app” object. I preferred the simple run-workflow approach. X-Callback-URL can also be used. I shared this if you would like to give it a try.

There are lots of actions that can be integrated into 1Writer. You can check them out and modify according to your needs.

Drafts4 Way

Drafts4 Action Creation

Initially, my thinking was that I could use the very same JavaScript in Drafts4 with little modifications. Since 1Writer and Drafts4 objects would be different, those modifications were necessary. But through the conversion process, the situation turned out to be more complex than that, as it is almost always the case. In 1Writer, there is only one action to be performed. Drafts4 divides the whole automation into action steps. For the time being, there are 26 different action steps. What we will use is one Script and one Run Workflow step.

Drafts4 Action

Action Steps

// Script steps run short Javascripts
// For documentation and examples, visit:
// http://help.agiletortoise.com
var parameters = prepareBlogPostParameters();
var workflowName = 'Publish';
draft.defineTag('parameters', parameters);
draft.defineTag('workflowName', workflowName);

// you can add as many parameters as you like
function prepareBlogPostParameters() {
var text = draft.content;
var endOfTitle = text.indexOf("\n");
var title = text.substring(1, endOfTitle).trim();
text = text.substring(endOfTitle+1).trim();
var parametersAsJson = {
"title": title,
// "newParameter": value,
"text" : text
};
return JSON.stringify(parametersAsJson);
}

Drafts4 has also a detailed documentation about its features. Moreover, you can find many actions in its own directory.

Our first step will again generate the JSON object. This is more or less the same as in 1Writer. To get the editor text, we use the “draft” object and its “content” property. When the parameters are set, we again generate the JSON object. Different from 1Writer, we need to preserve this JSON object and the name of the workflow. That is mandatory if we want to use these in following action steps. This is accomplished by defining tags on the draft object. I just defined tags for each of them. We remove the URL call to run the workflow, since it will be done in the latter step.

Workflow from Drafts4

The next action step includes getting the parameters and workflow name and then running that workflow. The tags that we he previous action step can be accessed by writing its name inside the [[]] operator. With this, we complete the whole action in two steps that follow each other. I also put this in the Action Directory. You can try it.

The Publish Workflow

Publish Part 1
Publish Part 2

The Publish workflow gets the title and the body text. As I said before, my aim was to create a contract or interface between the editors and the workflow. With that, the editors only concentrate on generating the input JSON parameter, which contains the title and the text body. Nothing else. They do not know anything about the later stages. The Publish workflow knows nothing about the editors either. The only thing that matters for it is the JSON parameter that contains the title and text body.

As the first step, the JSON parameter is converted into a dictionary by “Get Dictionary from Input” action of the workflow. This enables us to get the title and body text with their respective keys. The “Post to WordPress” action gets the text as its input to post and the title information to set the title. To me, Post as Type and Draft as Status are enough. You can change them according to your own needs.

You can get this workflow from here. Enjoy!

Posted in Uncategorized | Leave a Comment »

bit.ly URL Shortener Workflow

30/05/2016

It is here but I would like to share its story.

Ten years ago, I used to play Magic the Gathering… A lot and with excitement… I created decks writing the card information on plain papers. If the deck is fun to play, I bought them from eBay auctions (another excitement) and play with my friends day and night. What I still like about Magic is not the sheer power of individual cards. Some cards are powerful or insanely powerful by themselves. But what I really like is the interaction between the cards and their effect to the overall game. The combos are much more bigger then their card components. I feel that this is exactly what Workflow does between iOS apps.

I am trying to do my real work as much as possible on iPad. I follow especially Federico Viticci in that regard. Blogging a few times a month is one of my self-fulfilling tasks. I used WordPress and Blogo Mac apps, iOS app and web editor. None of them suits me. So I decided to prepare my simple blogging steps on iPad. Again, Mr. Viticci showed me the way.

I started with using Markdown. I am using 1Writer and Drafts. However, in this post, I will talk about a simple workflow I use to shorten URLs and how it came to life.

Every post I write contains a few links to other internet sites. This is the norm for most of the bloggers. I use bit.ly for this purpose. In fact, I have found a way to do that with Workflow.

wrong way

As we can see, we need our own OAuth access token to make bit.ly shorten the URL for us. To have that, we login (or create) our bit.ly account and get our access token from here. We will use it in the workflow.

I got mine but no matter what, it failed with the proposed solution above. The problem was that, “Get Contents of Web Page” action creates a web archive output which is a rich text but does not have the required information in an accessible form. So I tried to modify it while learning the Workflow actions. It is advised to use “Get Contents of URL” action. I replaced it with that and bingo!

right way

I would like to learn what is going on behind the scenes. So let’s go over it together. As the first step, we make the save_link API call and get the contents. It is a text in JSON format. In our example, I wanted to convert the long URL http://www.grandprix.com/race/r941racereport.html. The response we have contains the data that we will use, status code and text, as explained in the API document.

{
    "status_code": 200,
    "data": {
        "link_save": {
            "link": "http://bit.ly/1qW7Coh",
            "aggregate_link": "http://bit.ly/1RDu8rY",
            "long_url": "http://www.grandprix.com/race/r941racereport.html",
            "new_link": 1
        }
    },
    "status_txt": "OK";
}

To access the “link” information, this text output should be converted to a dictionary first. Since it’s a JSON, that is ok to use “Get Dictionary from Input” action. When it is done, we can get the data part by using “Get Value for Key” action by setting the key value to “data”. This logic is important, since we will use it extensively whenever we try to extract the information from JSON or XML files. Now we have this:

{
    "link_save": {
        "link": "http://bit.ly/1qW7Coh",
        "aggregate_link": "http://bit.ly/1RDu8rY",
        "long_url": "http://www.grandprix.com/race/r941racereport.html",
        "new_link": 1
    }
}

We are getting closer to the “link”. The same mechanism can be applied but this time, our key will be “link_save”.

{
    "link": "http://bit.ly/1qW7Coh",
    "aggregate_link": "http://bit.ly/1RDu8rY",
    "long_url": "http://www.grandprix.com/race/r941racereport.html",
    "new_link": 1
}

Applying the “Get Value for Key” one last time, we get the bit.ly URL created for us. It can be send to the other applications by “Copy to Clipboard” action. This workflow can be found in here.

While digging into the API Documentation, I found that there are two main methods to make a URL short. The one that I explained above is the complex one. The other is so simple. In that method, we use the “shorten” API call. I read the details and saw that there is no need to generate a JSON and try to get the short URL searching key by key by key… All we need to do is tell in the shorten API call that the output should be in txt format! The content of the output contains only the short URL. I also uploaded that workflow to the gallery.

simple way

Hope it makes your lives simpler than before. I am planning to post about another workflow which combines 1Writer javascripts to make a successful post to WordPress.

Posted in Uncategorized | Leave a Comment »

Getting Smaller: Splitting Evernote Notes

13/05/2016

There is an interesting feature (I hope!) that Evernote has: It searches a keyword blazingly fast within many notes but it it is painfully slow to search the same keyword in a single huge note. This makes me write many but small notes.

However, I have a one big log file. A few days ago, I decided to split it into monthly journals, as in Drafts 4. My original log file has the following format:

dd.MM.yyyy - hhmm
* Bla bla bla activity
* More activity...

dd.MM.yyyy - hhmm
* Another activity on the same day but in a different time frame

dd.MM.yyyy - hhmm
* The activity started the day after

The preferred format I used in Drafts 4 was something like this:

## yyyy-MM-dd
@hhmm
Bla bla bla activity
More activity...

@hhmm
Another activity on the same day but in a different time frame

## yyyy-MM-dd
@hhmm
The activity started the day after

Now, the important thing is that, the full log file should be separated into non-overlapping parts. Each part contains the logs of a single month. And what should I do with the monthly logs? I have three options:

Outputting them to different text files
Sending each of them as mails to my Evernote account
Using Evernote API to inject them as notes

For the time being, I have chosen the second one. I am planning to play with Evernote API in the near future.

No matter what our approach is, we need to use jar packages designed for specific purposes. Sending mails requires javax.mail.jar or Evernote API is presented within evernote-api.jar. Dependency management is a big deal so I wanted to keep it as simple as possible with a proven tool: Maven. Actually, Maven is more than a dependency manager but for my purposes, it will fill that spot.

In this application, I will use Eclipse as my IDE. Creating a new project which will use Maven is simple. Just create a Maven Project. It will generate the directory structure and the basic pom.xml file, which is crucial in Maven operations. It is a three step process. You can find mine below:

Maven-01

Maven-02

Maven-03

At the end of it, we will see the project in our Eclipse workspace. There are a few things to mention. First, the directory structure is designed as in Maven’s preferred style. You can change it but do not forget to make necessary adjustments to Build Path. The source codes are in src/main/java folder. src/test/java contains the unit tests. src/main/resources and src/test/resources hold the setting and other files necessary for running the application or tests. Another point of interest is the pom.xml file. Here, we will add our dependencies. Moreover, this file tells Maven about the java requirements. As you can see, the default JRE version is 1.5.

Maven-04

So, our project is ready but empty. We can either create new packages and class in it, or copy a pre-written source code. I had mine before the project creation. Therefore, I will put it inside the source code folders.

Maven-05

Maven-07

You can link files if you want. Then, the workspace will refer to the original location of the file. I generally copy, so the workspace is coherent within itself.

Ups… We have a problem.

Maven-08

With Java 7, a convenient way to write try-with-resources is introduced. This way, a resource implementing java.lang.AutoClosable can be used just after try, and this will ensure that the resource will be properly closed no matter the try failed or not. There is no need to explicitly close the resource. For more information, I advise you to visit this Oracle Java Tutorial.

Eclipse says it can solve it. Great! But DO NOT DO THAT! If you only use that Eclipse workspace, or you somehow share your project properties with other Eclipse workspaces, that may be fine. But generally, that is not the case. Maven has its own way to solve it and it is a better one. Remember the pom.xml file? We can also define properties there. In our case, the source and target should be at least 1.7. My preference is 1.8. Problem solved.

Maven-09

To add a dependency, we need to modify the pom.xml file and Update the project. For example, to integrate JUnit as our unit test framework, we should add it’s packages. Also, to run the code on a simple input, I made a small test file and put it under src/main/resources.

Maven-11

To update the project, we can use the following path:

Maven-16

After the update, the content of our project will be like this. Note the jar files under Maven Dependencies.

Maven-12

Since my code expects an input file, to it with the test file, we will add a run/debug configuration in Eclipse.

Maven-10

The output of the code is directly printed to console. That may be good for early testing phase, but it is undesirable for a production environment, even in my simple project. What I want is to send this output as a set of e-mails to my Evernote account. For that purpose, we will visit the central Maven repository to learn the necessary packages and put them inside our project.

Maven-13

The most widely used mail API is javax.mail.

Maven-14

We get the Maven dependency information and directly paste it inside the pom.xml file.

Maven-15

We have access to the methods for sending mail. I will send from my gmail account to my Evernote account. I know both mail addresses. That is also fine. So, how can I send them? The mkyong tutorial in here provides a great example. I have used the SSL version.

If you are also a gmail user, you will most probably encounter a problem as I did. As a security measure, gmail may prohibit sending emails without logging through a client application. I solved it by changing my default settings, as explained in here.

That’s all. You can get my code from github any time you like. Hope it will also save your time as mine.

Posted in Uncategorized | Leave a Comment »

Getting Bigger: How to Scale A Small Application

03/05/2016

A few months ago, I was required to list the anagrams in a file. The file consists of words in each line. The output should be the anagram words. A line will present the anagrams, without a particular order. Hence, there will be as many lines as anagram sets.

Simple Design

The very step I took was to think small and make a simple design. The main idea of my code is to transform every word to a canonical form, such that, it is unique for every anagram of that word. So, I can use this canonical form as my key of the key-value map. This map stores the canonical form as key and the list of the words that have the same canonical form as value. Actually, the value is no more than the list of anagrams. After finding all the anagrams, I iterate over the keys of the map and output the values which have more than one word.

The canonical form is achieved by sorting the characters of the word. In my implementation, the upper case is a different character than its lowercase correspondent. If it is not the case, then, the canonical form extracting function can be modified by first converting the input into lower cases. If the locale of the words is different than the default of the Java Virtual Machine (VM), or we would like to be VM independent for that matter, then toLowerCase(Locale locale) should be used.


	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

To sort the characters, I preferred the Java implementation of Arrays class. This uses a quicksort variant for the char basic type. It is an in-place and very fast (most probably with an O(nlogn) time complexity) sorting mechanism. But because of it’s recursive nature, it (again most probably) requires O(logn) memory space, since for each recursive call, it needs to allocate another call stack. My sorting choice can be replaced under certain restrictions on the input. For instance, if we are assured that the characters of the words will be strictly from standard ASCII set, or extended ASCII set, or only letters, or only alphanumeric letters, etc. we can use Radix (Bucket) sort. That way, it is possible to sort the characters in O(n) time, at the expense of O(Character Set Size) memory space.

The advantage of this approach is, it is easy to understand and divide into smaller functions, which lead to a lower maintenance cost. Performance wise, we consider each word only once. Assuming that the number of words dwarfs the number of characters in a word, the running time of this iteration will be dominant factor. The time required to sort the characters will only be important when the input words are only a few. Henceforth, I can state that the time complexity of my algorithm is O(n). That is the best possible time complexity that can be achieved for this problem, since we must always investigate each word. The whole code is here:


import java.io.*;
import java.util.*;

public class AnagramPrinter {

	public static void main(String[] args) {
		if (args.length != 1) {
			System.out.println("Usage : java AnagramPrinter <file_name>");
			System.exit(1);
		}
		String fileName = args[0].trim();

		Map<String, List<String>> wordMap = new HashMap<String, List<String>>();
		try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
			String line;
			while ((line = br.readLine()) != null) {
				String canonicalForm = getCanonicalForm(line);
				List<String> anagramList = wordMap.get(canonicalForm);
				if (anagramList == null) {
					anagramList = new LinkedList<String>();
					wordMap.put(canonicalForm, anagramList);
				}
				anagramList.add(line);
			}
		} catch (FileNotFoundException e) {
			StringBuilder sb = new StringBuilder("File ");
			sb.append(fileName);
			sb.append(" not found!");
			System.out.println(sb.toString());
			System.exit(2);
		} catch (IOException e) {
			throw new RuntimeException(e);
		}

		printAnagrams(wordMap);
	}

	private static String getCanonicalForm(String word) {
		char[] content = word.toCharArray();
		Arrays.sort(content);
		return new String(content);
	}

	private static void printAnagrams(Map<String, List<String>> wordMap) {
		for (String canonicalForm : wordMap.keySet()) {
			List<String> words = wordMap.get(canonicalForm);
			if (words.size() > 1) {
				StringBuilder sb = new StringBuilder();
				for (String w : words) {
					sb.append(w);
					sb.append(" ");
				}
				System.out.println(sb.toString().trim());
			}
		}
	}

}

The memory requirements of my approach is, in fact, the weakest link. Using a map considerably speeds up the process, but that requires a possible memory location for each word. Moreover, my algorithm cannot start to output the anagrams up until all words are exhausted. These two factors should be reconsidered for scalability purposes. Without that, applying this algorithm directly is, at the very least, unfeasible to millions of words and nearly impossible for the case of many billions. Here is why.

Suppose that the average number of characters in a word is 5. Let’s also assume that each character occupies 2 bytes. In the worst case, 10 million words will need approximately 100MB of memory for the keys of the map and another 100MB for the values. So how can I modify my approach?

Scalable Design

I thought this in the context of Map-Reduce and implemented a similar algorithm using Hadoop. The idea in my HadoopAnagramPrinter.java code is more or less the same. Again, we generate the canonical form with the same rules. Since Hadoop first splits the whole input file into independent set of blocks and feeds the data nodes with those, each node running the mapper code outputs the key-value pair, which is simply the canonical form of the word and the word itself.

  public static class CanonicalMapper
      extends Mapper<Object, Text, Text, Text>{

      public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
        String canonicalForm = getCanonicalForm(value.toString());
        context.write(new Text(canonicalForm), value);
      }

      private static String getCanonicalForm(String word) {
        char[] content = word.toCharArray();
        Arrays.sort(content);
        return new String(content);
      }
  }

There can be many mapper nodes running in parallel, because the output of each of them is independent from the others. That is the essence of Map-Reduce which makes the whole process scalable. After all of them finish their jobs, before running the reduce job, Hadoop generates a list of values for each key created in the mapping phase and sorts these lists by their keys. The key is our canonical form and the list contains the words, which are anagrams! Therefore, the duty of Reducer is very simple: if the size of the list is greater than 1, append each word in the list and print them in one line.

  public static class AnagramIterableReducer
      extends Reducer<Text,Text,NullWritable,Text> {

      public void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
        Iterator<Text> iterator = values.iterator();
        StringBuilder sb = new StringBuilder();
        if (iterator.hasNext()) {
          String word = iterator.next().toString();
          sb.append(word);
          if (iterator.hasNext()) {
            do {
              word = iterator.next().toString();
              sb.append(" ");
              sb.append(word);
            } while (iterator.hasNext());
            context.write(NullWritable.get(), new Text(sb.toString().trim()));
          }
        }
      }
  }

If we put the two codes side by side, we see that the map generation and word investigation part of the basic algorithm is directly turned into the Mapper class activity and the key-value list creation of Hadoop. The anagram printing code in the basic approach is nearly the same as the Reducer class. Upon a closer inspection, we see that the Reducer code can run on different nodes and produce the same result. That is because each list is independent from other lists. A Partitioner, which generates a number of segments of the data, can be implemented. For example, assuming that we have 3 CPUs dedicated to run as reducers, we must have three partitioner tasks. These tasks should divide the “canonical form (key)” – “list of anagrams (value)” input pairs into three non-overlapping segments. The criteria can be either their length of the key, or the alphanumeric order of keys or the number of words in the anagram lists, etc. By this design, not only can we run the Mapper nodes in parallel, but also we will have the same opportunity for the Reducer nodes. Therefore, all our system can be scalable.

Posted in Uncategorized | Leave a Comment »

More Hadoop on My Jam

02/04/2016

After coding my first Hadoop application, I decided to add new requirements to enrich it. For example, I wanted to sort the jams in decreasing order. In this post, I will try to answer this question with another Hadoop application.

Ordering Mapper Output

The mapper class is the same as the first Hadoop application. It simply writes 1 (one) to context to count them later in the reduce step. So nothing changes for mapper. For the reducer, it should still count the number of likes for each each jam. Moreover, it must output them in a decreasingly sorted manner.

There are a few ways to accomplish this. The one I implemented is to create a TreeSet and put outputs of each reducer to that TreeSet. Since the TreeSet is a sorted data structure, the output of the reducer will be a sorted set.

This solution introduces two new problems:

To use the TreeSet, we need to define a private class that either has a natural ordering, or can be externally sorted by a Comparator
Since the reducer code will run on different threads for different keys, the TreeSet must be synchronized

The answer for the first problem is to create a class which implements Comparable interface and has jam name and jam count in it. The natural ordering for them is the ordering between the jam counts. Henceforth, the private class I used is below:

  private class JamCountPair implements Comparable<JamCountPair> {
    private String jam;
    private int count;

    public JamCountPair(String jam, int count) {
      this.jam = jam;
      this.count = count;
    }

    @Override
    public int compareTo(JamCountPair other) {
      if (this.count < other.count) return -1; 
      if (this.count > other.count) return 1;
      return this.jam.compareTo(other.jam);
    }

    @Override
    public boolean equals(Object other) {
      if (this == other) return true;
      if (!(other instanceof JamCountPair)) return false;
      JamCountPair otherPair = (JamCountPair) other;
      if (jam == null) {
        if (otherPair.jam != null) return false;
        if (count == otherPair.count) return true;
      }
      if (!jam.equals(otherPair.jam)) return false;
      return count==otherPair.count;
    }
  }

The mapper class is the same as before. For the sake of completeness, I put the code again in here:

  public static class TokenizerMapper
      extends Mapper<Object, Text, Text, IntWritable>{

      private final static IntWritable one = new IntWritable(1);

      @Override
      public void map(Object key, Text value, Context context
          ) throws IOException, InterruptedException {
        String[] userAndJam = value.toString().split("\t");
        context.write(new Text(userAndJam[1]), one);
          }
  }

The task of the reducer is now twofold. It should still count the jams as before. Moreover, it has to sort the “jam – like count” pairs. The JamCountPair class serves for that purpose. After counting the likes of a jam, it stores the pair object in a SortedSet. The critical part is that, this sorted set must be thread-safe. There may be different reducers in different threads which try to store their own pair objects. That is the reason for the sorted set to be synchronized. Here is my solution:

  public static class IntSumReducer
      extends Reducer<Text,IntWritable,Text,IntWritable> {
      private SortedSet<JamCountPair> sortedSet = Collections.synchronizedSortedSet(new TreeSet<JamCountPair>());
      private OrderedLikeCount olc = new OrderedLikeCount(); 

      @Override
      public void reduce(Text key, Iterable<IntWritable> values,
          Context context
          ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
          sum += val.get();
        }
        sortedSet.add(olc.new JamCountPair(key.toString(), sum));
          }
  }

When the reducer finishes all input from the mappers, the sorted set is also ready. So, how will we print it? There must be a point where all reducer threads finish their job and sorted set is fully available. Welcome to the clean-up step of the reducer. It will wait for the completion of the sorted set and iterate over it to add all pairs to the context. Therefore, the reducer class becomes this:

  public static class IntSumReducer
      extends Reducer<Text,IntWritable,Text,IntWritable> {
      private SortedSet<JamCountPair> sortedSet = Collections.synchronizedSortedSet(new TreeSet<JamCountPair>());
      private OrderedLikeCount olc = new OrderedLikeCount(); 

      @Override
      public void reduce(Text key, Iterable<IntWritable> values,
          Context context
          ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
          sum += val.get();
        }
        sortedSet.add(olc.new JamCountPair(key.toString(), sum));
          }

      @Override
      public void cleanup(Context context) throws IOException, InterruptedException {
        synchronized(sortedSet) {
          Iterator iter = sortedSet.iterator();
          while (iter.hasNext()) {
            JamCountPair jcp = (JamCountPair) iter.next();
            IntWritable result = new IntWritable();
            result.set(jcp.count);
            context.write(new Text(jcp.jam), result);
          }
        }
      }
  }

Another approach can be sorting the set inside the cleanup method. Each reducer puts its output again in a synchronized set but this time, it is not required to make that set a sorted one. The time and space complexity will not change. For an easier maintenance, the latter approach can be much more suitable.

Posted in Uncategorized | Leave a Comment »

Software Development with iPad

13/03/2016

I clearly remember the first day I bought my iPad 2. On April 29th, 2011, I got the iPad and from then on, every single day, I thought how I can develop software with it. Though iPad is a very capable machine and strengthens itself with each generation, it is not possible to use it as a full featured development machine. In this post, I would like to share my experience about developing simple Java code with iPad.

The code editing is done on iPad. The Java compiling is on DigitalOcean VPS. To me, it is much more simpler and pleasant to use text editing iOS apps than using vim or other stuff on VPS. I also add GitHub flavor for preserving, sharing and version controlling of the source codes. Either, a git client can run on iPad or command line interface can be used through VPS. I explained both of them.

The iOS Apps

I preferred to use Textastic for editing of source codes. It is one of the best productivity tools developed for iPad. The additional key row is priceless especially for coders. It can also make SFTP connections. Combining these two features in itself made my choice really straightforward. You can see the simple Test.class file below. The selection disk on the upper right corner simplifies the multiple line selection tremendously.

We need to connect to our DigitalOcean server over ssh. For that purpose, I chose Prompt 2. Its clean interface drew my attention. Also, typing on the iPad keyboard is very responsive. There are only 4 active extra keys over the keyboard. You can change that with other predefined sets but this is the most used ones for my case. Tab key is vital and saves much time.

These two apps are our main tools. Optionally, as a git client, we can use Working Copy Enterprise app. It has a free version but it cannot push your commits so it is pretty much useless for our purposes. I recommend spending the extra money and get the full features. I will cover the use case for it later in this post.

Now, we have two paths to walk. The first one is, editing the source codes, sending them to VPS, compiling on VPS and committing / pushing the modifications to github by means on command line. The second path starts as the first one, but after successful compile, we do not use the command line. Rather, we depend on Working Copy as our git client.

First, let’s create a git repository on GitHub.

From Safari on iPad, I logged into my account and created a FreeTesting repository.

The Direct Way to GitHub

We need to setup Prompt 2 for logging into our VPS. No matter what the path is, it is our main compiler platform.

After logging in, we are welcomed by the command line. We clone our brand new repository to a VPS folder using the command:


git clone https://Merter@github.com/Merter/FreeTesting.git

So, DigitalOcean VPS knows our github repository, but the editor does not. We will create our first Java source code file and make the necessary connection between the Textastic editor and VPS.

We create a new local file in Textastic.

We need to connect this file to our remote server. The File Transfer section shows the local files and remote connections side by side.

We are adding an (S)FTP Connection. This is where our VPS is.

Just complete the necessary information to successfully login to the remote server.

When we upload the file, it becomes connected to the respective remote location. Here, we see the basic file attributes and text information.

After the connection is complete, we can either upload our version edited in Textastic or update the file and overwrite our changes by downloading the latest version of the file.

We have the file on the VPS. We can compile and run to see what the results are. If anything goes wrong or the code has compile time errors or the code produces unexpected results, we can edit the code in the same way and update the file on VPS.

We will send our modifications to GitHub with the command line from VPS. First, we add the modified files. Later, we will commit them and eventually push the changes to GitHub.

git status
git add Test.java
git commit -m "Hello World"
git push origin master

The Working Copy Way

The other way to achieve the same result is that, after the successful compilation, we do not use the command line to store our files in GitHub. The Working Copy app will be our git client so we will do the save (add), commit and push operations with it. The beauty of that app is that, we can see the branches, commits and their respective time and message information with a nice graphical representation. Moreover, we should have the advantage of touch input as much as possible with an iPad.

The first thing we will do with Working Copy is to clone the repository we have created.

After getting the repository, we turn back to Textastic again, since our source code is still there. We save that file to the respective repository in Working Copy.

We only saved the file but we can also commit it with a message at the same time.

We have the option to commit the files later. There are two ways to do that. Either we commit the all saved changes in the repository…

… or we can commit the individual file.

Either way, we can also push them while committing.

I preferred to push later. We can only push the branch of a repository.

We can see the overall situation and history of the repository with the following graph.

It was a pleasant and flawless flow to work on an iPad. I did not expect to do all these on an iPad mini but it was very nice to accomplish this feat.

Hope you also enjoy as I do. Happy coding…

Posted in Uncategorized | Leave a Comment »

Hadoop on DigitalOcean

04/03/2016

Raspberry Pi is good for toy clustering but it is dead slow for real tasks. The third version won’t change that fact. I was planning to add more RPis to my cluster but it seems to be a wasted effort. After I met DigitalOcean (I mean VPS in general), my whole perspective is changed. Instead of running on RPis, the droplets become mush more suitable targets for me. I decided to give them a try.

An RPi with a decent SD card is about $45. It is four and a half months of $10 per month plan from DigitalOcean. I had 4 RPis, translating into one and a half year VPS payment. The snapshot / backup (with extra money) features are bonus. Uptime is a bonus. Performance is a huge bonus. Anytime-connection is another bonus. Mobility is the ultimate bonus. It is not easy to move the RPis around. All in all, the VPS investment seem to justify itself, right?

There is one catch. The calculation above takes only one droplet into account. What about the other droplets (nodes) of a cluster? We can have many more than one. That’s fine but it will quickly add up to a dramatic increase in spending. So what if we have one droplet always fully operational and create other droplets from time to time, when we really need them?

I mentioned that in my previous post. Destroying/recreating can be a pain. To be honest, I did not do any experiments before writing that. It is unfair, so this time I toyed with the idea and setup a small cluster with two droplets. I also destroy/recreate and saw that it is not as painful as I have expected. So without further ado, let me tell my story in detail.

At the beginning I have one droplet and not mush else. The first step is, to create a snapshot of it. We will use that snapshot to create our second droplet in the cluster. To take a snapshot, the droplet must be powered off.

sudo shutdown -h now

In the web interface of DigitalOcean, we can take a snapshot of the droplet. After getting that, the droplet started itself. That’s ok. Now we will create our second droplet.

DigitalOcean_-_Create_Droplet-1

We select one of our snapshots so that we won’t make the same Hadoop installation and arrangements again. The important thing here is that, the new droplet must be in the same datacenter region with the first one.

DigitalOcean_-_Create_Droplet-2

The hostname and the number of droplets using the same configuration can also be set in here.

DigitalOcean_-_Create_Droplet-3

Hitting the Create will start and finish in no more than a few minutes.

DigitalOcean_-_Create_Droplet-4

It is that simple to add more than one droplet to the cluster. I only followed my second RPi installation post. I modified the /etc/hosts files of both droplets. I simply put their names and IPs. In droplet 02, I generated a new ssh key and append it to .ssh/authorized_keys file.

hduser@02_ssh

I changed the replication count to 2 in hdfs-site.xml file. Moreover, I edited the slaves file. Logged in to two droplets from droplet 01. Then, run the following command in both droplets to clean the hdfs filesystem.

rm -rf /hdfs/tmp/*

I format the hdfs in droplet 01. Now it is time to run the daemons. I verified that everything is ok and run the wordcount example. It works! This shows us that we are on the right track. Applying the same principles, we can add many more droplets to the same cluster.

We want to have a cluster with many droplets but we do not want to pay lots of money. So, what we should do is that, save the droplet configuration and run it when we want to. DigitalOcean provides a facility to accomplish that but with a few quirks. Now, I will show you what they are and how to turn them around.

Whenever we want the preserve the state of a droplet, we can take a snapshot of it. Actually, we did this in our previous step, when creating a second droplet from the first one. Now, we in fact have the real second droplet to preserve, we can take the snapshot of it.

DigitalOcean_-_Take_Snapshot

DigitalOcean_-_02_Snapshot

All of our snapshots and backups can be found under the Images link.

Images

We can directly create a droplet from any of our snapshots. We can also rename or delete those. After the snapshot is created, we will destroy our droplet. The important point here is that, while the destroy operation removes the server and its backups, it WILL NOT remove the snapshots.

DigitalOcean_-_02_Destroy

So we are free to recreate the 02 droplet whenever it suits.

DigitalOcean_-_02_Recreate 02-Recreate

It seems really easy. Power off your droplet, take a snapshot of it, destroy it, recreate that and proceed from where you left off. It is a little bit complicated than that. When we create the droplet 02 from its latest snapshot, we see a different IP. That’s problematic for our case since that IP is used in /etc/hosts file and implicitly in ssh login. This shows itself when we ssh to the second droplet from the first one.

02-Host_key-fail

We modify the /etc/hosts file in both droplets and remove the host key of droplet 02 from droplet 01 by the following command:

ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R ubuntu-1gb-ams3-02

To be on the safe side, I also recommend to clean /hdfs/tmp/ folder completely and reformat the hdfs system.

That’s all we need to do. Our 2 droplet cluster is safe and sound again.

Posted in Uncategorized | Leave a Comment »

A Digital Ocean Adventure

24/02/2016

Whenever I use a computer having an operating system other than a variant of Linux, I almost always install a virtual Ubuntu system on it. My single preference in the world of Windows is VirtualBox. In early days of my Mac history, I paid for Parallels and VMware Fusion but at the end settled with VirtualBox.

The beauty of virtual systems is many. You can share your resources with your host operating system without crippling or accidentally damaging it. You have the freedom to play with different configurations, tools and deployments. You can get snapshots and return to them if something goes wrong. The whole virtual disk is a folder that you can easily transfer from machine to machine.

On the downsides, they are limited to your host resources. Especially RAM is the most critical resource. Although Linux is stingy on it, you have to spare a few GBs of your RAM. For me, in today’s standards, the least amount of RAM to be used in two different operating systems at the same time is 8GB.

While experimenting with my RPi cluster and Hadoop, I developed the codes on a virtual Ubuntu 14.04 LTS. But for a long time, I was thinking about setting up a dedicated system to connect remotely by ssh and transferring my development duties there. This way, I would have the option of using even an iPhone or an iPad since there are many ssh apps on the store.

When I read an article about it on Marco Arment’s this post, and listened to these two episode of his Under the Radar podcast, I decided to give it a try and dive into the DigitalOcean. I created my very first Droplet.

DigitalOcean_-_Droplets

We can control most of the droplets’ settings with an iOS app. I use DigitalOcean Manager by Philip Heinser.

I opened up my Hadoop installation post and installed Java 7. Later, I installed Hadoop 2.7.2. There is nothing to be changed from the installation of version 2.7.1, other than the name of the machine. node1 is replaced with ubuntu-1gb-ams3-01, as it can be seen above.

After the installation, I run the WordCount test for smallfile.txt and mediumfile.txt. It took 10 and 30 seconds respectively. That is better than my virtual Ubuntu! I am really impressed.

As far as I know, we pay for the droplet even if it is powered off. That is because DigitalOcean keeps our data, CPU and IP for us. To avoid that payment, we have an option but I do not recommend that. Here is how and here is why.

Either with iOS app or within the DigitalOcean web site, first, we power our droplet off. That is mandatory for the second step. Then, we take a snapshot of the droplet. After that, we destroy the droplet. This does NOT destroy the snapshot. That’s the critical part. We can create a droplet from that snapshot whenever we want. Until that moment, DigitalOcean does not charge for that droplet anymore.

So why do I not recommend that way? I tried this and experienced that the root password changes. That’s not a problem for my Hadoop experiments. But IP also changes. That is a problem. We cannot start hdfs without making necessary changes. The commands that I executed are as follows:

rm ~/.ssh/authorized_keys
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys
ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R localhost
ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R 0.0.0.0

These seem to be a small cost to pay for but think of it when there are many more data nodes in the cluster. Each time we create a droplet, we need to modify the IP->name mapping files.

Other than that, the only drawback I encountered while using a droplet is not opening large files. This seems to be related to RAM size. Let’s face it, 1GB is not enough to open files with vim.

I am planning to create another droplet and run Hadoop experiments on them.

Posted in Uncategorized | Leave a Comment »

Merter Sualp's Weblog

Engineer’s Dilemma

A Brief History

The Common Ground

Beauty of Maven

Posting to WordPress With Workflow

Matryoshka Workflows

Editor Automation

1Writer Way

Drafts4 Way

The Publish Workflow

bit.ly URL Shortener Workflow

Getting Smaller: Splitting Evernote Notes

Getting Bigger: How to Scale A Small Application

Simple Design

Scalable Design

More Hadoop on My Jam

Ordering Mapper Output

Software Development with iPad

The iOS Apps

The Direct Way to GitHub

The Working Copy Way

Hadoop on DigitalOcean

A Digital Ocean Adventure

Pages

Archives

Categories

Blogroll

Meta