Download non jar dependency in gradle

We are using gradle as our build tool. Recently, we had to download a third party war which is a maven artifact and bundle it into an RPM which was our distributable. So in some sense the war was a dependency to our project.

As an initial solution, we download the war manually from the maven repository and checked it into our source. The gradle script when building the rpm would then take the war from the source tree and package it along with the RPM. Obviously, we faced troubles when we had to increment the version of the war. We did not want to manually download the war again and replace the war in the source tree.

As a second increment, we downloaded the war using wget from gradle during build time and then packaged it into the RPM. This was a better solution than first one because we did not have to manually download the war every time we had to update the version. But this solution suffered from two problems. One, it expects wget to be installed in your system and two, we loose out on the dependency handling capabilities of gradle. For example, you could specify a + in the version and gradle would automatically fetch the latest version of the war and you don’t even have to manually change the version in the build.gradle file.

So we came up with the third approach with a bit of help. The solution is to add the war as a dependency under a custom configuration and then use the resolve method to download the war. 

configurations {
mrsWar
}

dependencies {
mrsWar "org.openmrs.web:openmrs-webapp:${openmrsVersion}@war"
}

task downloadMRSWar << {
new File("${buildDir}/resources/main").mkdirs();
configurations.mrsWar.resolve().each { file ->
//Copy the file to the desired location
}
}
}

This goes to prove the point that custom dependency configurations is one of the primary reasons why you should switch to gradle if you are using maven.

Advertisements

Twelve factor: Port Binding

I came across 12 factor through my colleagues and was intrigued by their opinion on port binding.  12 factor suggests that each web application be packaged as an executable war file and that the war file should have a web server embedded within it that binds the HTTP Service to a port. Other people have gone a bit further and have packaged the application as a Linux service that runs the war file. This is completely different than the traditional way of deploying web applications to an application container. This post draws a comparison between the two approaches.

Installation

Installation of an application with an embedded server (hence forth called as the 12 factor app) does not require that you place a war file inside the server’s installation directory. The installation directory could differ from host to host and this makes the installation process more complicated. You could work around this complexity by using provisioning tools like puppet, but the fact remains that the root cause of the problem is coupling the installation of your application with the installation of the application server.

Use system package managers

The 12 factor app can be installed through system packages like RPM or Deb since it does not make any assumptions about installation directories. It is theoretically possible to install the traditional application through system packages, but since they would depend upon specific directory structures, it is not a recommended approach as the packages would not be portable.

Fine grained control over resource usage

Say that you have two web applications. Installing them inside the same container means that they are going to compete for resources like heap and perm space, thread pools etc. Lets say that one of those applications need to have high throughput. For applications that require high throughput, you would want to allocate more resources. Since the resources provided by the container is shared between the two applications, there is no guarantee that increasing the amount of resources allocated to the container would automatically get translated into more resources being allocated to the high throughput application. The solution is to split the applications into two separate processes.

Horizontal scaling

Imagine again that there are two web applications. This time, one web application consumes services provided by the other. If they are 12 factor, you could spin up multiple processes for the service provider and put them behind a load balancer at will. This allows you to scale your applications independently of each other.

System resources

Obviosuly, 12 factor apps since they are separate processes, would require more system resources than traditional apps.

Independent start and stop

12 factor apps since they are not inside the same process could be started and stopped independently of each other.

Ordered starts

Consider again that there are two web applications, one consuming the service of another. The service provider should be started before the service consumer. This is not possible if they are inside the same container as there is no specification that guarantees the order in which applications are started inside a container like Tomcat. 12 factor apps though, since they use system services, could take advantage of dependency mechanisms provided by service managers like chkconfig on centos boxes to implement ordering.

OS Specific

12 factor apps, since they use OS packages, are specific to the OS. You need to build separate packages for each OS you are going to deploy to.

Similar Runtime.

12 factor apps are not affected by hard to trace bugs that occur because there is a jar in the lib folder of tomcat that interferes with the one used in your application. This makes the runtime of your application same in all the environments to which it is deployed.

Properties

Since the 12 factor app is an executable war which is not exploded, the configurable properties of the app should be sourced through environment variables and not property files.

Control server based provisioning : Credentials

Provisioning tools like puppet and ansible can operate in two modes

  1. Master slave [control server mode]
  2. Standalone

In the control server mode, there is a process running on the master (control server) which could be used to either manually or automatically provision a new machine. In standalone mode, the puppet or ansible scripts are downloaded to the machine that you want to provision and executed from there.

One problem with Standalone mode is credential information. Regardless of whether your provisioning code is open source (eg github) or not, it is a bad idea to check in credential information into the source code. The provisioning script thus would have to download the credential information from a central and controlled location securely.

Though storing credential information on a central server solves the problem, setting up and running a server process and maintaining it is additional cost that one would have to incur. For small scale automations, the most cost effective strategy might be to manually enter the credential information when you are provisioning a machine. This could be tricky if you would provisioning in the standalone mode, but natural when you are using a control server to configure your machines.

Ansible solves this problem beautifully by allowing you to specify prompts that will get this information at run time from the systems administrator. Given that we are provisioning at most eight machines, we find this strategy more effective than the standlone setup.

Lessons learnt when specifying configuration as code

So often you hear the phrase “it works on my machine” when you face errors upon running a piece of software. The phrase is a symptom of a deeper problem whose root cause is the nature of software systems. Software systems are inherently dependent upon the environment in which they are executed called the runtime. Runtime represents the set of resources that are available to the software system. A resource could be anything from main memory to a file in a specific location.

Problems arise when software systems make assumptions about the runtime environment. For example

file('/home/jiraaya/a.txt')

assumes that the file a.txt (which is a part of the runtime) is present in the location /home/jiraaya. Now when L runs the program on his machine, the file might be present in /home/L which would cause the program to fail.

The traditional solution to this problem is not assuming the location of the file, and letting L specify the location to the file as a parameter. So the code would look like this

file('${path_to_file}')

where path_to_file is a variable and L would set the path to file as /home/L/a.txt

In very large programs, the number of such parameters could easily exceed the capacity of a single programmer to keep track of. There are multiple ways to solve this problem. A relatively recent approach tries to side step the problem in two steps.

  1. Use virtualization to setup similar machines and networks for everyone. [Standardization]
  2. Use tools like puppet or ansible to install packages that you need on top of the virtual machines. [Provisioning]

Now you might ask, why don’t you create a new virtual machine with all the software that you want installed. Well, you could do that unless you end up creating multiple images each with a slightly different configuration. For example, in your developer machines, the name of the user that runs your program might be dev but on your QA machines, the name of the user might be qa. Virtual machine images consume a lot of memory and it is a waste to create one for each variation.

Next you might ask, why don’t you use shell that has been there all along to do the same things that puppet and ansible do. You can do that, but the puppet and ansible have solved a lot of problems that you have to write code for if you are using shell; which brings me to the set of lessons learned when doing configuration as code myself.

  1. Use declarative programming – reduce redundant checks
  2. Make your scripts idempotent. That is, they can be safely rerun multiple times without side effects. Some guidelines to do this are
    1. Don’t try to install packages already installed
    2. Don’t try to add content to a config file like iptables that already has equivalent content
    3. The file system is a shared memory, use it carefully. Create a unique folder under /tmp for each run of the installation script and place all the temporary files that it creates inside that location.
  3. Don’t hide ugliness through automation. If the installation of your program requires ugly steps like copying files from one location to another, formatting them etc, try to think about solving that problem first.
  4. Use system packages like rpm and deb whenever possible. You get a lot of things like dependencies, update protocol etc for free
  5. Don’t compile from source using tools like puppet, it is better to create a package and distribute it through local repositories
  6. Test your scripts using tools like vagrant and not on the actual machine.
  7. Do not install packages from untrusted sources, if you are using a mirror, use cheksums.
  8. Keep module configuration separate from node configuration.

In short standardization of runtime and provisioning will save you lots of time trying to chase down hard to debug errors and as such has a direct impact on the success of the development effort.

Linux status tools

The following is a small set of tools that could be used to query the status of Linux systems.

  1. top                   => process statistics
  2. sar                   => system statistics
  3. iostat-mx         => IO statistics
  4. free -m            => RAM and swap usage statistics

Gradle build optimizations

It is good to have fast builds. They provide immediate feedback and generally cut a lot of slack. Towards this end, every build tool does some form of optimization. The most common optimization is not running tasks that are deemed to up-to-date.

As an example, consider a compile task that takes in java files and generates class files. It is deemed to be not up-to-date when

  1. There are no class files present in the working set
  2. The input files were modified
  3. The output files generated after compile are different than the output files generated by the previous run

The last point is somewhat non obvious. There could be a lot of reasons why the output of a compile can be different for the same input and one of them is the way we link libraries. But regardless, it might be difficult to see how rerunning a task regardless of whether the input has changed or not could possibly an optimization. If a build tool after executing a task deems that it is up-to-date, it need not run tasks that “depend upon” the current task. For example, we need not run the package task if there was no change to the binaries.

Historically, tools have used timestamps to determine whether the sources are up-to-date or not. This strategy is error prone and it could not possibly be extended to apply to the output files as they are regenerated every time you run the task. Hence gradle uses hashes of these files to determine whether they are up-to-date. This is similar to how git determines whether the content has changed.

There are is one small problem with this strategy. Gradle never considers tasks that do not generate any output to be up to date. To overcome this problem, gradle lets you define your own strategy to determine whether the task is up to date using the  upToDateWhen() method of  TaskOutputs. This has proven to be a very effective strategy that significantly reduces build times and avoid hard to debug errors.

Reference

1. http://www.gradle.org/docs/current/userguide/more_about_tasks.html

Ansbile copy files from a windows share

We are using Ansible to automate installation of software that our system depends upon in deployment environments. One of the tasks required that we copy a set of files from a windows share. The target operating system is CentOS box. We used smbclient to copy the files from the windows share to the CentOS box.

First, we need to prompt the admin to enter the username and password to access the windows share

---
- hosts: all

  # See note below.
  vars_prompt:
    - name: smb_username
      prompt: "Enter samba share username"
    - name: smb_password
      prompt: "Enter samba share password"
      private: yes

Then, using Ansible, we would have to install smbclient and the libraries that it requires viz

– samba
– samba-client
– samba-common
– cifs-utils

After this, you are good to use the smbclient command to copy the files from a windows share like this.

 - name: Copy archive from samba_share.
    command:
      smbclient //hostname/samba_share/ {{ smb_password }} -U {{ smb_username }} -W "WORKGROUP" -c "recurse;lcd /local/path;get archive.zip"
      creates=/local/path/archive.zip

We faced one small hiccup though. The smbclient was prompting the user to confirm copy. As we were copying a large set of files, confirming each and every copy manually was a tedious process. Hence we added the prompt command to smbclient to not ask for a confirmation before copying.

smbclient //hostname/samba_share/ {{ smb_password }} -U {{ smb_username }} -W "WORKGROUP" -c "recurse;lcd /local/path;prompt;get archive.zip"

Reference

1. https://servercheck.in/blog/getting-file-samba-server-ansible-playbook