Tuesday, 19 June 2018

Testing Hive expressions locally on OS/X with brew

If you want to test a hive expression (like a regex for an RLIKE clause) you can do so locally on OS/X with the following steps:

1) Run brew install hive
2) Copy the following into /usr/local/opt/hive/libexec/conf/hive-site.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:/usr/local/var/hive/metastore_db;create=true</value>
  </property>
  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>false</value>
  </property>
</configuration>

3) Run schematool -initSchema -dbType derby
4) Run hive
5) At the hive> prompt, test your expression - e.g. select 'G:AUS:ENG:@:PT:X:' RLIKE '(G:USA:ENG:[$]:DL:X:)|(G:AUS:ENG:@:PT:X:)';

You’ll get a true or false to show if your expression returned true or false.

Steps 2&3 only need to be done on install.

Sunday, 18 February 2018

Error handling when enforcing invariants in the construction of statically checked types

tl;dr - error handling would be a lot less painful if the compiler could coerce an Either[A, B] to an A or a B at compile time if the program would not type check if it was an Either[A, B], would type check if it was an A or a B, and whether it was in fact a Left[A, B] or a Right[A, B] was decidable at compile time.

Here's a problem that's been bothering me for a long time - how to error handle when enforcing invariants in the construction of statically checked types. It's the nature of types to constrain the set of values they can represent. Consequently you have to construct an instance of a type from one or more other, often more general, types, and in one form or another it will frequently therefore be possible to call that constructor with values that are not in the set of valid values for that type. At which point the attempt to construct that instance of the type will hopefully be rejected, to prevent invalid program state occurring. For instance, Java has a URI class with a constructor that accepts a String. Since the String cannot be guaranteed to be a valid URI, the constructor throws URISyntaxException if it's not a valid URI.

This introduces a problem, and one I've written about before. Ideally in a statically type checked language the type checker will tell you about the different values a function can return. And the URI constructor function does just that, via the much hated checked exception; the constructor will "return" either a valid URI or the exception, and the type checker will force you to do something about it.

In more functional languages such as Scala or Haskell we might represent that via a genuine return type - an Either[URISyntaxException, URI], or a Try[URI], or perhaps Option[URI] if we're not interested in giving much feedback on what went wrong. Which is all fine and dandy if the input is untrusted. I like the type checker to remind me that if I take some input from a user and try and turn it into a URI I need to handle the case when the user has failed to oblige.

However, it becomes a lot more irritating in the case where we know the input is valid. Whether it's because we're using the language to configure something, or writing a test, or just jamming a little bit of code together to do something right now that doesn't need to take inputs and validate them, there are times when you just want to hardcode that value and so there's no error handling to be done.

Let's say we define a type to represent the set of natural numbers:

object Natural {

  def apply(natural: Int): Either[IllegalArgumentException, Natural] =
    if (natural > 0) Right(new Natural(natural))
    else Left(new IllegalArgumentException(s"$natural is not a Natural number"))
}

case class Natural private (natural: Int) extends AnyVal

Cool - there's only one way to construct it, which enforces the invariant we want, and the type checker tells us it may not succeed. And let's define a function that will use it:

def sumOfFirst(n: Natural): Natural = ???

OK, let's test it:

test("sum of first 4 natural numbers") {
  sumOfFirst(Natural(4)) mustBe Natural(10)
}

And... compiler says no. I'm trying to pass an Either[IllegalArgumentException, Natural] to a function that expects a plain Natural, and I'm trying to compare the resulting plain Natural with an Either[IllegalArgumentException, Natural]. Not going to work. But making the compiler happy is going to make a bit of a mess of what was a fairly legible test:

test("sum of first 4 natural numbers") {
  Natural(4).map(sumOfFirst) mustBe Natural(10)
}

So now you've got the additional cognitive load of mapping over an Either to call the function you wanted to test, plus the fact that the types you are actually comparing in the assertion are both Either[IllegalArgumentException, Natural] rather than the simple Natural you were interested in. Not nice.

In the URI case mentioned earlier, Java provides a get out of jail option - public static URI create(String str)  does not throw a checked exception, it will only fail at runtime if the string is invalid. And we could do likewise for Natural. But that's a bit unsatisfying, because in the process we've opened up the possibility that the programmer abuses that facility, and the type checker has thereby lost some of its ability to prove our programs correct.

Scala offers some interesting string parsing options where the compiler can start statically checking the contents of a String to create an instance of a constrained type and report any failure at compile time; Jon Pretty has made this significantly easier with http://co.ntextu.al/.

However, it occurs to me that there might be a more general way of doing this. What if, in cases where a pure function returns a result of type Either[A, B], but is being used in a way that means the compiler expects its result to be a plain A or a plain B, and it is called in such a way that its arguments are fully known at compile time, the compiler could actually call the function at compile time and check whether it returns a Left(a: A) or a Right(b: B)? And if it returns the correct one, permit its use directly as an A or B? And if not, report the other as the text of a compile error?

This would then work entirely seamlessly, with no library code, and would not require any String parsing for cases such as our Natural above that are not naturally stringy, so long as your constructor function enforces its invariants and reports the result as an Either. The first version of the test case above would happily pass, as both the constructions of Natural could be proved at compile time to return a Right(n: Natural) and hence be coerced to a Natural since that would type check and the Either[IllegalArgumentException, Natural] type would not.

The compiler would of course be clever enough to still treat the output as an Either[A, B] if it were being used in a context that required it to be an Either[A, B] - indeed, that would be the default. Only if the required type of the output was not satisfied by the type Either[A, B] but was satisfied by either the type A or the type B would it bother to start trying to work out whether it was decidable at compile time, and if so which it was.

Monday, 10 October 2016

Git repo that tracks only the head commit of a branch

Quick note to self on how to have a git repo that only contains a single commit, the head of a branch.

Initiate it with:

git clone <url> --branch <branch_name> --depth=1

Update it with:

git pull && git pull --depth=1 && git reflog expire --expire-unreachable=now --all

Monday, 3 October 2016

Gradle Multi-Project Build with local SNAPSHOT dependency resolution

Note - 2016-10-22 - I missed Gradle Composite Builds which do something very similar

I often find myself on a project with multiple applications depending on common libraries, so I tend to end up with a super project looking like this:

|
|-app1
|-app2
|-lib

All three projects are separate git projects, separately built and deployed; the apps are on a continuous deployment pipeline, the lib requires a decision to cut a release to move it from a SNAPSHOT version to a fixed version. The top level project is just a convenience to allow checking them all out in one shot and building them all with one command.

During development of a feature that requires a change to the lib, I would update the dependency in the app that needs the feature to X.X.X-SNAPSHOT and work on them both at the same time.

In Maven this worked OK for development - both Maven and most IDEs would successfully resolve any SNAPSHOT dependencies locally if possible. Then after cutting a release of the app you only had to delete the -SNAPSHOT bit from the dependency version and job done.

However, Gradle does not do this by default; you have to specify the dependency as being part of your multi-module build as so:

dependencies {
  ...
  compile project(':lib')
  ...
}


This is much more invasive - changing to a release version of the lib now requires replacing that with:

dependencies {

  ...
  compile 'mygroup:lib:1.2.3'
  ...
}

So you have to add the group of the lib, and know which precise version to specify, rather than just deleting '-SNAPSHOT' from the version. This makes it harder to automate changing the dependency - ideally, I would like to release the lib automatically as part of the continuous deployment process of the app after pushing a commit of the app which references a SNAPSHOT version of the lib.

I'm experimenting with a way around this by manipulating the top level gradle build as so:


subprojects.each { it.evaluate() }

def allDependencies = subprojects.collect { it.configurations.collect { it.dependencies }.flatten() }.flatten()
def localDependencies = allDependencies.findAll { dependency ->
    subprojects.any { it.name == dependency.name && it.version == dependency.version && it.group == dependency.group }
}

subprojects {
    configurations.all {
        resolutionStrategy.dependencySubstitution {
            localDependencies.each {
                substitute module("${it.group}:${it.name}:${it.version}") with project(":${it.name}")
            }
        }
    }
}

This effectively gives the Maven behaviour - and IntelliJ at least respects it correctly and resolves the dependencies to the same workspace

You can play with an example here:
https://github.com/lidalia-example-project/parent

Tuesday, 24 February 2015

DevOps

Just a quick note on what DevOps means to me. At heart I think it's two things:

  1. Developers need to take responsibility for what happens in production. This goes across definition of done (devs need to make sure the appropriate automated checks are in place so that the team will know both when it's not working and, as far as possible, why it's not working) and also across support; developers should be on support, feeling the pain of poor operational performance and monitoring.
  2. Operations work needs to be automated. Ideally nothing should ever be changed manually in production; everything should be done by an automated process that runs against multiple environments with an automated build, check & deploy process fast enough to use to deploy a fix when production's on fire.
    Automation is a form of development, and consequently requires the same disciplines and skills as any other development; automation code needs to be as well factored and well tested as any other form of code.
In other words, a lot of ops work is development and developers need to be doing ops work. Which does not mean there is no room for specialisation; but like a US undergraduate degree your ops major should have a minor in dev and your dev major should have a minor in ops. In addition they should be on the same team, working together (hopefully pairing) to bring both their specialities to bear on the problem of making the product work seamlessly in production.

Wednesday, 17 September 2014

Homebrew & Finder launched Applications

Recently had an issue where scripts launched from IntelliJ did not have my Homebrew installed executables on their path in Snow Leopard. Fixed it with the following:

sudo sh -c 'echo "setenv PATH /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin" >> /etc/launchd.conf'

and restarting. No guarantees for any other machine / OS! YMMV.

Tuesday, 31 December 2013

Running a service on a restricted port using IP Tables

Common problem - you need to run up a service (e.g. an HTTP server) on a port <= 1024 (e.g. port 80). You don't want to run it as root, because you're not that stupid. You don't want to run some quite complicated other thing you might misconfigure and whose features you don't actually need (I'm looking at you, Apache HTTPD) as a proxy just to achieve this end. What to do?

Well, you can run up your service on an unrestricted port like 8080 as a user with restricted privileges, and then do NAT via IP Tables to redirect TCP traffic from a restricted port (e.g. 80) to that unrestricted one:

iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 8080

However, this isn't quite complete - if you are on the host itself this rule will not apply, so you still can't get to the service on the restricted port. To work around this I have so far found you need to add an OUTPUT rule. As it's an OUTPUT rule it *must* be restricted to only the IP address of the local box - otherwise you'll find requests apparently to other servers are being re-routed to localhost on the unrestricted port. For the loopback adapter this looks like this:

iptables -t nat -A OUTPUT -p tcp -d 127.0.0.1 --dport 80 -j REDIRECT --to-ports 8080

If you want a comprehensive solution, you'll have to add the same rule over and over for the IP addresses of all network adapters on the host. This can be done in Puppet as so:

define localiptablesredirect($to_port) {
  $local_ip_and_from_port = split($name,'-')
  $local_ip = $local_ip_and_from_port[0]
  $from_port = $local_ip_and_from_port[1]

  exec { "iptables-redirect-localport-${local_ip}-${from_port}":
    command => "/sbin/iptables -t nat -A OUTPUT -p tcp -d ${local_ip} --dport ${from_port} -j REDIRECT --to-ports ${to_port}; service iptables save",
    user    => 'root',
    group   => 'root',
    unless  => "/sbin/iptables -S -t nat | grep -q 'OUTPUT -d ${local_ip}/32 -p tcp -m tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}' 2>/dev/null"
  }
}

define iptablesredirect($to_port) {
  $from_port = $name
  if ($from_port != $to_port) {
    exec { "iptables-redirect-port-${from_port}":
      command => "/sbin/iptables -t nat -A PREROUTING -p tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}; service iptables save",
      user    => 'root',
      group   => 'root',
      unless  => "/sbin/iptables -S -t nat | grep -q 'PREROUTING -p tcp -m tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}' 2>/dev/null";
    }

    $interface_names = split($::interfaces, ',')
    $interface_addresses_and_incoming_port = inline_template('<%= @interface_names.map{ |interface_name| scope.lookupvar("ipaddress_#{interface_name}") }.reject{ |ipaddress| ipaddress == :undefined }.uniq.map{ |ipaddress| "#{ipaddress}-#{incoming_port}" }.join(" ") %>')
    $interface_addr_and_incoming_port_array = split($interface_addresses_and_incoming_port, ' ')

    localiptablesredirect { $interface_addr_and_incoming_port_array:
      to_port    => $to_port
    }
  }
}

iptablesredirect { '80':
  to_port    => 8080
}