Tuesday, 31 December 2013

Running a service on a restricted port using IP Tables

Common problem - you need to run up a service (e.g. an HTTP server) on a port <= 1024 (e.g. port 80). You don't want to run it as root, because you're not that stupid. You don't want to run some quite complicated other thing you might misconfigure and whose features you don't actually need (I'm looking at you, Apache HTTPD) as a proxy just to achieve this end. What to do?

Well, you can run up your service on an unrestricted port like 8080 as a user with restricted privileges, and then do NAT via IP Tables to redirect TCP traffic from a restricted port (e.g. 80) to that unrestricted one:

iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 8080

However, this isn't quite complete - if you are on the host itself this rule will not apply, so you still can't get to the service on the restricted port. To work around this I have so far found you need to add an OUTPUT rule. As it's an OUTPUT rule it *must* be restricted to only the IP address of the local box - otherwise you'll find requests apparently to other servers are being re-routed to localhost on the unrestricted port. For the loopback adapter this looks like this:

iptables -t nat -A OUTPUT -p tcp -d 127.0.0.1 --dport 80 -j REDIRECT --to-ports 8080

If you want a comprehensive solution, you'll have to add the same rule over and over for the IP addresses of all network adapters on the host. This can be done in Puppet as so:

define localiptablesredirect($to_port) {
  $local_ip_and_from_port = split($name,'-')
  $local_ip = $local_ip_and_from_port[0]
  $from_port = $local_ip_and_from_port[1]

  exec { "iptables-redirect-localport-${local_ip}-${from_port}":
    command => "/sbin/iptables -t nat -A OUTPUT -p tcp -d ${local_ip} --dport ${from_port} -j REDIRECT --to-ports ${to_port}; service iptables save",
    user    => 'root',
    group   => 'root',
    unless  => "/sbin/iptables -S -t nat | grep -q 'OUTPUT -d ${local_ip}/32 -p tcp -m tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}' 2>/dev/null"
  }
}

define iptablesredirect($to_port) {
  $from_port = $name
  if ($from_port != $to_port) {
    exec { "iptables-redirect-port-${from_port}":
      command => "/sbin/iptables -t nat -A PREROUTING -p tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}; service iptables save",
      user    => 'root',
      group   => 'root',
      unless  => "/sbin/iptables -S -t nat | grep -q 'PREROUTING -p tcp -m tcp --dport ${from_port} -j REDIRECT --to-ports ${to_port}' 2>/dev/null";
    }

    $interface_names = split($::interfaces, ',')
    $interface_addresses_and_incoming_port = inline_template('<%= @interface_names.map{ |interface_name| scope.lookupvar("ipaddress_#{interface_name}") }.reject{ |ipaddress| ipaddress == :undefined }.uniq.map{ |ipaddress| "#{ipaddress}-#{incoming_port}" }.join(" ") %>')
    $interface_addr_and_incoming_port_array = split($interface_addresses_and_incoming_port, ' ')

    localiptablesredirect { $interface_addr_and_incoming_port_array:
      to_port    => $to_port
    }
  }
}

iptablesredirect { '80':
  to_port    => 8080
}

Monday, 30 December 2013

Fixing Duplicate Resource Definitions for Defaulted Parameterised Defines in Puppet

Recently I have been working on a puppet module which defines a new resource which in turn requires a certain directory to exist, as so:

define mything ($log_dir='/var/log/mythings') {

  notify { "${name} installed!": }

  file { $log_dir:
    ensure => directory
  }

  file { "${log_dir}/${name}":
    ensure => directory,
    require => File[$log_dir]
  }
}

As you can see the log directory is parameterised with a default, combining flexibility with ease of use.

As it happens there's no reason why multiple of these mythings shouldn't be installed on the same host, as so:

mything { "thing1": }
mything { "thing2": }

But of course that causes puppet to bomb out:
Duplicate definition: File[/var/log/mythings] is already defined

The solution I've found is to realise a virtual resource defined in an unparameterised class, as so:

define mything ($log_dir='/var/log/mythings') {

  notify { "${name} installed!": }

  include mything::defaultlogging

  File <| title == $log_dir |>

  file { "${log_dir}/${name}":
    ensure => directory,
    require => File[$log_dir]
  }
}

class mything::defaultlogging {
  @file { '/var/log/mythings':
    ensure => directory
  }
}

Now the following works:
mything { "thing1": }
mything { "thing2": }

If we want to override and use a different log directory as follows:
mything { "thing3":
  log_dir => '/var/log/otherthing'
}
we get this error:
Could not find dependency File[/var/log/otherthing] for File[/var/log/otherthing/thing3] at /etc/puppet/modules/mything/manifests/init.pp:12

This just means we need to define the new log directory as so:
$other_log_dir='/var/log/otherthing'
@file { $other_log_dir:
  ensure => directory
}
mything { "thing3":
  log_dir => $other_log_dir
}
and all is good. Importantly, applying this manifest will not create the default /var/log/mythings directory.

Tuesday, 10 December 2013

H2 & HSQLDB for Simulating Oracle

H2 & HSQLDB are two Java in-memory databases. They both offer a degree of support for simulating an Oracle database in your tests. This post describes the pros and cons of each.

H2

How to setup:


import org.h2.Driver;
import javax.sql.DataSource;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

Driver.load();
DataSource dataSource = new DriverManagerDataSource(
    "jdbc:h2:mem:MYDBNAME;MVCC=true;DB_CLOSE_DELAY=-1;MODE=Oracle",
    "sa",
    "");

DB_CLOSE_DELAY is vital here or the database is deleted whenever the number of connections drops to zero - a highly unintuitive situation.

Pros:

In general I've found I had to make fewer compromises on my SQL syntax in general and my DDL syntax in particular using H2's Oracle compatibility mode. For instance it supports sequences and making the default value of a column a select from a sequence, which HSQLDB does not.

Cons:

The transaction capabilities are not as good as HSQLDB. Specifically, if you use MVCC=true in the connection string then H2 does not support a transaction isolation of serializable, only read committed. If you do not set MVCC=true then a transaction isolation of serializable does work but only by doing a full table lock, which is not at all how Oracle does it.

HSQLDB

How to setup:

import org.hsqldb.jdbc.JDBCDriver;
import javax.sql.DataSource;
import org.springframework.jdbc.core.JdbcTemplate; 
import org.springframework.jdbc.datasource.DriverManagerDataSource;

JDBCDriver.class.getName();
DataSource dataSource = new DriverManagerDataSource(
    "jdbc:hsqldb:mem:MYDBNAME",
    "sa",
    "")
JdbcTemplate  jdbcTemplate = new JdbcTemplate(dataSource)
jdbcTemplate.execute("set database sql syntax ORA TRUE;");
jdbcTemplate.execute("set database transaction control MVCC;"); 

Pros:

MVCC with a transaction isolation of serializable works as expected - other transactions can continue to write whilst a transaction sees only the state of the DB when it started.

Cons:

Support for Oracle syntax, particularly in DDL, is patchy - I was unable to run the following, which works fine in Oracle:
CREATE SEQUENCE SQ_TABLE_A;
CREATE TABLE TABLE_A (
  ID NUMBER(22,0) NOT NULL DEFAULT (SELECT SQ_TABLE_A.NEXTVAL from DUAL),
  SOME_DATA NUMBER(22,0) NOT NULL,
  TSTAMP TIMESTAMP(3) DEFAULT CURRENT_TIMESTAMP);

Friday, 25 October 2013

CAP Theorem 2 - The Basic Tradeoff

WARNING - on further reading I'm not at all sure the below is accurate. Take it with a large pinch of salt as part of my ;earning experience...

tl;dr:

You can't sacrifice Availability, so you have to choose between being Consistent and being Partition Tolerant. But only in the event of a network partition! You can be Partition Tolerant and still be Consistent when no partition is occurring.

Following up from my previous post on CAP Theorem, I'm going to discuss what in practical terms the CAP trade-off means.

A is non-negotiable - a truly CP data store is a broken idea

Remember, "Available" doesn't mean "working", "Available" means "doesn't hang indefinitely". In the event of a network partition a truly CP data store will simply hang on a request until it has heard from all its replicas. A system that hangs indefinitely is a chocolate tea pot. Poorly written clients will also hang indefinitely, ending up with users sitting staring at some equivalent of the Microsoft sand timer. In the end someone (a well written client, or just the poor schmuck staring at his non-responsive computer) will decide to time the operation out and give up, leaving them in the same not-working state as a CA system but with the additional worry that they've no idea what happened to the request they sent.

Hang on, there are CP data stores out there aren't there?

No, not really - not as I understand CAP theorem, anyway. See below!

The choice is between CA and AP

In fact it can be reduced to a very, very simple trade-off - in the event of a network partition, do I want the data store to continue to work or do I want the data store to remain consistent?

CA means a single point of failure

CA is the simplest model. It's what we get when we run up a single node ACID data store - it's either there, working and consistent or it isn't. There are ways to add a measure of redundancy to it in the form of read-only slaves with a distributed lock, but fundamentally if a network partition occurs between them and the master then the master has to stop accepting writes if it is to remain consistent with the slave.

It's a model that means outages are essentially guaranteed. If that's acceptable then it's nice and easy for developers to work with; but it's rarely acceptable.

Which leaves AP

Nearly all data stores used in scenarios where there is a desire to avoid outages entirely in so far as is possible (human error notwithstanding). Which means having multiple copies of state on machines connected by the network, which means network partitions can and will happen. Which means needing to be available and tolerant of those partitions.

Oh noes! No consistency! Sounds dreadful...

The important point to remember here is that the loss of Consistency implied by Partition Tolerance (i.e. Continuing to Work) only has to be accepted in the event of a partition. This is what lots of so-called "CP" systems are trying to do - remain consistent whilst the network is healthy, and only become inconsistent in the event of a partition.

Wednesday, 23 October 2013

CAP Theorem

WARNING - on further reading I'm not at all sure the below is accurate. Take it with a large pinch of salt as part of my learning experience...

I've been a bit confused over the meaning of the C, A and P of CAP theorem. I think I've got it sussed now, so this post is my attempt to encapsulate that knowledge and get it out there for someone to correct if I'm still wrong!

C - Consistent

This is the easy one - I think i've always understood this, though I'm sure there are nuances to it. If you write some data then, on anyone anywhere trying to read it, then so long as they do not get an error or someone else has independently updated it in the meantime then they will see the same data you wrote.

A - Available

Took me a while to get this one; I was thinking of it in terms of whether a system is up or not. That's not what Available means in this context. All it means is "able to return a response in a timely manner". That response could be as simple as a refusal to allow a new TCP connection - that's a response, and a timely one. An HTTP system returning 500 errors is available. If you're not timing out trying to communicate with the system, it's available, no matter how unhelpful the responses you are getting back are.

In contrast, a system is unavailable when a client gets nothing back at all and are left waiting until you timeout (you've got a timeout set up, right? Right?). Stick a Thread.sleep(Long.MAX_VALUE) in your HTTP handling code and your system is unavailable. Put a firewall in the way that quietly drops all response packets and you're unavailable.

P - Partition Tolerant

There's two aspects to this one. The first is the obvious - a network partition occurs so that two nodes in a cluster are unable to communicate without them getting a chance to sign off from each other beforehand. What was less obvious to me at first is that a node that crashes is an example of a partition - not, as I naively thought, an example of being unavailable. The other nodes in the cluster cannot distinguish between "crashed" and "network issue somewhere between us". 

A system is Partition Tolerant if it a) has more than one node and b) it can handle transactions without returning an error in the event that those nodes cannot communicate. 

Consequences

It should be obvious that network partitions can always happen wherever a cluster exists with multiple nodes that hold their own copy of state and that need to communicate over a network in order to maintain consistent state. CAP theorem says that when that partition happens, one of C, A and P has to be sacrificed. And now it should be fairly clear why. When a client attempts to write to a cluster which is partitioned, that write will arrive at one side or the other of the partition, and the system will have to do one of three things:

Wait for the Partition to Heal (CP)

The simple solution to maintain consistency is for the node getting the write to wait, and not return a response until it knows that write has been committed on all nodes. Obviously this sacrifices availability - the partition may never heal, or not for a prohibitively long time. However, data will be consistent and no errors are returned, so we have a rather useless Partition Tolerance.

Discard the Write and return an Error (CA)

Option two is to return an error. Consistency is maintained by the simple expedient of not changing state at all. The system is available - it's returning errors in a timely manner. However it's not partition tolerant - indeed it's questionable whether there's any benefit over a single node data store. By having more than one node and a network connection the chances of failure are simply increased. A single node data store is CA - it's either there or not.

Accept the Write (AP)

The system is available and partition tolerant - no hanging, no error returned. The cost is that it is not consistent - the state either side of the partition is different, and someone reading from the other side of the partition will not see it. A dynamo style store with a read/write quota lower than half the nodes has sacrificed C in return for A and P. 

It's Not That Simple

Of course it isn't - the C, A and P qualities are not binary, they are a continuum, and data stores can make trade offs between them. A dynamo style store can choose to sacrifice some tolerance to a partition in return for more consistency by setting quora at a level of n/2 +1. A system could tolerate mild unavailability in the hope of the partition healing quickly. A store can vote up masters so that consistency is only sacrificed between partitioned halves, not sacrificed between all nodes. You get the idea.


Tuesday, 30 July 2013

If-Modified-Since and If-None-Match woes

I'm confused about the correct behaviour of If-Modified-Since and If-None-Match in certain scenarios.

The scenarios I have in mind are:

  1. The client claims to have a later representation of a resource than an intermediate cache, but the intermediate cache's representation is still fresh.

    Imagine a cache has a fresh representation of /resource as so:

    ETag: some_etag
    Last-Modified: Fri, 26 Jul 2013 14:00:00 GMT

    It receives a request as so:

    GET /resource HTTP/1.1
    If-None-Match: other_etag
    If-Modified-Since: Fri, 26 Jul 2013 14:00:01 GMT

    The nature of ETags is that they have no notion of sequence - it's either the same or it isn't, so the cache cannot know whether the ETags differ because one represents a later or earlier version of the resource. However, from my reading of the draft update to the HTTP/1.1 spec HTTP Bis Section 5 the If-None-Match header takes precedence over If-Modified-Since, and so the correct behaviour here is for the cache to return a 200 with its old but fresh representation of the resource. That means that any client needs to be aware that it may quite legitimately receive an earlier version of an entity than the one it already has when making a conditional request to a cache. Which... disturbs me; my whole reason for making a conditional request is to find out if there's a fresher version, I don't want an older one!

    I suppose the theory is that the local client cache and the intermediate cache are obeying the same max-age / expires headers so by the time the client is needing to send conditional requests because its representation is stale the representation in the upstream cache should also be stale. However, this is not guaranteed - a request with a Cache-Control header forcing revalidation that goes via a different intermediate cache (due to load balancing) could easily legitimately leave a downstream client with a later representation than an upstream cache it later consults.
  2. The client claims to have a later representation of a resource than an origin server.

    Imagine the entity is stored in a database.

    Origin node 1 has temporarily cached an old version of the entity with an ETag of some_etag and a Last-Modified of Fri, 26 Jul 2013 14:00:00 GMT.

    Origin node 2 meanwhile returns a later version of the entity to the client.

    The client then makes a conditional request which happens to get mapped to node 1:

    GET /resource HTTP/1.1
    If-None-Match: other_etag
    If-Modified-Since: Fri, 26 Jul 2013 14:00:01 GMT

    What should node 1 return? It could:
    1. Treat this as a trigger to ensure that it updates its own cache.
    2. Assume that since If-Modified-Since is after its notion of Last-Modified, 304 Not Modified is safe to return
    3. Assume the client is talking nonsense and simply return a 200 with its older version of the entity
    Again, HTTP Bis Section 5 suggests that option 3 is the most "correct", though it feels very wrong to me. I favour option 1 followed by 3...

(One answer might be "don't use ETags if you care about this possibility". But the spec is quite clear that origin servers SHOULD return an ETag if they can...)

Wednesday, 24 July 2013

Stupid Stuff I did Implementing an Atom Event Feed Service & Client

Some quick notes on the issues we've had implementing an Atom Feed as a means of exposing events to clients over HTTP as suggested by REST in Practice. You can read up on the full details in the book here.

Decisions I'm happy with:

  1. We implemented the feeds using ISO 8601 formatted query params: http://hostname/eventfeed/period?from=2013-07-24T09:30Z&to=2013-07-24T09:40Z. This worked well, as it meant we could as humans quickly find specific events we were interested in - in dev environments we have far fewer events, so we can simply ask for all those in a day rather than searching for a ten minute period that has one.
  2. We made the "from" inclusive and the "to" exclusive, to ensure that an entry only ever appears in one feed, and also ensure that if we set the Last-Modified header to the latest entry in the working feed and the to date on an archive feed the Last-Modified header on a feed when it is archived is guaranteed to be later than the Last-Modified header on the feed when it was the working feed, even if the last event occurred a millisecond before the feed was archived.
  3. The persistent atom feed client we've built has proven very stable - all of the bugs have been on the server side (mostly around cache headers), proving that so long as the atom and http specs are followed rigorously on the server side writing a client is trivial.
  4. We've used the client in a replayable way - processing an entry simply updates our local cache of the entity to the current state of the entity by following a link. Whilst events should not arrive out of order or ever be replayed, it doesn't (functionally!) matter if they are.

Mistakes I made:

  1. Calculating whether a feed is a working feed or not multiple times when responding to the same request.
    This one is embarrassing, as it's a real schoolboy error, but I made it so may as well 'fess up. The same URI will at first point to a working feed and later point to an archive feed based on whether its period includes the current time or is entirely in the past. The working feed and the archive feed differ in their cache characteristics (the archived feed does not change, and so can be cached for a very long time) and also in whether or not they have a "next" link - an archive feed will, a working feed will not. Calculating whether or not a request was for a working feed twice in the same request can result in building the entity as a working feed, but setting the cache headers for an archive feed - which is disastrous, as it may lack events and will lack a next link.
  2. Using the same Etag & Last-Modified values for working and archived feeds
    We went for the Etag of a feed being the entry ID of the latest entry in the feed, but this created an issue - the Etag for the working feed must be different to the Etag for an archived feed, otherwise a conditional (If-None-Match) request for a feed that was the working feed but has since become an archive feed will get a 304 Not Modified response, with an eternal cache header since archived feeds should never change. Since the working feed will not have a "next" link, this means that the feed client gets stuck with a cached archive feed with no "next" link, so as it walks forward through the feed it always stops on that archive under the impression that there is no later feed whose events it can consume.

    Solving this was relatively trivial - we just appended "-archive" or "-working" to the Etag depending on which the feed was.

    In the same way the Last-Modified header should be the "updated" date of the latest entry on a working feed, but should be the "to" date on an archived feed to ensure that when a feed moves from being a working feed to being an archived feed a conditional If-Modified-Since request from the client will not result in the server returning a 304 Not Modified response to the client.
  3. Caching the latest feed independent of the working feed.
    The client starts at the latest feed, which is a static URI which we allow to be cached for 1 second. It works backwards following the previous links in the feeds until it finds the feed containing the latest entry it has processed. It then works forward processing each entry and following the next links in the feed to find the next feed. This ultimately leaves it at the working feed (not the latest feed - the feed prior to the working feed has been archived, so in order for it to be cached eternally it must link to the working feed so the link is still correct when the working feed is itself archived). The working feed is also cached for 1 second.

    The danger here is that because the latest and working feeds are cached separately, they may have different values in the cache - on run one, the latest feed is cached without event x, but by the time the working feed is requested it contains event x. This is then stored as the latest handled event. On run two, the cached latest feed may be returned - which does not contain event x, so the client follows the previous link chain to the earliest archived feed and in the process replays all of the events that it has already handled.

    Due to our decision to make the events replayable this is not functionally catastrophic, but obviously it creates performance and latency issues.

    There are three solutions we have thought of to this:
    1. Stop allowing the responses on the latest and working feeds to be cached at all. This is undesirable, as it removes the scalability characteristics we get from using even very limited (max-age=1) generic HTTP caching.
    2. Add a Content-Location header to the latest feed, pointing at the working feed, and to the working feed, pointing at the latest feed. As I understand it this is simply a hint to any cache that they are the same thing at the moment the response is received (crucially not thereafter), and therefore that the cache may update its representations under other URIs appropriately. A quick test suggested this worked with the Apache HTTP Client cache implementation, but the HTTP 1.1 spec itself does not go so far as to promise all caches will honour it, and it seems a relatively little talked about part of the spec. I'm not entirely sure that my reading of the spec is correct, and clearly there is a lot of potential room for bugs in caches which would affect us in a case such as this where the relationship between the URI of the latest feed and the URI of the working feed is only ever a temporary one.
    3. Switch the latest response from a 200 containing the feed to a 307 or 302 redirect to the working feed. This response can in turn be cached using an Expires header for the remainder of the working feed period. This seems relatively simple, so we're going to trial this approach.
  4. Using ROME/JDom with Xerces, and XML as the entries' content's type
    We used ROME for both creating and consuming our ATOM feed, but if the content type of the content element of your entry is XML then ROME runs up a new SAXBuilder in JDom, to parse it and prove it is valid. This in turn does a lot of reflection based looking up of factories, and because we're using Apache TomEE it ends up with Xerces 2.9. This process happens for every single entry in every single feed, and for us proved catastrophically expensive - with just two clients loading up feeds concurrently we ended up with the service essentially becoming unavailable.

Things that worry me:

At present we are storing the events in an Oracle database using a column defined as TIMESTAMP(3) WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP. I'm a little concerned that if the time on the database server were ever set backwards (perhaps due to an NTP correction) it is possible a new event would be inserted prior to events that clients had already consumed, and so those clients would never see it. It's possible that we could do the insert rather more intelligently - something along the lines of CURRENT_TIMESTAMP or latest event + 1 millisecond, whichever is the later, to ensure that a new event always happens after the latest event; though I need to play with some SQL to have a clear idea of whether this is possible.

Another option might be to attempt to maintain a sequence and return events sequentially rather than by time; however, this then presents us with the issue of ensuring the sequence is guaranteed to be sequential, which might not be the case with Oracle RAC and would be very hard should we have to distribute the database in a more fundamental way. I would be interested to hear of other people's solutions to this issue.

Still To Do:

I'd really like to open source both the client and server code for this, as the whole point is that it's a generic mechanism that can be used domain independently. The server side code, tied as it is to containers such as Spring and the Servlet API, is perhaps less widely useful; but the client side code is a pure library.

I'd also like to explore atom-archive-traverser, which I've just found and looks very similar to our client though built on different libraries.

Tuesday, 26 February 2013

Script to Install Public Key on Multiple Hosts

Here's a little script that can upload a public key onto a server - you could run it for multiple servers at the same time. Requires sshpass to be installed.

stty -echo
read -p "Password: " passw; echo
stty echo

public_key=`cat ~/.ssh/id_rsa.pub`

command="
mkdir -p ~/.ssh;
chmod og=,u=rwx ~/.ssh;
if [ ! -f ~/.ssh/authorized_keys ];
  then touch ~/.ssh/authorized_keys;
fi;
if ! grep -Fxq \"$public_key\" ~/.ssh/authorized_keys;
  then echo \"$public_key\" >> ~/.ssh/authorized_keys;
fi
"

function updateKey {
  sshpass -p $passw ssh -oLogLevel=Error -oStrictHostKeyChecking=no $1 $command
  if [ $? -eq 0 ]
    then echo "Updated key on $1"
  else
    echo "Failed to update key on $1"
  fi
}

updateKey myhostname

Tuesday, 19 February 2013

Specifications, Tests & Code

This is a quick reaction to various things I've read recently, most immediately this tweet:


I think the observations in this article by Bertrand Meyer about the limits of testing are entirely correct. In any even vaguely complex system you cannot begin to test all the combinations of inputs and outputs. That's why we focus on testing what we think are the important cases and what we think are the boundary conditions. I agree with him that as such the tests are not the specification and cannot be. So I don't think we can just replace "Test" with "Spec" and solve the problem.

(I should stick in a caveat here - I read a tweet by Ben Goldacre recently saying that people who rebut tweets in blog posts (or newspaper articles) are being prats, because a tweet by its nature is going to lack subtlety and depth of argument. I dare say that Kevlin Henney would mount a staunch defence of what he actually meant, perhaps along the lines of Martin Fowler's "Specification by Example" essay which acknowledges that a specification by example will be necessarily incomplete with the rest of the specification to be inferred from it.)

I think there's a useful analogy with real science. A specification is the equivalent of a theory; F=MA, for instance, or E=mc2. A test is the equivalent of an experiment; for a given set of controlled inputs, it measures the actual output against that predicted by the theory. And the running system is the equivalent of the real world. Just as in science, the tests (experiments) cannot prove the specification (theory) holds in the runtime system (real world), they can only disprove it. A black swan event can still occur (and anyone who has ever written software will have encountered bugs in well tested software arising from inputs the tester had not anticipated and so had not tested for).

The analogy breaks down in two respects; firstly, a correct but failing experiment in science means that it's time to re-evaluate the theory, because reality isn't subject to error, whereas often in programming it means that the running system is not behaving as actually desired.

Secondly, in science the theory (specification) is something a human being writes and understands and is obviously distinctly separate from the real world (runtime system); it may or may not accurately represent it. This leads me on to the second article that prompted me to write this post; Leslie Lamport arguing that we need formal specifications in addition to code. To me a specification is a formal, logically precise, human readable statement of precisely how a system is expected to operate under all conditions. So far so in agreement. However, once you've got such a thing, I think it should be possible to compile it into a form a computer can execute, and the name for human readable text that can be compiled into a form that a computer can execute is "source code".

I do not accept at all the the notion that the specification states "what and why" and the code states "how". Code is written at multiple levels of abstraction, typically represented by functions. I would argue that the why, what and how are encoded in these abstraction layers. For any given function, the function name states "what", the context of the parent function in which it is called states "why" and the body of the function states "how". As you move up and down the call stack, these roles change.

Which I think raises the question - if the runtime system (real world) is actually compiled from the specification (the theory) and the tests (experiments) are written to validate the specification (theory) is correct, haven't we got a circular argument? How can the tests ever fail? And why do we even need them?

I think the answer is that most of the time in programming we have two levels of specification. One exists in our heads or in a requirement document or a user story; it's informal, it doesn't cover all the cases, it may even be self contradictory or downright impossible at times, but it's essentially "correct" in the sense that it captures what we actually want this system to do. That's the one we use to write our tests with. Then we have to create the formal specification of what it should actually do under all circumstances, by writing the code. Our tests are about validating that the formal specification actually specifies what we were hoping it would specify.

Monday, 4 February 2013

Maven Logging Config for Libraries & Applications

A quick dump of my standard Maven poms for both libraries & applications.

Basic theory - pipe everything to SLF4J & use Logback as the SLF4J implementation.

A library should ONLY have a dependency on slf4j-api - it should not use classes in any logging implementation.

Libraries:


  4.0.0
  com.acme
  my_library
  1.2.3

  
    1.7.1
  

  
    
      version99
      http://version99.qos.ch/
    
  

  
    
    
      org.slf4j
      slf4j-api
      ${slf4j.version}
      compile
    

    
    
      ch.qos.logback
      logback-classic
      1.0.7
      test
    
    
      org.slf4j
      jcl-over-slf4j
      ${slf4jversion}
      test
    
    
      org.slf4j
      jul-to-slf4j
      ${slf4jversion}
      test
    
    
      uk.org.lidalia
      jul-to-slf4j-config
      1.0.0
      test
    
    
      org.slf4j
      log4j-over-slf4j
      ${slf4jversion}
      test
    
    
      commons-logging
      commons-logging
      99-empty
      test
    
    
      log4j
      log4j
      99-empty
      test
    
  


  
  
  


Application:

  4.0.0
  com.acme
  my_application
  1.2.3

  
    1.7.1
  

  
    
      version99
      http://version99.qos.ch/
    
  

  
    
    
      org.slf4j
      slf4j-api
      ${slf4jversion}
      compile
    

    
    
      ch.qos.logback
      logback-classic
      1.0.7
      runtime
    
    
      org.slf4j
      jcl-over-slf4j
      ${slf4jversion}
      runtime
    
    
      org.slf4j
      jul-to-slf4j
      ${slf4jversion}
      runtime
    
    
      uk.org.lidalia
      jul-to-slf4j-config
      1.0.0
      runtime
    
    
      org.slf4j
      log4j-over-slf4j
      ${slf4jversion}
      runtime
    
    
      commons-logging
      commons-logging
      99-empty
      runtime
    
    
      log4j
      log4j
      99-empty
      runtime
    
  


  
  
  
    
       %d [%thread] %-5level %logger{36} CLIENTID=%X{CLIENTID} SESSIONID=%X{SESSIONID} USERID=%X{USERID} TRANSACTIONID=%X{TRANSACTIONID} - %msg%n