Additive Noise Models (ANMs) provide a theoretically sound approach for inferring the most likely causal direction between pairs of random variables given only a sample from their joint distribution. The key assumption is that the effect is a function of the cause, with additive noise that is independent of the cause. In many cases ANMs are identifiable. Their performance, however, hinges on the chosen dependence measure, the assumption we make on the true distribution, and on the sample size.
In this paper we propose to use Shannon entropy to measure the dependence within an ANM, which gives us a general approach by which we do not have to assume a true distribution, nor have to perform explicit significance tests during optimization. Moreover, through the Minumum Description Length principle, we further show the direct connection between this ANM formulation and the more general Algorithmic Markov Condition (AMC). While practical instantiations of the AMC have so far not been known to be identifiable, we show that under certain adjustments using ANMs this is possible. Our information theoretic formulation gives us a general, efficient, identifiable, and, as the experiments show, highly accurate method for causal inference on pairs of discrete variables---achieving (near) 100% accuracy on both synthetic and real-world data.