code refactored for icdm submission

EDA · Jun 11, 2018 · 10943bf · 10943bf
1 parent 6259733
commit 10943bf
Show file tree

Hide file tree

Showing 251 changed files with 1,000 additions and 334,468 deletions.
diff --git a/README.md b/README.md
@@ -1,8 +1,4 @@
-# Causal Inference by Stochastic Complexity
-The algorithmic Markov condition states that the most likely causal  direction between two random variables `X` and `Y` can be identified  as that direction with the lowest Kolmogorov complexity. Due to the halting problem, however, this notion is not computable.
-
-We hence propose to do causal inference by stochastic complexity. That is, we propose to approximate Kolmogorov complexity via the Minimum Description Length (MDL) principle, using a score that is mini-max optimal with regard to the model class under consideration. This means that even in an adversarial setting, such as when the true distribution is not in this class, we still obtain the optimal encoding for the data relative to the class.
-
-We instantiate this framework, which we call CISC, for pairs of univariate discrete variables, using the class of multinomial distributions.
-Experiments show that CISC is highly accurate on synthetic, benchmark, as well as real-world data, outperforming the state of the art by a margin, and scales extremely well with regard to sample and domain sizes.
+# Accurate Causal Inference on Discrete Data
+Additive Noise Models (ANMs) provide a theoretically sound approach for inferring the most likely causal direction between pairs of random variables given only a sample from their joint distribution. The key assumption is that the effect is a function of the cause, with additive noise that is independent of the cause. In many cases ANMs are identifiable. Their performance, however, hinges on the chosen dependence measure, the assumption we make on the true distribution, and on the sample size. 
 
+In this paper we propose to use Shannon entropy to measure the dependence within an ANM, which gives us a general approach by which we do not have to assume a true distribution, nor have to perform explicit significance tests during optimization. Moreover, through the Minumum Description Length principle, we further show the direct connection between this ANM formulation and the more general Algorithmic Markov Condition (AMC). While practical instantiations of the AMC have so far not been known to be identifiable, we show that under certain adjustments using ANMs this is possible. Our information theoretic formulation gives us a general, efficient, identifiable, and, as the experiments show, highly accurate method for causal inference on pairs of discrete variables---achieving (near) 100% accuracy on both synthetic and real-world data. 
diff --git a/acid.py b/acid.py
@@ -9,12 +9,6 @@
 from entropy import entropy
 
 
-__author__ = "Kailash Budhathoki"
-__email__ = "kbudhath@mpi-inf.mpg.de"
-__copyright__ = "Copyright (c) 2018"
-__license__ = "MIT"
-
-
 def marginals(X, Y):
     Ys = defaultdict(list)
     for i, x in enumerate(X):

diff --git a/cisc.py b/cisc.py
@@ -9,12 +9,6 @@
 from sc import stochastic_complexity
 
 
-__author__ = "Kailash Budhathoki"
-__email__ = "kbudhath@mpi-inf.mpg.de"
-__copyright__ = "Copyright (c) 2018"
-__license__ = "MIT"
-
-
 def marginals(X, Y):
     Ys = defaultdict(list)
     for i, x in enumerate(X):

diff --git a/crisp.py b/crisp.py
diff --git a/data/acute/acute.names b/data/acute/acute.names
diff --git a/data/acute/acute.tsv b/data/acute/acute.tsv