ClassesInPython.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Statistical Learning 002: Classes in Python\n",
    "## concise code, increased functionality\n",
    " by Lukas Anneser - 16th July 2018\n",
    "\n",
    "If you want to read up on classes, you can use the following resources:\n",
    "\n",
    "1: official python documentation: https://docs.python.org/3/tutorial/classes.html \n",
    "\n",
    "2: an easy-to-read tutorial by Jeff Knupp: https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/\n",
    "\n",
    "To get started with classes, it helps to understand that literally everything in python is an instance of a class: Integers, strings, floats - all of these are classes and have a certain set of rules that their class restricts them to.\n",
    "\n",
    "As an example, let's consider the '+' computation for strings and integers:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import modules to use\n",
    "\n",
    "import os, sys\n",
    "import csv\n",
    "import matplotlib\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'int'>\n",
      "9\n",
      "<class 'str'>\n",
      "example\n"
     ]
    }
   ],
   "source": [
    "a = 5\n",
    "b = 4\n",
    "c = \"exa\"\n",
    "d = \"mple\"\n",
    "print(type(a))\n",
    "print(a + b)\n",
    "print(type(c))\n",
    "print(c + d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, we can use the same command on different classes - with rather different outcomes. What happens if we try to use the '+' operator to combine strings and integers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "unsupported operand type(s) for +: 'int' and 'str'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-25-2616b5956feb>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0mc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'str'"
     ]
    }
   ],
   "source": [
    "print(a + c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a case like this, python throws a so-called TypeError: although both integers and strings can use the '+' command, it cannot be used on different classes. The reason for this is that the '+' means something very different for integers as compared to strings. Let's have a look at the implementation of the addition operation for strings. The way python interprets a statement like a + b is the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.__add__(b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above statement, there are two rather important concepts to be found:\n",
    "\n",
    "1) 'a' is the instantiation of the class 'integer' and a point after the 'a' indicates that we will now use a function or attribute that is specific to this object.\n",
    "\n",
    "2) The two underscores before and after the 'add' command indicate that this is actually not a command we are supposed to use. It is a way to break into the class architecture and use internal commands. What happens if we try and use the command without these underscores?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'int' object has no attribute 'add'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-27-73f363cda532>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0ma\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0madd\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m: 'int' object has no attribute 'add'"
     ]
    }
   ],
   "source": [
    "a.add(b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, Python tells us that this attribute doesn't exist - that's the way it's supposed to be, we should not try to evoke this function like this.\n",
    "\n",
    "Hint: In a shared development project, you can indicate that you would prefer other developers to not use a function outside a certain environment with a single underscore.\n",
    "\n",
    "So, by now we know that basically everything in Python is an instance of a class and that these classes have certain attributes that can be used. How can we define classes ourselves and what is the benefit as opposed to regular script-coding?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is a student.\n"
     ]
    }
   ],
   "source": [
    "class Student():\n",
    "    print(\"This is a student.\")\n",
    "\n",
    "Hans = Student()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the basic formula for defining classes. In this case, we defined a class 'student' and everytime a student is instantiated, we receive the feedback that the variable we created is a student. Not very useful, but it conveys the idea. What else could we do with this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hans\n",
      "neuroscience\n",
      "1\n"
     ]
    }
   ],
   "source": [
    "class Student():\n",
    "    def __init__(self, name, subject):\n",
    "        self.name = name\n",
    "        self.subject = subject\n",
    "        self.term = 1\n",
    "        \n",
    "Hans = Student('Hans', 'neuroscience')\n",
    "\n",
    "print(Hans.name)\n",
    "print(Hans.subject)\n",
    "print(Hans.term)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the __init__ function we can store certain attributes in our object. For example, we can now store the name of our student and their subject. We can also define that a new instance of a class automatically receives a certain attribute, in this case the information that our student is in his first term. \n",
    "\n",
    "Now imagine that we want to change the term of our student (at some point, a term has to finish). We could do this by manually changing the term:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Hans.term = 5\n",
    "Hans.term"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works, but it makes no sense that Hans, a new student, is now already in his fifth term. Let's change our class definition to prevent mistakes like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hans\n",
      "neuroscience\n",
      "1\n",
      "1\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "class Student():    \n",
    "    def __init__(self, name, subject):\n",
    "        self.name = name\n",
    "        self.subject = subject\n",
    "        self.__term__ = 1        \n",
    "            \n",
    "    def whichTerm(self):\n",
    "        return self.__term__\n",
    "    \n",
    "    def increaseTerm(self):\n",
    "        self.__term__ += 1\n",
    "        return self.__term__\n",
    "        \n",
    "Hans = Student('Hans', 'neuroscience')\n",
    "\n",
    "print(Hans.name)\n",
    "print(Hans.subject)\n",
    "print(Hans.whichTerm())\n",
    "\n",
    "Hans.term = 5\n",
    "\n",
    "print(Hans.whichTerm())\n",
    "\n",
    "Hans.increaseTerm()\n",
    "\n",
    "print(Hans.whichTerm())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Three things happened:\n",
    "\n",
    "1) We double-underscored the term-attribute of the class student. Now, you cannot simply change it from outside.\n",
    "\n",
    "2) We designed a function called __whichTerm__ which we can use to see the term.\n",
    "\n",
    "3) We inserted a new function called __increaseTerm__ - invoking it increases the term by one.\n",
    "\n",
    "Our code is now safer, because we prevented the term variable to be set to weird values and restricted the access to our precious data.\n",
    "\n",
    "The next concept we have to cover here is __inheritance__. This is truly important because it saves you a lot of work. Let's assume you want to create a new class of students: PhD students. These students are supposed to have all the functionality of a regular student and additional ones. We can do this rather easily like this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "2\n",
      "Max works 60 hours per week.\n",
      "His intake of coffee is high.\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "class Phdstudent(Student):\n",
    "    workingHours = 60\n",
    "    coffeeConsumption = \"high\"\n",
    "            \n",
    "Max = Phdstudent(\"Max\", \"neuroscience\")\n",
    "\n",
    "print(Max.whichTerm())\n",
    "Max.increaseTerm()\n",
    "print(Max.whichTerm())\n",
    "print(\"Max works \" + str(Max.workingHours) + \" hours per week.\")\n",
    "print(\"His intake of coffee is \" + Max.coffeeConsumption + \".\")\n",
    "\n",
    "print(isinstance(Max, Student))\n",
    "print(isinstance(Max, Phdstudent))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, we can use both the __whichTerm__ and the __increaseTerm__ function, although we did not explicitly define it for the phdstudent class. Additionally, the PhD student has the attributes of 60 working hours per week and by default a high coffee consumption. The power of inheritance shines if we want to use particular classes and add functionality. Let's say we want to use a particular way to read in and store a dataset. Functions that we define(d) in order to manipulate or analyse the dataset can be added to the class and easily accessed from this point on. Additionally, we can change functions in dependence of the class we are looking at. Consider the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max studies sociology\n",
      "Kathy studies mathematics and the amount of consumed coffee is high\n",
      "Flora studies history\n"
     ]
    }
   ],
   "source": [
    "class Student():    \n",
    "    def __init__(self, name, subject):\n",
    "        self.name = name\n",
    "        self.subject = subject\n",
    "        self.__term__ = 1        \n",
    "            \n",
    "    def whichTerm(self):\n",
    "        return self.__term__\n",
    "    \n",
    "    def increaseTerm(self):\n",
    "        self.__term__ += 1\n",
    "        return self.__term__\n",
    "    \n",
    "    def getInfo(self):\n",
    "        print(self.name + \" studies \" + self.subject)\n",
    "            \n",
    "class Phdstudent(student):\n",
    "    workingHours = 60\n",
    "    coffeeConsumption = \"high\"\n",
    "    \n",
    "    def getInfo(self):\n",
    "        print(self.name + \" studies \" + self.subject + \" and the amount of consumed coffee is \" + self.coffeeConsumption)\n",
    "        \n",
    "Max = Student(\"Max\", \"sociology\")\n",
    "Kathy = Phdstudent(\"Kathy\", \"mathematics\")\n",
    "Flora = Student(\"Flora\", \"history\")\n",
    "\n",
    "student_list = [Max, Kathy, Flora]\n",
    "\n",
    "for student in student_list:\n",
    "    student.getInfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, we can use the __getInfo__ function on each student in the list and dependent on whether the student is just a student or a PhD student, the output will be different. For our specific context, it would be nice to have a class which bundles all functions necessary for regression. This way, we can perform lots of regressions by a minimum amount of lines and loops. Let's get it started:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyLinearRegression:\n",
    "    \n",
    "    def __init__(self, fit_intercept=True):\n",
    "        self.coef_ = None\n",
    "        self.intercept_ = None\n",
    "        self._fit_intercept = fit_intercept\n",
    "        \n",
    "    def fit(self, X, y):\n",
    "        \"\"\"\n",
    "        fit model coefficients.\n",
    "        \n",
    "        Arguments:\n",
    "        X: 1D or 2D numpy array\n",
    "        y: 1D numpy array\n",
    "        \"\"\"\n",
    "        \n",
    "        # check if X is 1D or 2D array\n",
    "        if len(X.shape) == 1:\n",
    "            X = X.reshape(-1,1)\n",
    "        \n",
    "        # add bias if fit_intercept is True\n",
    "        if self._fit_intercept:\n",
    "            X = np.c_[np.ones(X.shape[0]), X]\n",
    "            \n",
    "        # closed form solution\n",
    "        xTx = np.dot(X.T, X)\n",
    "        inverse_xTx = np.linalg.inv(xTx)\n",
    "        xTy = np.dot(X.T, y)\n",
    "        coef = np.dot(inverse_xTx, xTy)\n",
    "        \n",
    "        # set attributes\n",
    "        if self._fit_intercept:\n",
    "            self.intercept_ = coef[0]\n",
    "            self.coef_ = coef[1:]\n",
    "        else:\n",
    "            self.intercept_ = 0\n",
    "            self.coef_ = coef\n",
    "            \n",
    "    def predict(self, X):\n",
    "        \"\"\"\n",
    "        Output model prediction.\n",
    "            \n",
    "        Arguments: \n",
    "        X: 1D or 2D numpy array\n",
    "        \"\"\"\n",
    "            \n",
    "        # check if X is 1D or 2D array\n",
    "        if len(X.shape) == 1:\n",
    "            X = X.reshape(-1,1)\n",
    "        return self.intercept_ + np.dot(X, self.coef_)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use this class to implement the __fit__ and potentially also the __predict__ method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20.485981308411226\n",
      "14.439252336448597\n"
     ]
    }
   ],
   "source": [
    "mlr = MyLinearRegression()\n",
    "X_data = np.array([1,4,6,9,20])\n",
    "y_data = np.array([20,80,110,170, 300])\n",
    "mlr.fit(X_data, y_data)\n",
    "\n",
    "print(mlr.intercept_)\n",
    "print(mlr.coef_[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, everything (data as well as methods) are organized in a single class object. This is already a rather good way to keep your data clean and tidy. However, the strength of classes (let's call it by its real name: __object oriented programming__ or __OOP__) is just starting to shine. Let's define another class that captures several metrics associated with regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Metrics:\n",
    "    \n",
    "    def __init__(self, X, y, model):\n",
    "        self.data = X\n",
    "        self.target = y\n",
    "        self.model = model\n",
    "        # degrees of freedom population dependent variable variance\n",
    "        self._dft = X.shape[0] - 1\n",
    "        \n",
    "    def sse(self):\n",
    "        # returns sum of squared errors (model vs actual)\n",
    "        squared_errors = (self.target - self.model.predict(self.data)) ** 2\n",
    "        self.sq_error_ = np.sum(squared_errors)\n",
    "        return self.sq_error_\n",
    "    \n",
    "    def sst(self):\n",
    "        '''returns total sum of squared errors (actual vs avg(actual))'''\n",
    "        avg_y = np.mean(self.target)\n",
    "        squared_errors = (self.target - avg_y) ** 2\n",
    "        self.sst_ = np.sum(squared_errors)\n",
    "        return self.sst_\n",
    "    \n",
    "    def r_squared(self):\n",
    "        '''returns calculated value of r^2'''\n",
    "        self.r_sq_ = 1 - self.sse()/self.sst()\n",
    "        return self.r_sq_    \n",
    "    \n",
    "    def mse(self):\n",
    "        '''returns calculated value of mse'''\n",
    "        self.mse_ = np.mean( (self.model.predict(self.data) - self.target) ** 2 )\n",
    "        return self.mse_\n",
    "    \n",
    "    def pretty_print_stats(self):\n",
    "        '''returns report of statistics for a given model object'''\n",
    "        items = ( ('sse:', self.sse()), ('sst:', self.sst()), \n",
    "                 ('mse:', self.mse()), ('r^2:', self.r_squared()), \n",
    "                  )\n",
    "        for item in items:\n",
    "            print('{0:8} {1:.4f}'.format(item[0], item[1]))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Metrics class requires X, y, and a model object to calculate the key metrics. It’s certainly not a bad solution. However, we can do better. With a little tweaking, we can give MyLinearRegression access to Metrics in a simple yet intuitive way. Let me show you how:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ModifiedMetrics:\n",
    "    \n",
    "    def sse(self):\n",
    "        '''returns sum of squared errors (model vs actual)'''\n",
    "        squared_errors = (self.target - self.predict(self.data)) ** 2\n",
    "        self.sq_error_ = np.sum(squared_errors)\n",
    "        return self.sq_error_\n",
    "        \n",
    "    def sst(self):\n",
    "        '''returns total sum of squared errors (actual vs avg(actual))'''\n",
    "        avg_y = np.mean(self.target)\n",
    "        squared_errors = (self.target - avg_y) ** 2\n",
    "        self.sst_ = np.sum(squared_errors)\n",
    "        return self.sst_\n",
    "    \n",
    "    def r_squared(self):\n",
    "        '''returns calculated value of r^2'''\n",
    "        self.r_sq_ = 1 - self.sse()/self.sst()\n",
    "        return self.r_sq_\n",
    "    \n",
    "    def mse(self):\n",
    "        '''returns calculated value of mse'''\n",
    "        self.mse_ = np.mean( (self.predict(self.data) - self.target) ** 2 )\n",
    "        return self.mse_\n",
    "    \n",
    "    def pretty_print_stats(self):\n",
    "        '''returns report of statistics for a given model object'''\n",
    "        items = ( ('sse:', self.sse()), ('sst:', self.sst()), \n",
    "                 ('mse:', self.mse()), ('r^2:', self.r_squared()), \n",
    "                  )\n",
    "        \n",
    "        for item in items:\n",
    "            print('{0:8} {1:.4f}'.format(item[0], item[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ModifiedMetrics no longer has the __init__ method. Using the powerful tool of __inheritance__, we can now modify __LinearRegression__ and allow it to access the functionality of ModifiedMetrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyLinearRegressionWithInheritance(ModifiedMetrics):\n",
    "    \n",
    "    \n",
    "    def __init__(self, fit_intercept=True):\n",
    "        self.coef_ = None\n",
    "        self.intercept_ = None\n",
    "        self._fit_intercept = fit_intercept\n",
    "          \n",
    "        \n",
    "    def fit(self, X, y):\n",
    "        \"\"\"\n",
    "        Fit model coefficients.\n",
    "\n",
    "        Arguments:\n",
    "        X: 1D or 2D numpy array \n",
    "        y: 1D numpy array\n",
    "        \"\"\"\n",
    "        \n",
    "        # training data & ground truth data\n",
    "        self.data = X\n",
    "        self.target = y\n",
    "        \n",
    "        # degrees of freedom population dep. variable variance \n",
    "        self._dft = X.shape[0] - 1  \n",
    "        \n",
    "        # check if X is 1D or 2D array\n",
    "        if len(X.shape) == 1:\n",
    "            X = X.reshape(-1,1)\n",
    "            \n",
    "        # add bias if fit_intercept\n",
    "        if self._fit_intercept:\n",
    "            X = np.c_[np.ones(X.shape[0]), X]\n",
    "        \n",
    "        # closed form solution\n",
    "        xTx = np.dot(X.T, X)\n",
    "        inverse_xTx = np.linalg.inv(xTx)\n",
    "        xTy = np.dot(X.T, y)\n",
    "        coef = np.dot(inverse_xTx, xTy)\n",
    "        \n",
    "        # set attributes\n",
    "        if self._fit_intercept:\n",
    "            self.intercept_ = coef[0]\n",
    "            self.coef_ = coef[1:]\n",
    "        else:\n",
    "            self.intercept_ = 0\n",
    "            self.coef_ = coef\n",
    "            \n",
    "    def predict(self, X):\n",
    "        \"\"\"Output model prediction.\n",
    "\n",
    "        Arguments:\n",
    "        X: 1D or 2D numpy array \n",
    "        \"\"\"\n",
    "        # check if X is 1D or 2D array\n",
    "        if len(X.shape) == 1:\n",
    "            X = X.reshape(-1,1) \n",
    "        return self.intercept_ + np.dot(X, self.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The computed intercept is: 20.485981308411226\n",
      "The computed coefficient is: 14.439252336448597\n",
      "sse:     702.7103\n",
      "sst:     45320.0000\n",
      "mse:     140.5421\n",
      "r^2:     0.9845\n"
     ]
    }
   ],
   "source": [
    "mlr = MyLinearRegressionWithInheritance()\n",
    "mlr.fit(X_data, y_data)\n",
    "print(\"The computed intercept is: \" + str(mlr.intercept_))\n",
    "print(\"The computed coefficient is: \" + str(mlr.coef_[0]))\n",
    "mlr.pretty_print_stats()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Statistical Learning 002: Classes in Python\n",
	"## concise code, increased functionality\n",
	" by Lukas Anneser - 16th July 2018\n",
	"\n",
	"If you want to read up on classes, you can use the following resources:\n",
	"\n",
	"1: official python documentation: https://docs.python.org/3/tutorial/classes.html \n",
	"\n",
	"2: an easy-to-read tutorial by Jeff Knupp: https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/\n",
	"\n",
	"To get started with classes, it helps to understand that literally everything in python is an instance of a class: Integers, strings, floats - all of these are classes and have a certain set of rules that their class restricts them to.\n",
	"\n",
	"As an example, let's consider the '+' computation for strings and integers:\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Import modules to use\n",
	"\n",
	"import os, sys\n",
	"import csv\n",
	"import matplotlib\n",
	"import numpy as np\n",
	"import matplotlib.pyplot as plt\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<class 'int'>\n",
	"9\n",
	"<class 'str'>\n",
	"example\n"
	]
	}
	],
	"source": [
	"a = 5\n",
	"b = 4\n",
	"c = \"exa\"\n",
	"d = \"mple\"\n",
	"print(type(a))\n",
	"print(a + b)\n",
	"print(type(c))\n",
	"print(c + d)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As you see, we can use the same command on different classes - with rather different outcomes. What happens if we try to use the '+' operator to combine strings and integers?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {},
	"outputs": [
	{
	"ename": "TypeError",
	"evalue": "unsupported operand type(s) for +: 'int' and 'str'",
	"output_type": "error",
	"traceback": [
	"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
	"\u001b[1;32m<ipython-input-25-2616b5956feb>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0mc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[1;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'str'"
	]
	}
	],
	"source": [
	"print(a + c)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In a case like this, python throws a so-called TypeError: although both integers and strings can use the '+' command, it cannot be used on different classes. The reason for this is that the '+' means something very different for integers as compared to strings. Let's have a look at the implementation of the addition operation for strings. The way python interprets a statement like a + b is the following:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"9"
	]
	},
	"execution_count": 26,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"a.__add__(b)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In the above statement, there are two rather important concepts to be found:\n",
	"\n",
	"1) 'a' is the instantiation of the class 'integer' and a point after the 'a' indicates that we will now use a function or attribute that is specific to this object.\n",
	"\n",
	"2) The two underscores before and after the 'add' command indicate that this is actually not a command we are supposed to use. It is a way to break into the class architecture and use internal commands. What happens if we try and use the command without these underscores?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {},
	"outputs": [
	{
	"ename": "AttributeError",
	"evalue": "'int' object has no attribute 'add'",
	"output_type": "error",
	"traceback": [
	"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
	"\u001b[1;32m<ipython-input-27-73f363cda532>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0ma\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0madd\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[1;31mAttributeError\u001b[0m: 'int' object has no attribute 'add'"
	]
	}
	],
	"source": [
	"a.add(b)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In this case, Python tells us that this attribute doesn't exist - that's the way it's supposed to be, we should not try to evoke this function like this.\n",
	"\n",
	"Hint: In a shared development project, you can indicate that you would prefer other developers to not use a function outside a certain environment with a single underscore.\n",
	"\n",
	"So, by now we know that basically everything in Python is an instance of a class and that these classes have certain attributes that can be used. How can we define classes ourselves and what is the benefit as opposed to regular script-coding?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"This is a student.\n"
	]
	}
	],
	"source": [
	"class Student():\n",
	" print(\"This is a student.\")\n",
	"\n",
	"Hans = Student()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is the basic formula for defining classes. In this case, we defined a class 'student' and everytime a student is instantiated, we receive the feedback that the variable we created is a student. Not very useful, but it conveys the idea. What else could we do with this?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Hans\n",
	"neuroscience\n",
	"1\n"
	]
	}
	],
	"source": [
	"class Student():\n",
	" def __init__(self, name, subject):\n",
	" self.name = name\n",
	" self.subject = subject\n",
	" self.term = 1\n",
	" \n",
	"Hans = Student('Hans', 'neuroscience')\n",
	"\n",
	"print(Hans.name)\n",
	"print(Hans.subject)\n",
	"print(Hans.term)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"With the __init__ function we can store certain attributes in our object. For example, we can now store the name of our student and their subject. We can also define that a new instance of a class automatically receives a certain attribute, in this case the information that our student is in his first term. \n",
	"\n",
	"Now imagine that we want to change the term of our student (at some point, a term has to finish). We could do this by manually changing the term:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"5"
	]
	},
	"execution_count": 33,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Hans.term = 5\n",
	"Hans.term"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This works, but it makes no sense that Hans, a new student, is now already in his fifth term. Let's change our class definition to prevent mistakes like this:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Hans\n",
	"neuroscience\n",
	"1\n",
	"1\n",
	"2\n"
	]
	}
	],
	"source": [
	"class Student(): \n",
	" def __init__(self, name, subject):\n",
	" self.name = name\n",
	" self.subject = subject\n",
	" self.__term__ = 1 \n",
	" \n",
	" def whichTerm(self):\n",
	" return self.__term__\n",
	" \n",
	" def increaseTerm(self):\n",
	" self.__term__ += 1\n",
	" return self.__term__\n",
	" \n",
	"Hans = Student('Hans', 'neuroscience')\n",
	"\n",
	"print(Hans.name)\n",
	"print(Hans.subject)\n",
	"print(Hans.whichTerm())\n",
	"\n",
	"Hans.term = 5\n",
	"\n",
	"print(Hans.whichTerm())\n",
	"\n",
	"Hans.increaseTerm()\n",
	"\n",
	"print(Hans.whichTerm())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Three things happened:\n",
	"\n",
	"1) We double-underscored the term-attribute of the class student. Now, you cannot simply change it from outside.\n",
	"\n",
	"2) We designed a function called __whichTerm__ which we can use to see the term.\n",
	"\n",
	"3) We inserted a new function called __increaseTerm__ - invoking it increases the term by one.\n",
	"\n",
	"Our code is now safer, because we prevented the term variable to be set to weird values and restricted the access to our precious data.\n",
	"\n",
	"The next concept we have to cover here is __inheritance__. This is truly important because it saves you a lot of work. Let's assume you want to create a new class of students: PhD students. These students are supposed to have all the functionality of a regular student and additional ones. We can do this rather easily like this"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 67,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1\n",
	"2\n",
	"Max works 60 hours per week.\n",
	"His intake of coffee is high.\n",
	"True\n",
	"True\n"
	]
	}
	],
	"source": [
	"class Phdstudent(Student):\n",
	" workingHours = 60\n",
	" coffeeConsumption = \"high\"\n",
	" \n",
	"Max = Phdstudent(\"Max\", \"neuroscience\")\n",
	"\n",
	"print(Max.whichTerm())\n",
	"Max.increaseTerm()\n",
	"print(Max.whichTerm())\n",
	"print(\"Max works \" + str(Max.workingHours) + \" hours per week.\")\n",
	"print(\"His intake of coffee is \" + Max.coffeeConsumption + \".\")\n",
	"\n",
	"print(isinstance(Max, Student))\n",
	"print(isinstance(Max, Phdstudent))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As you see, we can use both the __whichTerm__ and the __increaseTerm__ function, although we did not explicitly define it for the phdstudent class. Additionally, the PhD student has the attributes of 60 working hours per week and by default a high coffee consumption. The power of inheritance shines if we want to use particular classes and add functionality. Let's say we want to use a particular way to read in and store a dataset. Functions that we define(d) in order to manipulate or analyse the dataset can be added to the class and easily accessed from this point on. Additionally, we can change functions in dependence of the class we are looking at. Consider the following example:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Max studies sociology\n",
	"Kathy studies mathematics and the amount of consumed coffee is high\n",
	"Flora studies history\n"
	]
	}
	],
	"source": [
	"class Student(): \n",
	" def __init__(self, name, subject):\n",
	" self.name = name\n",
	" self.subject = subject\n",
	" self.__term__ = 1 \n",
	" \n",
	" def whichTerm(self):\n",
	" return self.__term__\n",
	" \n",
	" def increaseTerm(self):\n",
	" self.__term__ += 1\n",
	" return self.__term__\n",
	" \n",
	" def getInfo(self):\n",
	" print(self.name + \" studies \" + self.subject)\n",
	" \n",
	"class Phdstudent(student):\n",
	" workingHours = 60\n",
	" coffeeConsumption = \"high\"\n",
	" \n",
	" def getInfo(self):\n",
	" print(self.name + \" studies \" + self.subject + \" and the amount of consumed coffee is \" + self.coffeeConsumption)\n",
	" \n",
	"Max = Student(\"Max\", \"sociology\")\n",
	"Kathy = Phdstudent(\"Kathy\", \"mathematics\")\n",
	"Flora = Student(\"Flora\", \"history\")\n",
	"\n",
	"student_list = [Max, Kathy, Flora]\n",
	"\n",
	"for student in student_list:\n",
	" student.getInfo()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As you see, we can use the __getInfo__ function on each student in the list and dependent on whether the student is just a student or a PhD student, the output will be different. For our specific context, it would be nice to have a class which bundles all functions necessary for regression. This way, we can perform lots of regressions by a minimum amount of lines and loops. Let's get it started:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [],
	"source": [
	"class MyLinearRegression:\n",
	" \n",
	" def __init__(self, fit_intercept=True):\n",
	" self.coef_ = None\n",
	" self.intercept_ = None\n",
	" self._fit_intercept = fit_intercept\n",
	" \n",
	" def fit(self, X, y):\n",
	" \"\"\"\n",
	" fit model coefficients.\n",
	" \n",
	" Arguments:\n",
	" X: 1D or 2D numpy array\n",
	" y: 1D numpy array\n",
	" \"\"\"\n",
	" \n",
	" # check if X is 1D or 2D array\n",
	" if len(X.shape) == 1:\n",
	" X = X.reshape(-1,1)\n",
	" \n",
	" # add bias if fit_intercept is True\n",
	" if self._fit_intercept:\n",
	" X = np.c_[np.ones(X.shape[0]), X]\n",
	" \n",
	" # closed form solution\n",
	" xTx = np.dot(X.T, X)\n",
	" inverse_xTx = np.linalg.inv(xTx)\n",
	" xTy = np.dot(X.T, y)\n",
	" coef = np.dot(inverse_xTx, xTy)\n",
	" \n",
	" # set attributes\n",
	" if self._fit_intercept:\n",
	" self.intercept_ = coef[0]\n",
	" self.coef_ = coef[1:]\n",
	" else:\n",
	" self.intercept_ = 0\n",
	" self.coef_ = coef\n",
	" \n",
	" def predict(self, X):\n",
	" \"\"\"\n",
	" Output model prediction.\n",
	" \n",
	" Arguments: \n",
	" X: 1D or 2D numpy array\n",
	" \"\"\"\n",
	" \n",
	" # check if X is 1D or 2D array\n",
	" if len(X.shape) == 1:\n",
	" X = X.reshape(-1,1)\n",
	" return self.intercept_ + np.dot(X, self.coef_)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can now use this class to implement the __fit__ and potentially also the __predict__ method."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"20.485981308411226\n",
	"14.439252336448597\n"
	]
	}
	],
	"source": [
	"mlr = MyLinearRegression()\n",
	"X_data = np.array([1,4,6,9,20])\n",
	"y_data = np.array([20,80,110,170, 300])\n",
	"mlr.fit(X_data, y_data)\n",
	"\n",
	"print(mlr.intercept_)\n",
	"print(mlr.coef_[0])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, everything (data as well as methods) are organized in a single class object. This is already a rather good way to keep your data clean and tidy. However, the strength of classes (let's call it by its real name: __object oriented programming__ or __OOP__) is just starting to shine. Let's define another class that captures several metrics associated with regression."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {},
	"outputs": [],
	"source": [
	"class Metrics:\n",
	" \n",
	" def __init__(self, X, y, model):\n",
	" self.data = X\n",
	" self.target = y\n",
	" self.model = model\n",
	" # degrees of freedom population dependent variable variance\n",
	" self._dft = X.shape[0] - 1\n",
	" \n",
	" def sse(self):\n",
	" # returns sum of squared errors (model vs actual)\n",
	" squared_errors = (self.target - self.model.predict(self.data)) ** 2\n",
	" self.sq_error_ = np.sum(squared_errors)\n",
	" return self.sq_error_\n",
	" \n",
	" def sst(self):\n",
	" '''returns total sum of squared errors (actual vs avg(actual))'''\n",
	" avg_y = np.mean(self.target)\n",
	" squared_errors = (self.target - avg_y) ** 2\n",
	" self.sst_ = np.sum(squared_errors)\n",
	" return self.sst_\n",
	" \n",
	" def r_squared(self):\n",
	" '''returns calculated value of r^2'''\n",
	" self.r_sq_ = 1 - self.sse()/self.sst()\n",
	" return self.r_sq_ \n",
	" \n",
	" def mse(self):\n",
	" '''returns calculated value of mse'''\n",
	" self.mse_ = np.mean( (self.model.predict(self.data) - self.target) ** 2 )\n",
	" return self.mse_\n",
	" \n",
	" def pretty_print_stats(self):\n",
	" '''returns report of statistics for a given model object'''\n",
	" items = ( ('sse:', self.sse()), ('sst:', self.sst()), \n",
	" ('mse:', self.mse()), ('r^2:', self.r_squared()), \n",
	" )\n",
	" for item in items:\n",
	" print('{0:8} {1:.4f}'.format(item[0], item[1]))\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The Metrics class requires X, y, and a model object to calculate the key metrics. It’s certainly not a bad solution. However, we can do better. With a little tweaking, we can give MyLinearRegression access to Metrics in a simple yet intuitive way. Let me show you how:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"metadata": {},
	"outputs": [],
	"source": [
	"class ModifiedMetrics:\n",
	" \n",
	" def sse(self):\n",
	" '''returns sum of squared errors (model vs actual)'''\n",
	" squared_errors = (self.target - self.predict(self.data)) ** 2\n",
	" self.sq_error_ = np.sum(squared_errors)\n",
	" return self.sq_error_\n",
	" \n",
	" def sst(self):\n",
	" '''returns total sum of squared errors (actual vs avg(actual))'''\n",
	" avg_y = np.mean(self.target)\n",
	" squared_errors = (self.target - avg_y) ** 2\n",
	" self.sst_ = np.sum(squared_errors)\n",
	" return self.sst_\n",
	" \n",
	" def r_squared(self):\n",
	" '''returns calculated value of r^2'''\n",
	" self.r_sq_ = 1 - self.sse()/self.sst()\n",
	" return self.r_sq_\n",
	" \n",
	" def mse(self):\n",
	" '''returns calculated value of mse'''\n",
	" self.mse_ = np.mean( (self.predict(self.data) - self.target) ** 2 )\n",
	" return self.mse_\n",
	" \n",
	" def pretty_print_stats(self):\n",
	" '''returns report of statistics for a given model object'''\n",
	" items = ( ('sse:', self.sse()), ('sst:', self.sst()), \n",
	" ('mse:', self.mse()), ('r^2:', self.r_squared()), \n",
	" )\n",
	" \n",
	" for item in items:\n",
	" print('{0:8} {1:.4f}'.format(item[0], item[1]))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"ModifiedMetrics no longer has the __init__ method. Using the powerful tool of __inheritance__, we can now modify __LinearRegression__ and allow it to access the functionality of ModifiedMetrics."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {},
	"outputs": [],
	"source": [
	"class MyLinearRegressionWithInheritance(ModifiedMetrics):\n",
	" \n",
	" \n",
	" def __init__(self, fit_intercept=True):\n",
	" self.coef_ = None\n",
	" self.intercept_ = None\n",
	" self._fit_intercept = fit_intercept\n",
	" \n",
	" \n",
	" def fit(self, X, y):\n",
	" \"\"\"\n",
	" Fit model coefficients.\n",
	"\n",
	" Arguments:\n",
	" X: 1D or 2D numpy array \n",
	" y: 1D numpy array\n",
	" \"\"\"\n",
	" \n",
	" # training data & ground truth data\n",
	" self.data = X\n",
	" self.target = y\n",
	" \n",
	" # degrees of freedom population dep. variable variance \n",
	" self._dft = X.shape[0] - 1 \n",
	" \n",
	" # check if X is 1D or 2D array\n",
	" if len(X.shape) == 1:\n",
	" X = X.reshape(-1,1)\n",
	" \n",
	" # add bias if fit_intercept\n",
	" if self._fit_intercept:\n",
	" X = np.c_[np.ones(X.shape[0]), X]\n",
	" \n",
	" # closed form solution\n",
	" xTx = np.dot(X.T, X)\n",
	" inverse_xTx = np.linalg.inv(xTx)\n",
	" xTy = np.dot(X.T, y)\n",
	" coef = np.dot(inverse_xTx, xTy)\n",
	" \n",
	" # set attributes\n",
	" if self._fit_intercept:\n",
	" self.intercept_ = coef[0]\n",
	" self.coef_ = coef[1:]\n",
	" else:\n",
	" self.intercept_ = 0\n",
	" self.coef_ = coef\n",
	" \n",
	" def predict(self, X):\n",
	" \"\"\"Output model prediction.\n",
	"\n",
	" Arguments:\n",
	" X: 1D or 2D numpy array \n",
	" \"\"\"\n",
	" # check if X is 1D or 2D array\n",
	" if len(X.shape) == 1:\n",
	" X = X.reshape(-1,1) \n",
	" return self.intercept_ + np.dot(X, self.coef_)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 68,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The computed intercept is: 20.485981308411226\n",
	"The computed coefficient is: 14.439252336448597\n",
	"sse: 702.7103\n",
	"sst: 45320.0000\n",
	"mse: 140.5421\n",
	"r^2: 0.9845\n"
	]
	}
	],
	"source": [
	"mlr = MyLinearRegressionWithInheritance()\n",
	"mlr.fit(X_data, y_data)\n",
	"print(\"The computed intercept is: \" + str(mlr.intercept_))\n",
	"print(\"The computed coefficient is: \" + str(mlr.coef_[0]))\n",
	"mlr.pretty_print_stats()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}