5.9. learning theory#

references: (1) Machine Learning - Tom Mitchell, (2) An Introduction to Computational Learning Theory - Kearns & Vazirani

5.9.1. evolution#

  • performance is correlation \(Perf_D (h,c) = \sum h(x) \cdot c(x) \cdot P(x)\)

    • want \(P(Perf_D(h,c) < Perf_D(c,c)-\epsilon) < \delta\)

5.9.2. sample problems#

  • ex: N marbles in a bag. How many draws with replacement needed before we draw all N marbles?

    • write \(P_i = \frac{N-(i-1)}{N}\) where i is number of distinct drawn marbles

      • transition from i to i+1 is geometrically distributed with probability \(P_i\)

      • mean times is sum of mean of each geometric

    • in order to get probabilities of seeing all the marbles instead of just mean[## draws], want to use Markov’s inequailty

  • box full of 1e6 marbles

    • if we have 10 evenly distributed classes of marbles, what is probability we identify all 10 classes of marbles after 100 draws?

5.9.3. computational learning theory#

  • frameworks

    1. PAC

    2. mistake-bound - split into b processes which each fail with probability at most \(\delta / b\)

  • questions

    1. sample complexity - how many training examples needed to converge

    2. computational complexity - how much computational effort needed to converge

    3. mistake bound - how many training examples will learner misclassify before converging

  • must define convergence based on some probability

5.9.3.1. PAC - probably learning an approximately correct hypothesis - Mitchell#

  • want to learn C

    • data X is sampled with Distribution D

    • learner L considers set H of possible hypotheses

  • true error \(err_d (h)\) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.

    • \(err_D(h) = \underset{x\in D}{Pr}[c(x) \neq h(x)]\)

  • getting \(err_D(h)=0\) is infeasible

  • PAC learnable - consider concept class C defined over set of instances X of length n and a learner L using hypothesis space H

    • C is PAC-learnable by L using H if for all \(c \in C\), distributions D over X, \(\epsilon\) s.t. 0 < \(\epsilon\) < 1/2 \(\delta\) s.t. \(0<\delta<1/2\), learner L will with probability at least \((1-\delta)\) output a hypothesis \(h \in H\) s.t \(err_D(h) \leq \epsilon\)

  • efficiently PAC learnable - time that is polynomial in \(1/\epsilon, 1/\delta, n, size(c )\)

    • probably - probability of failure bounded by some constant \(\delta\)

    • approximately correct - err bounded by some constant \(\epsilon\)

    • assumes H contains hypothesis with artbitraily small error for every target concept in C

5.9.3.2. sample complexity for finite hypothesis space - Mitchell#

  • sample complexity - growth in the number of training examples required

  • consistent learner - outputs hypotheses that perfectly fit training data whenever possible

    • outputs a hypothesis belonging to the version space

  • consider hypothesis space H, target concept c, instance distribution \(\mathcal{D}\), training examples D of c. The versions space \(VS_{H,D}\) is \(\epsilon\)-exhaused with respect to c and \(\mathcal{D}\) if every hypothesis h in \(VS_{H,D}\) has error less than \(\epsilon\) with respect to c and \(\mathcal{D}\): \((\forall h \in VS_{H,D}) err_\mathcal{D} (h) < \epsilon\)

5.9.3.3. rectangle learning game - Kearns#

  • data X is sampled with Distribution D

  • simple soln: tightest-fit rectangle

  • define region T so prob a draw misses T is \(1-\epsilon /4\)

    • then, m draws miss with \((1-\epsilon /4)^m\)

      • choose m to satisfy \(4(1-\epsilon/4)^m \leq \delta\)

5.9.3.4. VC dimension#

  • VC dimension measures capacity of a space of functions that can be learend by a statistical classification algorithm

    • let H be set of sets and C be a set

    • \(H \cap C := \{ h \cap C \: \vert h \in H \}\)

    • a set C is shattered by H if \(H \cap C\) contains all subsets of C

    • The VC dimension of \(H\) is the largest integer \(D\) such that there exists a set \(C\) with cardinality \(D\) that is shattered by \(H\)

  • VC (Vapnic-Chervonenkis) dimension - if data is mapped into sufficiently high dimension, then samples will be linearly separable (N points, N-1 dims)

  • VC dimension 0 -> hypothesis either always returns false or always returns true

  • Sauer’s lemma - let \(d \geq 0, m \geq 1\), \(H\) hypothesis space, VC-dim(H) = d. Then, \(\Pi_H(m) \leq \phi (d,m)\)

  • fundamental theorem of learning theory provides bound of m that guarantees learning: \(m \geq [\frac{4}{\epsilon} \cdot (d \cdot ln(\frac{12}{\epsilon}) + ln(\frac{2}{\delta}))]\)

5.9.4. concept learning and the general-to-specific ordering#

  • definitions

    • concept learning - acquiring the definition of a general category given a sample of positive and negative training examples of the category

      • concept is boolean function that returns true for specific things

      • can represent function as vector acceptable features, ?, or null (if any null, then entire vector is null)

    • general hypothesis - more generally true

      • general defines a partial ordering

    • a hypothesis is consistent with the training examples if it correctly classifies them

    • an example x satisfies a hypothesis h if h(x) = 1

  • find-S - finding a maximally specific hypothesis

    • start with most specific possible

    • generalize each time it fails to cover an observed positive training example

    • flaws

      • ignores negative examples

    • if training data is perfect, then will get answer

      1. no errors

      2. there exists a hypothesis in H that describes target concept c

  • version space - set of all hypotheses consistent with the training examples

    • list-then-eliminate - list all hypotheses and eliminate any that are inconsistent (slow)

    • candidate-elimination - represent most general (G) and specific (S) members of version space

      • version space representation theorem - version space can be found from most general / specific version space members

      • for positive examples

        • make S more general

        • fix G

      • for negative examples

        • fix S

        • make G more specific

      • in general, optimal query strategy is to generate instances that satisfy exactly half the hypotheses in the current version space

    • testing?

      • classify as positive if satisfies S

      • classify as negative if doesn’t satisfy G

  • inductive bias of candidate-elimination - target concept c is contained in H