learning theory
Contents
5.9. learning theory#
references: (1) Machine Learning - Tom Mitchell, (2) An Introduction to Computational Learning Theory - Kearns & Vazirani
5.9.1. evolution#
performance is correlation \(Perf_D (h,c) = \sum h(x) \cdot c(x) \cdot P(x)\)
want \(P(Perf_D(h,c) < Perf_D(c,c)-\epsilon) < \delta\)
5.9.2. sample problems#
ex: N marbles in a bag. How many draws with replacement needed before we draw all N marbles?
write \(P_i = \frac{N-(i-1)}{N}\) where i is number of distinct drawn marbles
transition from i to i+1 is geometrically distributed with probability \(P_i\)
mean times is sum of mean of each geometric
in order to get probabilities of seeing all the marbles instead of just mean[## draws], want to use Markov’s inequailty
box full of 1e6 marbles
if we have 10 evenly distributed classes of marbles, what is probability we identify all 10 classes of marbles after 100 draws?
5.9.3. computational learning theory#
frameworks
PAC
mistake-bound - split into b processes which each fail with probability at most \(\delta / b\)
questions
sample complexity - how many training examples needed to converge
computational complexity - how much computational effort needed to converge
mistake bound - how many training examples will learner misclassify before converging
must define convergence based on some probability
5.9.3.1. PAC - probably learning an approximately correct hypothesis - Mitchell#
want to learn C
data X is sampled with Distribution D
learner L considers set H of possible hypotheses
true error \(err_d (h)\) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.
\(err_D(h) = \underset{x\in D}{Pr}[c(x) \neq h(x)]\)
getting \(err_D(h)=0\) is infeasible
PAC learnable - consider concept class C defined over set of instances X of length n and a learner L using hypothesis space H
C is PAC-learnable by L using H if for all \(c \in C\), distributions D over X, \(\epsilon\) s.t. 0 < \(\epsilon\) < 1/2 \(\delta\) s.t. \(0<\delta<1/2\), learner L will with probability at least \((1-\delta)\) output a hypothesis \(h \in H\) s.t \(err_D(h) \leq \epsilon\)
efficiently PAC learnable - time that is polynomial in \(1/\epsilon, 1/\delta, n, size(c )\)
probably - probability of failure bounded by some constant \(\delta\)
approximately correct - err bounded by some constant \(\epsilon\)
assumes H contains hypothesis with artbitraily small error for every target concept in C
5.9.3.2. sample complexity for finite hypothesis space - Mitchell#
sample complexity - growth in the number of training examples required
consistent learner - outputs hypotheses that perfectly fit training data whenever possible
outputs a hypothesis belonging to the version space
consider hypothesis space H, target concept c, instance distribution \(\mathcal{D}\), training examples D of c. The versions space \(VS_{H,D}\) is \(\epsilon\)-exhaused with respect to c and \(\mathcal{D}\) if every hypothesis h in \(VS_{H,D}\) has error less than \(\epsilon\) with respect to c and \(\mathcal{D}\): \((\forall h \in VS_{H,D}) err_\mathcal{D} (h) < \epsilon\)
5.9.3.3. rectangle learning game - Kearns#
data X is sampled with Distribution D
simple soln: tightest-fit rectangle
define region T so prob a draw misses T is \(1-\epsilon /4\)
then, m draws miss with \((1-\epsilon /4)^m\)
choose m to satisfy \(4(1-\epsilon/4)^m \leq \delta\)
5.9.3.4. VC dimension#
VC dimension measures capacity of a space of functions that can be learend by a statistical classification algorithm
let H be set of sets and C be a set
\(H \cap C := \{ h \cap C \: \vert h \in H \}\)
a set C is shattered by H if \(H \cap C\) contains all subsets of C
The VC dimension of \(H\) is the largest integer \(D\) such that there exists a set \(C\) with cardinality \(D\) that is shattered by \(H\)
VC (Vapnic-Chervonenkis) dimension - if data is mapped into sufficiently high dimension, then samples will be linearly separable (N points, N-1 dims)
VC dimension 0 -> hypothesis either always returns false or always returns true
Sauer’s lemma - let \(d \geq 0, m \geq 1\), \(H\) hypothesis space, VC-dim(H) = d. Then, \(\Pi_H(m) \leq \phi (d,m)\)
fundamental theorem of learning theory provides bound of m that guarantees learning: \(m \geq [\frac{4}{\epsilon} \cdot (d \cdot ln(\frac{12}{\epsilon}) + ln(\frac{2}{\delta}))]\)
5.9.4. concept learning and the general-to-specific ordering#
definitions
concept learning - acquiring the definition of a general category given a sample of positive and negative training examples of the category
concept is boolean function that returns true for specific things
can represent function as vector acceptable features, ?, or null (if any null, then entire vector is null)
general hypothesis - more generally true
general defines a partial ordering
a hypothesis is consistent with the training examples if it correctly classifies them
an example x satisfies a hypothesis h if h(x) = 1
find-S - finding a maximally specific hypothesis
start with most specific possible
generalize each time it fails to cover an observed positive training example
flaws
ignores negative examples
if training data is perfect, then will get answer
no errors
there exists a hypothesis in H that describes target concept c
version space - set of all hypotheses consistent with the training examples
list-then-eliminate - list all hypotheses and eliminate any that are inconsistent (slow)
candidate-elimination - represent most general (G) and specific (S) members of version space
version space representation theorem - version space can be found from most general / specific version space members
for positive examples
make S more general
fix G
for negative examples
fix S
make G more specific
in general, optimal query strategy is to generate instances that satisfy exactly half the hypotheses in the current version space
testing?
classify as positive if satisfies S
classify as negative if doesn’t satisfy G
inductive bias of candidate-elimination - target concept c is contained in H