Tr{Ceng Lou}

Study Notes

Briefing : PAC Learnibility

The following notes are summarized from Prof I-Hsiang Wang’s lecture: Mathematic Principles of Machine learning, covering lecture 1 and 2.

Discriminative learning, Consistency and No free lunch

In the setting of discriminative learning, excessive risk $R_{exessive}$ is decomposed into estimation error (which is associated with sampling bias) and the approximation error (which is associated with the size of the functional class $\mathcal{F}$) as :

$R_{excessive} = L(f, P)- L(f^* ,P) = (L(f, P) - \mathop{\inf}_{f_n \in \mathcal{F}} L(f_n, P)) \longrightarrow \; {\rm estimation \; error}$

$+ \; ( \mathop{\inf}_{f_n \in \mathcal{f}} L(f_n, p) - L(f^* , p)) \longrightarrow \; {\rm approximation \; error}$

where $f^{\ast}$ denotes the Baysian optimal rule. The term consistency for a sequence of learning algorithms $f_n $ and a distribution $\mathcal{P}$ is defined as: \begin{eqnarray} && L(f_n ; P) - L(f^* ; P) \mathop{\rightarrow}^{L_1}0 \; as \; n \rightarrow \infty \end{eqnarray}

And with loss bounded from above, by dominate convergence theorem the consistency criteria is equivalent to convergence in probability: \begin{eqnarray} \forall \, \epsilon > 0, \; \mathop{\lim}_{n \leftarrow \infty} P^{\otimes n}[|L(f_n; P)- L(f^*; P)| < \epsilon ] = 1 \end{eqnarray}

Impossibility result holds for general learning paradigms as well as discriminative paradigm. No free lunch theorem states the existence of bad distribution such that small excessive risk can’t be achieved with finite samples. Therefore discriminative approach confine the functional group to circumvent bad convergent behaviour. This derives the concept of PAC learnibility: There’re some functional class being learnbable, and some that are not.

Note that the lower bound for the imposibility results is established via a key observation: If a distribution $\mathcal{P}$ is concentrated on a finite set of data $X$ with $P(Y|X)$ being deterministic , then the algorithm output can be highly biased. It’s due to the fact that the training data may not cover all of the element in $X$, which make the algorithm only conduct random guess for the unseen data. (i.e. $f(X_1,Y_1, X_2, Y_2 \dots X_n, Y_n) \perp Y_{\rm test} $)

PAC learnable, realizability

In the discriminative paradigm we aim to minimize the estimation error $\Xi_{\mathcal{F}}(L,P) = L(f,P) - \mathop{\inf}_{f_n \in \mathcal{F}}L(f_n, P)$, and we’re interested in the number of samples required to achieve certain ($\epsilon, \delta $) generalization guarantee. Define the sample complexity of a hypothesis class $\mathcal{F}$ as

\begin{eqnarray} && n_{\mathcal{F}}(\epsilon, \delta) \mathop{=}^{\Delta} {\rm min}[n: \, \exists f_n \in \mathcal{F} \, {\rm s.t.} \, \forall P, \, P^{\otimes n}[\Xi_{\mathcal{F}}(f_n, P) \leq \epsilon] \geq 1-\delta] \end{eqnarray}

So if there’s some learning algorithm $f_n \in \mathcal{F}$ that guarantee estimation error to be less than $\epsilon$ with high confident level and finite sample for all distribution, than such a functional class in PAC learnable.

ERM rule provide a sufficient condition of PAC learnability, via \begin{eqnarray} && \Xi_{\mathcal{F}}(f_{ERM},P) \leq 2 \mathop{\sup}_{f\in \mathcal{F}}|\Delta (f,P)| \end{eqnarray}

Where $\Delta (f, P)$ is the deviation between true risk and the empirical risk.

Realizability indicates a existance of $f \in \mathcal{F}$ such that L(f,P) is zero, which is equivalent to $\ell (f(X), Y) = 0$ with probability one. In this case:

  • ERM always achieve zero emprirical risk.

  • $f^{*}$ is a deterministic fuction

Thus any bad function $f_{\rm bad} \in \mathcal{F}$ is less probable to be picked out from ERM rule. This indeed provided a sample commplexity with smaller scaling factor $\mathcal{O}(\epsilon^{-1})$, as compared with the agnostic case $\mathcal{O}(\epsilon^{-2})$