Data science notes

I/ Definitions

1) Vers Estimateur


Ensemble d’individus représentatifs d’une population.

Une statistique

Résultat d’une opération appliquée à un échantillon.

Variable aléatoire

Application définie sur l’ensemble des éventualités (=Ensemble des résultats possibles d’une expérience aléatoire)

Loi de probabilité

Loi décrivant le comportement d’une variable aléatoire.


Statistique permettant d’évaluer un paramètre inconnu relatif à une loi de probabilité (ex: espérance ou variance) Mesures de qualité :

Mesure de $\hat\theta$ Valeur
Biais $Biais(\hat \theta)=E[\hat \theta]-\theta$
Erreur quadratique moyenne $MSE(\hat \theta)=E[(\hat \theta-\theta)^2]$
Convergence $\lim \limits_{n\rightarrow \infty} I\kern-.9ex P(\mid\hat \theta_n -\theta\mid>\epsilon)=0, \space \forall\epsilon>0$
Convergence forte $I\kern-.9ex P(\lim \limits_{n\rightarrow \infty} \hat \theta_n =\theta)=1$
Efficacité $Var(\hat{\theta})=E[\hat{\theta}^2]-E[\hat{\theta}]^2$
Robustesse Sensibilité aux outlayers

2) Vers test du $\chi^2$

Loi normale

$densité(x)=\frac{1}{\sigma \sqrt{2\pi }}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Théorème central limite


Loi du $\chi^2$

Hypothèse nulle

Postule l’égalité entre des statistiques de deux échantillons différents, supposé pris sur des populations équivalentes.


Test statistique

Peut-on rejeter l’hypothèse nulle ? Ils retournent une valeur-p (p-value) ou une critical value à comparer à des seuils conventionnels pour conclure.

Etape d’un test d’hypothèse:

  1. Faire une supposition initiale
  2. Collecter des données
  3. Collecter des preuves pour rejeter ou pas la supposition

Côte Z (z-score)



Approximation du z-test si on ne connait qu’un échantillon de la population

Test du $\chi^2$

3) Loss functions


\[loss_{0-1}(\hat f, X, y) = \sum\limits_{i=1}^{|X|}I_{y^{(i)} \ne \hat f(x^{(i)})}\] \[loss_{L_1}(\hat f, X, y) = =\sum\limits_{i=1}^{|X|}|y^{(i)} - \hat f(x^{(i)})|\] \[loss_{L_2}(\hat f, X, y) = \sum\limits_{i=1}^{|X|}(y^{(i)} - \hat f(x^{(i)}))^2\]


Do not use $loss_{L_2}$ nor $loss_{L_1}$ in classification if activation function output can be greater than $1$: We don’t want correctly classified samples with activation function output $>1$ to be involved in weights updates because it might slow the learning.

II/ Bias-variance tradeoff

Training set $x_1,…,x_n$ with $y_i$ associated with each sample.

ML models as estimators

Estimateur : Statistique permettant d’évaluer un paramètre inconnu relatif à une loi de probabilité

the square of the bias of the learning method the variance of the learning method the irreducible error
$(E[\hat{f}(x)]-f(x))^2$ $E[\hat{f}(x)^2]-E[\hat{f}(x)]^2$ $\sigma^2$
the error caused by the simplifying assumptions built into the method how much the learning method $\hat{f}(x)$ will move around its mean Since all three terms are non-negative, this forms a lower bound on the expected error on unseen samples

Le meilleur modèle a une complexité $c_0$ telle que: \(\frac{d(Bias^2)}{d(complexité)}(c_0)=-\frac{d(Variance)}{d(complexité)}(c_0)\)

import scimple as scm
%matplotlib notebook
scm.Plot(title="Bias-variance tradeoff", borders=[0.3,2, -1,5])\
     y=lambda i, x: 1/x[i]**2, marker="-", label="Bias²")\
     y=lambda i, x: x[i]**2, marker="-", label="Variance")\
     y=lambda i, x: 0.5, marker="-", label="NoiseError")\
     y=lambda i, x: 0, marker="+", markersize=1)\
     y=lambda i, x: 1/x[i]**2 + x[i]**2 + 0.5, 
     marker="-", label="Error")\
     y=lambda i, x: scm.derivate(lambda z: 1/z**2 + z**2 + 0.5, x[i]), 
     marker="-", label="d(Error)")

Vapnik–Chervonenkis Capacity

Evaluation, Model selection


more measures summarized at confusion matrix’ wiki

ROC receiver operating characteristic

  really P really N
predicted as P TP FP
predicted as N FN TN
  P = TP + FN N = FP + TN

Actual positives well classified rate: $TPR=\frac{TP}{TP+FN}=\frac{TP}{P}$

Actual negatives miss classified rate: $FPR=\frac{FP}{FP+TN}=\frac{FP}{N}=1 - TNR$

We suppose that our classifier predicts $P$ if its output score $\geq\theta$.

When moving $\theta$ in its interval $[a,b]$, here are special cases:


Area Under Curve $\in [0.5 ,1]$.


$=\frac{TP+TN}{P+N}=\frac{\vert right \space guesses\vert}{\vert all\space records\vert}$


$=\frac{TP}{TP+FP}=\frac{\vert records\space rightly\space predicted\space as\space P\vert }{\vert records\space predicted\space as\space P\vert }=positive\space predictive\space value\space (PPV)$


$=\frac{TP}{TP+FN}=\frac{TP}{P}=\frac{\vert records\space rightly\space predicted\space as\space P\vert }{\vert actual \space positive \space targets\vert }=TPR=hit\space rate$


Hyper params tuning


Bayesian approach

Time series split

To ensure an unbiased split:

  individuals A individuals B
Most old events Used for train  
Most recent events   Used for test

The split train-validation inside outer train set must follow the same logic.


Kernel trick

Tour of algorithms

Naive Bayes Classifier

Classification And Regression Tree



Random forest


Regularized Regressions




Support Vector Machine

Logistic Regression

MLE oriented Logistic Regression

Maximum Likelihood Estimation: A method for determining a distribution model follows MLE principle iif it tries to find parameters that make the studied distribution be as probable as possible for the built model.

Logistic Regression: Let’s consider a use case of binary classification, being given a matrix $X\in R^{n_x\times m}$ of samples (as columns) and binary labels $y \in {0, 1}^m$. The logistic regression learning uses a gradient descent to fit parameters $w, b\in R^{n_x}$ so that the model prediction for the $i^{th}$ sample is:

\[p(y^{(i)}\vert X^{(i)}) = {ŷ^{(i)}}^{y^{(i)}}.(1 - ŷ^{(i)})^{1-y^{(i)}}\]


MLE compliant Logistic Regression: The MLE principle wants the algorithm to maximize the $likelihood = \prod_{i=1}^{m} p(y^{(i)}\vert X^{(i)})$ which is how probable our training samples are from the perspective of our model (under the assumption that our samples are identically and independently distributed, or IID). Minimizing the likelihood is the same task as minimizing its log, i.e.:

\[log(likelyhood) = \sum_{i=1}^m log(p(y^{(i)}\vert X^{(i)}))\] \[= \sum_{i=1}^m log({ŷ^{(i)}}^{y^{(i)}}.(1 - ŷ^{(i)})^{1-y^{(i)}})\] \[= \sum_{i=1}^m [ y^{(i)}.log(ŷ^{(i)}) + (1-y^{(i)}).log(1 - ŷ^{(i)})]\]

So to comply with the MLE principle, we just have to pass to our classic logistic regression algorithm the cost function (with a little rescaling):

\[C: (y,ŷ) \mapsto \frac{1}{m} \sum_{i=1}^m L(y, ŷ)\]

$L$ being the loss function:

\[L: (y, ŷ) \mapsto -(y^{(i)}.log(ŷ^{(i)}) + (1-y^{(i)}).log(1 - ŷ^{(i)}))\]

Note the minus sign that is here to transform a quantity that we want to maximize (the likelihood) into a quantity that we want to minimize (the cost function).

\[\frac{\partial L}{\partial ŷ}: (y, ŷ) \mapsto -\frac{y}{ŷ} + \frac{1-y}{1-ŷ}\]

Python vectorized minimalist implementation (inspired by Andrew Ng course)

# A vectorized logistic regression implem, with MLE compliant cost function
import numpy as np

# dims
nx = 2  # number of features
m = 20  # number of samples

# training data:
X = np.random.randint(-99, 100, (nx, m))  # nx x m
y = (np.dot(X.T, np.array([5, -1])) > 0).astype(np.int16)  # 1 x m

# vectorized logit function
logit = np.vectorize(lambda z: 1/(1 + np.exp(-z)))

def learn(X, y, alpha, n_iter):
    w = np.random.uniform(-1, 1, (nx, 1))
    b = np.random.uniform(-1, 1)
    for _ in range(n_iter):
        # forward propagation a = y_hat
        a = logit((np.dot(w.T, X) + b))  # dim 1 x m
        # backward propagation
        w -= (np.dot((a - y), X.T) * alpha).T/m
        b -= (np.average(a - y) * alpha)/m
    return (w, b)

w, b = learn(X, y, 0.01, 10)
print(f"learned parameters:\nw={w}\nb={b}")

print(f"accurancy={np.sum(logit(np.dot(w.T, X) + b) - y < 0.5)/m}")

Neural Networks

Matrix representations

A bit of notation

\[W_{unit\space id}^{[layer\space id\space k]} \in R^{\# units\space in\space layer\space k-1}\] \[A_{unit\space id}^{[layer\space id\space k](sample\space id\space i)} \in R\]

Vectorizable forward propagation using matrix representations

Computation of the activation values on the entire training set $X \in \mathbb{R}^{n^{[k-1]} \times m}$ for all the $n^{[k]}$ units in the $k^{th}$ layer whose weights are stacked in $W^{[k]} \in \mathbb{R}^{n^{[k]} \times n^{[k-1]}}$ and $b^{[k]} \in \mathbb{R}^{n^{[k]} \times 1}$, noted $A^{[k]} \in \mathbb{R}^{n^{[k]} \times m}$ :

\[A^{[k]} = g^{[k]}(W^{[k]}.A^{[k-1]} + b^{[k]})\]

