Skip to content

Latest commit

 

History

History
304 lines (294 loc) · 22.7 KB

faq.md

File metadata and controls

304 lines (294 loc) · 22.7 KB
layout title permalink
post
Frequently Asked Questions
/faq/

Maths/Stats

Answers are in [Algebra for ML Engineers]({{ site.baseurl }}{% post_url 2018-04-20-Algebra-for-ML %}) and [Statistics for ML Engineers]({{ site.baseurl }}{% post_url 2018-06-15-Statistics-for-ML %}).

  • What is a random variable?
  • What’s a linear combination of vectors?
  • how to check if two vectors are independent?
    • dot product is zero, they're orthogonal.
  • how to check if vectors are orthogonal or parallel?
  • what's orthogonal vectors, normal vectors, orthonormal vectors?
  • what's an orthogonal matrix?
  • what's the cross product of two perpendicular vectors and the cross product of two parallel vectors?
    • The cross product $$ a \times b$$ is always orthogonal to both vectors, and has magnitude zero when the vectors are parallel and maximum magnitude $$‖a‖‖b‖$$ when they are orthogonal.
  • what's the dot product of two perpendicular vectors and the dot product of two parallel vectors?
    • The dot product of orthogonal vectors is $$0$$, as $$cos \theta = 0$$. The dot product of parallel vectors $$a$$ and $$b$$ is the product of the magnitude of both vectors ie $$|a| |b|$$.
  • how to check if multiple vectors are independent?
    • row echelon format of the matrix with all vectors as columns
  • what is the generating set of a vector space? and basis, dimension?
  • what is the rank and the trace of a matrix?
    • trace = sum of diagonals,
  • What does full rank mean on a matrix?
    • full rank means invertibility
  • what is an affine transformation? What does it guarantee?
    • guarantees parallelism
  • what is a projection?
    • idempotent
  • what is (uniform) row echelon format?
  • how many matrix multiplication operations are there?
    • element-wise multiplication, inner/dot product, outer product, cross product?
  • how do you calculate the angle between two vectors?
  • what is the complexity of a dot product?
  • what is a (strictly) convex function?
  • why do we need convex functions?
    • single solution and global minimum for strictly convex.
  • what is gradient descent?
    • a first order optimisation algorithm.
  • How do you compute a Jacobian and a Hessian? What is their dimensionality? Are they square and/or symmetric?
    • dims: $$f \times d$$ and $$d \times d$$. Jacobian is not symmetric and not square. Hessian is symmetric and square.
  • What are second-order optimisation algorithms? what is the Adams optimizer? does it use the second derivative? (why not?)
    • an optimizer that uses the momentum / second order, not the second derivative (Hessian) because it would be too slow.
    • in Adam, second order means curvature, not derivative!
  • How can the 2nd derivative be used in an optimisation algorithm?
    • (??)
  • how do you compute an expectaction?
  • is $$\mathbb{Cov}[X,Y] = \mathbb{Cov}[Y,X]$$?
    • yes for univariate random vars, no for multivariate: $$\mathbb{Cov}[X,Y] = \mathbb{Cov}[Y,X]^{\intercal}$$
  • What is the variance of the sum of random vars, i.e. $$\mathbb{Var}[A+B]$$?
  • What is the expectation of the sum of random vars, i.e. $$\mathbb{E}[A+B]$$?
  • What is the sum of two random variables? (convolution)
  • What is the linear transformation of a gaussian random variable?
  • Describe Newton's method? Where does it come from? How can it be adapted to find a minimum in a function?
    • Yes, the minimum of a function can be found by computing the root of the derivative ie: $$x_{n+1}=x_{n}-{\frac {f'(x_{n})}{f''(x_{n})}} $$. Division is not defined between matrices, needs to be replaced by multiplication of the inverse.
  • What is the formula for the Taylor series? Why does it make sense? Why is there a 1/n! in the formula?
    • (??)
    • Example, to learn distribution of model parameters: $$p(\theta \mid X) = \frac{ p(X \mid \theta) , p(\theta)}{p (X)}$$
  • What is an invertible matrix? What are the criteria that apply? What are the procedures to invert a Matrix? What is the complexity?
    • invertible matrix $$A^{-1}$$ of a matrix $$A$$ is that matrix that allows $$AA^{-1} = I = A^{-1}A$$.
    • if inverse exists, $$A$$ is called regular/invertible/non-singular.
    • if determinant is not zero, matrix is invertible.
    • to invert, perform gaussian elimination on the matrix $$[A \mid I]$$ to achieve $$[ I \mid A^{-1} ]$$.
    • it runs on the same complexity as gaussian elimination, cubic!
  • How do you compute the determinant of a triangular matrix?
    • determinant of a triangular matrix is the product of the diagonals!
  • What is linear programming? Quadratic programming? Dynamic programming?
  • How to perform an inverse or gaussian elimination of very large matrices?
    • because GE runs on cubic complexity, needs instead an iterative algorithm e.g. Jacobi method.
  • whats the inverse of a symmetric matrix?
    • its transpose: if $$A$$ is symmetric, then $$A^{\intercal}=A^{-1}$$
  • if $$A$$ is invertible, is $$A^{\intercal}$$ invertible? Why?
    • (??)
  • How to check if a set of vectors is independent?
  • If vector $$A$$ is independent of $$B$$ and $$B$$ independent of $$C$$, is $$A$$ independent of $$C$$? Example? No!
  • If random var $$A$$ is independent of $$B$$ and $$B$$ independent of $$C$$, is $$A$$ independent of $$C$$? Example? No!
  • What is the solution to the least squares, or system of linear equations $$Ax=b$$
    • if $$A$$ is square and invertible: $$x = A^{-1}b$$
    • otherwise: $$x=(A^TA)^{−1}A^Tb$$
  • What is the difference between Covariance and correlation?
  • What is the difference between dependence and correlation? (??)
  • What is a positive (semi) definite matrix? How to check if a matrix is psd? Why are covariance matrices psd?
    • Diagonal is always positive.
  • What is likelihood? What does the likelihood result mean i.e. how to interpret it? Give an example. (??)
  • What’s the difference between probability and likelihood?
  • What are the product and sum rules of probability? What is marginalization and marginal likelihood?
  • What is the sum of two gaussian random variables?
  • What is Bayes' Formula? Can you give an example?
    • Bayes describes the relationship between some prior knowledge $$p(x)$$ about an unobserved random variable $$x$$ and some relationship $$p(y \mid x)$$ between $$x$$ and a second variable $$y$$
  • What is the problem of the bayesian setup?
    • intractable or computationally prohibitive. Therefore we can use variational inference for instance, MCMC, Taylor series, ...
  • What does Markov Chain Monte Carlo mean:
    • Random stepping (Monte Carlo) where next iteration depends only on the current (Markov chain)
  • What is the exponential family?
  • What properties do distributions of the exponential family have?
    • conjugate prior, MLE/MAP standard resolution, includes stuff statistics $$Φ(x)$$.
  • What are generalized linear models?
  • What is a conjugate prior?
  • What the theorem of Jensen's inequality? When do you use it?
  • What is KL divergenge? Is it a metric?
  • What is the central limit theorem? Give an example.
  • What is the law of large numbers? Give an example.
  • What are eigen values and eigen vectors? What do they mean?
    • Eigenvalues and eigenvectors
    • An eigenbasis is a basis of $$R^n$$ consisting of eigenvectors of $$A$$. Eigenvectors with different eigenvalues are automatically linearly independent.
  • What is eigendecomposition? Give the example of an algorithm.
  • how to compute eigenvalues and eigen vectors?
  • How do you perform a Principal Component Analysis?
  • How to recover original dataset from PCA?
    • $$X’=VX \Leftrightarrow V^{−1} X’ = X$$, but because eigen vectors are orthonormal, $$V^T X’=X$$
  • How do you perform a Singular Value Decomposition? when would you use this instead of PCA?
  • What is Cholesky’s factorization (positive semidefinite matrix)
  • What is the Toeplitz matrix? When to use it in ML?
    • matrix where each diagonal from left to right is constant e.g. $$[ [a, b, c, d], [e, a, b, c], [d, e, a, b], [c, d, e, a] ]$$
    • used to compute linear convolution as a multiplication by the Toeplitz instead of a sliding kernel. Useful e.g. on GPUs.
  • What’s a Fourier transform?
  • What are strided convolutions and what are they used for?
    • Used for a larger receptive field on CNNs without increasing kernel size (and computation and memory).
  • How are neural nets related to Fourier transforms? What are Fourier transforms, for that matter?
    • We can consider the discrete Fourier transform (DFT) to be an artificial neural network: it is a single layer network, with no bias, no activation function, and particular values for the weights. The number of output nodes is equal to the number of frequencies we evaluate.
  • What is a Lagrange multiplier? When to use it?
  • What are sufficient statistics? What theorem describes it? (Fischer-Neyman)
  • How to sample from a Gaussian distribution?

Machine Learning

  • What is a regularizer? What is L1/L2 regularization and when are they used?

    • To discuss: regularize means to make things regular or acceptable; aims at reducing overfitting; L1 = absolute value = weight dropping; L2 = square of difference;
  • What is the difference between Type I and Type II error?

    • A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population.
  • How do batch-/layer normalization work as a regularizer?

    • allows much higher learning rates and smapller step size. Paper abstract “Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift".
  • What is the biad/variance tradeoff? How do they look on a good model?

    • low bias, low variance.
  • What is overfitting/underfitting, how do know if we are over-/underfitting, and how to overcome it?

    • Overfitting; Low train error, high valid error. Low bias high variance
    • Underfitting: high error on train and validation, High Bias and a High Variance
    • to overcome overfitting:
      • Reduce the network’s capacity by removing layers or reducing the number of elements in the hidden layers: the higher the capacity, the easier it is for the model to learn the input classes.
      • Apply regularization, which comes down to adding a cost to the loss function for large weights
      • dropout
  • What is dropout and how does it work? Why does it work?

    • train vs valid steps, $$p$$ parameter, reduces overfitting, subnetworks learn alternative logics (while DNN is more prone to overfitting), 2-3 times slower to train, regularizer as promotes sparse activations like L2 regularization.
  • What are ensembles? Give few examples. What is bagging and boosting?

  • How to model uncertainty or error of a model?

    • use ensembles to compute mean, and compare model with the mean of the ensembles output.
  • How do ensemble methods work? What ensemble methods do you know?

    • ensembles is a techinique that combines multiple models and....
    • possible options: each model is trained with different data (maybe with some overlaps), each model is trained with different seeds, different models.
  • How to model graphs of molecules represented as strings? (Transformers)

  • What is the memory/computation bottleneck of a Transformer?

  • What are latent variables? What is a latent variable model?

  • What is a (direct) graphical model?

  • What is a graph neural net? What is the message passing and what's its formulation?

  • What are BERT's / GPT's pre-trainining and post-training tasks?

  • Why use Transformers instead of RNNs for sequences? What are the computational advantages/disadvantages?

    • RNN rent to perform worse with tokens on the beginning of the sequence, and better with final tokens.
    • Transformers allows every token to attend to every token, with a weighted attention given by the attention matrix.
  • What is the point of the mask in the attention head of the Transformer?

  • What are GANs and what are the challenges in training them? What's the loss function?

    • Model collapse, Discriminator lean too fast, therefore the gradients are too small in the discriminator
  • Describe the loss function of a GAN.

  • Whats the goal of Variational Inference?

  • What are VAEs and what are the challenges in training them? What's the loss function?

    • reparametrization trick.
  • What are kernels and how do they work? What is a Gram matrix? How are they useful? Limitations? Complexity?

    • we project input into higher dimensionality where we can do (linear) regression/separation.
  • How do SVMs work? What's the complexity?

    • maximize margin (best answer)
    • Lagrange multipliers
    • support vectors, dual optimization problem
    • $$K(x,y)= \phi(x)^T \phi(y)$$, this multiplication makes it quadratic.
    • we can then perform linear regression on the higher dimensionality data.
    • SVMs then have complexity quadratic to cubic during the training phase (??), therefore they do not scale well with data.
  • What's least squares? Derive its analytical solution.

    • computed by minimizing $$(y - Xw)^2 , = , (y-Xw)^T(y-Xw)$$.
    • derivating with respect to $w$ leads to $$w = (X^TX)^{-1} X^Ty$$.
  • What's ridge regression? Derive its analytical solution.

    • computed by minimizing $$(y - Xw)^2 + \lambda w^Tw , = , (y - Xw)^T(y-Wx) + \lambda w^Tw$$.
    • derivating with respect to $w$ leads to $$w = (X^TX - \lambda I)^{-1} X^Ty$$.
  • What is entropy and Mutual Information?:

  • What loss function to use for categorical classification, binary classification, regression, etc?

  • What’s the difference between a generative and discriminative model?

  • How is KNN different than k-means?

  • How do ROC curves work?

  • What is an autoencoder?

  • What is a random walk?

    • the movements of an object or changes in a variable that follow no discernible pattern or trend.
  • What does stochastic mean?

    • “Stochastic” is a description that refers to outcomes based upon random probability. Or having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.

ML software enginnering

  • Train, validation, test datasets? k-cross validation?
  • What is the back-propagation algorithm?
  • What is the Q-function, policy function, state and action in Reinforcement Learning?
    • build a DNN that takes as input the current state and output the Q-value for each possible action.
    • Q-function: given a state, what's the value (return) of every action that we can take, ie $$Q(state,action)$$.
    • policy function follows after the Q-function: given the state, what's the best action out of all Q values (Q function outputs), e.g. $$\pi(state) = argmax_a Q(action, state)$$.
    • output of policy function is then backpropagated.
  • What's the setup for DNN for Reinforcement learning e.g. ATARI player?
    • What's the loss (Q-loss)?
  • What's the downside of using Q-values? What's the alternative?
    • Complexity: can only model scenarions where action space is discrete and small. Cant handle continuous action spaces.
    • To adress this, we use Policy gradient methods.
  • What are the different deep reinforcement learning algorithms?
    • Q-learning: find $$Q(s,a)$$ then $$\pi(s) = argmax_a Q(s,a)$$.
    • Policy learning: find $$\pi(s)$$ then sample $$a \sim \pi(s)$$
  • How does policy gradient work? What's the loss function? How to pre-train?
    • DNN learns parametric policies e.g. $$P(a_1 \mid s)$$ for all actions where $$P(a \mid s) = \mathcal{N}(\mu, \sigma^2)$$. Then $$\pi(s) \sim P(a \mid s)$$.
    • Car racer: do simulated environment
    • board games: train two networks against each other.
  • What are the different types of scaling and parallelism?
    • microbatching (gradient accumulation), data parallelism, model parallelism across layers, pipelining, sharding, etc
  • What is the automatic differentiation and how does it work?
  • What are the different ways of reducing model size?
    • gradient clipping, model quantization, mixed-precision, prunning or model distilation (to a smaller model).
  • What is the receptive field of a CNN?
  • What is the pooling in a CNN and why do we need it? and kernel and why do we need it?
  • Why when to use Relu, Leaky telu, sigmoid?
    • With a Leaky ReLU (LReLU), you won’t face the “dead ReLU” (or “dying ReLU”) problem which happens when your ReLU always have values under 0 - this completely blocks learning in the ReLU because of gradients of 0 in the negative part
  • A method/recide to create large neural network, detect hyper-parameters (layer count, neuron count), avoid overfitting, etc.?
  • How to compute variance of a set of $$n$$ values, in one single loop?
    • as we loop, accumulate the sum of $$x_i$$ and the sum of $$x_i^2$$. At the get the square of the sum of $$x_i$$ and divide both sums by $$n$$.
    • ie we compute variance as $$Var[X] = \mathbb{E[x^2]} - \mathbb{E[x]^2}$$.
  • What is a Turing Machine?
  • What are the performance metrics used for binary classification?
    • accuracy, precision, recall, F1-score. ROC curve?
  • Does low loss mean higher accuracy? Give an example. Why to use accuracy instead of loss to measure performance?
  • what is a decision tree, and how is a decision tree pruned?
  • What is the memory bottleneck in DNNs and CNNs?
    • DNNs: optimizer parameters and model weights. CNNS: activations (layer outputs)
  • how to compare two Bayesian models (priors)?
    • compute the ratio, they have same evidence, and assume same prior, all goes down to ratio of likelihoods
  • Why does the cross-entropy fomula make sense?
    • CrossEntropy is the negative of the log-Likelihood of x (or softmax of x if output is not a distribution).
  • Why do we use cross-entropy instead of negative log-likelihood? (computational stability)
  • How to build a CNN for image segmentation?
  • How to find edges without DNN (ie using traditional methods)?
    • location with smalles value of sum of image with jittered image
  • what's the dimensionality of a kernel in a CNN?
  • explain how to collect embeddings?
    • VAE middle layer, U-net middle layer / downsampling, PCA.
  • What is an autoencoder?
    • a DNN that learns to transform an input into noise and back into the input.
  • What is the Difference between Bagging and Boosting?
  • What is a Random Forest? What is the computational complexity in training?
  • Given two sets of samples, how to check if they come from the same distribution?
    • If shape of distribution is known, e.g. gaussian, fit to gaussian and compare mean and variance values.
    • Otherwise a first quick comparison is to compute the expected value and the variance of the samples.
    • Otherwise, build a histogram for each set of samples, then compare the overlap of the histograms. Or equivalently, compare absolute difference of CDFs.
      • If it's above a given threshold, assume they're drawn from same distribution
    • Note: perfect match/comparison is not possible, because the histogram/shape of two sets of samples will likely not match, even if drawn from the exact same distribution.
  • What are exploding and vanishing gradients?
  • Why do we need skip connections?
    • to train deep networks, to remove vanishing gradients
  • What are the components of a CNN? what is the stride? How does kernel size influence training?
  • What's the complexity of a matrix multiplication? Write the code.
  • What's the complexity of a Gaussian Elimination ?
  • Take the operation A^T X^T A X (??), with a matrix and vector X. Is the output a vector or a scalar? Write the pseudocode to do this in one loop. What's the code to compute its derivative?
    • Code to run in one loop (quadratic):
# A: M x N
# X: N
res = 0
AT = A.T  #: N x M, precomputed transpose
for m in range(M):
  for n in range(N):
    res += AT[n,m] * X[n] * A[m,n] * X[n] 
  • derivative (??): res += 2 * pow( A[m,n], 2) * X[n]
  • How does attention matrix and positional encoding of a transformer work?
  • How to perform object classification in images?
    • U-Net and CrossEntropy loss on output vector of classes.
  • How to perform object detection in images?
    • downsampling until outputting the 4 corners of bounding box and the classification vector (with class "no object").
  • How to do image segmentation?
    • U-Net and output images as 1 layer per class with object segmentation as 1 and everything else 0.
  • How to perform image denoising?
    • autoencoder from dirty to clean image, MSE loss
  • How to do Speech recognition, i.e. audio to text?
    • input as spectogram, network with 1D convolution, Transformer network, output as BPE tokenizer.
  • Text-image representations: how to have an image and a text of the image refer to the same feature vector?
    • Contrastive Language-Image Pre-training (CLIP), combininig am image encoder (visual transformer) and a text encoder (GPT).
  • How to do text generation?
    • GPT
  • How to do image generation?
    • One method for image synthesis relies on inverting a diffusion process (Ho et al 2020). The principle consists of defining analytically a process that gradually degrades any sample, and consequently transforms the complex and unknown density of the data into a simple and well-known density such as a normal, and training a deep architecture to invert this degradation process.
  • What is Self-supervised learning?
    • Methods that take advantage of unlabeled datasets (e.g. GPT, Contrastive Predictive Coding (CPC), Simple Framework for Contrastive Learning of Representations (SimCLR)). The key principle of these methods is to define a task that does not require labels but necessitates feature representations which are useful for the real task of interest, for which a small labeled data set exists.
    • see A Cookbook of Self-Supervised Learning for details.
  • How to collect embeddings of sequences, text, words (Word2vec, skipgram, CBOW), point clusters or arrays, graphs, images?
  • In data parallelism, why do we synchronize (reduce) gradients between accelerators, and not the weights?
    • (??) because jacobians are produced as source of linearity so they can be averaged, while weights are not, so we can't reduce weights (??).
  • In transformers, why do we need the feed forward network?
    • (??) A simple feed-forward neural network is applied to every attention vector to transform the attention vectors into a form that is acceptable to the next encoder or decoder layer. (??)