Skip to main content
  1. Posts/

Self Test Questions Data Science I

Answers to self test questions for the lecture “Data Science I” at KIT. If you spot any errors, write me an e-mail or Discord message.

Lecture 1: Introduction

  • Give examples of applications of clustering.
    • Customer groups clustered based on bought products
    • Unsupervised malware family identification
    • Outlier Detection
  • Describe a scenario from natural sciences, in which classification is useful: What are the attributes/class? How would you try to solve it?

    Flower family classification:

    • Attributes (features)
      • Color of different parts
      • Shape of different parts
      • Size of different parts
    • Solve it by training a multi-class NN with enough high quality training data
  • Explain the principle of the One Rule classifier.

    For each attribute and each attribute value, select the most frequent class and predict it for this value. Select the attribute (and its rules) with the lowest error.

  • What is the difference between supervised/unsupervised learning?

    Supervised learning has labels during training, unsupervised learning does not.

  • What is overfitting?

    Prioritizing accuracy at the cost of an overly complex decision function. This decision function does not generalize well to unseen data. Overfitting happens, when minimizing the training error to a point where the validation error is on the rise.

Lecture 2: Fundamentals

  • How can we categorize data by type?
    • Numerical
      • Continuous
      • Discrete
    • Categorical
      • Ordinal
      • Nominal
  • Consider aggregation function X: is it distributive/algebraic/holistic/self-maintainable?

    Examples for each are:

    • Distributive: min, sum, count
    • Algebraic: mean, midrange
    • Holistic: median
    • Self-maintainable (w.r.t delete, insert) : sum, count, mean
    • Not self-maintainable (w.r.t delete, insert): min, median
  • Do you see a relationship between distributive/algebraic/holistic and self-maintainable? (there is one)

    Self-maintainable β‡’ Algebraic

  • How high is the entropy when all the classes are equally likely? Write down a general formula for this case.

    H(S)=βˆ’βˆ‘i=1Np(xi)log⁑(p(xi))=βˆ’βˆ‘i=1N1Nlog⁑(1N)=βˆ’N(1Nlog⁑(1N))=log⁑(N)H(S) = -\sum_{i=1}^Np(x_i)\log(p(x_i))= -\sum_{i=1}^N\frac{1}{N}\log(\frac{1}{N})=-N(\frac{1}{N}\log(\frac{1}{N}))=\log(N)ο»Ώ

  • What tools do you know to quantify the strength of the relationship between two random variables?
    • Covariance:Cov(X,Y)=E((Xβˆ’E(X))(Yβˆ’E(Y))\mathrm{Cov}(X,Y) = \mathbb{E}((X-\mathbb{E}(X))(Y-\mathbb{E}(Y))ο»Ώ
    • Pearson's correlation coefficient:ρ(X,Y)=Cov(X,Y)V(X)V(Y)\rho(X,Y) = \frac{\mathrm{Cov}(X,Y)}{\mathbb{V}(X)\mathbb{V}(Y)}ο»Ώ
  • Which statistical tests do you know? What can they do? (Differences between each of them)
    • Pearson’s Chi-Squared: For categorical data: Test if two random variables are not independent with significanceΞ±\alphaο»Ώ
    • Kolmogorov-Smirnov (continuous or discrete with many values)
      • Test if two distributions are different with significanceΞ±\alphaο»Ώ
      • Test if two samples are drawn from the same distribution with significanceΞ±\alphaο»Ώ
    • Mann-Whitney-U-Test is a test for centrality. Resulting in a likelihood of two distributions (numerical, ordinal) having the same mean.
  • What is dimensionality reduction? Explain how PCA works!

    Dimensionality reduction is the task of finding a lower dimensional feature space for the input data, without loosing too much information.

    PCA:

    1. Normalize the input dataA~ij=xijβˆ’ΞΌΟƒ\tilde{A}_{ij} = \frac{x_{ij}-\mu}{\sigma}ο»Ώ
    1. Do an orthogonal eigendecomposition of the covariance matrix1NA~TA~=Cov=VDVT\frac{1}{N}\tilde{A}^T \tilde{A} = Cov = V D V^Tο»Ώ (always exists, because covariance matrix is real and symmetric)
    1. Sort eigenvectors by size of eigenvalues
    1. Use the first t eigenvectors for a change of basis in a lower dimensional space
  • How can we discretize data? How can we find the best merge/split?

    We use binning with n bins. Our goal is to minimize the information/variance inside a bin. The two presented methods are entropy based discretization (top down with splits) and ChiMerge (bottom up with merge).

Lecture 3: Indexes

  • What is an index? Why can't we answer spatial queries by simply constructing one B tree per dimension?

    An index is a redundant representation of the data in a data structure that allows certain operations to be performed efficiently, especially without scanning the whole database.

    B trees only index a single scalar value, spatial queries require indexing on multiple dimensions.

  • Explain what is the R tree and its elements?

    An R tree is a balanced tree containing possibly overlapping rectangles. There is an upper bound on the number of data points / contained rectangles per rectangle.

  • How do we search for the nearest neighbor with the R tree? Which rectangles/nodes are inspected?

    For a query point, we can efficiently retrieve the containing rectangle (if any) as well the adjacent rectangles and then do standard NN search with a priority queue.

  • How can we deal with volumetric objects?

    R-trees can handle volumetric objects by placing bounding boxes around objects. We need to be careful though, because an overlap of the bounding box does not imply an overlap of the objects.

  • Is there any problem with the overlaps in R trees?

    For data points (not volumetric objects) we want to reduce the amount over overlap, because it causes a linear component in the runtime instead of a log component.

  • What are the differences between B trees, k-d trees and R trees?
    R treekd treeb tree
    balancedtruefalsetrue
    dimensionalityβ‰₯2β‰₯ 21
    full coveragefalsetruetrue
    overlaptruefalsefalse
  • How can we insert a new object in the R tree? (procedure)

    We traverse the tree along its index to find the rectangle closest to the data point, and insert into the rectangle that requires the least enlargement. If a rectangle reaches maximum capacity, we split it.

  • What queries can be handled by spatial indexes?

    k-NN, range, match, ranking

  • What do we understand under β€œLazy Learning”?

    Non-parametric methods that predict based on the stored dataset without a classical training step.

  • What are pros and cons of the k-NN classifier?
    • No training required (pro)
    • Slow queries (con)
    • Suffers from the curse of dimensionality (con)
    • Sensitive to noise dimensions (con)
    • Interpretable (pro)
    • Only one hyperparameter (pro)
    • Can be easily used as an active/online learner (pro)

Lecture 4: Classification

  • Why do most classifiers have a training and a prediction phase?

    Parametric classifiers need to calculate likely parameters from the training data before using them to make new predictions.

  • Compare classifier X with classifier Y.

    Comparison can be made along the following criteria:

    • Performance (Accuracy and similar)
    • Interpretability
    • Robustness (noise, outliers, concept drift, AEs, ...)
    • Efficiency (time & memory complexity/energy)
    • Number of hyperparameters
    • Applicability to high dimensional data, data streams, temporal data, ...
  • How can we learn the weights in a linear model?

    If this questions does refer to linear regression, we can compute the closed form solution

    w1=Cov⁑(x,y)Var⁑(x)w0=yΛ‰βˆ’w1β‹…xΛ‰w_{1}=\frac{\operatorname{Cov}(x, y)}{\operatorname{Var}(x)} \quad w_{0}=\bar{y}-w_{1} \cdot \bar{x}ο»Ώ

    Can be easily be derived by computing the maximum likelihood of the summed squared error, to get the covariance clearly the matrix order we start with matters.

    If the question refers to logistic regression (linear classification) we use gradient descent in training.

  • Explain the k-NN classifier. Can we also use it for regression?

    For an unseen data point, we compute the k-neighbourhood (k nearest points) and choose the most frequent class in the k-neighbourhood.

  • How do we build a decision tree? (elements, procedure)

    We use the ID3 algorithms which works top town recursive with the following pseudocode

    id3(S, targetAttribute, attributes):
    	if (all e in S have same label l):
    		return Leaf(l, S)
    	if attributes.empty():
    		return Leaf(argmax(S, key=label), S)
      bestAttribute = (find best attribute according to splitting criterion)
      node = Node()
    	for value in bestAttribute:
        Sv = S[S[bestAttribute] == value]
        if Sv.empty():
    			node.add(Leaf(argmax(S, key=label), S))
        else:
          node.add(id3(Sv, targetAttribute, attributes.remove(targetAttribute)))
        
  • How can we decide which attributes to split and where to split?

    We use the split with optimal splitting criterion value. Possible splitting criteria are:

    Maximum Information Gain:

    IG(T)=H(T)βˆ’βˆ‘i=1K∣Ti∣∣T∣H(Ti)IG(T) = H(T) - \sum_{i=1}^K\frac{|T_i|}{|T|}H(T_i)ο»Ώ

    Minimum Gini Index:

    Gini⁑A(T)=βˆ‘i=1K∣Ti∣∣T∣(1βˆ’βˆ‘c=1Cpc2)\operatorname{Gini}_A(T)=\sum_{i=1}^K \frac{|T_i|}{|T|} (1-\sum_{c=1}^C p_c^2)ο»Ώ

    Minimum Misclassification Error:

    βˆ‘i=1K∣Ti∣∣T∣(1βˆ’max⁑c∈[1,C]pc(Tk))\sum_{i=1}^K \frac{|T_i|}{|T|}(1-\max_{c\in [1,C]}p_c(T_k))ο»Ώ

  • Explain different ways to avoid overfitting when building a tree.
    • Prepruning
      • Minimum confidence: Stop when the points in a leaf are similar enough (misclassification error)
      • Minimum support: Stop when minimum number of objects in leaf are reached.
    • Postpruning
      • Reduced-Error Pruning: Keep removing the subtree with maximal error on a hold out set, until no subtree is left.
      • Minimal Cost Complexity Pruning: Trade of classification error with tree complexity on training set:Cost-Complexity⁑(E,Ξ±)=Ξ±Error(E)+(1βˆ’Ξ±)∣E∣\operatorname{Cost-Complexity}(E, \alpha) = \alpha Error(E) + (1 βˆ’ \alpha)|E|ο»Ώ
  • How can we build or use a tree while considering that different mistakes may have different costs?

    With conditional riskR(ci∣x)=βˆ‘jp(cj∣x)L(ci∣cj)R(c_i |x) = \sum_j p(c_j \mid x) L(c_i \mid c_j )ο»Ώ

  • Explain the difference between a Bayes Classifier and a Naive Bayes Classifier? What changes in the decision rule?

    Given an objectooο»Ώ, we want to predict the most probable labelo∈Co\in Cο»Ώ:K(o)=arg max⁑cj∈Cp(cj∣o).K(o) = \argmax_{c_j\in C} p(c_j\mid o).ο»Ώ

    We can use Bayes Rule and notice that the result is not changed by the division byp(o)p(o)ο»Ώ, to obtain the decision rule for Bayes Classifier:

    K(o)=arg max⁑cj∈Cp(o∣cj)β‹…p(pj)p(o)=arg max⁑cj∈Cp(o∣cj)β‹…p(pj)K(o) = \argmax_{c_j\in C} \frac{p(o\mid c_j)\cdot p(p_j)}{p(o)}= \argmax_{c_j\in C} p(o\mid c_j)\cdot p(p_j)ο»Ώ

    The hard part is to computep(o∣pj)p(o\mid p_j).

    The Naive Bayes Classifier further assumes, thatoo is a vector with dimensions that are conditionally independent for allcj∈Cc_j\in C. This helps us to express

    p({o1,…,od}∣cj)=∏i=1dp(oi∣cj)p\left(\left\{o_{1}, \ldots, o_{d}\right\} \mid c_{j}\right)=\prod_{i=1}^{d} p\left(o_{i} \mid c_{j}\right)ο»Ώ

    Which we can plug in the decision rule for the Gaussian Classifier

  • What do we understand under the β€œNo Free Lunch” theorem?

    Without knowledge of the problem, there are no context-independent or usage-independent reasons to favor one learning method over another, i.e., there are always situations when one classifier is better than some others

  • Can we combine different classifiers? How?

    Yes, we can form ensembles of classifiers, by

    • Bagging: Voting of equally weighted classifiers
    • Boosting: Train a new classifier on the misclassified examples of the previous classifier. The later classifiers have decaying weight in the decision
    • Stacking: Like boosting but also learn weights

Lecture 5: Evaluation

  • What are different meanings for the β€œquality” of a learning algorithm?
    • Performance (Accuracy and similar)
    • Compactness of the model
      • Decision tree size; number of decision rules; number of parameters
    • Interpretability of the model
      • Insights and understanding of the model – In general, hard to quantify
    • Efficiency
      • Time to train and to apply the model
    • Scalability for large database
      • Memory and runtime (big O notation)
      • Considering disk-access, out-of-core computation
    • Robustness
      • How results are affected by noise, outliers, missing values
  • What is 10-fold cross validation? When do we need it?

    With the standard hold-out method, there can be a significant variance of accuracy between all possible splits. We can average out that variance with k-fold cross validation. For the case of 10-fold cross validation, we partition the data in 10 partitions of (almost) equal size. For each partition, we use the other partitions as training data and use that partition as test data. We then average the performance.

  • How do we define the overall accuracy of a classifier?

    GTe(K)=∣{o∈Te,K(o)=C(o)}∣∣Te∣G_{T e}(K)=\frac{|\{o \in T e, K(o)=C(o)\}|}{|T e|}

  • What are the different error types of a classifier?

    False Positives: Classified as negative class, although it belongs to the positive class

    False Negatives: Classified as positive, although it belongs to the negative class

  • Explain Cohen’s Kappa coefficient.

    Cohen’s Kappa measures how much better a classifier is than random guessing.

    Letp0=GTe(K)p_0=G_{Te}(K)ο»Ώ be the Accuracy of the classifierKKο»Ώ andpe=βˆ‘cj∈C∣{o∈Te∣K(o)=cj}∣Nβ‹…βˆ£{o∈Te∣C(o)=cj}∣Np_e = \sum_{c_j\in C} \frac{|\{o\in Te\mid K(o)=c_j\}|}{N}\cdot \frac{|\{o\in Te\mid C(o)=c_j\}|}{N}ο»Ώ the agreement betweenKKο»Ώ and the real frequencies (guessing).

    Cohen’s Kappa is the difference normalized for perfect Accuracy

    ΞΊ=p0βˆ’pe1βˆ’pe\kappa=\frac{p_0-p_e} {1-p_e}ο»Ώ

    A perfect classifier hasΞΊ=1\kappa=1ο»Ώ, random guessing hasΞΊ=0\kappa = 0ο»Ώ and a classifier, that is always wrong hasΞΊ=βˆ’1\kappa=-1ο»Ώ

  • Explain the F-score.

    Accuracy can be misleading for testing sets with imbalanced class sizes. Most commonly, we use theF1F_1ο»Ώ score, that harmonic mean of precision and recall.

    F1=2β‹…Prβ‹…RePr+ReF_1 = 2\cdot\frac{Pr\cdot Re}{Pr+Re}ο»Ώ

    If we want to weight one of the inputs more or less, we can use theFΞ²F_\betaο»Ώ score. For example, if we want to be sure that all TPs are found and don’t mind having a few FPs, we can overweight recall (Ξ²>1)\beta > 1)ο»Ώ.

    FΞ²=(1+Ξ²2)Prβ‹…Re(Ξ²2β‹…Pr)+ReF_\beta = (1+\beta^2)\frac{Pr\cdot Re}{(\beta^2\cdot Pr) + Re}ο»Ώ

  • Explain the bias-variance tradeoff (in your own words).

    The error of a classifier can be decomposed in variance, the error that we are making, because we only have a limited data set, squared bias, the error we are making, because the function we are approximating is not contained in the class of the function we are fitting, and inherent noise.

  • What advantage do you see in using the informational loss instead of the quadratic loss?

    It penalizes large errors stronger

  • What ways do you know to measure the quality of a classifier? Which one would you prefer in which situation?
    • Is the data set highly imbalanced?
      • Then standard β€œaccuracy” is probably not meaningful
    • Do you care about one class (positives) in particular?
      • Then F-scores are useful
      • By setting Ξ², you can control the trade-off between recall/precision
    • Do you need to penalize β€œgross” errors more badly?
      • Then maybe informational loss
    • Some measures are controversial
      • But still useful, e.g., Cohen’s Kappa
      • MCC tends to be preferred
  • What is a lift chart? How different is it from the ROC curve?

    A is a 2D plot with the sample size by decreasing probability threshold on the x-axis and the TPs (Gain/Loss) on the y-axis.

  • How can we evaluate the result of a regression model?

    With theR2R^2ο»Ώ measure, which is mean squared error normalized by the inherit variance of the data. So we are measuring data variance of the prediction, that is not explained by the variance of the data.

    R2=1βˆ’βˆ‘i(f(xi)βˆ’yi)2βˆ‘i(yiβˆ’yΛ‰)2∈(βˆ’βˆž,1]R^{2}=1-\frac{\sum_{i}\left(f(x_{i})-y_{i}\right)^{2}}{\sum_{i}\left(y_{i}-\bar{y}\right)^{2}} \in (-\infty,1]ο»Ώ

Lecture 6: Association Rules

  • What are association rules? How do we find them?

    An association ruleA⇒pBA\Rightarrow_p B means that if A happens, B also happens with probability p.

  • How can we describe association rules?

    A⇒B[Support, Confidence]A\Rightarrow B [\text{Support, Confidence}]

  • What are maximal/closed itemsets?

    A maximal item set is an item set, is an item set, that is frequent and not a proper subset of another frequent item set.

    A closed item set, is an item set, that is not a proper subset of another item set with the same support.

  • Give an exemplary association rule with high/low support/confidence.

    Let’s stick with the supermarket example.

    Low support, low confidence: {Kids Toys} β‡’ {Vodka} never bought together

    High support, low confidence: {Apples, Steak} β‡’ {Eggs} many people who bought apples and steak, also bought egg, but there are just not many people in the first place who bought apples and steak together.

    Low support, high confidence:

    High support, high confidence:

  • Why can support/confidence for association rules be misleading?

    Support and confidence are not the same as correlation. Something can be negatively correlated and still have high support and confidence. Example from the lecture:

    Consider that among 5000 students 3000 play basketball (= 60%) 3750 eat cereal (=75%) 2000 both play basketball and eat cereal (=40%)

    Play basketball β‡’ eat cereal [40%, 66.7%] vs. Play basketball β‡’ not eat cereal [20%, 33.3%]

  • What are multi-dimensional association rules?

    Multi-dimensional association rules are association rules applied to categorical relational data. Instead of looking at transactions of a single column, we now consider multiple columns. We can easily translate that into a standard association rule problem by having a transaction for each row that looks like this: {col1Name/val[i, 1], ..., col1Name/val[i, N]}

  • What are multi-level association rules and how to find them?

    Normal association rules have a fixed level of abstraction. E.g., having one category for milk, but in a supermarket there are multiple subtypes. Multi-level association rules allow capturing effects across levels. The multi-level case can be transformed into a standard association rule problem by looking at the directed tree of types and having a category for every walk in the tree starting from the root and ending at the tree root. But we need to be aware of the exponential increase in categories for each full new layer in the tree.

  • Do you see a relation between association rules and decision trees?

    For each result of an association rule, we can build a decision tree, that predicts how likely it is that the result will happen. With that, we have a conditional decision, instead of just an occurrence in a set. But we need to build a decision tree for each product instead of mining decision rules from transactions.

  • In which situations is Apriori expensive? What can we do about it?

    The join step is expensive for large datasets, and gets more and more expensive ifLiL_iο»Ώ does not fall fast. This is the case when the minimum support is low. We should therefore take a larger minimum support for large datasets.

    Also, support counting requires a full database scan. We can use a hash tree to speed up the lookups for counting, as our transactions are sorted.

    Sampling with Toivonen’s algorithm improves both step by approximating the negative border.

  • Explain every step of Apriori. What is the negative border?

    InitializeL1L_1ο»Ώ with frequent one item sets.

    1. Given sorted frequent item setsLkL_kο»Ώ, join them to get the candidatesCk+1C_{k+1}ο»Ώ. Efficient because of sorted item sets.
    1. GetCk+1prunedC_{k+1}^\text{pruned}ο»Ώ by going throughCk+1C_{k+1}ο»Ώ and only keep candidates, where each subset is inLk+1L_{k+1}ο»Ώ. Efficient because of sorted item sets.
    1. GetLk+1L_{k+1}ο»Ώ by counting the support of each item set and delete the ones with less than minimum support.

    Stop ifCkC_kο»Ώ is empty.

    The negative border is the set of item sets that are not frequent but contain a frequent Length - 1 item set. Apriori finds that negative border.

  • What do we understand under the concept of anti-monotonicity?

    Any superset of non-frequent item set is non-frequent too

  • What are FP-trees, and how do we construct them?

    Frequent pattern trees contains all frequent patterns with their support and are not larger than the original database.

    1. We construct them by sorting every transaction.
    1. And have a walk starting from the root for every transaction. Each node counts the overlapping parts (prefixes). For efficiency, we have a header table for each item with pointers to the FP-tree.
    1. From the prefixes pointed to by the header table, we can further construct the conditional pattern base, for every frequent item. For each pattern base, we can repeat the process.
  • Why is FP-growth so fast (compared to Apriori)?

    No candidate generation Compact data structure Only 2 database scans

Lecture 7: Clustering

  • Which (types of) clustering algorithms do you know?
    • Partitioning Clustering
      • k-Means and variants
    • Probabilistic Clustering
      • Expectation Maximization
    • Density based Clustering
      • DBSCAN
    • Hierachical Cluster
      • Agglomerative/Divisive
      • OPTICS (density-based and hierarical)
  • Explain clustering algorithm k-Means!
    1. Pick k random centers
    1. Assign each point to the closest center
    1. Compute a new center, as the mean of all points in the current cluster
    1. GOTO 2. until nothing changes anymorep
  • Explain clustering algorithm k-Medoid (PAM Algorithm)!

    We want to minimize the total distanceTD=βˆ‘i=1kβˆ‘xβƒ—j∈Cidist⁑(xβƒ—j,mi)T D=\sum_{i=1}^{k} \sum_{\vec{x}_{j} \in C_{i}} \operatorname{dist}\left(\vec{x}_{j}, m_{i}\right)ο»Ώ

    1. Pick k random objects (medioid)
    1. Assign each object to the closest medioid
    1. For each cluster, compute TD. And for each pair of medioid and non-medioid compute how TD would change, if we swap the objects. Swap the pair that minimizes TD. Repeat this step until nothing changes.
  • Explain clustering algorithm k-Mode!

    Like k-Means, but for categorical data. Use Mode of a cluster, instead of the mean and use the Hamming distance to compare points

  • Explain clustering algorithm DBSCAN!

    Choose hyperparametersΞ΅\varepsilonο»Ώ and MinPts

    1. Calculate core objects: An object is core, if and only ifNΞ΅(o)>MinPtsN_\varepsilon(o) > \mathrm{MinPts}ο»Ώ
    1. Density-connected objects form density-based clusters!
  • Explain clustering algorithm OPTICS!
    1. Visit each point, remember core distance and reachability distance
    1. Move from one point to another by the β€œshortest jump”
    1. Maintain a priority list (sorted by increasing reachability-distance to already processed points)

    Output: Ordered list of points: Core distance and reachability distance of points

    This can be interpreted as a dendrogram

  • What are advantages of partitioning clustering compared to density-based clustering?

    Density-based clustering typically has fewer hyperparameters.

  • What is the complexity of k-Medoid? How can we improve?

    O(tk(nβˆ’k)2)\mathcal{O}(tk(n-k)^2)ο»Ώ

  • Explain the Silhouette coefficient: Write down the formula!

    The silhouette coefficient captures the goals of clustering: Similarity within the clusters, dissimilarity between clusters, in a distance based way.

    a(o)=1∣C(o)βˆ£βˆ’1βˆ‘p∈C(o),pβ‰ odist⁑(o,p)a(o)=\frac{1}{|C(o)|-1} \sum_{p \in C(o), p \neq o} \operatorname{dist}(o, p)ο»Ώ

    b(o)=min⁑Ciβ‰ C(o)1∣Ciβˆ£βˆ‘p∈Cidist⁑(o,p)b(o)=\min _{C_{i} \neq C(o)} \frac{1}{\left|C_{i}\right|} \sum_{p \in C_{i}} \operatorname{dist}(o, p)ο»Ώ

    s(o)={0Β if ∣C(o)∣=1b(o)βˆ’a(o)max⁑{a(o),b(o)}Β otherwiseΒ s(o)=\left\{\begin{array}{cc} 0 & \text { if }|C(o)|=1 \\ \frac{b(o)-a(o)}{\max \{a(o), b(o)\}} & \text { otherwise } \end{array}\right.ο»Ώ

    silh⁑(C)=1nβˆ‘Ci∈Cβˆ‘o∈Cis(o)∈[βˆ’1,1]\operatorname{silh}(\mathcal{C})=\frac{1}{n} \sum_{C_{i} \in \mathcal{C}} \sum_{o \in C_{i}} s(o)\in [-1,1]ο»Ώ

  • What can we do if we do not know parameter k upfront?

    We can look at a dendrogram, e.g., generated with OPTICS.

    Or we can try different k and plot a quality measure like the silhouette coefficient. Typically, we see a drop somewhere in the plot. This point is called elbow. The k at the elbow is preferred.

  • Explain the Expectation-Maximization procedure!

    Do some reasonable initialization of distribution parameters, e.g., with k-means

    Expectation: Compute cluster probabilities aka responsibilities for each sample (Bayes rule)qik=p(ki=k∣xi)=p(xi∣k)p(k)βˆ‘jp(xi∣j)p(j)=Ο€kN(xi∣μk,Ξ£k)βˆ‘j=1KΟ€jN(xi∣μj,Ξ£j) q_{i k}=p\left(k_{i}=k \mid \boldsymbol{x}_{i}\right)=\frac{p\left(\boldsymbol{x}_{i} \mid k\right) p(k)}{\sum_{j} p\left(\boldsymbol{x}_{i} \mid j\right) p(j)}=\frac{\pi_{k} \mathcal{N}\left(\boldsymbol{x}_{i} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\boldsymbol{x}_{i} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)}ο»Ώ

    Maximization: Compute (weighted) maximum likelihood estimateΟ€k=βˆ‘iqikNΞΌk=βˆ‘iqikxiβˆ‘iqikΞ£k=βˆ‘iqik(xiβˆ’ΞΌk)(xiβˆ’ΞΌk)Tβˆ‘iqik\pi_{k}=\frac{\sum_{i} q_{i k}}{N} \quad \boldsymbol{\mu}_{k}=\frac{\sum_{i} q_{i k} \boldsymbol{x}_{i}}{\sum_{i} q_{i k}} \quad{\Sigma}_{k}=\frac{\sum_{i} q_{i k}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{k}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{k}\right)^{T}}{\sum_{i} q_{i k}}ο»Ώ

  • What is the difference between a core/border/noise object?

    Core objects: Objects withNΞ΅(o)β‰₯MinPtsN_\varepsilon(o) \geq \mathrm{MinPts}ο»Ώ

    Border objects: Objects that are density-connected to a core object, but not core itself

    Noise: Other points

  • How can we work around the sensitivity of DBSCAN’s parameters?

    Or we can fix a MinPts, typically2β‹…d2\cdot dο»Ώ and try differentΞ΅\varepsilonο»Ώ and plot a quality measure like the silhouette coefficient. Typically, we see a drop somewhere in the plot. This point is called elbow. TheΞ΅\varepsilonο»Ώ at the elbow is preferred.

  • What is a dendrogram?

    A tree where the levels are increasing values of a hyperparameter of a clustering algorithm and the root represents the whole data set, and a leaf represents a single object in the data set. The nodes are the union of its children.

  • How to measure the distance between two clusters or sets of points?

    Minimum distance:dist⁑sl(X,Y)=min⁑x∈X,y∈Ydist⁑(x,y)\operatorname{dist}_{s l}(X, Y)=\min _{x \in X, y \in Y} \operatorname{dist}(x, y)

    Average distance:dist⁑cl(X,Y)=max⁑x∈X,y∈Ydist⁑(x,y)\operatorname{dist}_{c l}(X, Y)=\max _{x \in X, y \in Y} \operatorname{dist}(x, y)

    Maximum distance:dist⁑al(X,Y)=1∣Xβˆ£β‹…βˆ£Yβˆ£βˆ‘x∈X,y∈Ydist⁑(x,y) \operatorname{dist}_{a l}(X, Y)=\frac{1}{|X| \cdot|Y|} \sum_{x \in X, y \in Y} \operatorname{dist}(x, y)ο»Ώ

  • What are the advantages of hierarchical clustering?

    We can capture clusters in clusters.

Further Questions

  • What are the different clustering paradigms and what algorithms have we learned for each of them.
    • Partitioning
      • K-{Means, Mode, Median, Medioid}
    • Probabilistic
      • EM
    • Density based
      • DBSCAN, OPTICS
    • Hierarchical
      • OPTICS, BIRCH, DIANNA

Lecture 8: Outlier Detection

  • What is an Outlier according to Hawkins?
    An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. ~ Hawkins, 1980
  • Why should Outlier Detection be considered unsupervised?

    Because we do not have ground truth for outliers.

  • Do you see a relationship between Clustering and Outlier Detection?

    For outlier detection, we want an unsupervised clustering in normal data and outlier data.

    We may also find outliers by clustering and looking for small clusters or noise

  • Describe an exemplary Outlier Detection task (not from the lecture).

    An AV company that wants to find a new malware family

  • Describe the limitations of statistical methods.
    • Multivariate density functions are difficult to estimate
      • Statistical method are themselves sensitive to outliers
    • Most tests are for single attributes only
    • In many cases, data distribution may not be known
  • Is it preferable to have an outlier ranking, or rather a yes/no decision?

    An outlier ranking, is preferable, because it carries more information and can be converted to a binary decision just with a threshold value.

  • Explain LOF! (Top students can write down the formulas)

    LOF is the average of the ratio of the local reachability density of o and those of o’s k nearest neighbors

    LOFk(o)=βˆ‘p∈Nk(o)lrd⁑k(p)lrd⁑k(o)∣Nk(o)∣=βˆ‘p∈Nk(o)lrd⁑k(p)∣Nk(o)βˆ£β‹…lrd⁑k(o)LOF_{k}(o)=\frac{\sum_{p \in N_{k}(o) \frac{\operatorname{lrd}_{k}(p)}{\operatorname{lrd}_{k}(o)}}}{\left|N_{k}(o)\right|}=\frac{\sum_{p \in N_{k}(o)} \operatorname{lrd}_{k}(p)}{\left|N_{k}(o)\right| \cdot \operatorname{lrd}_{k}(o)}ο»Ώ

  • What types of outliers do you know?
    • Global outliers
      • Deviate from the majority of data
    • Local/Contextual outliers
      • β€œThe temperature is 28Β°C today, is that exceptional?”
    • Collective outliers
      • Abnormal β€œburst” of observations
  • How can one use Neural Networks to detect outliers?

    We can use:

    • Autoencoders
    • Restricted Boltzmann Machines
    • Self-Organizing Maps
  • Why is it harder to find outliers in high-dimensional spaces?

    Because we do not have meaningful distances

Lecture 9: Data Science in Practice

  • What do you consider big data? What are the "5V"s?

    The internationally recognized definition of big data is: A dataset so large, that it can’t be dealt with in Excel!

    • Volume: amount of data
    • Variety: type and nature of data
    • Velocity: speed of new data arriving
    • Veracity: truthfulness
    • Value: worth in information
  • Explain the main differences in data science from theory to practice.

    In Theory:

    • Data attributes are immutable
    • All data collected from a single source
    • Classification: Labels are given
    • Streaming: Time is equidistant
    • Outlier Detection: Outliers are rare and very recognizable

    In Practice

    • Data attributes can change rapidly due to new business requirements
    • Data is distributed among many systems
    • Classification: Labels are subject to interpretation
    • Streaming: Everything is event based
    • Outlier Detection: Outlier blend easily into the mass
  • What are the responsibilities of "Data Engineers", "Data Analysts", and "Data Scientists", respectively?

    Data Engineers:

    • Data Collection
    • Data Cleaning

    Data Analyst (additionally):

    • Coming up with a problem statement
    • Explorative Analysis

    Data Scientists (additionally):

    • Model Building
    • Model Deployment
  • Which frameworks exists to standardize model development in businesses? What are their peculiarities?

    Is this question referring to Crisp DM.

  • What do you consider the benefit of Agile Software Development?

    It is fast in responding to changing demands from the customer.

  • Which model would you try first when working on a new project? Which one would you try second? Why?
    1. Decision Tree
    1. Random Forest

    But remember there is no free lunch