Data Science | Blog | Harshit Kumar

False positive paradox

The false positive paradox: why a test with low false positive rate can still produce more false positives than true positives for rare conditions.

Oct 12, 2018 · 2 min read

Data Science

Loss functions

A survey of common loss functions MSE, cross-entropy, hinge loss, with background on entropy, KL divergence, and the MLE connection.

Aug 24, 2018 · 4 min read

Data Science

Optimizers

An overview of neural network optimizers: SGD, momentum, RMSProp, and Adam, and how they improve on basic gradient descent.

Aug 17, 2018 · 5 min read

Data Science

Methods of Hyperparameter optimization

Comparing hyperparameter optimization strategies like grid search, random search, and Bayesian optimization with scikit-learn examples.

Aug 03, 2018 · 2 min read

Data Science

The Bayesian Thinking - III

Probabilistic programming with PyMC3, applying Bayesian linear regression using the Bayesian view of statistics.

Jun 22, 2018 · 4 min read

Data Science

The Bayesian Thinking - II

Comparing classical, frequentist, and Bayesian probability frameworks, and how Bayesian thinking updates beliefs with new evidence.

Jun 15, 2018 · 3 min read

Data Science

The Bayesian Thinking - I

An introduction to Bayes' theorem and conditional probability through a disease-testing example that challenges intuitive reasoning.

Jun 08, 2018 · 3 min read

Data Science

Dropout: Prevent overfitting

How dropout regularization prevents overfitting by randomly deactivating neurons during training, effectively ensembling many sub-networks.

May 04, 2018 · 1 min read

Data Science

How deep should neural nets be?

Practical guidance on choosing neural network depth and layer sizes, input, hidden, and output layers for different problem types.

Apr 27, 2018 · 2 min read

Data Science

Don't use sigmoid: Neural Nets

Why sigmoid activation functions should be avoided in deep neural networks, and what alternatives like ReLU offer instead.

Apr 20, 2018 · 2 min read

Data Science

Scaling vs Normalization

The difference between feature scaling (min-max) and normalization (standardization), and when to apply each in machine learning pipelines.

Mar 23, 2018 · 5 min read

Data Science

Ensembling is the key

An overview of ensemble learning methods: bagging, random forest, boosting, and stacking, and why combining models often outperforms any single algorithm.

Mar 16, 2018 · 2 min read

Data Science

Computational graphs: Backpropagation

Backpropagation explained via computational graphs, a local, chain-rule-based method for computing gradients efficiently in neural networks.

Mar 09, 2018 · 4 min read

Data Science

Gradient descent: The core of neural networks

How gradient descent works to optimize neural network weights by following the steepest direction of the loss function.

Mar 02, 2018 · 4 min read

Data Science

Gradient boosted trees: Better than random forest?

Comparing gradient boosted trees and random forests, their differences in training strategy, tuning requirements, and when to prefer each.

Feb 23, 2018 · 1 min read

Data Science

Linear algebra: The essence behind deep learning

How linear algebra underpins deep learning from score functions and weight matrices to image classification with neural networks.

Feb 16, 2018 · 2 min read

Data Science

Data Mining: Knowledge discovery in databases

An overview of the KDD (Knowledge Discovery in Databases) process and how data mining, machine learning, and data science relate to each other.

Feb 09, 2018 · 1 min read

Data Science

Anscombe's Quartet

Anscombe's quartet illustrates why visualizing data matters, four datasets with nearly identical statistics but completely different distributions.

Feb 02, 2018 · 2 min read

Data Science

The Curse of Dimensionality

Why increasing the number of features degrades kNN performance, the curse of dimensionality explained intuitively and mathematically.

Jan 26, 2018 · 3 min read

Data Science

Dealing with categorical data

Techniques for encoding categorical variables in machine learning models, including dummy variables and one-hot encoding.

Jan 19, 2018 · 3 min read

Data Science

Regularization

How regularization techniques, L1 (Lasso) and L2 (Ridge), add penalty terms to the loss function to combat overfitting in linear models.

Jan 12, 2018 · 4 min read

Data Science

Evaluation metrics for classification and False positives

Guide to classification evaluation metrics: confusion matrix, precision, recall, specificity, F1, balanced accuracy, ROC-AUC, PR curves, handling imbalanced datasets, and when to choose each metric.

Dec 29, 2017 · 6 min read

Data Science

Simplicity doesn't imply accuracy

Examining Occam's razor in machine learning, why simpler models aren't always more accurate and how complexity relates to overfitting.

Dec 22, 2017 · 2 min read

Data Science

p-Value

Understanding p-values and statistical significance in the context of simple linear regression and hypothesis testing.

Dec 15, 2017 · 5 min read

Data Science

Out-liars

How to detect and handle outliers in data using the Interquartile Range (IQR) method and box plots.

Oct 27, 2017 · 1 min read

Data Science

Correlation is not causation

Why correlation between two variables does not imply causation, illustrated with classic examples of spurious correlations.

Oct 20, 2017 · 1 min read

Data Science

Overfitting and Underfitting

Explaining overfitting and underfitting in machine learning, and how the bias-variance tradeoff helps build better-generalizing models.

Oct 13, 2017 · 2 min read

Data Science

Data leakage: A big problem

Understanding data leakage when training data inadvertently contains information about the target, causing unrealistically good but unreliable model performance.

Oct 06, 2017 · 2 min read

Data Science

Simpson's paradox

Simpson's paradox explained through UC Berkeley's 1973 admissions data, a trend that reverses when data is aggregated across groups.

Sep 01, 2017 · 6 min read

Data Science

Email spam filtering: Text analysis in R

Building and evaluating an email spam filter using text analytics and machine learning in R.

Aug 25, 2017 · 63 min read

Data Science

Friendship paradox: facebook

Exploring the friendship paradox, phenomenon where most people have fewer friends than their friends have on average, using Facebook data and Python.

Aug 18, 2017 · 4 min read

Data Science

Moneyball: Why no prediction can't be made for baseball champion

Using logistic regression in R to explore why ML cannot reliably predict the baseball World Series champion.

Aug 04, 2017 · 27 min read

Data Science

Moneyball: How linear regression changed baseball

How Oakland A's used linear regression in R to identify undervalued players and compete despite limited budget.

Jul 28, 2017 · 17 min read