Data Science
33 posts
-
Data ScienceFalse positive paradox
A false positive is an error when test results incorrectly indicate presence of a condition when it doesn’t exist. False positives often plays an important role in hypothesis...
-
Data ScienceLoss functions
In machine learning, the difference between the predicted output and the actual output is used to tune the parameters of the algorithm. This error in prediction, so called...
-
Data ScienceOptimizers
Building neural networks, we compute gradients with the backpropagation algorithm. These gradients are used to perform the parameter updates. The default way of doing this is to update...
-
Data ScienceMethods of Hyperparameter optimization
The parameters, called hyperparameters, that define the performance of the machine learning algorithm (model), depends on the problem we are trying to solve. Thus, they need to be...
-
Data ScienceThe Bayesian Thinking - III
Read the first part The Bayesian Thinking - I and the second part The Bayesian Thinking - II.
-
Data ScienceThe Bayesian Thinking - II
Statistics is the study of uncertainity. One way to deal with uncertainity is by probabilities.
-
Data ScienceThe Bayesian Thinking - I
A disease has affected 0.1% of the world’s population. The test for the disease correctly identifies 99% of people who have the disease and only incorrectly identifies 1%...
-
Data ScienceDropout: Prevent overfitting
Dropout is a regularization technique that prevents neural networks from overfitting. Regularization methods like L2 and L1 reduce overfitting by modifying the cost function. Dropout, on the other...
-
Data ScienceHow deep should neural nets be?
How do we decide on what architecture to use while solving a problem using neural networks? Should we use no hidden layers? One hidden layer? Two hidden layers?...
-
Data ScienceDon't use sigmoid: Neural Nets
In neural networks, activation functions are used to introduce non-linearity in the model. There are several activation functions to choose from. Traditionally, people have been using sigmoid as...
-
Data ScienceScaling vs Normalization
Feature scaling (also known as data normalization) is the method used to standardize the range of features of data. Since, the range of values of data may vary...
-
Data ScienceEnsembling is the key
Most of us have our favourite machine learning algorithms. For some, it may be state-of-the-art algos like Support Vector Machines while for others it may be something simple...
-
Data Science
Computational graphs: Backpropagation
Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks. This is not a learning method, but rather a nice computational...
-
Data ScienceGradient descent: The core of neural networks
As discussed in the post linear algebra and deep learning, the optimization is the third and last step in solving image classification problem in deep learning. It helps...
-
Data ScienceGradient boosted trees: Better than random forest?
Does gradient boosted trees generally perform better than random forest? Let’s see that. But, first what are these methods? Random forest and boosting are ensemble methods, proved to...
-
Data ScienceLinear algebra: The essence behind deep learning
Mathematics lies behind every algorithm; if not mathematics then mathematical thinking. In case of deep learning algorithms, linear algebra is the driving force.
-
Data ScienceData Mining: Knowledge discovery in databases
Knowledge discovery in databases (KDD) is a 7 step process to search for hidden knowledge in data. Data Mining refers to the analysis step in the KDD process....
-
Data ScienceAnscombe's Quartet
We often look for summary statistics during EDA (Exploratory Data Analysis). But, sometimes these statistics may give us wrong interpretation of the data. In 1973, a statistician Francis...
-
Data Science
The Curse of Dimensionality
While applying k nearest neighbors approach in solving a problem, we can sometimes notice that there is a deterioration in the kNN performance when the number of predictors,...
-
Data Science
Dealing with categorical data
Often in our machine learning model, we encounter qualitative predictors.
-
Data Science
Regularization
Our machine learning model often encouters the problem of overfitting. Regularization is one of the techniques to solve this problem.
-
Data ScienceEvaluation metrics for classification and False positives
A false positive error or false positive (false alarm) is a result that indicates a given condition exists when it doesn’t.
-
Data ScienceSimplicity doesn't imply accuracy
Often, people say things like beauty lies in simplicity, simplicity is the glory of expression, complexity is the enemy of execution. But to what extent, these statements are...
-
Data Sciencep-Value
In data science, p-value is used to determine statistical significance of the result. It gives the probability of a statistical model that, when the null hypothesis is true,...
-
Data ScienceOut-liars
An outlier in our data can sometimes adversely affect our machine learning model. An outlier is any value that is distant from other observations in our data.
-
Data ScienceCorrelation is not causation
We often calculate correlation during EDA (Exploratory data analysis) to check how strongly two variables are correlated to one another. It’s tempting to assume that one variable causes...
-
Data ScienceOverfitting and Underfitting
In machine learning, sometimes the prediction of our model may not be satisfactory. Although there may be many reasons for that, often it is due to either overfitting...
-
Data ScienceData leakage: A big problem
Let’s say your machine learning model performs better than than you expect it to in the test set. You are happy. Well, you should be. Now, you release...
-
Data ScienceSimpson's paradox
In 1973, the University of California, Berkeley was sued for gender bias against women who had applied to graduate schools. The data for fall 1973 showed that men...
-
Data ScienceEmail spam filtering: Text analysis in R
Email spam1, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email.
-
Data ScienceFriendship paradox: facebook
-
Data ScienceMoneyball: Why no prediction can't be made for baseball champion
-
Data ScienceMoneyball: How linear regression changed baseball
It’s unbelievable how much you don’t know about the game you’ve been playing all your life. — Mickey Mantle