Analysis of 2012 Presidential Election Polls

Here, my goal is to predict 2012 US Presidential election results based on multiple polls. I used online data for polls and Electoral College votes. As long as the links do not change, these codes should work on any machine. This project is basically based on an assignment of Harvard’s data science course although I wrote my own functions and made my personal interpretations most of the time.

Posted in Data Science, Programming | Tagged , , | Comments Off on Analysis of 2012 Presidential Election Polls

Predict 2012 Presidential Elections Based on Occupation and Employer

My code and description can be found here:

Posted in Data Science, Programming | Tagged , , , , | Comments Off on Predict 2012 Presidential Elections Based on Occupation and Employer

Data Loading, Storage and file formats using Pandas

Pandas is a very useful library in Python for data analysis. Here is my python notebook script to get started with pandas. I used Wes McKinney’s book “Python for data analysis”.

Chapter 6 – Data Loading, Storage and file formats

Posted in Data Science, Programming | Tagged , , , | Comments Off on Data Loading, Storage and file formats using Pandas

Getting Started with pandas

Pandas is a very useful library in Python for data analysis. Here is my python notebook script to get started with pandas. I used Wes McKinney’s book “Python for data analysis”.

Chapter 5 – Getting started with pandas


Posted in Data Science, Programming | Tagged , , | Comments Off on Getting Started with pandas

Evaluation metrics for binary classification

Say you perform a binary classification algorithm using different models and you want to find out which model yields the best results. The evaluation depends on the application but there are some common frameworks for this task. Accuracy is probably the simplest evaluation criteria which is defined as the ration of the “number of correct decisions made” to the “total number of decisions made”. This is equivalent to 1-errorRate. This metric is commonly used in data mining as it reduces classifier’s performance to a single number. Unfortunately, it is too simplistic and has come well-known issues.

Let’s first introduce the notion of a confusion matrix. This is a 2×2 matrix where the columns labeled with actual classes and rows labeled with predicted classes. This can be generalized to nxn matrix for problems involving n labels. In this way, we can analyze different sorts of errors. The diagonal terms represent correctly predicted classes and the non-diagonal terms represent either false positive errors (negative instances classified as positive) or false negative (positive instances classified as positive) errors.

However, using such a metric might be problematic for unbalanced classes. It is very common that the interesting instances are rare. For example, the number of customers who own your insurance policy is much smaller compared to those who don’t (you would be rich otherwise and would not be reading this blog). Under such circumstances, you can get a very high performance rate by just assigning all the classes as negatives. Obviously, this is not you want.

Ideally, you want to estimate the cost or benefit of  each decision a classifier can make. Say you gain $100 for each customer who buys your insurance policy and lose $1 for advertising.  Your profit would be $99 for every customer who buys the policy and loss would be $1 for every customer who don’t buy. Your benefit will be greater than zero even if there is only one customer who buys your insurance policy among 99 customers. If we can estimate a customer’s likelihood of buying and not buying your insurance company mined form historical data we can estimate the expected benefit. We can then target the customer as long as the probability of buying the product is greater than some threshold so that your profit is greater than zero. Hence, instead of computing the accuracies for the competing models, we can compute expected values.

There are other several other evaluation metrics derived from confusion matrices. Let’s abbreviate the number of  true positives, false positives, true negatives and false negatives by TP, FP, TN and FN. We can describe various evaluation metrics using these counts. True positive rate is defined as TP/(TP + FN) and False negative rate as FN/(TP+FN).

The metrics Precision and Recall are often used. Recall is the same as true positive rate, while the precision is TP/(TP + FP), which is the accuracy over the cases predicted to be positive.  The F-measure is the harmonic mean of precision and recall at a given point and is F-measure = 2.(precision x recall) / (precision + recall).

A different strategy to expected value framework would be to rank a set of cases by these scores. A ROC graph is a 2-D plot of a classifier with false positive rate on the x axis against true positive rate on the y-axis. This way, a ROC graph depicts relative trade-offs that a classifier makes between benefits and costs. The advantage is that ROC graphs are independent of the class proportions as well as the costs and benefits. AUC (area under curve) is an important summary statistic of ROC. This is useful when a single number is needed to summarize performance.

Posted in Data Science | Tagged , , | Comments Off on Evaluation metrics for binary classification

Covariance shift

Covariance shift is an important problem in data science. Here, I illustrated how weighted maximum likelihood will improve the prediction results.

Posted in Data Science | Tagged , , | Comments Off on Covariance shift

How to install IPython Notebook in MAC OSX?

IPython Notebook seems to a new way of scientific computing with fantastic features. It is very easy to install for Linux but might be a headache for MAC OS X (at least for me). After searching and failing many times I found this tutorial which finally helped me successfully install the package. Thank you Titipat Achakulvisut for making my life easier. In case authors delete this file, I am attaching the .pdf here as well.
Posted in Programming | Comments Off on How to install IPython Notebook in MAC OSX?

Overfitting and Its Avoidance

Assume you work in a company and your boss asked you to build a model for a prediction of customers’ tendency to accept a special offer. You built a model and you came up with a result which is almost perfect (around 99%). Your boss might think you are a genius until he realizes that your model does not work well for the new data set. This phenomena is called overfitting in data science. We want to avoid to develop a model yielding perfect results for the existing data but do not generalize to the new data sets. As the Nobel Laureate Coase said, “If you torture the data long enough, it will confess.” The general accuracy of the model should be tested in another data set which was not used to build the model but the target variable is known. This is called holdout data. When the hold-out data is used in this manner, it is called test data in data science. Using a single holdout data might still be misleading as it might be a matter of chance that you are getting a good or a bad performance. More sophisticated way of holdout and testing procedure is cross-validation. It uses the existing data in a smart way. Instead of splitting the data into one training and one test data, cross-validation estimates the performance over all data by performing multiple splits and systematically swapping out samples for testing. This approach will allow you to compute the statistics on the estimated performance, such as the mean and the variance so that we can understand how the performance of the model varies across data sets. Cross-validation begins by splitting a labeled data set into k partitions called folds. A different fold is chosen as the test data. The chance of overfitting increases as the model becomes more complex. For instance, if we increase the number of leaves in tree induction, our model will become more prone to overfitting. Therefore, tree induction commonly limits the number of leaves to avoid overfitting. In linear models such as logistic regression, complexity can be controlled by choosing a right set of attributes. The general strategy is to find models which not only fit the data but also are simpler. This methodology is called regularization. We essentially penalize the weights if they get more complex. The most commonly used penalty is the sum of the weights, sometimes called the “L2 norm” of weights. This approach gives a large penalty if the absolute values of the weights have large values. The ridge regression is the name of the procedure when L2-norm penalty is used in standard least-squares linear regression. If instead we use the sum of the absolute values, known as L1-norm, we get a procedure known as lasso (Hastie et al., 2009). More generally, this is called L1-regularization which ends up zeroing out many coefficients. Happy holidays and avoid overfitting! Reference: “Data Science for Business” by Foster Provost and Fawcett.
Posted in Data Science | Comments Off on Overfitting and Its Avoidance

Predictive Modeling and Supervised Segmentation

Predictive modeling would be very useful to better understand or predict a target quantity. In business, this quantity might be something we want to avoid. For example, you may want to predict if a customer will leave the company when her/his contract expires. Or it might be something we will be happy to have happened, such as which customers are most likely to respond a special offer. The key step in predictive modeling is the identification of the attributes/features. Supervised learning model describes a relationship between these attributes and the target variable. In business, you may want to build a model of probability of leaving the company for a customer as a function of customer account attributes such as age, income, data usage, etc. The obvious question is “Which of these attributes would best segment these people into groups, in a way distinguishing leaving the company from not leaving?”. Purity measure can quantify how well an attribute segments the data according to a given target. For instance, if we can successfully segment the data into two the categories we want each subset to be a set of examples belong to only one target (class). These subsets would be called pure. One of the most common purity measures is called information gain which is based on entropy. I know you have proved a lot of fancy theorems in your information theory class but it is not as complicated in the business world. Entropy is defines as: entropy = -p_1 log left(p_1right)-p_2 log left(p_2right)-cdots Each p_i is the probability of property i within the set. High values of entropy indicates impurity. We can then define information gain (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates. The definition of IG is: IG(parent, children) = entropy(parent)- left[p(c_1) entropy(p(c_1))+p(c_2) entropy(p(c_2))+cdots right] So far we only discussed impurity measure for a categorical prediction. A natural measure of impurity of a numerical value is variance. Similar notion to IG can be formalized by looking at the reductions in variance between parent and children. Tree induction is one of the commonly used technique for data mining as it is trivial to comprehend. Tree induction starts with the whole data set and creates “purist” subgroups using the attributes. In many prediction problems, we may want a more informative prediction that just a classification. For example, we can use the probability of leaving instead of a direct class assignment. Frequency based estimate can be used to assign class probability. For example, is a leaf contains n positive instances and m negative instances, the probability of any new instance being positive maybe estimated as n/(n+m) . We may be overly optimistic if there are not enough number of instances in a leaf. This is called overfitting problem in data science. Laplace correction can be used to moderate the influence of leaves with only a few instances: p(c) = frac{n+1}{n+m+2}
Posted in Data Science | Tagged , , | Comments Off on Predictive Modeling and Supervised Segmentation

Data Science terms in Business

Recently, I came across with several engineering students applying for data scientist positions. They all have great analytical and programming skills but they mentioned that it was hard to understand the jargon used in business world for the same problems we often apply in engineering. Therefore, I decided to post some of these terms that I learned from the wonderful book “Data Science for Business” by Foster Provost and Fawcett. This is a great source if you want to get accustomed to the business world problems. 1. Classification and class probability estimation: In business world, this will seek to answer problems such as “Among all the customers of Company X, which are more likely to respond to a given offer?”. Here, the two classes are “will respond” and “will not respond”. This is a basic data mining problem. We can also estimate a class probability for a given customer. This task is also called “scoring” in business. So, nest time if a recruiter asks you about scoring, you should be able to explain right away. 2. Regression You might be used to use regression to find the best line explaining your data but in business world you may need to use the same skills to answer a question like “How much will a given customer use the service?” You will estimate the parameter “service usage” using the similar individuals and their historical usage. I know, mathematically, classification and regression are the same but don’t tell this to your recruiter. They might think you don’t know anything. Instead, you should say: “Classification predicts whether something will happen, whereas regression predicts how much something will happen.” 3. Similarity Matching Do not freak out when you hear this term in your interview. In fact, you should not freak out anytime. Similarity matching is just an attempt to identify similar individuals  based on data known about them. For example, Company X is interested in finding companies similar to their best business customers. 4. Clustering This attempts to group individuals in a population together by their similarity, but not driven by a specific purpose. Yes, you can use the word “unsupervised” here. An example question in business world might be “Do our customers form natural groups or segments?” 5. Co-occurance grouping When you purchase a product in Amazon, it usually offers you more items to purchase based on their known data from customers similar to you. This is called “co-occurance grouping”. It attempts to find associations between entities based on transactions involving them. The question we want to answer is “What items are commonly purchased together?”. 6. Profiling I did not use to have a car in my first three months back in Los Angeles so my bank did not expect me to use my ATM card in a gas station. When I tried to use my ATM card to fill the tank of my new car, my card was blocked and I had to call the bank to unblock. I was mad at my bank that time but then I realized that this was actually done to protect me. They had a profile of me based on my previous purchases and their system thought my card might had been stolen. As you can guess, this is frequently used for fraud detection. 7. Link Prediction If you are interviewed by a social network company like Facebook, Twitter, etc, you better know what this means. Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. You might try to answer the question “Since you and Sergul (myself) share 10 friends on Facebook, maybe you’d like to be my friend?”. Graph theory people may enjoy these kinds of problems. 8. Data reduction If you know principle component analysis (PCA), this should be a joke for you. Basically, we want to replace a massive amount of data with a smaller set of data which contains much of the important information in the larger data set. 9. Causal Modeling This is very relevant to autoregressive models (AR) used in signal processing and economy.  This helps us to understand what events or actions influence each other. Say, we observe the targeted customers observe at a higher rate subsequent to having been targeted. We may want to know “Was this because the advertisements influenced the customers to purchase? Or did the predictive model simply do a good job identifying those customers who would have purchased anyway?”.
Posted in Data Science | Tagged , | Comments Off on Data Science terms in Business