Follow me on TwitterMy Tweets
Data Science terms in Business
Recently, I came across with several engineering students applying for data scientist positions. They all have great analytical and programming skills but they mentioned that it was hard to understand the jargon used in business world for the same problems we often apply in engineering. Therefore, I decided to post some of these terms that I learned from the wonderful book “Data Science for Business” by Foster Provost and Fawcett. This is a great source if you want to get accustomed to the business world problems. 1. Classification and class probability estimation: In business world, this will seek to answer problems such as “Among all the customers of Company X, which are more likely to respond to a given offer?”. Here, the two classes are “will respond” and “will not respond”. This is a basic data mining problem. We can also estimate a class probability for a given customer. This task is also called “scoring” in business. So, nest time if a recruiter asks you about scoring, you should be able to explain right away. 2. Regression You might be used to use regression to find the best line explaining your data but in business world you may need to use the same skills to answer a question like “How much will a given customer use the service?” You will estimate the parameter “service usage” using the similar individuals and their historical usage. I know, mathematically, classification and regression are the same but don’t tell this to your recruiter. They might think you don’t know anything. Instead, you should say: “Classification predicts whether something will happen, whereas regression predicts how much something will happen.” 3. Similarity Matching Do not freak out when you hear this term in your interview. In fact, you should not freak out anytime. Similarity matching is just an attempt to identify similar individuals based on data known about them. For example, Company X is interested in finding companies similar to their best business customers. 4. Clustering This attempts to group individuals in a population together by their similarity, but not driven by a specific purpose. Yes, you can use the word “unsupervised” here. An example question in business world might be “Do our customers form natural groups or segments?” 5. Co-occurance grouping When you purchase a product in Amazon, it usually offers you more items to purchase based on their known data from customers similar to you. This is called “co-occurance grouping”. It attempts to find associations between entities based on transactions involving them. The question we want to answer is “What items are commonly purchased together?”. 6. Profiling I did not use to have a car in my first three months back in Los Angeles so my bank did not expect me to use my ATM card in a gas station. When I tried to use my ATM card to fill the tank of my new car, my card was blocked and I had to call the bank to unblock. I was mad at my bank that time but then I realized that this was actually done to protect me. They had a profile of me based on my previous purchases and their system thought my card might had been stolen. As you can guess, this is frequently used for fraud detection. 7. Link Prediction If you are interviewed by a social network company like Facebook, Twitter, etc, you better know what this means. Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. You might try to answer the question “Since you and Sergul (myself) share 10 friends on Facebook, maybe you’d like to be my friend?”. Graph theory people may enjoy these kinds of problems. 8. Data reduction If you know principle component analysis (PCA), this should be a joke for you. Basically, we want to replace a massive amount of data with a smaller set of data which contains much of the important information in the larger data set. 9. Causal Modeling This is very relevant to autoregressive models (AR) used in signal processing and economy. This helps us to understand what events or actions influence each other. Say, we observe the targeted customers observe at a higher rate subsequent to having been targeted. We may want to know “Was this because the advertisements influenced the customers to purchase? Or did the predictive model simply do a good job identifying those customers who would have purchased anyway?”.