Feature engineering focuses on using the variables already present in your dataset to create additional features that are (hopefully) better at representing the underlying structure of your data.
For example, your model performance may benefit from binning numerical features. This essentially means dividing continuous or other numerical features into distinct groups. By applying domain knowledge, you may be able to engineer categories and features that better emphasize important trends in your data.
In this post, we’ll walk through three different methods for binning numerical features with specific examples using NumPy and Pandas. We’ll engineer features from a dataset with information…
Working with categorical data for machine learning (ML) purposes can sometimes present tricky issues. Ultimately these features need to be numerically encoded in some way so that an ML algorithm can actually work with them.
You’ll also want to consider additional methods for getting your categorical features ready for modeling. For example, your model performance may benefit from binning categorical features. This essentially means lumping multiple categories together into a single category. By applying domain knowledge, you may be able to engineer new categories and features that better represent the structure of your data.
In this post, we’ll briefly cover…
In a previous post, I laid out how you can use the google-play-scraper to scrape both app details (description, price, current version, etc.) and app reviews. This post will focus on using Python code to do the same thing, but for the App Store.
Whereas the google-play-scraper provides functions for scraping app info and reviews in one convenient package, you’ll need to use two separate libraries to accomplish this for the App Store.
The itunes-app-scraper provides a couple methods that can be used to obtain app IDs, and additional methods to actually scrape data about those apps. …
So much of how we interact with the world and with each other happens through apps. Social media, shopping, music, news, dating… You name it, there’s probably more than one app for it. And some apps are better than others.
By analyzing the text of user reviews, we can gain insight into what people like and don’t like about an app. Various fields of Natural Language Processing (NLP) such as Sentiment Analysis and Topic Modeling can help with this, but not if we don’t have any reviews to analyze!
In this post I will walk through two functions: one for plotting SHAP force plots for binary classification problems, and the other for multi-class classification problems.
At this point you may be thinking “Alright, but there’s already a
shap.force_plot() function, so what are we even doing here?” And yes, technically you are correct. BUT pretty much all the examples of SHAP force plots I have seen are for continuous or binary targets. You actually can produce force plots for multi-class targets, it just takes a little extra digging. My goal is to help you do that digging so you get…
It can be difficult to find any sort of consensus on what “feature engineering” specifically refers to. My goal for this post is to provide an introduction to this very broad, yet fundamental aspect of building successful machine learning (ML) models for new and aspiring data scientists. We’ll cover the difference between a variable and a feature, why feature engineering is important, and when you might want to engineer features. In future posts, I will walk through some basic examples of how to use Python, Pandas, and NumPy to engineer features.
Some people consider feature engineering to include the data…
Data scientist with a background in biological research and education