Literally just 5 things I wish I learned about sooner

I love working with the Pandas library. And the more I’ve worked with it, the more I’ve come to appreciate it.

In this post, I’m going to cover 5 somewhat random things that make life a little easier when viewing and organizing your Pandas DataFrames. Nothing truly groundbreaking, but 5 simple things I appreciated when I learned they existed.

Photo by Ying Wu on Unsplash

I recently made use of all 5 of these when concatenating data from 3 different files, so I’ll provide specific code examples from that general process.

1. The parameter

Sometimes you’ll have a data file that has more columns than you’re interested in using…


How to use NumPy or Pandas to quickly bin numerical features

Feature engineering focuses on using the variables already present in your dataset to create additional features that are (hopefully) better at representing the underlying structure of your data.

For example, your model performance may benefit from binning numerical features. This essentially means dividing continuous or other numerical features into distinct groups. By applying domain knowledge, you may be able to engineer categories and features that better emphasize important trends in your data.

Photo by Paper Beard on Unsplash

In this post, we’ll walk through three different methods for binning numerical features with specific examples using NumPy and Pandas. We’ll engineer features from a dataset with information…


How to use NumPy or Pandas to quickly bin categorical features

Working with categorical data for machine learning (ML) purposes can sometimes present tricky issues. Ultimately these features need to be numerically encoded in some way so that an ML algorithm can actually work with them.

You’ll also want to consider additional methods for getting your categorical features ready for modeling. For example, your model performance may benefit from binning categorical features. This essentially means lumping multiple categories together into a single category. By applying domain knowledge, you may be able to engineer new categories and features that better represent the structure of your data.

In this post, we’ll briefly cover…


How to use the itunes-app-scraper and app-store-scraper to build datasets of app information and reviews

In a previous post, I laid out how you can use the google-play-scraper to scrape both app details (description, price, current version, etc.) and app reviews. This post will focus on using Python code to do the same thing, but for the App Store.

Photo by William Hook on Unsplash

Whereas the google-play-scraper provides functions for scraping app info and reviews in one convenient package, you’ll need to use two separate libraries to accomplish this for the App Store.

The itunes-app-scraper provides a couple methods that can be used to obtain app IDs, and additional methods to actually scrape data about those apps. …


How to use the google-play-scraper and PyMongo to quickly build a dataset of app reviews

So much of how we interact with the world and with each other happens through apps. Social media, shopping, music, news, dating… You name it, there’s probably more than one app for it. And some apps are better than others.

By analyzing the text of user reviews, we can gain insight into what people like and don’t like about an app. Various fields of Natural Language Processing (NLP) such as Sentiment Analysis and Topic Modeling can help with this, but not if we don’t have any reviews to analyze!

Before we get ahead of ourselves, we need to scrape and…


How to functionize SHAP force plots for binary and multi-class classification

In this post I will walk through two functions: one for plotting SHAP force plots for binary classification problems, and the other for multi-class classification problems.

At this point you may be thinking “Alright, but there’s already a function, so what are we even doing here?” And yes, technically you are correct. BUT pretty much all the examples of SHAP force plots I have seen are for continuous or binary targets. You actually can produce force plots for multi-class targets, it just takes a little extra digging. My goal is to help you do that digging so you get…


A brief introduction to a very broad machine learning concept

It can be difficult to find any sort of consensus on what “feature engineering” specifically refers to. My goal for this post is to provide an introduction to this very broad, yet fundamental aspect of building successful machine learning (ML) models for new and aspiring data scientists. We’ll cover the difference between a variable and a feature, why feature engineering is important, and when you might want to engineer features. In future posts, I will walk through some basic examples of how to use Python, Pandas, and NumPy to engineer features.

Image by Kevin Ku via Unsplash

So What is Feature Engineering?

Some people consider feature engineering to include the data…

Max Steele (they/them)

Data scientist with a background in biological research and education

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store