5 Simple Tips for Viewing and Organizing Pandas DataFrames

Literally just 5 things I wish I learned about sooner

Max Steele (they/them)
Towards Data Science

--

I love working with the Pandas library. And the more I’ve worked with it, the more I’ve come to appreciate it.

In this post, I’m going to cover 5 somewhat random things that make life a little easier when viewing and organizing your Pandas DataFrames. Nothing truly groundbreaking, but 5 simple things I appreciated when I learned they existed.

Photo by Ying Wu on Unsplash

I recently made use of all 5 of these when concatenating data from 3 different files, so I’ll provide specific code examples from that general process.

1. The usecols parameter

Sometimes you’ll have a data file that has more columns than you’re interested in using for any given task. You could read in the entire file:

Screenshot of DataFrame by author

Or you could simply read in the columns you’re actually interested in by making use of the usecols parameter:

Screenshot of DataFrame by author

Why load things into memory if you know what you want and you’re not going to use the rest of it?

2. Quickly rename columns

I don’t really like the last 2 column names in the DataFrame we read in above. I don’t want to type those or keep track of upper versus lower case. I’d rather have shorter column names in all lower case. And since they’re both an identifier for an app, I’d rather have them in similar formats. Something like as_id for the App Store ID and gp_id for the Google Play ID.

I could use the rename() method, feeding in a dictionary mapping the original column names to my desired column names.

But depending on the specific context, I often prefer to do the following:

Screenshot of DataFrame by author

Both methods will produce the above DataFrame with the desired column names. But there’s just something satisfying about setting the column names directly equal to a list and not messing with a mapper.

However, when setting the column names directly to a list of names, you need to keep 2 things in mind:

  1. You must supply a name for every single column.
  2. Order matters.

You could easily mess things up here if you’re not careful. You don’t have to worry as much with the rename() method.

However, the df.columns = approach can be especially useful in certain cases. Say you read your data file into a Pandas DataFrame and your column names look like this:

Screenshot of DataFrame by author

Those spaces would be annoying to work around. But you can quickly fix the issue with the following code, which replaces all white spaces in all column names with underscores:

app_df.columns = [col.replace(' ', '_') for col in app_df.columns]

Or say you read in your file and the column names are in all caps (or otherwise weirdly and inconsistently capitalized) and difficult to read. You can change all column names to be all in lower case with the following:

app_df.columns = [col.lower() for col in app_df.columns]

With list comprehensions such as these, you don’t have to worry about explicitly providing a name for each column or making sure the column names are in the correct order. This happens automatically as you iterate through the original column names with the columns attribute.

3. Change max number of columns to be displayed

I am an extremely visual thinker and I like to be able to see what I’m working with. I continuously inspect DataFrames as I work, and I often like to be able to see all the columns I have available at once. By default, Pandas only displays 20 columns.

Below we see the first 10 columns of a DataFrame that has a total of 50 columns. The middle 30 columns are hidden (indicated by the ellipses highlighted in red), and if we scrolled to the right we’d see the last 10 columns.

Screenshot of DataFrame by author

If I wanted to know the name of all the columns, I could simply print out df.columns, but then I wouldn’t also be able to see the type of data in each column. So instead, I’ll use pd.options.display.max_columns = 50 to display all 50 columns at once.

Below we see the first chunk of the 30 columns which were originally hidden by default:

Screenshot of DataFrame by author

You can also adjust the maximum number of rows displayed for a DataFrame by adjusting pd.options.display.max_rows.

4. Slicing and reordering columns with double brackets

Do you ever get to a point when you’re working with a DataFrame and you know going forward you’re only going to work with a subset of the columns? And maybe you’d rather have that subset of columns in a specific order that makes more sense to you?

Say you’ve got the following DataFrame and you only want the app_id, review, rating, and date columns in that order.

Screenshot of DataFrame by author

You can use double brackets with a list of your desired columns to create a slice from the original DataFrame:

Screenshot of DataFrame by author

5. The insert() method

Say we want to tack on a column that identifies the reviews in the slice we made above as having come from the App Store. This would be useful when we concatenate it with reviews from Google Play, because we’ll know where each review came from. Doing so for our App Store slice is really simple:

Screenshot of DataFrame by author

We created a column filled with 1’s to indicate that yes, these reviews come from the App Store and not Google Play. Creating new columns in this way always tacks them on to the end of the DataFrame. That’s all well and good for this indicator variable we just created, but I like keeping related things near each other. And next I’d like to create a variable that gives me an idea of how long each review string is. I’d really like this new column to sit between the review and rating column.

This can be achieved with the insert() method, which takes three arguments:

  1. The desired index for the new column
  2. The name of the new column
  3. The value with which to fill the new column

For our example, the index needs to be 2 since I want the new column to appear immediately after the 2nd column (index starts at 0 like normal). We’ll name the new column rvw_len. Finally, we’ll map a lambda function to the 3rd parameter. Here we specify that we want to fill the new column with the length of the corresponding review string for each row.

Screenshot of DataFrame by author

Success! Our new column rvw_len has been inserted in the middle of everything else in a place that actually makes sense to me.

To Recap

  1. If you only want to read in a specific set of columns from a file, use the usecols parameter with a list of the column names you actually want.
  2. If you want to rename all your columns, set df.columns = [list of desired column names].
  3. Want to see more columns (or rows) than Pandas displays by default? Adjust the display options with pd.options.display.max_columns = (or .max_rows).
  4. Want a slice of specific columns and to get those columns in a specific order? Use double brackets on your DataFrame with your list of desired columns inside.
  5. If you want to insert a new column at a specific index, rather than having it tacked on as the very last column, use the insert() method.

I hope you found this informative and are able to apply something you learned to your own work. Thanks for reading!

--

--