Scraping App Store Reviews with Python

How to use the itunes-app-scraper and app-store-scraper to build datasets of app information and reviews

Max Steele (they/them)
Python in Plain English

--

In a previous post, I laid out how you can use the google-play-scraper to scrape both app details (description, price, current version, etc.) and app reviews. This post will focus on using Python code to do the same thing, but for the App Store.

Photo by William Hook on Unsplash

Whereas the google-play-scraper provides functions for scraping app info and reviews in one convenient package, you’ll need to use two separate libraries to accomplish this for the App Store.

The itunes-app-scraper provides a couple methods that can be used to obtain app IDs, and additional methods to actually scrape data about those apps. With this scraper you can obtain several app details like the app description, price, genre, and current version.

The app-store-scraper provides a method for scraping user reviews of apps in the App Store.

I’ll cover how I prefer to use each to make sure I’m getting the data I want for the apps I’m interested in.

Getting Started Scraping the App Store

Step 1: Obtain App Names and IDs

There is one piece of information that is required to scrape app info or reviews and that’s the app name. There is a second piece of information that I suggest you treat as required because sometimes things go a little wonky when trusting the scrapers to retrieve it for you automatically: the app ID.

Both pieces of information can be found in the url of the app’s page in the App Store. As shown in the image below, the app name can be found between “app/” and “/id”.

App Store url with app name highlighted

The app ID immediately follows “/id” and ends the url.

App Store url with app id highlighted

My newest project is focused on mental health, mindfulness, and self care apps. As I was researching apps, I kept track of lots of various info in a spreadsheet. This was a natural place to store the name and ID for each app. And with a spreadsheet like this one, we can easily read in the file to a Pandas DataFrame to get lists of app names and IDs to iterate over.

If you’re scraping reviews for multiple apps, I would also suggest keeping track of the rough estimate of the number of ratings for each app. It takes a while to scrape reviews, especially compared to the google-play-scraper. By keeping track of the number of ratings an app has, you can decide how to chunk your list of apps for scraping. If you know you’re going to need to pause at some point soon, you’ll know not to start scraping an app with millions of reviews just yet.

It should also be noted that the rough number of ratings for each app will definitely exceed the number of reviews you get from scraping all the reviews. Not everyone who rates an app takes the time to leave a review.

Step 2: Installs and Imports

Here I’ll import everything we’ll need. If you’d like to see an example of storing app data in a MongoDB collection using Pymongo, refer to my earlier post about using the google-play-scraper. For this post, we’ll simply write each batch to a csv file.

You should pip install as necessary to be able to be able to import the following:

Scraping App Info

The stage is mostly set for us to start scraping and storing. We just need our list of app IDs. I downloaded a version of my spreadsheet as a csv file, so I’ll read that in as a Pandas DataFrame.

Screenshot of app name and ID DataFrame

And now we can easily get lists of app names and IDs to loop through while scraping:

For now, to get the app info using the itunes-app-scraper, we will only be using the app IDs. The library provides a method for retrieving app IDs based on the app name from the url (get_app_ids_for_query), but I’ve found that it doesn’t reliably return what I’m asking for or it returns extra IDs. So rather than bother with that, we’ll feed our list of app IDs directly into the get_multiple_app_details method after instantiating the AppStoreScraper.

The last line prints a dictionary containing various information about our first app. Even though it’s pretty print, it’s still not very nice to look at:

Screenshot of pprint output

So let’s make our list of dictionaries into a Pandas DataFrame and write that to a csv file using the following code:

Screenshot of the last several columns of app info DataFrame

Scraping App Reviews

Now we’ll be using the app-store-scraper from which we imported the AppStore class to scrape reviews. Once instantiated, the AppStore class has a review method that enables us to scrape reviews. To instantiate, you need to provide a country code, the app name, and the app ID. I definitely recommend supplying the ID directly to the class, otherwise you might not get exactly what you’re expecting.

The review method has 3 parameters. The first, how_many, is simply how many reviews you want to scrape in total. If no argument is provided, all reviews will be scraped. The review method scrapes batches of 20 reviews at a time. This can’t be changed.

The second parameter, after, allows you to filter out older reviews by providing a datetime object so that you only get reviews written after that date. This will not actually limit the number of calls you make because App Store reviews can’t be sorted by date. So basically review will still make a call to each review, but won’t actually store the review if it doesn’t meet the criteria.

The last parameter, sleep, is optional but I highly suggest using it to build in sleep time between calls. Especially if you plan to scrape a lot of reviews. Just slow it down.

Juvenile box turtle — Photo by author

After the review method has completed it’s job, we can access all the reviews through the reviews attribute and find out how many reviews were scraped for that particular app via the review_count attribute.

Assuming you want reviews for multiple apps, you have to instantiate AppStore for each app. So we’ll need to iterate through both our lists of app names and IDs to accomplish this.

The following block of code loops through all the apps in our lists. For each app, it instantiates the AppStore class and calls review to scrape reviews. We’ve limited the number of reviews collected to 10,000 and constrained collection to those reviews written after February 28, 2020. We’ve also built in a sleep interval between each call lasting 20 to 25 seconds.

After scraping the reviews, we also append 2 keys to each review dictionary, one to include the app name and the other to include the app ID. This way, once we concatenate all our separate csv files, we have an easy way to identify which app the review belongs to. Finally, we convert the list of dictionaries to a Pandas DataFrame and write that to a csv file that includes the app name in the file name.

Executing the above block of code also produces an output that keeps you updated on how scraping is progressing (the pink output is automatically output by the review method):

Screenshot of output from scraping reviews for one app

Summary

We covered:

  • How to get set up to scrape information and reviews from the App Store, including how to find the necessary app names and IDs
  • How to use the itunes-app-scraper library to get app info data
  • How to use the app-store-scraper library to scrape app reviews
  • How to convert scraped app data into Pandas DataFrames and write to csv files

I hope you found this informative and are able to apply something you learned to your own work. Thanks for reading!

--

--