Scraping & Storing Google Play Reviews with Python

How to use the google-play-scraper and PyMongo to quickly build a dataset of app reviews

Max Steele (they/them)
Python in Plain English

--

So much of how we interact with the world and with each other happens through apps. Social media, shopping, music, news, dating… You name it, there’s probably more than one app for it. And some apps are better than others.

By analyzing the text of user reviews, we can gain insight into what people like and don’t like about an app. Various fields of Natural Language Processing (NLP) such as Sentiment Analysis and Topic Modeling can help with this, but not if we don’t have any reviews to analyze!

Before we get ahead of ourselves, we need to scrape and store some reviews. This post focuses on how to use Python code to do exactly that using the google-play-scraper and PyMongo. You can store or save your scraped reviews in numerous ways. However, I like the flexibility of essentially dumping them into a MongoDB collection as I go.

Photo by Rami Al-zayat on Unsplash

The google-play-scraper provides APIs to crawl the Google Play Store. You can use it to obtain:

  • App info — things like the title and description of the app, price, genre, current version, etc.
  • App reviews

To get app info, you use the app function and to get reviews you can use the reviews or reviews_all functions. I’ll cover how to use app briefly, then focus on getting the most out of reviews. I’ve found that, while reviews_all is nice for some use cases, I prefer working with reviews. I’ll explain why and how with plenty of code once we get there.

Getting Started with Google-Play-Scraper

Step 1: Obtain App IDs

There’s one piece of information you’ll need to scrape each app — the app’s ID code. This can be found in the url of the app’s page in the Google Play Store. As shown in the image below, the part you’ll need immediately follows “id=”.

Google Play Store url with app ID highlighted

In some cases, the app ID ends the url. In cases like this example, you only want the part between “id=” and “&”.

My newest project is focused on mental health, mindfulness, and self care apps. As I was researching apps, I kept track of lots of various info in a spreadsheet. This was a natural place to store the ID for each app.

Screenshot of example spreadsheet for keeping track of app IDs

With a spreadsheet like the one above, we can easily read in the file to a Pandas DataFrame to get a list of app IDs to iterate over.

Step 2: Installs and Imports

Here I’ll import everything I used, including PyMongo. If you’re not interested in that part, skip those imports. Otherwise, you’ll need to install MongoDB. Instructions for installing the Community Edition can be found here.

You should pip install as necessary to be able to import each of the following:

We’ll also set up our Mongo Client, create a new database for our project, and set up new collections (essentially the MongoDB equivalent to the tables of relational databases). We’ll store app info in one collection and app reviews in another.

MongoDB creates databases and collections lazily. This means these things won’t actually exist until we start inserting documents (essentially the MongoDB equivalent to the rows of relational database tables) into our collections.

Scraping App Info

The stage is mostly set for us to start scraping and storing. We just need our list of app IDs. I downloaded a version of my spreadsheet as a csv file, so I’ll just read that in as a Pandas DataFrame.

Screenshot of app ID DataFrame

And now we can easily get lists of app names and IDs to loop through while scraping:

We’ll make use of app_names when we scrape reviews. To scrape general app info with the app function, all we need is app_ids. The following code loops through each app, scrapes its info from the Google Play Store, and stores that info in a list.

That last line returns a dictionary containing various information about our first app. I’ve included a truncated version of that output below:

{'adSupported': None,
'androidVersion': '4.1',
'androidVersionText': '4.1 and up',
'appId': 'com.aurahealth',
'containsAds': False,
'contentRating': 'Everyone',
'contentRatingDescription': None,
'currency': 'USD',
'description': '<b>Find peace everyday with Aura</b> - discover thousands of ' ... (truncated),
'descriptionHTML': '<b>Find peace everyday with Aura</b> - discover ' ... (truncated),
'developer': 'Aura Health - Mindfulness, Sleep, Meditations',
'developerAddress': '2 Embarcadero Center, Fl 8\nSan Francisco, CA 94111',
'developerEmail': 'hello@aurahealth.io',
'developerId': 'Aura+Health+-+Mindfulness,+Sleep,+Meditations',
'developerInternalID': '8194778368040078712',
'developerWebsite': 'http://www.aurahealth.io',
'editorsChoice': False,
'free': True,
'genre': 'Health & Fitness',
...
}

Let’s safely store those app details into our info_collection using PyMongo’s insert_many method. insert_many expects a list of dictionaries, which is exactly what we just made.

Whenever you want to start working with that dataset, you can query it straight to a DataFrame with a single line of code!

First few rows of Pandas DataFrame containing app info

Scraping App Reviews

Earlier I said I prefer working with the reviews function as opposed to reviews_all. Here’s why:

  1. You can still get all the reviews if that’s really what you want.
  2. You can chunk the process for each app rather than having to do everything for a single app in one go. This is advantageous because it gives you options. You can:
  • Get periodic updates on how many reviews you’ve scraped
  • Save scraped data as you go rather than waiting until the end

Anatomy of the `reviews` Function

The reviews function returns 2 variables. The first variable is the review data we’re after. The second variable is a token with information we need if we want to keep scraping more than count number of reviews.

The first argument you’ll need to provide to reviews is the app ID. There are two options for sorting reviews, either by most recent or by whatever Google Play determines is most relevant. You also have the option to filter reviews by score.

The intended use of the count parameter is really to tell the function how many reviews you want it to retrieve before stopping. The google-play-scraper documentation offers the following insight:

Photo by Szabo Viktor on Unsplash

“Setting count too high can cause problems. Because the maximum number of reviews per page supported by Google Play is 200, it is designed to pagination and recrawl by 200 until the number of results reaches count.”

**Side note: reviews_all is akin to setting count to infinity, which seems pretty high to me.

I think count is better thought of and used as your batch size. Just set it to 200 reviews, return those reviews and your token, and use your token in the next iteration of the reviews function.

Walkthrough of Review Scraping

In this section, we’ll break down code that

  • Iterates through a list of app IDs to scrape Google Play reviews
  • Stores the reviews periodically in a MongoDB collection
  • Prints updates on the status of the scraping process
Example output detailing progress throughout review scraping for a single app

We’ll do this one chunk at a time, but I’ll be nice and include the chunks all stitched together at the very end so you don’t have to fuss over getting the indentation correct.

Step 1: Set Up the Loop

Earlier we stored our lists of app names and app IDs. The list of app names isn’t strictly necessary for scraping. It has a purpose, so please bear with me.

Photo by Hans-Jurgen Mager on Unsplash

In this block of code, we begin our for loop to go through all our apps. Just make sure your lists of names and IDs match up. We’ll also print the start of our output for the first app.

Step 2: Scrape First Batch of Reviews

Finally! We’ve seen the first several lines before, using the reviews function to get our reviews and token. Right after that, we add 2 keys to each of our newly obtained review dictionaries. Attaching these identifiers is helpful because the data scraped for each review doesn’t explicitly identify which app the review was for. Potential crisis averted!

A note on wait time: I’m not sure it’s entirely necessary to build that in. I did so because the reviews_all function includes a parameter for wait time. Depending on how many batches you run, our approach amounts to essentially the same thing: lots of continuous requests.

Step 3: Store the Review IDs from the First Batch

Each review comes with a unique identifier. Before collecting our next batch of reviews, we need to save these to compare against later.

Step 4: Set and Loop Through a Maximum Number of Batches

Photo by Andriyko Podilnyk on Unsplash

We obtained a continuation token with the first batch of reviews. This means that now we can loop through as many batches of 200 reviews as we want, picking up where we left off each time. To do this, we just feed in the newest token at each iteration.

In the code below, by using range(4999), I’ve set the maximum number of batches to 5,000 (we already got our first batch). This means we’ll obtain the first 1 million reviews, if there even are that many.

But what if an app has fewer than 1 million reviews? Will the scraper stop when it reaches the end?

No. No it will not. Unless you tell it to. If an app only has 281 reviews, the scraper will still return a continuation token once you’ve scraped all 281. And that means our loop will happily keep grabbing those same 281 reviews for 5,000 batches. Obviously that’s something to avoid, and that’s why we’ve been storing review IDs. Note that we store the ones from the current batch in a new_review_ids list, similar to the pre_review_ids list created in Step 3.

Step 5: Break the Loop if Nothing New is Being Added

Here we compare the set of review IDs we had prior to scraping the current batch with the set of review IDs we have after incorporating our current batch. If the length of the two sets is the same, that means we have stopped adding new reviews to our dataset. So we break the loop and move on to the next app.

If the lengths differ, we keep scraping. Before starting the next batch, we’ll reassign our current list of all review IDs to the pre_review_ids variable.

Step 6: Store the Data and Print an Update After Every ith Batch

If you’re scraping 10’s of thousands or even millions of reviews, it’s nice to get the occasional update on how things are going. Perhaps more importantly, it’s nice to know your data is being safely stored as you go. The following code does both every 100 batches.

Step 7: Finishing with One App

Photo by Anton Shuvalov on Unsplash

At this point you’ve either scraped 1 million reviews for an app, or all the reviews there were to scrape. Congrats! The code below wraps up our nice informative output, makes sure to store any remaining reviews to our MongoDB collection, and waits a tick before starting on the next app.

Here’s a link to the chunks all stitched back together.

You’ll ultimately end up with some duplicate reviews in your dataset, but that’s nothing df.drop_duplicates(subset=['reviewId]) can’t fix once you’ve got it into a Pandas DataFrame.

Summary

We covered:

  • How to get set up with google-play-scraper and PyMongo
  • How to scrape app info from the Google Play Store
  • Why the reviews function is more useful than reviews_all
  • A step-by-step walkthrough of how to effectively scrape and store Google Play reviews for multiple apps

I hope you found this informative and are able to apply something you learned to your own work. Thanks for reading!

--

--