Retrieve SafeGraph Patterns Data for Only Specific Points-of-Interest(POI) Using an Amazon Web Services Elastic Compute Cloud Instance

"What the heck did he just say?" Yes, even for a large proportion of the technical people who visit my blog, this very well may sound like gibberish. Further, even for those who know what it means, you are still asking, "Why go to all the trouble?"

What?

SafeGraph data is aggregated human movement data - cell phone tracking data aggregated to protect personally-identifiable-information (PII), reduce the size to be somewhat manageable, and provide valuable insights for understanding where people are coming from, and what locations are being visited. While SafeGraph provides a number of human movement data products, one I have been working with a decent amount recently is called Patterns. If only interested in the last few months, SafeGraph provides data access through a very intuitive browser based interface called the data bar.

If however, you are interested in understanding pre and post Covid trends, you need to get data going further back. This data is summarized by month and available as national datasets through Amazon S3 buckets. Each month's Patterns data is stored as a series of comma separated (CSV) files compressed using G-Zip (GZ) compression and split based on the size of the data for the month. Typically each month is a series of three to file files.

Patterns is based around the idea of Places of Interest (POIs), locations curated by SafeGraph where people congregate. While primarially business, these POIs include public recreation facilities such as parks, musems and campgrounds as well. This happens to be one interesting use I have been exploring lately, helping government agencies to understand how public park usage is changing due to Covid.

SafeGraph Patterns data provides a count of device nighttime locations by block group (with considerations for protecting PII) visiting each POI. Because Esri provides a mountain of demographic data describing exactly who people are by block group, Esri Tapestry Segmentation, combing SafeGraph Patterns with Esri demographic data provides a very rich picture of park patron behaviors based on who is visiting and where they are coming from - rich geographic context.

The first step, though, is getting the data.

Why AWS EC2?

The data is big, and moving data out of Amazon Web Services (AWS) can be costly. If the data is processed within AWS,  this dramatically reduces the time required to process, and also reduces the size of the data being downloaded, also dramatically reducing the cost incurred on SafeGraph. Since working primarially with pre-sales engagements partnering with SafeGraph, this is a consideration for my work, not incurring undue costs for exploring a proof-of-concept (POC).

SafeGraph patterns data is big. True, it is not ginormous. It is not trillions of records big, but it is still big enough to where it takes time to download, space to store, and compute to process what you typically need for analysis. Once filtered to the relevant POIs being studied, the data is quite managable.

This data resides in Amazon S3, but SafeGraph wisely does not have S3 Select enabed as this incurs costs. Consequently, getting the required data requires iteratively retrieving each file, filtering it to just the needed records, and compiling the result. Since each Patterns file is a CSV GZipped file, this also requires unpacking the archive. Further, SafeGraph has current data and archived data in different directories.

While none of this is exorbitantly difficult to navigate and perform after a little investigation, I firmly believe in working harder and smarter to be lazy later (or efficient, however you want to look at it). To this end, I pulled together the workflow after the first time around into a project repository on GitHub with most of the hard stuff moved into a Python package. This siginifcantly reduces the complexity and increases the reproducability of working with this data with ArcGIS.

Now, the data procurement step, getting only the data associated with POIs being studied, can now be as little as three lines of Python. Still, the actual process of downloading and selecting the data is incredibly slow due to challenges of data gravity.

The term “data gravity” is a metaphor coined by software engineer Dave McCrory in 2010. He was trying to convey the idea that large masses of data exert a form of gravitational pull within IT systems...While data doesn’t literally exert a gravitational pull, McCrory used the concept to explain why smaller applications and other bodies of data seemed to gather around larger masses. More importantly, as applications and the datasets associated with them grow larger, they become more difficult to move.

-VXCnge.com

Rather than downloading the data to extract only what is needed, by using an Amazon EC2 we can take this data extraction step to the data enabling us to only download the data needed.

How to Do It

The general overview is...

  1. Create a Security Group for network rules
  2. Start up an Amazon EC2 instance running Ubuntu
  3. Connect via SSH
  4. Install miniconda
  5. Clone the GitHub Repo
  6. Configure Jupyter
  7. Run Jupyter using TMux
  8. Run the data procurement (EC2) notebook
  9. BONUS: Save the image as an Amazon Machine Image (AMI)

It sounds like a lot, but once you run through it once, it really is not altogether difficult. Do not be afraid.

The easiest way to get started is just to log into the AWS Management Console and click on EC2 to.

Create a Security Group

Now along the left side, click on Security Groups.

Next, click the Create security group button.

Give the security group a name, description and select a VPC (I just have one, so pretty easy). Now, add inbound rules for SSH, HTTPS...

...and a custom rule for port 8888, the port Jupyter runs on.

From there, simply scroll to the bottom and click Create security group.

Create EC2 Instance

Now, we get to create an instance to work with. Return to the main EC2 Dashboard...

...and click on Launch instance.

Search for Ubuntu and Select the Server. For what we are doing any Ubuntu Server will work.

The next screen is where we have to find the right hardware for the task at hand. The SafeGraph data processing workflow I built keeps all the data in memory, so this instance needs a decent amount of memory. However, I have not yet figured out how to multithread it, so we do not need a lot of processing cores. Therefore, we need to find an instance with a decent amount of memory with less emphasis on processing cores.

I discovered the r5n.xlarge worked well. When I wrote this, it is just under $0.35 per hour, so if you work all the way through this, and shut it down, you might be out all of $2.00, enough to pay the paper boy in Better Off Dead.

Continue through to the next prompts to Step4: Add Storage. We are cloning a repo, and moving some data around, so we need some storage capacity. I gave myself enough to work with, 500GB, since I am going to delete all these resources after getting what I need.

Finally, take advantage of the security group you created. Select an existing security group, select the security group created in the earlier step, and click Review and Launch.

From the next pane, click on Launch. A modal window will appear. In this window create a new key pair, give it a name, and download it. You will use this to connect to your instance if using a standalone SSH client. Once you have downloaded the PEM file, you can then Launch Instances.

In the next window, click on View Instances. Congratulations, you just launched an intance, effectively a computer, in the cloud.

Connect via SSH

Now, we need to get some stuff set up on this new instance. To do this, we need to get connected first. You can use an SSH client, or you can just use the built in console. Amazon makes it relatively easy to do both.

First, select the instance you just started, and click on Connect.

If you just want to use the browser terminal client provided by Amazon, you can just use the first option, EC2 Instance Connect.

This will open a console right in another browser tab.

Myself, I prefer to use iTerm on my Mac, so I opt for the SSH client. Amazon makes this easy with a snippet esaily copied by clicking on the icon to the left of the full command line snippet.

After navigating to the directory where the PEM file is located,  cd ~/Documents, getting connected via SSH to the session is as simple as pasting the snippet copied above.

Now, we're connected and can get to work configuring the instance.

Install Miniconda

Now, we can get the Python environment installed and set up. We'll be using the lightweight Conda version, miniconda. Head to the Miniconda page and copy the link for the Linux 64-bit version.

The link has a pretty standard naming convetion, so you should be able to simply type use this command to download it.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Now, install Miniconda using the command...

$ bash ./Miniconda3-latest-Linux-x86_64.sh

It will prompt you to accept the license, but you have to read it first. Just hold down Enter until you read to the end and you get a prompt to "Please answer 'yes' or 'no'". Type yes. Hit Enter.

Next, you'll be asked where to install. By default this is in the user home directory (/home/ubuntu/miniconda3), and this works well, so simply hit Enter again. This is when it really gets to work downloading and seting up Conda.

Once it gets everyting, it will ask you, "Do you wish to run the installer to initialize Miniconda3 by running conda init?" The default is no, but in this case you do, so type yes. Hit Enter, and you've installed Miniconda.

For the bash session to correctly recognize conda, you need to add it to the .bashrc file. Open this file in the nano editor using the command...

$ nano ~/.bashrc

Once in the editor, add the following to the top of the file.

# Jupyter PATH
export PATH=/home/ubuntu/miniconda3/bin:$PATH

Hit Ctrl+X, Y (for yes to save), and Enter. You've enabled conda in the bash shell. For the bash session to recognize conda commands in this session, use the following command.

$ source .bashrc

You are now good to go. You can check the path using which python, and it will report the path we just set to where it is now finding Python.

Clone the GitHub Repo

Now, get the resources for working with SafeGraph data using the command...

$ git clone https://github.com/knu2xs/safegraph-data-utilities.git

This will create a directory in the current working directory, the user home directory, to work in. Switch into it.

$ cd ./safegraph-data-utilities

Add Packages to the Python Environment

The repo we just downloaded contains an environment.yml file we can use to get all the required packages to run the tools included with the repo. We will create a new Conda environment to work with containing all the needed tools using the command...

$ conda env create -f environment.yml

Once all the text is done flowing up the screen, we now have all the Python packages we need in a new Conda environment named sg-data. Activate this environment using...

$ conda activate sg-data

Before getting too far we need to set a few configuration options.

Configure Jupyter Notebook Settings

Juptyer is designed to run on a local machine by default. To enable remote access we need to get a hashed password and set a few configuration options. Don't worry, it is not as bad as it sounds.

First, get a hashed password. This is the password you will use in the browser to access Jupyter. Make it something you can remember. I typically don't make them hard since I remove the instance fairly quickly.

In the SSH terminal enter...

$ ipython

This changes the prompt to an interactive Python prompt. In this interactive Python prompt, type the follwoing commands.

>> from IPython.lib import passwd
>> passwd('your-nifty-password')

The value returned prefixed with sha1:, copy this for use in a minute. It helps to maybe save it in a text file so you don't loose it.

Once done in the iPython session, you can use exit() to get out of the session.

Now, create a Jupyter configuration file.

$ jupyter notebook --generate-config

Next, edit the file using nano.

$ nano ~/.jupyter/jupyter_notebook_config.py

In nano, add the following lines to the top, replacing the password hash with the value you just copied from above.

c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.password = u'passwd_hash'
c.NotebookApp.port = 8888

If you really want to find them, these settings are buried in the settings file somewhere, but it is just easier to put them at the top and be done with it.

Again, Ctrl+X, y, and Enter to save and exit nano.

See, that was not as hard as it sounded at first, was it?

Run Jupyter using TMux

From here, if you want to leave the SSH session open while you work, you don't have to use Tmux, terminal multiplexer, but I prefer to be able to start the Jupyter server and disconnect without loosing my session in the browser. Tmux enables you to do this.

$ tmux

Now, your terminal session looks a little different.

Begin by activating our new Conda environment.

$ conda activate sg-data

You can now start up Jupyter Lab (the only way to work).

$ jupyter lab --no-browser

From here, if you want to exit to the SSH session, you can by hitting Ctrl+B and then D. A more complete list of Tmux commands can be found in this cheatsheet.

Finally, this is the fun part, where we can start playing around. Find the url of your running instance back on the instance details AWS Dashboard. Copy the public url.

Paste this url into your browser and append port 8888 onto the end of the pasted url. In the instance of the example I used above, the url becomes.

http://ec2-18-218-182-84.us-east-2.compute.amazonaws.com:8888

You're presented with a rather boring page with a password prompt. Use the human readable version of password.

Once in, the notebook you are looking for is ./notebooks/02a-sg-patterns-procure-ec2.ipynb. However, this post is already way too long, so I'll save that for a part duex. There's nothing like a good sequel, right?