Mortgage Data, and Working with Large Datasets

Since The Markup reporters Lauren Kirchner and Emmanuel Martinez released their story on bias in mortgage algorithms, I’ve been digging into the data behind their reporting and looking at potential additional patterns. The story is worth a read, and a re-read. They also do a great job showing their work, which includes releasing the code and data they used for their analysis.

Their reporting is based on the 2019 data, but the Consumer Finance Protection Bureau also has 2020 data, so I figured I’d grab that as well.

This is a sizeable dataset, and even though I have a decent workhorse of a machine, loading the datasets made my computer VERY unhappy.

To work around this, I did two things. First, I pulled the code from the Jupiter notebooks into Python, which helped reduce memory usage and CPU load bit, at least in my setup. But this wasn’t enough to process the full dataset without crashing, so I made a temporary increase in the size of my swap directory. I saved this as a bash file so I can run it whenever I need a temporary memory boost to prevent crashes.

I’ve worked with large datasets with tens of millions of records in the past, and I have never needed to do this. Writing to swap files can be very slow in its own right, and if there is a better way to prevent crashes when loading large data sets, I’d love to hear it. As I process data, I am deleting dataframes when I no longer need them, and using gc to free memory, but on my machine loading the datasets caused the crash. I would not recommend using this hack as a permanent solution, or on a machine that is not local.

The commands can be typed out individually, which elimnates the need for a script. But hey – why type out three lines when you can just type out one? This script was used on a Debian flavored Linux system; YMMV if used in other setups.

In the script, you need to set two variables: the location of the swap file, and the size. Make sure that your hard drive has adequate room to support your swap file.

Before you run the script, run sudo free -h from the command line. This will show your default setup, with the amount of free memory on your system, and your default swap setup. After you run the shell script, re-run sudo free -h to see the changes.

When you restart your computer, your system reverts to the default setup.

#!/bin/bash

SWAPDIR=/swapfile

sudo fallocate -l $SIZE $SWAPDIR
sudo chmod 600 $SWAPDIR
sudo swapon $SWAPDIR