Install Microsoft Presidio on Ubuntu 18.04

TLDR

Skip to the bottom. The code is down there.

What is Presidio?

Presidio is an anonymization tool created by Microsoft and open-sourced by some sweet unicorn in those hallowed halls. A few minutes spent in your favorite search engine poking around for redaction services or redaction libraries will yield a great big ole null set; results = {} . Why? Where are the open-source goodies we developers have become accustomed too? Well, redaction is hard. It’s not just regex ad nausea until proper nouns and dates have been expunged. It’s a melange of natural language processing, machine learning, subject matter expertise, and diligent underpaid associates. So what is redaction? Redaction is the act of censoring or obscuring parts of the text for legal or security purposes.

Wicked fast and sexy, right?

Google Trends shows a linear interest in time. There is a single outlier in April 2019. Any guesses? You got it; Mueller Report. “Redacted” crashed into the American vocabulary like Kool-Aid Man.

Linear search trend for “redaction” with single outlier

You can see from the image of the Mueller Report below that redaction plays a key role in sensitive information. The output looks more like the work of Winston Smith from the Records Department. What other Winston Smiths are out there determining what information is sensitive?

Mueller report redactions (show in solid black)- SOURCE

Personally Identifiable Information (PII)

The full scope of PII can be explored at the above Wiki, for now, let us assume that the General Data Protection Regulation (GDPR) covers a sufficient basis. Individuals are incrementally wrestling their data rights back from data-hungry corporations seeking to profit off your behavior patterns expressed via data feeds.

Enter redaction. While the behavior patterns are still present, linking those patterns to you by name becomes a monumental effort. Names, phone numbers, dates, crypto wallets, bank numbers, social security numbers, and any other custom pattern can be detected and scrubbed from the data set.

You can try the demo here: https://presidio-demo.azurewebsites.net/

Installation

I tested this installation on a t2.medium with an extra 30Gb of storage. The smaller EC2 instances without memory were crippled from the start. Disk space was brimming with neural nets and RPC definitions before the Docker containers could complete their builds. Bumping up to a larger instance (good-bye free tier) with the extra storage provided just the right amount of wiggle room to make it all work.

Installing Presidio on a fresh AWS EC2 Ubuntu 18.04 requires fulfilling multiple dependencies. This script will ease some of your pain. If you are installing on your dev machine, you most likely have Go, Make, and Docker already installed; building from source will be a cakewalk.

#!/bin/bash# Make sure to add your docker credentials to the environment!
# export USERNAME=my_docker_username
# export PASSWORD=my_docker_password
### Install GoLang# Standard preinstall steps
sudo apt-get -y update
sudo apt-get -y upgrade
# Download tarball and extract
cd /tmp
wget https://dl.google.com/go/go1.11.linux-amd64.tar.gz
sudo tar -xvf go1.11.linux-amd64.tar.gz
sudo mv go /usr/local
# Update env vars
echo "export GOROOT=/usr/local/go" >> $HOME/.bashrc
echo "export GOPATH=$HOME/go" >> $HOME/.bashrc
echo "export PATH=$GOPATH/bin:$GOROOT/bin:$PATH" >> $HOME/.bashrc

### Install Make
sudo apt install -y make
### Install docker# Installation requirements
sudo apt-get -y update
sudo apt-get -y install \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common
# Get the keys
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
# Another update :P
sudo apt-get -y update
sudo apt-get -y install docker-ce docker-ce-cli containerd.io
# Avoid those pesky permissions
sudo usermod -aG docker ${USER}
sudo chmod 666 /var/run/docker.sock # This is frowned upon :(
# Must login to docker! Don't forget to set your info
docker login --username $USERNAME --password $PASSWORD
# Clone Presidio
cd
git clone https://github.com/microsoft/presidio.git
cd presidio
bash build.sh # Do the thing

Testing

Just kidding. You should do your testing. There are multiple examples on the Github repo. I’ve copied one below and the output.

# From the presidio github
echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991", "analyzeTemplate":{"allFields":true} }' | http <api-service-address>/api/v1/projects/<my-project>/analyze
The output from the test case

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Designing an ideal class in C++

Sources that have Helped Me as a Web Developer

21 Python Features — Tips and Tricks

Openwhisk web actions for API backends

Down on the Upside

3 Considerations For Beginner Testers

Time to Show the Product

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Miles Hill

Miles Hill

More from Medium

Deployment of Git and Github as Software Version Control System + Proof of Concept (PoC) Process of…

Configurando o Kubernetes Dashboard para monitoramento de clusters e node-groups do EKS

How to configure domain names with SSL in AWS?

CI/CD for the Rescue 🦸‍♂️