Data Masking and Anonymization with Python

Irvic May 3, 2023 Data Security

Data masking and anonymization are essential techniques to protect sensitive information while allowing data engineers and analysts to work with the data. This guide covers various data masking and anonymization methods, their use cases, and provides code examples for masking data before being used for analytics.

Data Masking

Data masking is the process of obscuring sensitive information by replacing it with fictitious or scrambled data while preserving the original data format. Some common data masking techniques include:

Substitution
Shuffling
Masking out
Number and date variance

Data Anonymization

Data anonymization is the process of removing personally identifiable information (PII) from a dataset, making it difficult or impossible to identify individuals. Common data anonymization techniques include:

Generalization
Suppression
Pseudonymization
Differential privacy

Use Case: Masking and Anonymizing Data for Analytics

In this use case, we will demonstrate how to import data into a SQLite database, mask sensitive information, and use the masked data for analytics.

Importing Data into a SQLite Database

First, we will create a sample CSV file containing user information, including sensitive data such as email addresses and phone numbers.

sample_data.csv:

user_id,first_name,last_name,email,phone
1,John,Doe,johndoe@email.com,555-123-4567
2,Jane,Smith,janesmith@email.com,555-987-6543
3,Michael,Johnson,michaelj@email.com,555-321-6789

Next, we will create a SQLite database and import the CSV data.

import sqlite3
import pandas as pd

# Create a SQLite database and connect to it
conn = sqlite3.connect('user_data.db')

# Read CSV data into a pandas DataFrame
data = pd.read_csv('sample_data.csv')

# Write the DataFrame to the SQLite database
data.to_sql('users', conn, if_exists='replace', index=False)

# Close the database connection
conn.close()

Masking Sensitive Data

Now that the data is in the SQLite database, we will create a Python script to mask the email addresses and phone numbers using the substitution method.

import sqlite3
import pandas as pd
import random
import string

# Function to generate random email addresses
def generate_random_email():
    domain = random.choice(['example.com', 'sample.com', 'test.com'])
    username = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
    return f'{username}@{domain}'

# Function to generate random phone numbers
def generate_random_phone():
    return '555-' + ''.join(random.choices(string.digits, k=3)) + '-' + ''.join(random.choices(string.digits, k=4))

# Connect to the SQLite database
conn = sqlite3.connect('user_data.db')

# Read the user data from the database
data = pd.read_sql_query('SELECT * FROM users', conn)

# Mask email addresses and phone numbers
data['email'] = data['email'].apply(lambda x: generate_random_email())
data['phone'] = data['phone'].apply(lambda x: generate_random_phone())

# Write the masked data back to the database
data.to_sql('masked_users', conn, if_exists='replace', index=False)

# Close the database connection
conn.close()

Using Masked Data for Analytics

With the sensitive information masked, the data can now be safely used for analytics. For example, we can calculate the frequency of users with specific email domains or analyze the distribution of first names in the dataset without compromising privacy.

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

# Connect to the SQLite database
conn = sqlite3.connect('user_data.db')

# Read the masked user data from the database
masked_data = pd.read_sql_query('SELECT * FROM masked_users', conn)

# Close the database connection
conn.close()

# Extract email domains and count their occurrences
email_domains = masked_data['email'].apply(lambda x: x.split('@')[1])
domain_counts = email_domains.value_counts()

# Plot the frequency of email domains
plt.bar(domain_counts.index, domain_counts.values)
plt.xlabel('Email Domain')
plt.ylabel('Frequency')
plt.title('Frequency of Email Domains in Masked Data')
plt.show()

# Analyze the distribution of first names
first_name_counts = masked_data['first_name'].value_counts()

# Plot the frequency of first names
plt.bar(first_name_counts.index, first_name_counts.values)
plt.xlabel('First Name')
plt.ylabel('Frequency')
plt.title('Frequency of First Names in Masked Data')
plt.show()

In this example, we have used the masked data to visualize the distribution of email domains and first names without exposing sensitive information such as real email addresses or phone numbers. This allows data engineers and analysts to perform their tasks while complying with privacy regulations and best practices.

Data masking and anonymization are crucial techniques to protect sensitive information in the age of data-driven decision-making. This in-depth guide has provided an overview of various methods, their use cases, and a practical example for data engineers to implement these techniques. By incorporating data masking and anonymization in their workflows, data engineers can ensure the privacy and security of sensitive information while enabling data-driven insights and analytics.