Complete Guide to Data Profiling: Techniques, SQL Examples & Tools

What is Data Profiling?

Data profiling is the systematic examination of data to understand its structure, content, relationships, and quality characteristics. Think of it as running a comprehensive health check on your data.

When you profile a dataset, you're answering questions like:

How many rows and columns exist?
What data types are in each column?
How many null or missing values are there?
What's the distribution of values?
Are there duplicates or unexpected patterns?
How do tables relate to each other?

Without profiling, you're working with data blindly. You might assume a column always has values when it's actually 40% null. You might expect numeric IDs when they're actually strings with prefixes. Profiling reveals reality.

Data Profiling vs Data Quality

Profiling tells you what your data looks like: "This column has 15% null values."
Quality tells you if that's acceptable: "Our threshold is 5%, so this is a problem."
Profiling is the foundation that makes quality measurement possible.

Types of Data Profiling

Data profiling is typically divided into three categories, each revealing different aspects of your data:

1. Structure Discovery

Structure profiling examines the technical characteristics of your data:

Data types: String, integer, float, boolean, timestamp, etc.
Formats: Date patterns (YYYY-MM-DD vs MM/DD/YYYY), phone numbers, emails
Lengths: Min/max character lengths for text fields
Precision: Decimal places in numeric columns

2. Content Discovery

Content profiling analyzes the actual values in your data:

Null rates: Percentage of missing values
Unique counts: Cardinality of each column
Value distributions: Histograms and frequency counts
Statistical measures: Min, max, mean, median, standard deviation
Pattern analysis: Common formats, outliers, anomalies

3. Relationship Discovery

Relationship profiling identifies connections between data elements:

Primary keys: Columns that uniquely identify rows
Foreign keys: Columns that reference other tables
Referential integrity: Whether FK values exist in parent tables
Cross-table consistency: Values that should match across tables

Data Profiling in SQL: Practical Queries

Here are ready-to-use SQL queries for profiling your data. These work on most SQL databases (Snowflake, PostgreSQL, BigQuery, etc.) with minor syntax adjustments.

Basic Table Profile

Get an overview of your table:

-- Table overview
SELECT
    COUNT(*) as row_count,
    COUNT(*) - COUNT(DISTINCT primary_key_column) as duplicate_count
FROM your_table;

Column-Level Statistics

Profile a specific column:

-- Comprehensive column profile
SELECT
    COUNT(*) as total_rows,
    COUNT(column_name) as non_null_count,
    COUNT(*) - COUNT(column_name) as null_count,
    ROUND(100.0 * (COUNT(*) - COUNT(column_name)) / COUNT(*), 2) as null_percentage,
    COUNT(DISTINCT column_name) as unique_values,
    MIN(column_name) as min_value,
    MAX(column_name) as max_value
FROM your_table;

Numeric Column Statistics

For numeric columns, add statistical measures:

-- Numeric column profile
SELECT
    COUNT(*) as total_rows,
    COUNT(amount) as non_null_count,
    ROUND(AVG(amount), 2) as mean_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median_value,
    MIN(amount) as min_value,
    MAX(amount) as max_value,
    ROUND(STDDEV(amount), 2) as std_deviation
FROM orders;

Value Distribution (Top Values)

Find the most common values in a column:

-- Top 10 most frequent values
SELECT
    status,
    COUNT(*) as frequency,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM orders
GROUP BY status
ORDER BY frequency DESC
LIMIT 10;

Data Freshness Check

Check how recent your data is:

-- Data freshness profile
SELECT
    MIN(created_at) as earliest_record,
    MAX(created_at) as latest_record,
    DATEDIFF('hour', MAX(created_at), CURRENT_TIMESTAMP()) as hours_since_last_update,
    COUNT(DISTINCT DATE(created_at)) as days_with_data
FROM events;

Full Table Profile Query

Profile all columns in one go (Snowflake/PostgreSQL syntax):

-- Generate profile for all columns (run for each column)
WITH column_stats AS (
    SELECT
        'customer_id' as column_name,
        COUNT(*) as total_rows,
        COUNT(customer_id) as non_null,
        COUNT(DISTINCT customer_id) as unique_values,
        'INTEGER' as inferred_type
    FROM customers
    UNION ALL
    SELECT
        'email' as column_name,
        COUNT(*) as total_rows,
        COUNT(email) as non_null,
        COUNT(DISTINCT email) as unique_values,
        'STRING' as inferred_type
    FROM customers
    -- Add more columns as needed
)
SELECT
    column_name,
    total_rows,
    non_null,
    total_rows - non_null as null_count,
    ROUND(100.0 * (total_rows - non_null) / total_rows, 2) as null_pct,
    unique_values,
    inferred_type
FROM column_stats;

Data Profiling Checklist

Use this checklist to ensure you're covering all aspects of data profiling:

Data Profiling Checklist

Table Level

☐ Total row count
☐ Duplicate rows (if primary key exists)
☐ Table size (bytes)
☐ Last modified timestamp

Column Level

☐ Data type verification
☐ Null count and percentage
☐ Unique value count (cardinality)
☐ Min/Max values
☐ Value distribution (top N values)
☐ Pattern detection (for strings)

For Numeric Columns

☐ Mean, median, mode
☐ Standard deviation
☐ Outlier detection (values beyond 3 std dev)
☐ Negative values (if unexpected)

For Date/Time Columns

☐ Date range (min to max)
☐ Freshness (time since last update)
☐ Gaps in time series
☐ Future dates (if unexpected)

Relationships

☐ Foreign key integrity
☐ Orphaned records
☐ Cross-table value consistency

Data Profiling Tools Compared

Choosing the right tool depends on your team size, technical expertise, and automation needs. Here's how popular options compare:

Tool	Approach	Best For	Profiling Style
Manual SQL	Write queries	Ad-hoc analysis	Manual, one-time
Sparvi	Automated SaaS	Small teams (3-15)	Automatic, continuous
Great Expectations	Python code	Python-heavy teams	Code-driven, flexible
Soda	YAML config	CI/CD integration	Config-driven
Monte Carlo	Automated SaaS	Enterprise	Automatic, ML-powered
dbt tests	SQL + YAML	dbt users	Build-time validation

When to Use Each Approach

Manual SQL: One-off investigations, learning what your data looks like
Sparvi: Ongoing monitoring with minimal setup, automated baselines
Great Expectations: Complex custom validations, Python-native workflows
Soda: CI/CD pipeline integration, version-controlled configs
Monte Carlo: Large-scale enterprise data estates, data mesh architectures

Automating Data Profiling

Manual SQL profiling works for ad-hoc analysis, but you need automation for continuous monitoring. Here's why:

The Problem with Manual Profiling

Time-consuming: Writing queries for every table is tedious
Not continuous: You only see problems when you think to check
No baselines: You can't detect drift without historical data
No alerts: Issues go unnoticed until they cause problems

What Automated Profiling Provides

Instant profiles: Connect your warehouse and see all tables profiled automatically
Continuous updates: Profiles refresh on schedule, catching changes
Historical tracking: Compare today's profile to last week's
Anomaly detection: Get alerted when metrics deviate from baselines
Schema change detection: Know when columns are added, removed, or changed

Automate Your Data Profiling

Sparvi automatically profiles your tables when you connect your data warehouse. Get instant visibility into row counts, null rates, distributions, and more—with continuous monitoring to catch changes before they cause problems.

Try Sparvi Free

Frequently Asked Questions

What is data profiling?

Data profiling is the systematic analysis of datasets to understand their structure, content, quality, and statistical characteristics. It examines metadata like data types, null rates, unique values, and distributions to provide a complete picture of your data.

How do you profile data in SQL?

To profile data in SQL, use queries that calculate statistics like COUNT(*) for row counts, COUNT(DISTINCT column) for unique values, SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) for null counts, and MIN/MAX/AVG for numeric ranges. Combine these to build a comprehensive profile.

What are the types of data profiling?

The three main types of data profiling are: structure discovery (analyzing data types, formats, and patterns), content discovery (examining actual values, distributions, and statistics), and relationship discovery (identifying foreign keys, dependencies, and cross-table relationships).

What is the difference between data profiling and data quality?

Data profiling is the analysis process that discovers what your data looks like, while data quality is the assessment of whether that data meets your standards. Profiling tells you "15% of emails are null" while quality tells you "null rate should be under 5%, so this is a problem."

What tools are used for data profiling?

Data profiling tools range from manual SQL queries to automated solutions. Popular options include Sparvi (automated profiling with ML), Great Expectations (code-based), Soda (YAML-based), and enterprise tools like Monte Carlo. The best choice depends on your team size and automation needs.

Complete Guide to Data Profiling: Techniques, SQL Examples & Best Practices

Table of Contents

What is Data Profiling?

Data Profiling vs Data Quality

Types of Data Profiling

1. Structure Discovery

2. Content Discovery

3. Relationship Discovery

Data Profiling in SQL: Practical Queries

Basic Table Profile

Column-Level Statistics

Numeric Column Statistics

Value Distribution (Top Values)

Data Freshness Check

Full Table Profile Query

Data Profiling Checklist

Data Profiling Checklist

Table Level

Column Level

For Numeric Columns

For Date/Time Columns

Relationships

Data Profiling Tools Compared

When to Use Each Approach

Automating Data Profiling

The Problem with Manual Profiling

What Automated Profiling Provides

Automate Your Data Profiling

Frequently Asked Questions

What is data profiling?

How do you profile data in SQL?

What are the types of data profiling?

What is the difference between data profiling and data quality?

What tools are used for data profiling?

Related Articles

What is Data Profiling?

Best Data Observability Tools

Anomaly Detection in Data

Data Quality Best Practices