Guide15 min read

Complete Guide to Data Profiling: Techniques, SQL Examples & Best Practices

Data profiling is the foundation of data quality. Learn how to analyze your data systematically, write profiling queries in SQL, and choose tools that automate the process.

By Sparvi TeamDecember 9, 2025

What is Data Profiling?

Data profiling is the systematic examination of data to understand its structure, content, relationships, and quality characteristics. Think of it as running a comprehensive health check on your data.

When you profile a dataset, you're answering questions like:

  • How many rows and columns exist?
  • What data types are in each column?
  • How many null or missing values are there?
  • What's the distribution of values?
  • Are there duplicates or unexpected patterns?
  • How do tables relate to each other?

Without profiling, you're working with data blindly. You might assume a column always has values when it's actually 40% null. You might expect numeric IDs when they're actually strings with prefixes. Profiling reveals reality.

Data Profiling vs Data Quality

Profiling tells you what your data looks like: "This column has 15% null values."
Quality tells you if that's acceptable: "Our threshold is 5%, so this is a problem."
Profiling is the foundation that makes quality measurement possible.

Types of Data Profiling

Data profiling is typically divided into three categories, each revealing different aspects of your data:

1. Structure Discovery

Structure profiling examines the technical characteristics of your data:

  • Data types: String, integer, float, boolean, timestamp, etc.
  • Formats: Date patterns (YYYY-MM-DD vs MM/DD/YYYY), phone numbers, emails
  • Lengths: Min/max character lengths for text fields
  • Precision: Decimal places in numeric columns

2. Content Discovery

Content profiling analyzes the actual values in your data:

  • Null rates: Percentage of missing values
  • Unique counts: Cardinality of each column
  • Value distributions: Histograms and frequency counts
  • Statistical measures: Min, max, mean, median, standard deviation
  • Pattern analysis: Common formats, outliers, anomalies

3. Relationship Discovery

Relationship profiling identifies connections between data elements:

  • Primary keys: Columns that uniquely identify rows
  • Foreign keys: Columns that reference other tables
  • Referential integrity: Whether FK values exist in parent tables
  • Cross-table consistency: Values that should match across tables

Data Profiling in SQL: Practical Queries

Here are ready-to-use SQL queries for profiling your data. These work on most SQL databases (Snowflake, PostgreSQL, BigQuery, etc.) with minor syntax adjustments.

Basic Table Profile

Get an overview of your table:

-- Table overview
SELECT
    COUNT(*) as row_count,
    COUNT(*) - COUNT(DISTINCT primary_key_column) as duplicate_count
FROM your_table;

Column-Level Statistics

Profile a specific column:

-- Comprehensive column profile
SELECT
    COUNT(*) as total_rows,
    COUNT(column_name) as non_null_count,
    COUNT(*) - COUNT(column_name) as null_count,
    ROUND(100.0 * (COUNT(*) - COUNT(column_name)) / COUNT(*), 2) as null_percentage,
    COUNT(DISTINCT column_name) as unique_values,
    MIN(column_name) as min_value,
    MAX(column_name) as max_value
FROM your_table;

Numeric Column Statistics

For numeric columns, add statistical measures:

-- Numeric column profile
SELECT
    COUNT(*) as total_rows,
    COUNT(amount) as non_null_count,
    ROUND(AVG(amount), 2) as mean_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median_value,
    MIN(amount) as min_value,
    MAX(amount) as max_value,
    ROUND(STDDEV(amount), 2) as std_deviation
FROM orders;

Value Distribution (Top Values)

Find the most common values in a column:

-- Top 10 most frequent values
SELECT
    status,
    COUNT(*) as frequency,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM orders
GROUP BY status
ORDER BY frequency DESC
LIMIT 10;

Data Freshness Check

Check how recent your data is:

-- Data freshness profile
SELECT
    MIN(created_at) as earliest_record,
    MAX(created_at) as latest_record,
    DATEDIFF('hour', MAX(created_at), CURRENT_TIMESTAMP()) as hours_since_last_update,
    COUNT(DISTINCT DATE(created_at)) as days_with_data
FROM events;

Full Table Profile Query

Profile all columns in one go (Snowflake/PostgreSQL syntax):

-- Generate profile for all columns (run for each column)
WITH column_stats AS (
    SELECT
        'customer_id' as column_name,
        COUNT(*) as total_rows,
        COUNT(customer_id) as non_null,
        COUNT(DISTINCT customer_id) as unique_values,
        'INTEGER' as inferred_type
    FROM customers
    UNION ALL
    SELECT
        'email' as column_name,
        COUNT(*) as total_rows,
        COUNT(email) as non_null,
        COUNT(DISTINCT email) as unique_values,
        'STRING' as inferred_type
    FROM customers
    -- Add more columns as needed
)
SELECT
    column_name,
    total_rows,
    non_null,
    total_rows - non_null as null_count,
    ROUND(100.0 * (total_rows - non_null) / total_rows, 2) as null_pct,
    unique_values,
    inferred_type
FROM column_stats;

Data Profiling Checklist

Use this checklist to ensure you're covering all aspects of data profiling:

Data Profiling Checklist

Table Level

  • ☐ Total row count
  • ☐ Duplicate rows (if primary key exists)
  • ☐ Table size (bytes)
  • ☐ Last modified timestamp

Column Level

  • ☐ Data type verification
  • ☐ Null count and percentage
  • ☐ Unique value count (cardinality)
  • ☐ Min/Max values
  • ☐ Value distribution (top N values)
  • ☐ Pattern detection (for strings)

For Numeric Columns

  • ☐ Mean, median, mode
  • ☐ Standard deviation
  • ☐ Outlier detection (values beyond 3 std dev)
  • ☐ Negative values (if unexpected)

For Date/Time Columns

  • ☐ Date range (min to max)
  • ☐ Freshness (time since last update)
  • ☐ Gaps in time series
  • ☐ Future dates (if unexpected)

Relationships

  • ☐ Foreign key integrity
  • ☐ Orphaned records
  • ☐ Cross-table value consistency

Data Profiling Tools Compared

Choosing the right tool depends on your team size, technical expertise, and automation needs. Here's how popular options compare:

ToolApproachBest ForProfiling Style
Manual SQLWrite queriesAd-hoc analysisManual, one-time
SparviAutomated SaaSSmall teams (3-15)Automatic, continuous
Great ExpectationsPython codePython-heavy teamsCode-driven, flexible
SodaYAML configCI/CD integrationConfig-driven
Monte CarloAutomated SaaSEnterpriseAutomatic, ML-powered
dbt testsSQL + YAMLdbt usersBuild-time validation

When to Use Each Approach

  • Manual SQL: One-off investigations, learning what your data looks like
  • Sparvi: Ongoing monitoring with minimal setup, automated baselines
  • Great Expectations: Complex custom validations, Python-native workflows
  • Soda: CI/CD pipeline integration, version-controlled configs
  • Monte Carlo: Large-scale enterprise data estates, data mesh architectures

Automating Data Profiling

Manual SQL profiling works for ad-hoc analysis, but you need automation for continuous monitoring. Here's why:

The Problem with Manual Profiling

  • Time-consuming: Writing queries for every table is tedious
  • Not continuous: You only see problems when you think to check
  • No baselines: You can't detect drift without historical data
  • No alerts: Issues go unnoticed until they cause problems

What Automated Profiling Provides

  • Instant profiles: Connect your warehouse and see all tables profiled automatically
  • Continuous updates: Profiles refresh on schedule, catching changes
  • Historical tracking: Compare today's profile to last week's
  • Anomaly detection: Get alerted when metrics deviate from baselines
  • Schema change detection: Know when columns are added, removed, or changed

Automate Your Data Profiling

Sparvi automatically profiles your tables when you connect your data warehouse. Get instant visibility into row counts, null rates, distributions, and more—with continuous monitoring to catch changes before they cause problems.

Try Sparvi Free

Frequently Asked Questions

What is data profiling?

Data profiling is the systematic analysis of datasets to understand their structure, content, quality, and statistical characteristics. It examines metadata like data types, null rates, unique values, and distributions to provide a complete picture of your data.

How do you profile data in SQL?

To profile data in SQL, use queries that calculate statistics like COUNT(*) for row counts, COUNT(DISTINCT column) for unique values, SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) for null counts, and MIN/MAX/AVG for numeric ranges. Combine these to build a comprehensive profile.

What are the types of data profiling?

The three main types of data profiling are: structure discovery (analyzing data types, formats, and patterns), content discovery (examining actual values, distributions, and statistics), and relationship discovery (identifying foreign keys, dependencies, and cross-table relationships).

What is the difference between data profiling and data quality?

Data profiling is the analysis process that discovers what your data looks like, while data quality is the assessment of whether that data meets your standards. Profiling tells you "15% of emails are null" while quality tells you "null rate should be under 5%, so this is a problem."

What tools are used for data profiling?

Data profiling tools range from manual SQL queries to automated solutions. Popular options include Sparvi (automated profiling with ML), Great Expectations (code-based), Soda (YAML-based), and enterprise tools like Monte Carlo. The best choice depends on your team size and automation needs.