Complete Guide to Data Profiling: Techniques, SQL Examples & Best Practices
Data profiling is the foundation of data quality. Learn how to analyze your data systematically, write profiling queries in SQL, and choose tools that automate the process.
Table of Contents
What is Data Profiling?
Data profiling is the systematic examination of data to understand its structure, content, relationships, and quality characteristics. Think of it as running a comprehensive health check on your data.
When you profile a dataset, you're answering questions like:
- How many rows and columns exist?
- What data types are in each column?
- How many null or missing values are there?
- What's the distribution of values?
- Are there duplicates or unexpected patterns?
- How do tables relate to each other?
Without profiling, you're working with data blindly. You might assume a column always has values when it's actually 40% null. You might expect numeric IDs when they're actually strings with prefixes. Profiling reveals reality.
Data Profiling vs Data Quality
Profiling tells you what your data looks like: "This column has 15% null values."
Quality tells you if that's acceptable: "Our threshold is 5%, so this is a problem."
Profiling is the foundation that makes quality measurement possible.
Types of Data Profiling
Data profiling is typically divided into three categories, each revealing different aspects of your data:
1. Structure Discovery
Structure profiling examines the technical characteristics of your data:
- Data types: String, integer, float, boolean, timestamp, etc.
- Formats: Date patterns (YYYY-MM-DD vs MM/DD/YYYY), phone numbers, emails
- Lengths: Min/max character lengths for text fields
- Precision: Decimal places in numeric columns
2. Content Discovery
Content profiling analyzes the actual values in your data:
- Null rates: Percentage of missing values
- Unique counts: Cardinality of each column
- Value distributions: Histograms and frequency counts
- Statistical measures: Min, max, mean, median, standard deviation
- Pattern analysis: Common formats, outliers, anomalies
3. Relationship Discovery
Relationship profiling identifies connections between data elements:
- Primary keys: Columns that uniquely identify rows
- Foreign keys: Columns that reference other tables
- Referential integrity: Whether FK values exist in parent tables
- Cross-table consistency: Values that should match across tables
Data Profiling in SQL: Practical Queries
Here are ready-to-use SQL queries for profiling your data. These work on most SQL databases (Snowflake, PostgreSQL, BigQuery, etc.) with minor syntax adjustments.
Basic Table Profile
Get an overview of your table:
-- Table overview
SELECT
COUNT(*) as row_count,
COUNT(*) - COUNT(DISTINCT primary_key_column) as duplicate_count
FROM your_table;Column-Level Statistics
Profile a specific column:
-- Comprehensive column profile
SELECT
COUNT(*) as total_rows,
COUNT(column_name) as non_null_count,
COUNT(*) - COUNT(column_name) as null_count,
ROUND(100.0 * (COUNT(*) - COUNT(column_name)) / COUNT(*), 2) as null_percentage,
COUNT(DISTINCT column_name) as unique_values,
MIN(column_name) as min_value,
MAX(column_name) as max_value
FROM your_table;Numeric Column Statistics
For numeric columns, add statistical measures:
-- Numeric column profile
SELECT
COUNT(*) as total_rows,
COUNT(amount) as non_null_count,
ROUND(AVG(amount), 2) as mean_value,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median_value,
MIN(amount) as min_value,
MAX(amount) as max_value,
ROUND(STDDEV(amount), 2) as std_deviation
FROM orders;Value Distribution (Top Values)
Find the most common values in a column:
-- Top 10 most frequent values
SELECT
status,
COUNT(*) as frequency,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM orders
GROUP BY status
ORDER BY frequency DESC
LIMIT 10;Data Freshness Check
Check how recent your data is:
-- Data freshness profile
SELECT
MIN(created_at) as earliest_record,
MAX(created_at) as latest_record,
DATEDIFF('hour', MAX(created_at), CURRENT_TIMESTAMP()) as hours_since_last_update,
COUNT(DISTINCT DATE(created_at)) as days_with_data
FROM events;Full Table Profile Query
Profile all columns in one go (Snowflake/PostgreSQL syntax):
-- Generate profile for all columns (run for each column)
WITH column_stats AS (
SELECT
'customer_id' as column_name,
COUNT(*) as total_rows,
COUNT(customer_id) as non_null,
COUNT(DISTINCT customer_id) as unique_values,
'INTEGER' as inferred_type
FROM customers
UNION ALL
SELECT
'email' as column_name,
COUNT(*) as total_rows,
COUNT(email) as non_null,
COUNT(DISTINCT email) as unique_values,
'STRING' as inferred_type
FROM customers
-- Add more columns as needed
)
SELECT
column_name,
total_rows,
non_null,
total_rows - non_null as null_count,
ROUND(100.0 * (total_rows - non_null) / total_rows, 2) as null_pct,
unique_values,
inferred_type
FROM column_stats;Data Profiling Checklist
Use this checklist to ensure you're covering all aspects of data profiling:
Data Profiling Checklist
Table Level
- ☐ Total row count
- ☐ Duplicate rows (if primary key exists)
- ☐ Table size (bytes)
- ☐ Last modified timestamp
Column Level
- ☐ Data type verification
- ☐ Null count and percentage
- ☐ Unique value count (cardinality)
- ☐ Min/Max values
- ☐ Value distribution (top N values)
- ☐ Pattern detection (for strings)
For Numeric Columns
- ☐ Mean, median, mode
- ☐ Standard deviation
- ☐ Outlier detection (values beyond 3 std dev)
- ☐ Negative values (if unexpected)
For Date/Time Columns
- ☐ Date range (min to max)
- ☐ Freshness (time since last update)
- ☐ Gaps in time series
- ☐ Future dates (if unexpected)
Relationships
- ☐ Foreign key integrity
- ☐ Orphaned records
- ☐ Cross-table value consistency
Data Profiling Tools Compared
Choosing the right tool depends on your team size, technical expertise, and automation needs. Here's how popular options compare:
| Tool | Approach | Best For | Profiling Style |
|---|---|---|---|
| Manual SQL | Write queries | Ad-hoc analysis | Manual, one-time |
| Sparvi | Automated SaaS | Small teams (3-15) | Automatic, continuous |
| Great Expectations | Python code | Python-heavy teams | Code-driven, flexible |
| Soda | YAML config | CI/CD integration | Config-driven |
| Monte Carlo | Automated SaaS | Enterprise | Automatic, ML-powered |
| dbt tests | SQL + YAML | dbt users | Build-time validation |
When to Use Each Approach
- Manual SQL: One-off investigations, learning what your data looks like
- Sparvi: Ongoing monitoring with minimal setup, automated baselines
- Great Expectations: Complex custom validations, Python-native workflows
- Soda: CI/CD pipeline integration, version-controlled configs
- Monte Carlo: Large-scale enterprise data estates, data mesh architectures
Automating Data Profiling
Manual SQL profiling works for ad-hoc analysis, but you need automation for continuous monitoring. Here's why:
The Problem with Manual Profiling
- Time-consuming: Writing queries for every table is tedious
- Not continuous: You only see problems when you think to check
- No baselines: You can't detect drift without historical data
- No alerts: Issues go unnoticed until they cause problems
What Automated Profiling Provides
- Instant profiles: Connect your warehouse and see all tables profiled automatically
- Continuous updates: Profiles refresh on schedule, catching changes
- Historical tracking: Compare today's profile to last week's
- Anomaly detection: Get alerted when metrics deviate from baselines
- Schema change detection: Know when columns are added, removed, or changed
Automate Your Data Profiling
Sparvi automatically profiles your tables when you connect your data warehouse. Get instant visibility into row counts, null rates, distributions, and more—with continuous monitoring to catch changes before they cause problems.
Try Sparvi FreeFrequently Asked Questions
What is data profiling?
Data profiling is the systematic analysis of datasets to understand their structure, content, quality, and statistical characteristics. It examines metadata like data types, null rates, unique values, and distributions to provide a complete picture of your data.
How do you profile data in SQL?
To profile data in SQL, use queries that calculate statistics like COUNT(*) for row counts, COUNT(DISTINCT column) for unique values, SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) for null counts, and MIN/MAX/AVG for numeric ranges. Combine these to build a comprehensive profile.
What are the types of data profiling?
The three main types of data profiling are: structure discovery (analyzing data types, formats, and patterns), content discovery (examining actual values, distributions, and statistics), and relationship discovery (identifying foreign keys, dependencies, and cross-table relationships).
What is the difference between data profiling and data quality?
Data profiling is the analysis process that discovers what your data looks like, while data quality is the assessment of whether that data meets your standards. Profiling tells you "15% of emails are null" while quality tells you "null rate should be under 5%, so this is a problem."
What tools are used for data profiling?
Data profiling tools range from manual SQL queries to automated solutions. Popular options include Sparvi (automated profiling with ML), Great Expectations (code-based), Soda (YAML-based), and enterprise tools like Monte Carlo. The best choice depends on your team size and automation needs.