Definition
What is Data Validation?
Data validation is the process of checking data against predefined rules, constraints, and formats to ensure it meets data quality standards before being used for analysis or operations.
Data Validation Explained
Think of data validation as a quality control checkpoint. Before data enters your warehouse, feeds your dashboards, or drives business decisions, validation rules check that it meets your requirements.
An order amount should be positive. An email should contain "@". A date shouldn't be in the future. A foreign key should reference an existing record. These are all validation rules—and when data fails them, you want to know immediately.
Without validation, bad data silently enters your systems. You might not discover it until a stakeholder reports wrong numbers in a dashboard—by which point the damage is done.
Types of Data Validation
Data validation rules fall into several categories, each catching different types of problems:
1. Format Validation
Checks that data matches expected patterns and structures:
- Email addresses: Must contain "@" and a valid domain
- Phone numbers: Must match regional format patterns
- Dates: Must be valid dates in expected format (YYYY-MM-DD)
- URLs: Must be properly formatted web addresses
- Credit cards: Must pass Luhn algorithm check
Format validation catches obvious errors like "john.doe@" (missing domain) or "2025-13-45" (invalid date).
2. Range Validation
Ensures numeric values fall within acceptable boundaries:
- Age: Must be between 0 and 150
- Percentage: Must be between 0 and 100
- Order amount: Must be positive and below $1M
- Quantity: Must be a positive integer
- Temperature: Must be within physically possible range
Range validation catches data entry errors and unit conversion bugs—like an order amount of $50,000,000 that should have been $500.
3. Type Validation
Confirms data is the correct type:
- Numeric fields: Should not contain text
- Date fields: Should contain valid timestamps
- Boolean fields: Should only be true/false
- Integer fields: Should not contain decimals
Type validation catches schema drift and ETL bugs where data types change unexpectedly.
4. Referential Validation
Verifies relationships between tables:
- Foreign keys: Must reference existing primary keys
- Lookup values: Must exist in reference tables
- Parent-child: Child records must have valid parents
Referential validation catches orphaned records and broken relationships that cause JOIN failures.
5. Uniqueness Validation
Ensures no unwanted duplicates:
- Primary keys: Must be unique
- Email addresses: One per customer
- Transaction IDs: No duplicate entries
Uniqueness validation catches duplicate data that inflates metrics and causes incorrect aggregations.
6. Completeness Validation
Checks that required data is present:
- NOT NULL: Required fields must have values
- Conditional: If field A is set, field B must also be set
- Minimum length: Text fields must have minimum characters
Completeness validation catches missing data that would break downstream calculations.
7. Business Rule Validation
Custom rules specific to your domain:
- Order rules: Ship date must be after order date
- Financial rules: Credits and debits must balance
- Inventory rules: Stock cannot be negative
- Status rules: Canceled orders cannot be shipped
Business rule validation catches logical errors that pass format checks but violate domain logic.
Data Validation in SQL
SQL provides several mechanisms for implementing data validation:
Database Constraints
Built-in constraints that enforce rules at the database level:
-- NOT NULL constraint
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
order_date DATE NOT NULL,
amount DECIMAL(10,2) NOT NULL
);
-- CHECK constraint for range validation
ALTER TABLE orders
ADD CONSTRAINT chk_positive_amount
CHECK (amount > 0);
-- UNIQUE constraint
ALTER TABLE customers
ADD CONSTRAINT uq_email
UNIQUE (email);
-- FOREIGN KEY constraint
ALTER TABLE orders
ADD CONSTRAINT fk_customer
FOREIGN KEY (customer_id)
REFERENCES customers(customer_id);Database constraints prevent invalid data from being inserted but don't help with data that's already in the system.
Validation Queries
Custom SQL queries that check existing data:
-- Find orders with invalid amounts
SELECT order_id, amount
FROM orders
WHERE amount <= 0 OR amount > 1000000;
-- Find orphaned records
SELECT o.order_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Find duplicate emails
SELECT email, COUNT(*) as count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Find invalid email formats
SELECT customer_id, email
FROM customers
WHERE email NOT LIKE '%@%.%';
-- Find orders with ship_date before order_date
SELECT order_id, order_date, ship_date
FROM orders
WHERE ship_date < order_date;Stored Procedures for Validation
Reusable validation logic:
-- Validation stored procedure
CREATE PROCEDURE validate_orders()
BEGIN
-- Check for negative amounts
SELECT 'Negative amounts' as rule,
COUNT(*) as violations
FROM orders WHERE amount < 0
UNION ALL
-- Check for missing customers
SELECT 'Orphaned orders' as rule,
COUNT(*) as violations
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
UNION ALL
-- Check for future order dates
SELECT 'Future order dates' as rule,
COUNT(*) as violations
FROM orders WHERE order_date > CURRENT_DATE;
END;Data Validation vs Data Verification
These terms are often confused but have distinct meanings:
| Aspect | Validation | Verification |
|---|---|---|
| Question | Is the data in the right format? | Is the data actually correct? |
| Example | Email contains @ symbol | Email belongs to this customer |
| Method | Automated rule checking | Cross-reference with source |
| When | At data entry/ingestion | During audits or reconciliation |
Validation is automated and happens continuously. Verification typically requires human review or external data sources.
Implementing Data Validation
At Ingestion
Validate data as it enters your system:
- API endpoints validate request payloads
- ETL pipelines check data before loading
- Streaming jobs validate events in real-time
- File uploads validate structure and content
Catching problems at ingestion prevents bad data from ever entering your warehouse.
During Transformation
Validate data as it moves through pipelines:
- dbt tests validate model outputs
- Intermediate checks catch transformation bugs
- Schema tests verify expected structure
- Row count checks catch partial loads
Before Serving
Validate data before it reaches consumers:
- Dashboard data quality checks
- Report validation before distribution
- API response validation
- Export file verification
Continuous Monitoring
Data observability platforms like Sparvi continuously validate data:
- Automated rule execution on schedule
- Alerts when validations fail
- Historical tracking of validation results
- Integration with incident management
Data Validation Best Practices
1. Validate Early, Validate Often
The earlier you catch bad data, the less damage it can do. Validate at every stage of your pipeline.
2. Make Rules Specific
Vague rules like "amount must be reasonable" are useless. Specific rules like "amount must be between $0.01 and $999,999.99" are actionable.
3. Document Your Rules
Every validation rule should have documentation explaining why it exists and what to do when it fails.
4. Prioritize Critical Data
Not all data is equally important. Focus validation efforts on data that drives key business decisions.
5. Handle Failures Gracefully
When validation fails, you need a plan: reject the data, quarantine it for review, or allow with warning.
6. Monitor Validation Results
Track validation pass/fail rates over time. A sudden increase in failures indicates a systemic problem.
Automate Data Validation with Sparvi
Sparvi makes it easy to create and monitor data validation rules. Define rules in SQL, run them on schedule, and get alerts when data fails validation—all without building custom infrastructure.
Learn More About SparviFrequently Asked Questions
What is data validation?
Data validation is the process of checking data against predefined rules, constraints, and formats to ensure accuracy and integrity. It verifies that data meets business requirements before it's used for analysis, reporting, or operational processes.
What is data validation in SQL?
Data validation in SQL involves writing queries that check data against rules and constraints. This includes CHECK constraints, NOT NULL constraints, UNIQUE constraints, foreign key validations, and custom validation queries that verify data meets business requirements.
What are the types of data validation?
Common types of data validation include: format validation (checking data structure like email formats), range validation (ensuring values fall within acceptable limits), type validation (confirming correct data types), referential validation (verifying relationships between tables), and business rule validation (checking domain-specific requirements).
Why is data validation important?
Data validation is important because it prevents bad data from entering your systems and causing downstream problems. Without validation, incorrect data can lead to wrong business decisions, failed reports, compliance issues, and wasted engineering time fixing data quality problems.
What is the difference between data validation and data verification?
Data validation checks if data conforms to defined rules and formats (is the email address in the right format?). Data verification confirms the data is accurate and true (is this actually the customer's correct email address?). Validation ensures data is acceptable; verification ensures it's correct.