Documentation
Data Profiling Guide
Data profiling is the process of examining data to understand its structure, content, quality, and patterns. With Sparvi, you can easily profile your database tables to gain deep insights into your data.
Understanding Sparvi Profiles
When you profile a table with Sparvi, it computes a variety of metrics and statistics that give you a comprehensive view of your data. A profile includes:
Basic Metrics
- Row count: Total number of rows in the table
- Column count: Total number of columns
- Duplicate count: Number of duplicate rows
- Timestamp: When the profile was created
Completeness Metrics
For each column, Sparvi calculates:
- Null count: Number of NULL values
- Null percentage: Percentage of NULL values
- Distinct count: Number of unique values
- Distinct percentage: Percentage of unique values
Numeric Column Statistics
For numeric columns, Sparvi calculates:
- Min/Max: Minimum and maximum values
- Mean/Median: Average and middle values
- Standard Deviation: Measure of data dispersion
- Quartiles: 25th, 50th, and 75th percentiles
- Outliers: Values that fall significantly outside the normal range
Text Column Statistics
For text columns, Sparvi analyzes:
- Min/Max/Avg Length: Character length statistics
- Pattern Recognition: Common patterns in the data
- Format Consistency: Checks for consistent formatting
- Empty String Detection: Identifies empty strings (different from NULLs)
Date Column Statistics
For date and timestamp columns, Sparvi calculates:
- Min/Max Date: Earliest and latest dates
- Date Range: Span of time represented
- Temporal Patterns: Distribution across time periods
- Future Dates: Detection of dates in the future
- Unrealistic Dates: Identification of dates that are unreasonably old
Feeding Profiles into Monitors
Profile statistics are also the source data for column-metric monitors. The same null %, distinct count, min, max, avg, and stddev that the profile surfaces can be promoted to a monitor in two clicks, with ML or threshold evaluation. Segment any of them by a dimension column to track quality per region, product, or tenant.
This is how the typical Sparvi feedback loop closes:
- Statistical outliers, surface in the profile; promote to a column-metric monitor with statistical evaluation to alert on future drift
- Distribution shifts, visible in the histogram; promote to an avg or stddev monitor
- Completeness issues, visible in the null % column; promote to a null % monitor with a threshold ("never above 5%") or a statistical baseline
- Schema shifts, picked up automatically by Sparvi's schema-change tracking; no monitor required
Using Data Profiling in Sparvi Cloud
Sparvi Cloud provides automated data profiling through the web interface:
- Connect Your Database: Add your database connection in the Connections section
- Navigate to Data Explorer: Browse your database schemas and tables
- View Profiles: Click on any table to see its comprehensive data profile
- Automated Scheduling: Configure automated profiling schedules for continuous monitoring
- Historical Comparison: Track changes over time with automatic historical comparisons
- Promote to monitors: Add a monitor on any column statistic so future drift creates an issue automatically
Analyzing Profiles in Sparvi Cloud
Sparvi Cloud provides comprehensive profile analysis through interactive dashboards and visualizations.
Viewing Profile Metrics
In the Data Explorer, when you view a table profile, Sparvi Cloud displays:
- Table Overview: Row count, column count, duplicate count, and profile timestamp
- Column Completeness Dashboard: Visual representation of null percentages and distinct value counts for all columns
- Interactive Charts: Sortable and filterable views of all column statistics
Identifying Data Quality Issues
Sparvi Cloud automatically highlights problematic columns:
- High Null Rates: Columns with elevated null percentages are flagged with warning indicators
- Low Cardinality Issues: Columns with very low distinct value percentages are identified
- Sorting and Filtering: Sort columns by null percentage, distinct count, or other metrics to quickly find issues
Numeric Column Analysis
For numeric columns, Sparvi Cloud provides:
- Statistical Summary: View min, max, average, median, and standard deviation
- Distribution Visualizations: Histograms and box plots showing data distribution
- Outlier Detection: Automatically identified outliers with example values
- Trend Analysis: Historical comparison of numeric statistics over time
Monitor Outcomes Alongside Profiles
For tables that have monitors attached, the profile view shows each monitor's current value, baseline (or threshold), and last evaluation timestamp. Click into a monitor to see its trend chart, segmented breakdown (if segmented), and full issue history.
- Monitor list: Every monitor on the table, with status (healthy, warning, critical)
- Severity levels: Critical and warning issues derived from monitor evaluations
- Schema change tracking: Automatic detection of schema shifts with before/after comparison
- Filterable views: Filter monitors by status, evaluation mode, or affected column
Dashboard and Reporting
The Sparvi Cloud dashboard provides comprehensive visualizations:
- Monitor trends over time: Line charts showing each monitor's value, baseline, and crossings
- Severity breakdown: Visual distribution of critical and warning issues
- Issue type analysis: Charts showing whether issues come from monitors, validations, or schema changes
- Table health heatmap: Color-coded view of data quality across all tables
- Export options: Download reports and share with your team
Best Practices
- Profile Regularly: Run profiles on a regular schedule to track changes over time
- Use Historical Comparisons: Always compare current profiles with previous ones
- Start with Basic Metrics: Begin analysis with row counts and null percentages
- Focus on Outliers: Pay special attention to statistical outliers
- Look for Patterns: Analyze patterns in text data for inconsistencies
- Check Date Ranges: Ensure date ranges fall within expected bounds
- Track Schema Changes: Monitor changes to table structures over time
- Automate Alerting: Set up alerts for significant changes in key metrics
- Tune monitors: Adjust sensitivity, baseline window, or thresholds on the monitors derived from your profiles so signal matches your tolerance for noise
- Document Findings: Keep records of data quality issues for future reference
Next Steps
After profiling your data, consider:
- Setting up Validation Rules based on profile insights
- Exploring the Anomaly Detection Guide, how Sparvi monitors flag values that fall outside a learned baseline, including segment-aware detection
- Connecting your databases with Sparvi Cloud Database Connections