Documentation
Data Profiling Guide
Data profiling is the process of examining data to understand its structure, content, quality, and patterns. With Sparvi, you can easily profile your database tables to gain deep insights into your data.
Understanding Sparvi Profiles
When you profile a table with Sparvi, it computes a variety of metrics and statistics that give you a comprehensive view of your data. A profile includes:
Basic Metrics
- Row count: Total number of rows in the table
- Column count: Total number of columns
- Duplicate count: Number of duplicate rows
- Timestamp: When the profile was created
Completeness Metrics
For each column, Sparvi calculates:
- Null count: Number of NULL values
- Null percentage: Percentage of NULL values
- Distinct count: Number of unique values
- Distinct percentage: Percentage of unique values
Numeric Column Statistics
For numeric columns, Sparvi calculates:
- Min/Max: Minimum and maximum values
- Mean/Median: Average and middle values
- Standard Deviation: Measure of data dispersion
- Quartiles: 25th, 50th, and 75th percentiles
- Outliers: Values that fall significantly outside the normal range
Text Column Statistics
For text columns, Sparvi analyzes:
- Min/Max/Avg Length: Character length statistics
- Pattern Recognition: Common patterns in the data
- Format Consistency: Checks for consistent formatting
- Empty String Detection: Identifies empty strings (different from NULLs)
Date Column Statistics
For date and timestamp columns, Sparvi calculates:
- Min/Max Date: Earliest and latest dates
- Date Range: Span of time represented
- Temporal Patterns: Distribution across time periods
- Future Dates: Detection of dates in the future
- Unrealistic Dates: Identification of dates that are unreasonably old
Anomaly Detection
Sparvi automatically detects anomalies such as:
- Statistical Outliers: Values far from the statistical norm
- Pattern Deviations: Inconsistencies in data patterns
- Distribution Anomalies: Unusual distributions of values
- Completeness Issues: Unexpected NULL patterns
- Schema Shifts: Changes in table schema compared to historical data
Using Data Profiling in Sparvi Cloud
Sparvi Cloud provides automated data profiling through the web interface:
- Connect Your Database: Add your database connection in the Connections section
- Navigate to Data Explorer: Browse your database schemas and tables
- View Profiles: Click on any table to see its comprehensive data profile
- Automated Scheduling: Configure automated profiling schedules for continuous monitoring
- Historical Comparison: Track changes over time with automatic historical comparisons
- Anomaly Detection: Get alerts when unusual patterns or data quality issues are detected
Analyzing Profiles in Sparvi Cloud
Sparvi Cloud provides comprehensive profile analysis through interactive dashboards and visualizations.
Viewing Profile Metrics
In the Data Explorer, when you view a table profile, Sparvi Cloud displays:
- Table Overview: Row count, column count, duplicate count, and profile timestamp
- Column Completeness Dashboard: Visual representation of null percentages and distinct value counts for all columns
- Interactive Charts: Sortable and filterable views of all column statistics
Identifying Data Quality Issues
Sparvi Cloud automatically highlights problematic columns:
- High Null Rates: Columns with elevated null percentages are flagged with warning indicators
- Low Cardinality Issues: Columns with very low distinct value percentages are identified
- Sorting and Filtering: Sort columns by null percentage, distinct count, or other metrics to quickly find issues
Numeric Column Analysis
For numeric columns, Sparvi Cloud provides:
- Statistical Summary: View min, max, average, median, and standard deviation
- Distribution Visualizations: Histograms and box plots showing data distribution
- Outlier Detection: Automatically identified outliers with example values
- Trend Analysis: Historical comparison of numeric statistics over time
Anomaly Detection
Sparvi Cloud automatically detects and displays anomalies:
- Anomaly List: All detected anomalies with severity levels (high, medium, low)
- Anomaly Details: Description, affected columns, and detection timestamp
- Schema Change Tracking: Automatic detection of schema shifts with before/after comparison
- Filterable Views: Filter anomalies by type, severity, or affected table/column
Dashboard and Reporting
The Sparvi Cloud dashboard provides comprehensive visualizations:
- Anomaly Trends Over Time: Line charts showing anomaly detection patterns
- Severity Breakdown: Visual distribution of high, medium, and low severity issues
- Anomaly Type Analysis: Charts showing which anomaly types are most common
- Table Health Heatmap: Color-coded view of data quality across all tables
- Export Options: Download reports and share with your team
Best Practices
- Profile Regularly: Run profiles on a regular schedule to track changes over time
- Use Historical Comparisons: Always compare current profiles with previous ones
- Start with Basic Metrics: Begin analysis with row counts and null percentages
- Focus on Outliers: Pay special attention to statistical outliers
- Look for Patterns: Analyze patterns in text data for inconsistencies
- Check Date Ranges: Ensure date ranges fall within expected bounds
- Track Schema Changes: Monitor changes to table structures over time
- Automate Alerting: Set up alerts for significant changes in key metrics
- Adjust Thresholds: Customize anomaly detection thresholds for your data
- Document Findings: Keep records of data quality issues for future reference
Next Steps
After profiling your data, consider:
- Setting up Validation Rules based on profile insights
- Exploring Anomaly Detection for more advanced monitoring
- Connecting your databases with Sparvi Cloud Database Connections