Data Profiling Guide

Data profiling is the process of examining data to understand its structure, content, quality, and patterns. With Sparvi, you can easily profile your database tables to gain deep insights into your data.

Understanding Sparvi Profiles

When you profile a table with Sparvi, it computes a variety of metrics and statistics that give you a comprehensive view of your data. A profile includes:

Basic Metrics

  • Row count: Total number of rows in the table
  • Column count: Total number of columns
  • Duplicate count: Number of duplicate rows
  • Timestamp: When the profile was created

Completeness Metrics

For each column, Sparvi calculates:

  • Null count: Number of NULL values
  • Null percentage: Percentage of NULL values
  • Distinct count: Number of unique values
  • Distinct percentage: Percentage of unique values

Numeric Column Statistics

For numeric columns, Sparvi calculates:

  • Min/Max: Minimum and maximum values
  • Mean/Median: Average and middle values
  • Standard Deviation: Measure of data dispersion
  • Quartiles: 25th, 50th, and 75th percentiles
  • Outliers: Values that fall significantly outside the normal range

Text Column Statistics

For text columns, Sparvi analyzes:

  • Min/Max/Avg Length: Character length statistics
  • Pattern Recognition: Common patterns in the data
  • Format Consistency: Checks for consistent formatting
  • Empty String Detection: Identifies empty strings (different from NULLs)

Date Column Statistics

For date and timestamp columns, Sparvi calculates:

  • Min/Max Date: Earliest and latest dates
  • Date Range: Span of time represented
  • Temporal Patterns: Distribution across time periods
  • Future Dates: Detection of dates in the future
  • Unrealistic Dates: Identification of dates that are unreasonably old

Feeding Profiles into Monitors

Profile statistics are also the source data for column-metric monitors. The same null %, distinct count, min, max, avg, and stddev that the profile surfaces can be promoted to a monitor in two clicks, with ML or threshold evaluation. Segment any of them by a dimension column to track quality per region, product, or tenant.

This is how the typical Sparvi feedback loop closes:

  • Statistical outliers, surface in the profile; promote to a column-metric monitor with statistical evaluation to alert on future drift
  • Distribution shifts, visible in the histogram; promote to an avg or stddev monitor
  • Completeness issues, visible in the null % column; promote to a null % monitor with a threshold ("never above 5%") or a statistical baseline
  • Schema shifts, picked up automatically by Sparvi's schema-change tracking; no monitor required

Using Data Profiling in Sparvi Cloud

Sparvi Cloud provides automated data profiling through the web interface:

  1. Connect Your Database: Add your database connection in the Connections section
  2. Navigate to Data Explorer: Browse your database schemas and tables
  3. View Profiles: Click on any table to see its comprehensive data profile
  4. Automated Scheduling: Configure automated profiling schedules for continuous monitoring
  5. Historical Comparison: Track changes over time with automatic historical comparisons
  6. Promote to monitors: Add a monitor on any column statistic so future drift creates an issue automatically

Analyzing Profiles in Sparvi Cloud

Sparvi Cloud provides comprehensive profile analysis through interactive dashboards and visualizations.

Viewing Profile Metrics

In the Data Explorer, when you view a table profile, Sparvi Cloud displays:

  • Table Overview: Row count, column count, duplicate count, and profile timestamp
  • Column Completeness Dashboard: Visual representation of null percentages and distinct value counts for all columns
  • Interactive Charts: Sortable and filterable views of all column statistics

Identifying Data Quality Issues

Sparvi Cloud automatically highlights problematic columns:

  • High Null Rates: Columns with elevated null percentages are flagged with warning indicators
  • Low Cardinality Issues: Columns with very low distinct value percentages are identified
  • Sorting and Filtering: Sort columns by null percentage, distinct count, or other metrics to quickly find issues

Numeric Column Analysis

For numeric columns, Sparvi Cloud provides:

  • Statistical Summary: View min, max, average, median, and standard deviation
  • Distribution Visualizations: Histograms and box plots showing data distribution
  • Outlier Detection: Automatically identified outliers with example values
  • Trend Analysis: Historical comparison of numeric statistics over time

Monitor Outcomes Alongside Profiles

For tables that have monitors attached, the profile view shows each monitor's current value, baseline (or threshold), and last evaluation timestamp. Click into a monitor to see its trend chart, segmented breakdown (if segmented), and full issue history.

  • Monitor list: Every monitor on the table, with status (healthy, warning, critical)
  • Severity levels: Critical and warning issues derived from monitor evaluations
  • Schema change tracking: Automatic detection of schema shifts with before/after comparison
  • Filterable views: Filter monitors by status, evaluation mode, or affected column

Dashboard and Reporting

The Sparvi Cloud dashboard provides comprehensive visualizations:

  • Monitor trends over time: Line charts showing each monitor's value, baseline, and crossings
  • Severity breakdown: Visual distribution of critical and warning issues
  • Issue type analysis: Charts showing whether issues come from monitors, validations, or schema changes
  • Table health heatmap: Color-coded view of data quality across all tables
  • Export options: Download reports and share with your team

Best Practices

  1. Profile Regularly: Run profiles on a regular schedule to track changes over time
  2. Use Historical Comparisons: Always compare current profiles with previous ones
  3. Start with Basic Metrics: Begin analysis with row counts and null percentages
  4. Focus on Outliers: Pay special attention to statistical outliers
  5. Look for Patterns: Analyze patterns in text data for inconsistencies
  6. Check Date Ranges: Ensure date ranges fall within expected bounds
  7. Track Schema Changes: Monitor changes to table structures over time
  8. Automate Alerting: Set up alerts for significant changes in key metrics
  9. Tune monitors: Adjust sensitivity, baseline window, or thresholds on the monitors derived from your profiles so signal matches your tolerance for noise
  10. Document Findings: Keep records of data quality issues for future reference

Next Steps

After profiling your data, consider:

  1. Setting up Validation Rules based on profile insights
  2. Exploring the Anomaly Detection Guide, how Sparvi monitors flag values that fall outside a learned baseline, including segment-aware detection
  3. Connecting your databases with Sparvi Cloud Database Connections