LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • What can go wrong with the input data?
  • Data quality metrics and analysis
  • Summing up
  1. ML OBSERVABILITY COURSE
  2. Module 2: ML monitoring metrics

2.4. Data quality in machine learning

Types of production data quality issues, how to evaluate data quality, and interpret data quality metrics.

Previous2.3. Evaluating ML model quality [CODE PRACTICE]Next2.5. Data quality in ML [CODE PRACTICE]

Last updated 1 year ago

Video 4. , by Emeli Dral

What can go wrong with the input data?

If you have a complex ML system, there are many things that can go wrong with the data. The golden rule is: garbage in, garbage out. We need to make sure that the data we feed our model with is fine.

Some common data processing issues are:

  • Wrong source. E.g., a pipeline points to an older version of the table.

  • Lost access. E.g., permissions are not updated.

  • Bad SQL. Or not SQL. E.g., a query breaks when a user comes from a different time zone and makes an action “tomorrow."

  • Infrastructure update. E.g., change in computation based on a dependent library.

  • Broken feature code. E.g., feature computation breaks at a corner case like a 100% discount.

Issues can also arise if the data schema changes or data is lost at the source (e.g., broken in-app logging or frozen sensor values). If you have several models interacting with each other, broken upstream models can affect downstream models.

Data quality metrics and analysis

Data profiling is a good starting point for monitoring data quality metrics. Based on the data type, you can come up with basic descriptive statistics for your dataset. For example, for numerical features, you can calculate:

  • Min and Max values

  • Quantiles

  • Unique values

  • Most common values

  • Share of missing values, etc.

Then, you can visualize and compare statistics and data distributions of the current data batch and reference data to ensure data stability.

When it comes to monitoring data quality, you must define the conditions for alerting.

If you do not have reference data, you can set up thresholds manually based on domain knowledge. “General ML data quality” can include such characteristics as:

  • no/low share of missing values

  • no duplicate columns/rows

  • no constant (or almost constant!) features

  • no highly correlated features

  • no target leaks (high correlation between feature and target)

  • no range violations (based on the feature context, e.g., negative age or sales).

Since setting up these conditions manually can be tedious, it often helps to have a reference dataset.

If you have reference data, you can compare it with the current data and autogenerate test conditions based on the reference. For example, based on the training or past batch, you can monitor for:

  • expected data schema and column types

  • expected data completeness (e.g., 90% non-empty)

  • expected batch size (e.g., number of rows)

  • expected patterns for specific columns, such as:

    • non-unique (features) or unique (IDs)

    • specific data distribution types (e.g., normality)

    • expected ranges based on observed values

    • descriptive statistics: averages, median, quantiles, min-max (point estimation or statistical tests with a confidence interval).

Summing up

Monitoring data quality is critical to ensuring that ML models function reliably in production. Depending on the availability of reference data, you can manually set up thresholds based on domain knowledge or automatically generate test conditions based on the reference.

Up next: hands-on practice on how to evaluate and test data quality using Python and library.

Evidently
Data quality in machine learning