LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • Why use a reference dataset?
  • What makes a good reference dataset?
  • Using training data as a reference
  • Reference dataset for drift detection
  • Summing up
  1. ML OBSERVABILITY COURSE
  2. Module 4: Designing effective ML monitoring

4.4. How to choose a reference dataset in ML monitoring

What a reference dataset is in ML monitoring, how to choose one for drift detection, and when to use multiple references.

Previous4.3. When to retrain machine learning modelsNext4.5. Custom metrics in ML monitoring

Last updated 1 year ago

Video 4. , by Emeli Dral

Why use a reference dataset?

There are two main uses for a reference dataset.

  1. You can use it to derive test conditions automatically, saving time and effort in setting up tests manually.

For example, you can use a reference dataset to generate conditions for data quality checks (to track feature ranges, share of missing values, etc.) and model quality checks (to keep tabs on metrics like precision and accuracy) by passing a previous batch of data as a reference.

  1. You can use a reference dataset as a baseline to detect data and prediction drift in production by comparing new data distributions against reference distributions.

Reference dataset also can be used to detect training-serving skew as it provides a baseline to detect changes between training and production data.

What makes a good reference dataset?

Characteristics of a good reference dataset:

  • Reflects realistic data patterns, including cycles and seasonality.

  • Contains a large enough sample to derive meaningful statistics.

  • Includes realistic scenarios (e.g., sensor outages) to validate against new data.

What a reference dataset is not:

  • It is not the same as a training dataset. You can sometimes choose training data to be your reference, but they are not synonymous.

  • It is not a “golden dataset,” which serves a different purpose.

You always need a reference dataset if your goal is to compare distributions to detect data or prediction drift (e.g., using metrics like Wasserstein distance).

However, having a reference dataset is not a must:

  • You can run one-sample statistical tests that don't require a comparison of distributions.

  • For most types of checks, you can manually specify test conditions, such as min-max feature ranges. This works well if you have a limited set of data with known expected behaviors.

However, using a reference dataset is a great hack to automate generating test conditions!

Using training data as a reference

Using training data as a reference can be acceptable in specific contexts but is generally not recommended due to pre-processing and potential biases.

If the training data is all you have, it is OK to use it for the following types of checks:

  • To derive feature types and data schema.

  • To derive feature ranges (num) and value lists (cat).

  • To derive feature correlations.

  • To detect training-serving skew.

It is less optimal for:

  • Generating expectations about model quality.

  • Deriving data on e.g., share of nulls.

  • Using it as a baseline for drift detection.

You can consider using hold-out validation data or previous batches of data instead.

Reference dataset for drift detection

When choosing a reference dataset for drift detection, make sure to pick a representative dataset that captures typical distributions and variations in the data. You can use historical data to decide on the appropriate windows; for example, you can compare the data using monthly, weekly, or daily windows.

You should make the following decisions:

  • What do you compare against? You can use training data (generally not recommended), validation, and previous production batches.

  • What batch size to use? You need to determine the size of current and reference datasets for effective comparison – 1 day, 1 week, 1 year, 1000 objects, etc.

  • How to update reference data? You can have a static reference (e.g., which you update once a month) or shift the reference data dynamically (e.g., sliding window approach).

Analyzing historical data can help determine the most effective reference data strategy.

Multiple references. It often makes sense to use multiple reference datasets. For example, you can have multiple comparison windows to capture seasonality and cyclic trends:

Sampling vs. entire dataset. If you have large datasets, you can consider using sampling, for example, random or stratified sampling.

  • Sampling is great for drift detection. In fact, all statistical tests were made to work with samples! It can save you computational resources and allow for faster results calculation. If you look to detect a statistical distribution shift in the overall dataset, sampling is totally fine.

  • For detecting data quality anomalies, full datasets are preferable. If you look for data quality issues – e.g., individual outliers or duplicates – sampling can disguise them.

Summing up

  • There is no “universal” reference dataset. It should be tailored to the specific use case and expectations of similarity to current data.

  • Hold-out validation data is preferred over training data for creating reference datasets, especially for drift detection. Use training data only if there is nothing else.

  • It is crucial to account for seasonality and historical patterns when choosing the reference dataset to ensure that it accurately represents the expected variations in data.

  • Historical data is a valuable resource for informing reference dataset selection.

Up next: custom metrics for ML monitoring.

Further reading: .

Further reading:

How to detect, evaluate and visualize historical drifts in the data
How to detect, evaluate and visualize historical drifts in the data
How to choose a reference dataset in ML monitoring