LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • When to perform testing
  • How to perform testing
  • Test automation
  • Recording test results
  • Example use case
  1. ML OBSERVABILITY COURSE
  2. Module 5: ML pipelines validation and testing

5.1. Introduction to data and ML pipeline testing

A brief introduction to different types of tests and testing conditions and how to incorporate them in data and ML pipelines.

PreviousModule 5: ML pipelines validation and testingNext5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]

Last updated 1 year ago

Video 1. , by Emeli Dral

When to perform testing

ML lifecycle involves various steps that require testing to ensure that our ML models function properly. Critical areas to test include the following steps of the ML lifecycle:

  • During the feature engineering stage: testing the input data quality as it affects the whole pipeline.

  • During model training (or retraining): model quality checks.

  • During model serving: validating incoming data and model outputs.

  • During performance monitoring: continuously testing the model quality to detect and resolve potential issues.

How to perform testing

There are different types of checks you can use to test data and ML pipelines:

Individual tests A test is a metric with a condition. You can perform a certain evaluation or measurement on top of a data batch and compare it against a threshold or expectation. You can formulate almost anything as a test: assertions on feature values, expectations about model quality on a specific segment, etc. Whatever you can measure, you can design as a test.

Tests can be column-level (when metrics are calculated for a specific feature or column) or dataset-level (in this case, you calculate metrics for the whole dataset).

Test suites Individual tests can be grouped into test suites. For each test in a test suite, you can define test criticality and set alerting conditions: for example, based on the number of failed critical tests.

When you create a test, you must define the test conditions. There are two main strategies you can use for establishing test conditions:

Reference-based conditions You can use a reference dataset to derive conditions automatically rather than set conditions manually for each individual test. This is a great option for certain types of checks, such as testing column types (which are easy to derive from a reference example) and for ad hoc testing such as when you import a new batch of data and can immediately visually explore the test results. However, be careful when designing alerting, as auto-generated test conditions are not perfect and may be prone to false alerts or missed issues.

Manually defined conditions With this approach, you specify conditions for each test manually. This method does not require additional data and can be great for encoding specific conditions based on domain expertise.

Test automation

If you want to test your data and ML models continuously, switching from ad-hoc checks to automated testing is a good idea. You can use workflow managers like Airflow, Kubeflow, or Prefect to automate testing as part of the ML pipeline. If you run your ML model in batch, just add a testing step to your pipeline.

Recording test results

If you already use logging tools like MLflow, you can use them to log test results as well. Evidently also offers a monitoring dashboard where you can visualize individual and aggregate test results to track them in time.

Example use case

You will design training and prediction pipelines as part of the code practice. For the training pipeline, you will prepare data, calculate features, do model training and scoring, and incorporate data, feature and model quality checks.

For the prediction pipeline, you will simulate the production usage of the model in a batch scenario. You will also implement data quality and stability checks, score model output, validate model quality, and use quality checks to make informed decisions on model retraining.

And now, to practice!

You can also combine reference-based and manual conditions. For example, you can manually pass conditions for specific features and use a reference dataset to define test conditions for the rest of your dataset. Combining these approaches is possible with tools like .

The practical part of the module involves applying data and model quality tests on a toy dataset. Using the Bank marketing dataset, we will predict subscription outcomes from a marketing campaign. Data source: .

Evidently
bank marketing
Introduction to data and ML pipeline testing