LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • What is a multimodal dataset?
  • Monitoring strategies for multi-modal data
  • Summing up
  • Enjoyed the content?
  1. ML OBSERVABILITY COURSE
  2. Module 3: ML monitoring for unstructured data

3.6. Monitoring multimodal datasets

Strategies for monitoring data quality and data drift in multimodal datasets.

Previous3.5. Monitoring text data [CODE PRACTICE]NextModule 4: Designing effective ML monitoring

Last updated 1 year ago

Video 6. , by Emeli Dral

What is a multimodal dataset?

Often, we don't only work with structured or unstructured data but a combination of both. Some common examples include product reviews, chats, support tickets, and emails. These applications may include unstructured data, e.g., text, and structured metadata like the region, device, product type, user category, etc.

Both structured and unstructured data provide valuable signals. Considering signals from both data types is essential to build comprehensive ML models.

Monitoring strategies for multi-modal data

We will cover three widely used strategies for monitoring multi-modal datasets.

Strategy 1. Split and monitor independently. The approach is straightforward – split the dataset by data type and monitor structured and unstructured data independently:

  • Monitor structured data using descriptive statistics, share of missing values, distribution drift, correlation changes, etc.

  • Use raw text data analysis or embedding monitoring for unstructured data.

  • Combine monitoring results into a unified dashboard.

Strategy 2. A joint structured dataset. This approach is based on turning unstructured data into structured by using descriptors:

  • Generate descriptors for unstructured data (e.g., text properties) to represent it in a structured form.

  • Combine these structured descriptors with existing metadata.

  • Perform a comprehensive analysis of the combined structured data. You can check for missing values, distribution drift, correlation changes, outliers, etc.

Strategy 3. Generate embeddings. As embeddings represent data as vectors in high-dimensional space, you can combine structured features with embeddings to create an expanded feature space. For instance, if you have 64 embeddings and three structured features, the combined space would be 67-dimensional. You can then apply various methods like share of drifted components, domain classifier, or distance-based metrics to this combined data.

Summing up

We discussed three strategies for monitoring data quality and data drift in multi-modal datasets. This concludes our module on ML monitoring for unstructured data. Here are some considerations to keep in mind:

  • If you have access to raw text data, do not ignore it. Interpretability wins! Evaluating metrics on raw text can provide a deep understanding of changes and potential issues with text data.

  • If working with embeddings, numerous methods are also available to detect drift.

  • When dealing with multimodal datasets, you can split data by type, leverage text descriptors, or generate a joint embedding dataset, depending on the specific use case and available data.

Enjoyed the content?

Star Evidently on GitHub to contribute back! This helps us create free, open-source tools and content for the community.

⭐️ on GitHub!

Star
Monitoring multimodal datasets