LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • Challenges of monitoring raw text data
  • Domain classifier
  • Topic modeling
  • Summing up
  1. ML OBSERVABILITY COURSE
  2. Module 3: ML monitoring for unstructured data

3.2. Monitoring data drift on raw text data

How to detect and evaluate raw text data drift using domain classifier and topic modeling.

Previous3.1. Introduction to NLP and LLM monitoringNext3.3. Monitoring text data quality and data drift with descriptors

Last updated 1 year ago

Video 2. , by Emeli Dral

Challenges of monitoring raw text data

Handling raw text data is more complex than dealing with structured tabular data. With structured data, you can usually define “good” or “expected” data, e.g., particular feature distributions or statistical values can signal the data quality. For unstructured data, there is no straightforward way to define data quality or extract a signal for raw text data.

When it comes to data drift detection, you can use two strategies that rely on raw text data: domain classifier and topic modeling.

Domain classifier

Domain classifier method or model-based drift detection uses a classifier model to compare distributions of reference and current datasets by training a model that predicts to which dataset a specific text belongs. If the model can confidently identify which text samples belong to the current or reference dataset, the two datasets are probably sufficiently different.

You can directly use the ROC AUC of the binary classifier as the “drift score” when you deal with large datasets. If you work with smaller datasets (< 1000), you can compare the model ROC AUC against a random classifier.

The benefit of using model-based drift detection on raw data is its interpretability. In this case, you can identify top words and text examples that were easy to classify to explain the drift and debug the model.

Topic modeling

Another strategy for evaluating raw data quality is topic modeling. The goal here is to categorize text into interpretable topic clusters, so instead of a binary classification model, we use a clustering model.

How it works:

  • Apply the clustering model to new batches of data.

  • Monitor the size and share of different topics over time.

  • Changes in topics can indicate data drift.

Using this method can be challenging due to difficulties in building a good clustering model:

  • There is no ideal structure in clustering.

  • Typically, it requires manual tuning to build accurate and interpretable clusters.

Summing up

Defining data quality and tracking data drift for text data can be challenging. However, we can extract interpretable signals from text data to detect drift. You can use such methods as domain classifier and topic modeling to monitor for drift and evaluate the quality of raw text data.

Further reading:

Up next: an exploration of alternative text drift detection methods that use descriptors.

Further reading: this approach is described in more detail in the paper .

Further reading: .

"Failing loudly: An Empirical Study of Methods for Detecting Dataset Shift"
Monitoring NLP models in production: a tutorial on detecting drift in text data
Failing loudly: An Empirical Study of Methods for Detecting Dataset Shift
Monitoring NLP models in production: a tutorial on detecting drift in text data
Monitoring data drift on raw text data