LogoLogo
DiscordGitHub
  • Welcome!
  • ML OBSERVABILITY COURSE
    • Module 1: Introduction
      • 1.1. ML lifecycle. What can go wrong with ML in production?
      • 1.2. What is ML monitoring and observability?
      • 1.3. ML monitoring metrics. What exactly can you monitor?
      • 1.4. Key considerations for ML monitoring setup
      • 1.5. ML monitoring architectures
    • Module 2: ML monitoring metrics
      • 2.1. How to evaluate ML model quality
      • 2.2. Overview of ML quality metrics. Classification, regression, ranking
      • 2.3. Evaluating ML model quality [CODE PRACTICE]
      • 2.4. Data quality in machine learning
      • 2.5. Data quality in ML [CODE PRACTICE]
      • 2.6. Data and prediction drift in ML
      • 2.7. Deep dive into data drift detection [OPTIONAL]
      • 2.8. Data and prediction drift in ML [CODE PRACTICE]
    • Module 3: ML monitoring for unstructured data
      • 3.1. Introduction to NLP and LLM monitoring
      • 3.2. Monitoring data drift on raw text data
      • 3.3. Monitoring text data quality and data drift with descriptors
      • 3.4. Monitoring embeddings drift
      • 3.5. Monitoring text data [CODE PRACTICE]
      • 3.6. Monitoring multimodal datasets
    • Module 4: Designing effective ML monitoring
      • 4.1. Logging for ML monitoring
      • 4.2. How to prioritize ML monitoring metrics
      • 4.3. When to retrain machine learning models
      • 4.4. How to choose a reference dataset in ML monitoring
      • 4.5. Custom metrics in ML monitoring
      • 4.6. Implementing custom metrics in Evidently [OPTIONAL]
      • 4.7. How to choose the ML monitoring deployment architecture
    • Module 5: ML pipelines validation and testing
      • 5.1. Introduction to data and ML pipeline testing
      • 5.2. Train and evaluate an ML model [OPTIONAL CODE PRACTICE]
      • 5.3. Test input data quality, stability and drift [CODE PRACTICE]
      • 5.4. Test ML model outputs and quality [CODE PRACTICE]
      • 5.5. Design a custom test suite with Evidently [CODE PRACTICE]
      • 5.6. Run data drift and model quality checks in an Airflow pipeline [OPTIONAL CODE PRACTICE]
      • 5.7. Run data drift and model quality checks in a Prefect pipeline [OPTIONAL CODE PRACTICE]
      • 5.8. Log data drift test results to MLflow [CODE PRACTICE]
    • Module 6: Deploying an ML monitoring dashboard
      • 6.1. How to deploy a live ML monitoring dashboard
      • 6.2. ML model monitoring dashboard with Evidently. Batch architecture [CODE PRACTICE]
      • 6.3. ML model monitoring dashboard with Evidently. Online architecture [CODE PRACTICE]
      • 6.4. ML monitoring with Evidently and Grafana [OPTIONAL CODE PRACTICE]
      • 6.5. Connecting the dots: full-stack ML observability
Powered by GitBook
On this page
  • Types of custom metrics
  • Summing up
  1. ML OBSERVABILITY COURSE
  2. Module 4: Designing effective ML monitoring

4.5. Custom metrics in ML monitoring

Types of custom metrics. Business or product metrics, domain-specific metrics, and weighted metrics.

Previous4.4. How to choose a reference dataset in ML monitoringNext4.6. Implementing custom metrics in Evidently [OPTIONAL]

Last updated 1 year ago

Video 5. , by Emeli Dral

Types of custom metrics

While there is no strict division between “standard” and “custom” metrics, there is some consensus on evaluating, for example, classification model quality using metrics like precision and recall. They are fairly “standard.”

However, you often need to implement “custom” metrics to reflect specific aspects of model performance. They typically refer to business objectives or domain requirements and help capture the impact of an ML model within its operational context.

Here are some examples.

Business and product KPIs (or proxies). These metrics are aligned with key performance indicators that reflect the business goals and product performance.

Examples include:

  • Manufacturing optimization: raw materials saved.

  • Chatbots: number of successful chat completions.

  • Fraud detection: number of detected fraud cases over $50,000.

  • Recommender systems: share of recommendation blocks without clicks.

We recommend consulting with business stakeholders even before building the model. They may suggest valuable KPIs, heuristics, and metrics that could be monitored even during the experimentation phase.

When direct measurement of a KPI is not possible, consider approximating the model impact. For example, you can assign an average “cost” to specific types of model errors based on domain knowledge.

Domain-specific ML metrics. These are metrics that are commonly used in specific domains and industries.

Examples include:

  • Churn prediction in telecommunications: lift metrics.

  • Recommender systems: serendipity or novelty metrics.

  • Healthcare: fairness metrics.

  • Speech recognition: word error rate.

  • Medical imaging: Jaccard index.

Weighted or aggregated metrics. Sometimes, you can design custom metrics as a “weighted” variation of other metrics. For example, you can adjust them to account for the importance of certain features or classes in your data.

Examples include:

  • Data drift weighted by feature importance.

  • Measuring specific recommender system biases, for example, based on product popularity, price, or product group.

  • In unbalanced classification problems, you can weigh precision and recall by class or by specific important user groups, such as based on the estimated user Lifetime Value (LTV).

Summing up

There is no need to invent “custom” metrics just for the sake of it. However, you might want to implement them to:

  • better reflect important model qualities,

  • estimate the business impact of the model,

  • add metrics useful for product and business stakeholders and accepted within the domain.

Up next: optional code practice to create and implement a custom quality metric in the Evidently Python library.

Custom metrics in ML monitoring