Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Does your AI product actually work? How to develop the right metric system


Take part in our daily and weekly newsletters to get the latest updates and exclusive content for reporting on industry -leading AI. Learn more


In my first use as a product manager for machine learning (ML), a simple question inspired passionate debates about functions and managers: How do we know whether this product actually works? The product I have done was suitable for both internal and external customers. The model enabled internal teams to identify the most important problems of our customers so that they can prioritize the right experience to fix customer problems. With such a complex network of interdependencies in internal and external customers, choose the Correct metrics The effects of the product were crucial to lead it to success.

Not to follow whether your product works well is like landing an aircraft without instructions from air traffic control. There is absolutely no way that you can make sound decisions for your customer without knowing what is going right or wrong. If you do not actively define the metrics, your team identifies your own backup metrics. The risk of having several flavors for an “accuracy” or “quality” metric is that everyone will develop their own version, which leads to a scenario in which they may not all work towards the same result.

For example, when I checked my annual goal and the underlying metric with our engineering team, the immediate feedback was: “But this is a business metric, we are already pursuing precision and recall.”

First determine what you want to know about your AI product

As soon as you deal with the task of defining the metrics for your product – where should I start? In my experience, the complexity of the operation of one ML product With several customers, metrics are also defined for the model. What do I use to measure whether a model works well? The measurement of the internal team’s result to prioritize starts based on our models would not be fast enough. The measurement as to whether the customer recommended by our model could risk risking conclusions of a very broad acceptance metric (what if the customer did not take over the solution because he only wanted to achieve one support agent?).

Quickly forward into the era of Great -speaking models (LLMS) – Where we not only have a single edition of an ML model, we also have textant words, pictures and music as outputs. The dimensions of the product for which metrics are required is now increasing quickly – formats, customers, type … The list continues.

In all my products, when I try to develop metrics, it is my first step to distill what I would like to know about its effects on customers in a few important questions. By determining the right questions, it makes it easier to identify the right sentence of metrics. Here are some examples:

  1. Has the customer received an issue? → metric for the cover
  2. How long did the product take to provide an issue? → metric for latency
  3. Did the user like the output? → metrics for customer feedback, customer introduction and storage

As soon as you have identified your key questions, the next step is to identify a number of sub -questions for the “Enter” and “Edition” signals. Priority metrics are delayed indicators where you can measure an event that has already been already happened. Input metrics and Lbring indicators can be used to identify trends or predict results. In the following options you will find the right sub -questions for delays and bread indicators below on the questions mentioned above. Not all questions must have leading/delayed indicators.

  1. Has the customer received an issue? → cover
  2. How long did the product take to provide an issue? → latency
  3. Did the user like the output? → customer feedback, customer introduction and storage
    1. Did the user indicate that the output is correct/wrong? (Output)
    2. Was the edition good/fair? (Entrance)

The third and last step is to identify the method for collecting metrics. Most metrics are collected on a scale by new instruments via data engineering. In some cases (as in question 3 above), especially for ML -based products, you have the option for manual or automated reviews that evaluate the model editions. While it is always best to develop automated reviews, starting with manual reviews for “War the edition was good/fair” and if you create a section for the definitions of good, fair and not good, you can also lay the foundation for a strict and tested automated evaluation process.

Example application cases: AI search, list list descriptions

The above framework can be applied to everyone ML-based product Identify the list of primary metrics for your product. Let’s take the search as an example.

Ask MetricsType of metric
Has the customer received an issue? → cover% Search sessions with the customer shown search results
output
How long did the product take to provide an issue? → latencyTime that was made to display search results for the useroutput
Did the user like the output? → customer feedback, customer introduction and storage

Did the user indicate that the output is correct/wrong? (Edition) Was the edition good/fair? (Entrance)

% of the search sessions with the feedback from thumb up to the customer’s search results or % of the search sessions with clicks from the customer

% of the search results that are marked as “good/fair” for every search term according to the quality section

output

Entrance

How about a product to generate descriptions for a list (regardless of whether it is a menu element in Doordash or a product list at Amazon)?

Ask MetricsType of metric
Has the customer received an issue? → cover% Listings with generated description
output
How long did the product take to provide an issue? → latencyThe time generated for the user descriptionsoutput
Did the user like the output? → customer feedback, customer introduction and storage

Did the user indicate that the output is correct/wrong? (Edition) Was the edition good/fair? (Entrance)

% of the lists with generated descriptions that required changes from the technical content team/seller/customers

% of the list descriptions as “good/fair”, per quality rubric

output

Entrance

The approach described above can be expanded for several ML-based products. I hope this framework helps you to define the right metrics for your ML model.

Sharanya Rao is a group product manager Intuit.


Leave a Reply

Your email address will not be published. Required fields are marked *