Amazon Bedrock Model Evaluation - Course Notes
So in order to choose a model, sometimes you may want to evaluate that model and you may want to bring some level of rigor when you evaluate that model.
Automatic Evaluation on Amazon Bedrock
So you can do on Amazon Bedrock what's called Automatic Evaluation. So this is to evaluate a model for quality control and then you're going to give it some tasks.
Built-in Task Types
So you have some built-in task types such as:
- Text summarization
- Question and answer
- Text classification
- Open-ended text generation
And so you're going to choose one of these task types and then you need to add prompt datasets or you can use one of the built-in, curated prompt datasets from AWS on Amazon Bedrock. And then thanks to all this, scores are going to be calculated automatically.
How Automatic Evaluation Works
So we have benchmark questions and again, you can bring your own benchmark questions or you can use the ones from AWS. And then of course, you have questions, but because you've created a benchmark, you need to have benchmark questions, as well as benchmark answers, and the benchmark answers are what would be for you an ideal answer to your benchmark question.
Then you have the model to evaluate and you're going to submit all the benchmark questions into the model that must be evaluated which is going to of course, generate some answers and these answers are generated by a GenAI model.
And then of course, we need to compare the benchmark answers to your generated answers. So we compare these two and because we are in an automatic evaluation, then it's going to be another model, another GenAI model, called a judge model which is going to look at the benchmark answer and generated answer and is going to be asked something along the lines of "can you tell if these answers are similar or not?"
And then it is going to give a grading score and there are different ways to calculate this grading score. For example, the BERTScore or the F1 or so on, but no need to linger on that specific jargon for now.
Benchmark Datasets
So a quick note on benchmark datasets. So they're very helpful and a benchmark dataset is a curated collection of data designed specifically to evaluate the performance of a language model and it can cover many different topics, or complexities, or even linguistic phenomena.
Why Use Benchmark Datasets?
So why do you use benchmark datasets? Well, they're very helpful because you can measure:
- The accuracy of your model
- The speed and efficiency
- The scalability of your model because you may throw a lot of requests at it at the same time
So some benchmark datasets are designed to allow you to quickly detect any kind of bias and potential discrimination against a group of people that your model may make, and this is something the exam can ask you.
And so therefore using a benchmark dataset gives you a very quick, low administrative effort to evaluate your models for potential bias.
Of course, it is possible for you to also create your own benchmark datasets that are going to be specific to your business if you need to have specific business criteria.
Human Evaluations
Of course, we can do also human evaluations. So this is the exact same idea. We have benchmark questions and benchmark answers, but then some humans, employees, for example, from the work team, could be employees of your company or it could be subject matter experts or SME or whatever, are going to look at the benchmark answers and the generated answers, and they're going to say "okay, this looks correct or not correct."
How Can They Evaluate?
So how can they evaluate? Well, there's different types of metrics:
- Thumbs up or thumbs down
- Ranking
- And so on
And then it's going to give a grading score again. So this time there's a human part in it and you may prefer it. You can again choose from the built-in task types or you can create a custom task because now humans are evaluating it so you are a little more free.
Foundation Model Evaluation Metrics
So there are a few metrics you can use to evaluate the output of an FM from a generic perspective. We have the ROUGE, the BLEU, the BERTScore, and perplexity and I'm going to give you a high level overview, so you understand them and they should be more than enough for the exam.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
So ROUGE is called Recall-Oriented Understudy for Gisting Evaluation. So here the purpose of it, and I think that's what you need to understand from an exam perspective, is to evaluate automatic summarization and machine translation systems. So very dedicated to these two things and we have different kinds of metrics.
We have ROUGE-N, and N can change between one, two, three, four usually, used to measure the number of matching n-grams between reference and generated text.
So what does that mean? That means you have a reference text, this is what you would like the output to be of your foundation model, and then whatever text has been generated by the foundation model. And ROUGE is going to look at how many n-grams are matching.
So if you take a one-gram, that means how many words are matching because a one-gram is just a word. But if you take two-grams, that means that it's a combination of two words. So if you have "the apple fell from the tree," you're going to look at "the apple," "apple fell," "fell from," "from the," and "the tree," and again, you look at how many matches between your reference text and your generated text.
If you take a very high gram, for example, 10-grams, it means you have 10 words matching exactly in the same order from one reference to the generated text. But it's a very easy one to compute and very easy one to make sense of.
And you have ROUGE-L which is going to compute the longest common subsequence between reference and generated text. What is the longest sequence of words that is shared between the two texts? Which makes a lot of sense, for example, if you have machine translation systems.
BLEU (Bilingual Evaluation Understudy)
Then you have BLEU. So ROUGE, by the way, is red in French and BLEU is blue in French, so just have some colors. BLEU is Bilingual Evaluation Understudy.
So here this is to evaluate the quality of generated text, especially for translation. So this is for translations and it considers both precision and is going to penalize as well for too much brevity.
So it's going to look at a combination of n-grams. The formula is a little bit different, but if the translation is too short, for example, it's going to give a bad score. So it's a slightly more advanced metric and I'm not going to show the mechanism underneath because you don't need to know it, but it's very helpful for translations and you need to remember it.
BERTScore
But these two things, ROUGE and BLEU, they just look at words, combination of words, and they look at the comparison. But we have something a bit more advanced.
Now because of AI, we have the BERTScore. So here we look for the semantic similarity between generated text. What does that mean? That means that you're going to compare the actual meaning of the text and see if the meanings are very similar.
So how do we do meaning? Well, you're going to have a model and it's going to compare the embeddings of both the texts, and it can compute the cosine similarity between them.
So embeddings are something we'll see very, very soon and they're a way to look at a bunch of numbers that represent the text. And if these numbers are very close between two embeddings, then that means the texts are going to be semantically similar.
And so here with the BERTScore, we're not looking at individual words. We're looking at the context and the nuance between the text. So it's a very good one now because we have access to AI.
Perplexity
And perplexity is how well the model will predict the next token, so lower is better, and that means that if a model is very confident about the next token, that means that it will be less perplexed and therefore more accurate.
Practical Example
So just to give you a diagram. Here we have a generative AI model that we trained on clickstream data, cart data, purchase items, and customer feedback and we're going to generate dynamic product descriptions.
And so from this, we can use the reference one versus the one generated to compute the ROUGE or the BLEU metric, as well as also look at some similarity in terms of nuance with a BERTScore.
And all these things can be incorporated back into a feedback loop to make sure we can retrain the model and get better outputs based on the quality of the scores of these metrics.
Business Metrics for Model Evaluation
On top of just having these types of grading of a foundation model, you may have business metrics to evaluate a model on and these are a little bit more difficult to evaluate, of course, but it could be:
-
User satisfaction - So you gather user feedback and you assess the satisfaction with the model response, so for example, the user satisfaction of an e-commerce platform
-
Average revenue per user - And of course, well, if the GenAI app is successful, you hope that this metric will go up
-
Cross-domain performance - So is the model able to perform across varied tasks across different domains?
-
Conversion rates - So what is the outcome I want? Do I want to have higher conversion rates? Again, I would monitor this and evaluate my model on that
-
Efficiency - What is the efficiency of the model? How much does it cost me? Is it efficient in computation, in resource utilization, and so on?
So that's it for evaluating a foundation model.