Demystifying RAGAS: A Deep Dive into Evaluating Retrieval-Augmented Generation Pipelines (Part 3: Implementing RAGAS in Action) — Code Included!

Bishal Bose
5 min readJul 15, 2024

In the previous parts of this blog series, we explored the concept of Retrieval-Augmented Generation (RAG) and its evaluation challenges. We then introduced the RAGAS framework and its functionalities for automated and reference-free evaluation. Finally, we delved into the core metrics used by RAGAS to assess various aspects of your RAG pipeline.

Now, it’s time to get your hands dirty! This part will guide you through implementing RAGAS to evaluate your own RAG pipelines. We’ll provide code examples and a practical walk-through to get you started.

Setting Up the RAGAS Framework

Before diving into the code, let’s ensure you have the necessary tools:

Python Environment: You’ll need Python 3.6 or later installed on your system. You can check your Python version by running python --version or python3 --version in your terminal. If you don't have Python installed, head over to https://www.python.org/downloads/ to download and install it.

Libraries: We’ll be using several Python libraries for this implementation. You can install them using the pip command in your terminal:

pip install datasets transformers ragas

A Sample RAG Pipeline (Just for Demonstration)

For demonstration purposes, let’s create a simple RAG pipeline with the following components:

  • Retrieval Model: We’ll use a basic TF-IDF based retrieval model to find relevant passages from a pre-defined knowledge base.
  • Large Language Model (LLM): We’ll simulate an LLM using the facebook/bart-base pre-trained model from the Transformers library for text generation.

Note: This is a simplified example. In real-world scenarios, you’d likely use a more sophisticated retrieval model and a powerful LLM like GPT-3 or Jurassic-1 Jumbo.

Creating a Sample Knowledge Base

Here’s a sample knowledge base containing factual information about different countries:

[
{ “title”: “France”, “content”: “France is a country located in Western Europe. The capital is Paris. The official language is French.”},
{ “title”: “Germany”, “content”: “Germany is a country located in Central Europe. The capital is Berlin. The official language is German.”},
{ “title”: “Italy”, “content”: “Italy is a country located in Southern Europe. The capital is Rome. The official language is Italian.”}
]

Writing Python Code for RAGAS Evaluation

Now, let’s write the Python code to utilize RAGAS for evaluating this sample RAG pipeline:

from datasets import Dataset
from ragas import evaluate
# Define the sample knowledge base as a list of dictionaries
knowledge_base = [
{ “title”: “France”, “content”: “…”}, # Replace “…” with actual content from previous example
{ “title”: “Germany”, “content”: “…”},
{ “title”: “Italy”, “content”: “…”}
]
# Create a Dataset object from the knowledge base
kb_dataset = Dataset.from_list(knowledge_base)
# Define a simple TF-IDF based retrieval function (replace this with your actual retrieval model)
def retrieve_documents(prompt):
# Implement your TF-IDF retrieval logic here to find relevant documents from kb_dataset based on the prompt
# For simplicity, we’ll just return all documents in this example
return kb_dataset
# Define the LLM for generation (replace this with your actual LLM)
def generate_text(prompt, retrieved_documents):
# Implement your LLM logic here to generate text using the prompt and retrieved documents
# For simplicity, we’ll just concatenate the prompt and retrieved document titles in this example
generated_text = prompt + “ Here’s some info I found: “ + “, “.join([doc[“title”] for doc in retrieved_documents])
return generated_text
# Define evaluation metrics
metrics = [“context_relevancy”, “faithfulness”, “answer_correctness”]
# Create a sample dataset for evaluation with prompts and ground truth answers
data = [
{“question”: “What is the capital of France?”, “answer”: “Paris”, “context”: []},
{“question”: “What language do they speak in Germany?”, “answer”: “German”, “context”: []}
]
# Convert the data to a Dataset object
dataset = Dataset.from_dict(data)
# Evaluate the RAG pipeline using RAGAS
score = evaluate(dataset, metrics=metrics, retrieval_function=retrieve_documents,

Code Breakdown:

Libraries:

  • datasets: This library helps us create and manage datasets in a structured format.
  • transformers: This library provides access to pre-trained transformer models for various NLP tasks, including text generation.
  • ragas: This is the core RAGAS library we'll be using for evaluation.

Sample Knowledge Base:

  • The code defines a list of dictionaries representing our knowledge base. Each dictionary entry has a “title” and “content” field containing factual information about different countries.

Creating a Dataset:

  • The Dataset.from_list function from the datasets library converts the knowledge base list into a structured Dataset object. This allows RAGAS to interact with the knowledge base efficiently.

Retrieval Function (Placeholder):

  • The retrieve_documents function is a placeholder for your actual retrieval model. In a real-world scenario, you'd implement your retrieval logic here, likely using a more sophisticated approach like TF-IDF or machine learning models to find relevant passages from the knowledge base based on the prompt.
  • For simplicity, this example returns all documents from the knowledge base.

LLM Function (Placeholder):

  • The generate_text function simulates an LLM. In practice, you'd replace this with your actual LLM implementation, potentially using powerful models like GPT-3 or Jurassic-1 Jumbo.
  • This example simply concatenates the prompt with the titles of retrieved documents for demonstration purposes.

Evaluation Metrics:

  • We define a list of metrics we want to evaluate:
  • context_relevancy: Measures how well-aligned the retrieved information is with the prompt.
  • faithfulness: Evaluates how accurately the generated text reflects the retrieved information.
  • answer_correctness (placeholder): This metric is replaced with a more suitable one depending on your specific task. In a question-answering scenario, it would be answer_relevancy to assess if the generated text directly answers the prompt.

Sample Evaluation Dataset:

  • A sample dataset with prompts, ground truth answers, and an empty context list is created using Dataset.from_dict. This dataset will be used to evaluate our RAG pipeline.

RAGAS Evaluation:

  • The evaluate function from the ragas library performs the evaluation. We provide the following arguments:
  • dataset: The dataset containing prompts and ground truth answers.
  • metrics: The list of metrics we want to evaluate.
  • retrieval_function: The function that retrieves documents from the knowledge base.
  • generate_text_function: The function that generates text using the prompt and retrieved documents.

This part provided a practical example of using RAGAS for evaluation. While the code uses simplified functions for demonstration, it showcases the core concepts of defining a retrieval function, LLM function, evaluation metrics, and utilizing the ragas.evaluate function.

In the next part, we’ll explore the The Future of RAGAS and Beyond by shedding light on how it has the potential to become main stream of RAG evaluation.

https://www.linkedin.com/in/bishalbose294/

Stay updated with all my blogs & updates on Linked In. Welcome to my network. Follow me on Linked In Here — -> https://www.linkedin.com/in/bishalbose294/

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Bishal Bose
Bishal Bose

Written by Bishal Bose

Senior Lead Data Scientist @ MNC | Applied & Research Scientist | Google & AWS Certified | Gen AI | LLM | NLP | CV | TS Forecasting | Predictive Modeling

No responses yet

What are your thoughts?