Rag Postgres Tutorial
Tutorial: Creating a Database for RAG using PDFs and PostgreSQL
Introduction
In this tutorial, we will guide you through the process of creating a database for Retrieval-Augmented Generation (RAG) applications using PDFs as source of information and PostgreSQL as database to store. This setup leverages the OpenAI API to generate text embeddings, which are then stored in a PostgreSQL database to facilitate efficient search and information retrieval.
Before starting, ensure that PostgreSQL is installed and configured. If you need assistance, follow the official installation guide and create a new database.
Prerequisites
To follow along, you will need:
- Python installed on your machine
- PostgreSQL installed and configured
- Access to the OpenAI API
- The following Python libraries:
psycopg2openaiPyPDF2langchain
Hands-on
Step 1: Environment Setup
First, install the necessary libraries:
pip install psycopg2 openai PyPDF2 langchain
Set up the environment variable for the OpenAI API key. Ensure the OPENAI_API_KEY environment variable is set. For instructions, refer to this guide.
Step 2: Extracting Text from PDFs
Use the following function to extract text from PDF files using PyPDF2:
from PyPDF2 import PdfReader
def extract_text_from_pdf(pdf_path):
try:
reader = PdfReader(pdf_path)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
except Exception as e:
print(f"Error extracting text from PDF {pdf_path}: {e}")
return None
Step 3: Generating Embeddings with OpenAI API
The class below generates embeddings using the OpenAI API. The embed_documents method allows specifying a model, with the default being "text-embedding-3-large":
import openai
import time
class OpenAIEmbeddings:
def embed_documents(self, texts, model="text-embedding-3-large"):
valid_texts = [text for text in texts if isinstance(text, str) and text.strip()]
if not valid_texts:
print("No valid texts to generate embeddings.")
return []
all_embeddings = []
batch_size is set to 2048 # Token limit per batch
for i in range(0, len(valid_texts), batch_size):
batch_texts = valid_texts[i:i + batch_size]
try:
response = openai.Embedding.create(
input=batch_texts,
model=model
)
embeddings = [data['embedding'] for data in response['data']]
all_embeddings.extend(embeddings)
except openai.error.RateLimitError as e:
print(f"Rate limit error: {e}. Waiting 60 seconds...")
time.sleep(60)
except openai.error.InvalidRequestError as e:
print(f"Error generating embeddings: {e}")
continue
return all_embeddings
Note: You can specify a different model based on your needs, considering factors like vector size, computational efficiency, and data type.
Step 4: PostgreSQL Database Setup
Note: We assume that you already have a database set up.
Configure your database connection details:
db_config = {
'dbname': 'rag_test',
'user': 'postgres',
'password': '12345',
'host': 'localhost',
'port': '5432'
}
connection_string = f"postgresql://{db_config['user']}:{db_config['password']}@{db_config['host']}:{db_config['port']}/{db_config['dbname']}"
Note: For production, avoid exposing sensitive information in your code. Use environment variables or secure vaults to manage credentials.
Clearing and Rebuilding the Database
Given that our dataset consists of only 6 PDF files totaling 23MB, it’s efficient to clear and rebuild the entire database rather than updating or appending new data. This approach is feasible for small datasets and simplifies the data management process.
To clear existing tables:
import psycopg2
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
def clear_tables():
cursor.execute("DROP TABLE IF EXISTS langchain_pg_embedding CASCADE;")
cursor.execute("DROP TABLE IF EXISTS langchain_pg_collection CASCADE;")
connection.commit()
clear_tables()
Step 5: Processing and Storing PDFs
Process the PDFs and store them in the database:
from langchain.vectorstores.pgembedding import PGEmbedding
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
def process_and_store_pdfs(directory):
documents = []
if not os.path.exists(directory):
print(f"The directory {directory} does not exist.")
return
for filename in os.listdir(directory):
if filename.endswith('.pdf'):
pdf_path = os.path.join(directory, filename)
print(f"Processing {filename}...")
content = extract_text_from_pdf(pdf_path)
if content:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(content)
for chunk in chunks:
documents.append(Document(
page_content=chunk,
metadata={"title": filename}
))
if documents:
embedding_function = OpenAIEmbeddings()
vector_store = PGEmbedding(
connection_string=connection_string,
embedding_function=embedding_function
)
try:
vector_store.add_documents(documents)
except Exception as e:
print(f"Error inserting documents into the database: {e}")
Set the path to your PDF files:
pdf_directory = '/path/to/your/pdfs'
process_and_store_pdfs(pdf_directory)
Step 6: Data Insertion Verification
Verify that the data was correctly inserted:
try:
with psycopg2.connect(connection_string) as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT COUNT(*) FROM langchain_pg_embedding;")
count = cursor.fetchone()[0]
print(f"Total number of documents in the 'langchain_pg_embedding' table: {count}")
except Exception as e:
print(f"Error checking the 'langchain_pg_embedding' table: {e}")
cursor.close()
connection.close()
Conclusion
This tutorial has guided you through setting up a system to extract text from PDFs, generate embeddings using the OpenAI API, and store the data in a PostgreSQL database. You can enhance this system by improving text extraction, refining text chunking methods, and optimizing database interactions.
Note: This tutorial is designed for hands-on practice. If you want to understand how an embedding database works in depth, see this article.
There are many ways to improve the efficiency of a RAG database, such as:
- Enhancing entity recognition and boundary detection
- Optimizing database schema and indexing
- Implementing advanced search and retrieval algorithms