Document RAG¶
Retrieval-Augmented Generation (RAG) lets Rocco ground its enhancements in your uploaded research papers rather than in the LLM’s general training knowledge. This section explains how the pipeline works and how to use it programmatically.
What RAG Means Here¶
When you upload documents, Rocco:
Ingests them — splits each file into overlapping text chunks and enriches each chunk with source metadata
Embeds the chunks — converts each chunk to a dense vector using a sentence-transformer model
Stores the vectors in a FAISS index held in memory
Retrieves the most relevant chunks at enhancement time — querying the index with the description text and each failing rubric criterion to find supporting context
The retrieved chunks are passed to the Writer, which is instructed to cite them explicitly when adding new information to the description.
Document Ingestion¶
Class: src.ingestor.document_ingestor.DocumentIngestor
Supported formats:
PDF (
.pdf) — loaded with LangChain’sPyPDFLoaderDOCX (
.docx,.doc) — loaded with LangChain’sDocx2txtLoader
Chunking uses LangChain’s RecursiveCharacterTextSplitter:
Chunk size: 500 characters (default)
Chunk overlap: 100 characters — ensures context is not lost at boundaries
Separators tried in order:
\n\n,\n,., ``, ``""
Each chunk is enriched with metadata:
Field |
Value |
|---|---|
|
Filename without extension (e.g., |
|
Page number (PDFs only; |
|
Sequential index within the document (0-based) |
|
Full file path (set by LangChain) |
Embedding¶
Class: src.ingestor.embedder.DocumentEmbedder
Chunks are embedded using BAAI/bge-large-en-v1.5
via langchain-huggingface. This model produces 1024-dimensional dense vectors normalised to unit
length, which makes cosine similarity equivalent to dot-product similarity.
The model runs on CPU by default. To use a GPU, pass model_kwargs={"device": "cuda"} to
DocumentEmbedder.
Note
The embedding model is downloaded from HuggingFace on first use (~1.3 GB). Subsequent runs use the local cache.
Vector Store¶
Class: src.retriever.retriever.VectorStoreManager
Embeddings are stored in a FAISS flat-L2 index (via LangChain’s FAISS wrapper). Key methods:
add_documents(chunks)— embed chunks and add them to the indexsimilarity_search(query, k)— return the top-k most similarDocumentobjectssimilarity_search_with_score(query, k)— same, but also returns the distance scoresave(path)/load(path)— persist and reload the index to/from disk
Query Behavior¶
How Rocco Uses RAG
During enhancement, Rocco generates targeted search queries based on the evaluation rubric to retrieve relevant context. Each of the 10 evaluation criteria has associated keywords that guide the vector store search:
Rubric Criterion |
Query Keywords |
|---|---|
Self-Contained Description |
data description, data summary, data overview |
Context of Creation |
research goals, objectives, study purpose, motivation |
Porous Media Type |
sample material, porous media type, lithology |
Research Problem |
research problem, research question, hypothesis |
Reuse and Beneficiaries |
applications, reuse, reproducibility, validation, machine learning, simulation |
Methodology |
methodology, experimental setup, x-ray imaging technique, data collection, image acquisition, scanning, image processing |
Contents and Organization |
dataset structure, organization, file, data, format, contents |
Quality Control |
quality control, validation, verification, calibration, inspection |
Clarity and Accessibility |
(No automatic query) |
Keywords |
keywords, relevant concepts, domain-specific terminology, nomenclature |
By default, Rocco queries all criteria (query_all=True). This retrieves relevant context across the full rubric, which generally produces stronger enhancements. Optionally, queries can be limited to only criteria that scored ≤0.5, focusing context retrieval on areas of weakness.
Running Without the UI¶
from src.ingestor.document_ingestor import DocumentIngestor
from src.ingestor.embedder import DocumentEmbedder
from src.retriever.retriever import VectorStoreManager
# Build the pipeline
ingestor = DocumentIngestor(chunk_size=500, chunk_overlap=100)
embedder = DocumentEmbedder()
vsm = VectorStoreManager(embedder)
# Ingest documents
chunks = ingestor.ingest(["paper.pdf", "protocol.docx"])
vsm.add_documents(chunks)
# Save the index to reuse later
vsm.save("my_faiss_index/")
# Search
results = vsm.similarity_search_with_score("Berea sandstone segmentation method", k=5)
for doc, score in results:
title = doc.metadata.get("doc_title", "unknown")
page = doc.metadata.get("page", "?")
print(f"{title} p.{page} (score {score:.3f})")
print(doc.page_content[:200])
print()
Loading a saved index:
vsm = VectorStoreManager(embedder)
vsm.load("my_faiss_index/")
results = vsm.similarity_search("porosity measurement", k=3)
Tips¶
What documents work best
Research publications by the data contributors that provide background context, analysis, applications, and methodology.
Method papers describing the imaging or analysis technique used to produce the dataset
Technical protocols or instrument specifications
Multiple documents
Rocco blends results from all uploaded documents — more documents generally produce stronger context. However, contradictory information across documents may appear in the output; always review citations.
Large document sets
Ingestion time scales with total file size (chunking + embedding). For very large sets,
consider pre-building the index offline and loading it at runtime with vsm.load().
See Also¶
Writer — How retrieved chunks are used in description enhancement
Streamlit App — Uploading documents through the web UI
API Reference — Full class documentation