API Reference

LLM Client

class src.llm.client.LLMClient(api_url=None, api_key=None, model=None, provider=None, timeout=60)[source]¶

Bases: object

Provider-agnostic LLM client supporting OpenAI, Anthropic, Gemini, DeepSeek, HuggingFace, Ollama, SambaNova, and any OpenAI-compatible API.

Parameters:

api_url (str)
api_key (str)
model (str)
provider (str)
timeout (int)

PROVIDER_URLS = {'anthropic': 'https://api.anthropic.com/v1', 'deepseek': 'https://api.deepseek.com/v1', 'gemini': 'https://generativelanguage.googleapis.com/v1beta/openai/', 'huggingface': 'https://router.huggingface.co/v1', 'ollama': 'http://localhost:11434/v1', 'openai': 'https://api.openai.com/v1', 'sambanova': 'https://ai.tejas.tacc.utexas.edu/v1'}¶

list_models()[source]¶

Return a list of model IDs available from the configured provider endpoint.

Return type:: List[str]

send_prompt(prompt, context=None, params=None)[source]¶

Send a prompt to the LLM and return the response text.

Parameters:

prompt (str)
context (str | None)
params (Dict[str, Any] | None)

Return type:

str

class src.llm.client.RoccoClient(api_url=None, api_key=None, model=None, provider=None, timeout=60)[source]¶

Bases: LLMClient

RoccoClient extends LLMClient for specific Rocco interactions.

Parameters:

api_url (str)
api_key (str)
model (str)
provider (str)
timeout (int)

evaluate_description(draft_text, rubric, examples, context=None)[source]¶

Evaluate a dataset description using the provided rubric and examples.

Parameters:

draft_text (str)
rubric (Dict[str, Any])
examples (List[Dict[str, Any]])
context (List[str] | None)

Return type:

str

improve_description(draft_text, context=None)[source]¶

Improve a dataset description based on the provided context.

Parameters:

draft_text (str)
context (List[str] | None)

Return type:

str

Evaluator

class src.evaluator.evaluator.DescriptionEvaluator(model, rubric, examples)[source]¶

Bases: object

Evaluates dataset descriptions against a rubric

Parameters:

model (RoccoClient)
rubric (List[Dict[str, Any]])
examples (List[Dict[str, Any]])

build_prompt(draft_text)[source]¶

Combine rubric, examples, and draft into prompt

Parameters:: draft_text (str)
Return type:: str

evaluate(draft_text)[source]¶

Call the LLM and return structured evaluation

Parameters:: draft_text (str)
Return type:: EvaluatorOutput

print_evaluation_result(evaluation_output)[source]¶

Utility to print evaluation results

Parameters:: evaluation_output (EvaluatorOutput)
Return type:: None

Editor

class src.editor.editor.DescriptionEditor(model, rubric, vector_store_manager=None, use_rag=True, top_k_context=5)[source]¶

Bases: object

Improves dataset descriptions

Parameters:

model (RoccoClient)
rubric (Dict)
vector_store_manager (VectorStoreManager | None)
use_rag (bool)
top_k_context (int)

save_session(filepath)[source]¶

Save the current session to a file

Parameters:: filepath (Path)
Return type:: None

load_session(filepath)[source]¶

Load a session from a file

Parameters:: filepath (Path)
Return type:: None

get_session_summary()[source]¶

Get a summary of the current session

Return type:: str

retrieve_context(query=None)[source]¶

Retrieve relevant context from related papers

Parameters:: query (str)
Return type:: List[Document]

generate_search_query(draft_evaluation, query_all=True)[source]¶

Generate search queries based on evaluation feedback

Parameters:

draft_evaluation (EvaluatorOutput)
query_all (bool)

Return type:

List[str]

build_prompt(draft_text, draft_evaluation, context=None, user_feedback=None, history_override=None)[source]¶

Prepare prompt for improving the draft

Parameters:

draft_text (str)
draft_evaluation (EvaluatorOutput)
context (List[Document] | List[str] | None)
user_feedback (str | None)
history_override (List[Dict[str, str]] | None)

Return type:

str

enhance(draft_text, draft_evaluation, retrieve_context=True, context_override=None, query_all_criterion=True, user_feedback=None, history_override=None)[source]¶

Improve the description draft using evaluation feedback and optional context from papers

Parameters:

draft_text (str)
draft_evaluation (EvaluatorOutput)
retrieve_context (bool)
context_override (List[str] | None)
query_all_criterion (bool)
user_feedback (str | None)
history_override (List[Dict[str, str]] | None)

Return type:

EditorOutput

print_enhancement_result(editor_output)[source]¶

Utility to print enhancement results

Parameters:: editor_output (EditorOutput)
Return type:: None

reset_conversation_history()[source]¶: Clear all stored conversation turns, starting a fresh refinement session.

Document Ingestor

class src.ingestor.document_ingestor.DocumentIngestor(chunk_size=500, chunk_overlap=100, separators=None)[source]¶

Bases: BaseIngestor

Ingests documents (PDF, DOCX) for RAG pipeline

Parameters:

chunk_size (int)
chunk_overlap (int)
separators (List[str] | None)

ingest(file_paths)[source]¶

Load and chunk document(s).

Parameters:: file_paths (str | List[str]) – Single file path (str) or list of file paths (List[str])
Returns:: List of chunked Document objects with enriched metadata
Return type:: List[Document]

Document Embedder

class src.ingestor.embedder.BaseEmbedder[source]¶

Bases: ABC

Base class for all embedders

abstractmethod embed_documents(texts)[source]¶

Embed a list of documents

Parameters:: texts (List[str])
Return type:: List[List[float]]

abstractmethod embed_query(text)[source]¶

Embed a single query

Parameters:: text (str)
Return type:: List[float]

abstractmethod get_embeddings()[source]¶

Get the underlying LangChain Embeddings object

Return type:: Embeddings

class src.ingestor.embedder.DocumentEmbedder(model_name='BAAI/bge-large-en-v1.5', model_kwargs=None, encode_kwargs=None)[source]¶

Bases: BaseEmbedder

HuggingFace embeddings implementation

Parameters:

model_name (str)
model_kwargs (Dict[str, Any] | None)
encode_kwargs (Dict[str, Any] | None)

embed_documents(texts)[source]¶

Embed a list of document strings and return their dense vectors.

Parameters:: texts (List[str])
Return type:: List[List[float]]

embed_query(text)[source]¶

Embed a single query string and return its dense vector.

Parameters:: text (str)
Return type:: List[float]

get_embeddings()[source]¶

Return the underlying LangChain Embeddings object (e.g., for use with FAISS).

Return type:: Embeddings

Vector Store Manager

class src.retriever.retriever.VectorStoreManager(embedder)[source]¶

Bases: object

Manages vector store operations (create, save, load, query)

Parameters:: embedder (DocumentEmbedder)

create_from_documents(documents)[source]¶

Create a new vector store from documents.

Parameters:: documents (List[Document]) – List of Document objects to index
Returns:: Created vector store
Return type:: VectorStore

add_documents(documents)[source]¶

Add documents to existing vector store.

Parameters:: documents (List[Document]) – List of Document objects to add
Return type:: None

save(path)[source]¶

Save vector store to disk.

Parameters:: path (str | Path) – Directory path to save the vector store
Return type:: None

load(path)[source]¶

Load vector store from disk.

Parameters:: path (str | Path) – Directory path containing the saved vector store
Returns:: Loaded vector store
Return type:: VectorStore

similarity_search(query, k=4)[source]¶

Search for similar documents.

Parameters:

query (str) – Query text
k (int) – Number of results to return

Returns:

List of most similar documents

Return type:

List[Document]

similarity_search_with_score(query, k=4)[source]¶

Search for similar documents with similarity scores.

Parameters:

query (str) – Query text
k (int) – Number of results to return

Returns:

List of (document, score) tuples

Return type:

List[tuple[Document, float]]

get_vector_store()[source]¶

Get the underlying vector store object

Return type:: VectorStore

Content Screener

class src.llm.content_screener.ContentScreener(model)[source]¶

Bases: object

Screen contents for usefulness

Parameters:: model (RoccoClient)

screen_user_content(content, context=None)[source]¶

Screen user provided content

Returns:

Screening result with keys:

is_valid (bool): Whether content is valid
issues (list): Issues found
confidence (float): Confidence score (0-1)
recommendation (str): Recommended action

Return type:

dict

Parameters:

content (str)
context (str)

Prompt Loader

Prompt loader utility for managing versioned YAML prompts.

src.prompts.loader.load_prompt(name)[source]¶

Load a prompt from src/prompts/{name}.yaml.

Parameters:

name (str) – Prompt name (e.g., ‘evaluator’, ‘editor’, ‘content_screener’)

Returns:

version, description, system (optional), user

Return type:

Dict with keys

Raises:

FileNotFoundError – If prompt file does not exist
yaml.YAMLError – If YAML parsing fails

src.prompts.loader.render(template_str, **kwargs)[source]¶

Render a Jinja2 template string with given variables.

Parameters:

template_str (str) – Template string with {{ variable }} placeholders
**kwargs – Variables to inject into template

Returns:

Rendered string

Return type:

str

Output Schemas

class src.llm.schemas.RubricItem(criterion, score, explanation=None)[source]¶

Bases: object

One criterion from the evaluation rubric, with its score and explanation.

Parameters:

criterion (str)
score (float)
explanation (str)

criterion: str¶

score: float¶

explanation: str = None¶

class src.llm.schemas.EvaluatorOutput(total_score, rubric_breakdown, comments=None)[source]¶

Bases: object

Structured output from DescriptionEvaluator

Parameters:

total_score (float)
rubric_breakdown (List[RubricItem])
comments (str | None)

total_score: float¶

rubric_breakdown: List[RubricItem]¶

comments: str | None = None¶

class src.llm.schemas.Citation(*, statement, source, quote, doc_title=None, page=None, chunk_index=None)[source]¶

Bases: BaseModel

Citation schema for each statemnt in the improved description.

Parameters:

statement (str)
source (str)
quote (str)
doc_title (str | None)
page (int | None)
chunk_index (int | None)

statement: str¶

source: str¶

quote: str¶

doc_title: str | None¶

page: int | None¶

chunk_index: int | None¶

model_config = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.EditorOutput(*, original_text, suggested_text, rationale, citation=<factory>, context_used=<factory>)[source]¶

Bases: BaseModel

Output from the description editor

Parameters:

original_text (str)
suggested_text (str)
rationale (str)
citation (List[Citation])
context_used (List[Dict[str, Any]])

original_text: str¶

suggested_text: str¶

rationale: str¶

citation: List[Citation]¶

context_used: List[Dict[str, Any]]¶

model_config = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.EditingSession(*, metadata, created_at, original_description=None, current_description=None, conversation_history=<factory>, rubric, config=<factory>)[source]¶

Bases: BaseModel

Schema for saving/loading editing sessions

Parameters:

metadata (Dict[str, Any])
created_at (str)
original_description (str | None)
current_description (str | None)
conversation_history (List[Dict[str, str]])
rubric (Dict[str, Any])
config (Dict[str, Any])

metadata: Dict[str, Any]¶

created_at: str¶

original_description: str | None¶

current_description: str | None¶

conversation_history: List[Dict[str, str]]¶

rubric: Dict[str, Any]¶

config: Dict[str, Any]¶

get_summary()[source]¶

Get a human-readable summary of the session

Return type:: str

model_config = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.PDFChunk(chunk_id, text, embedding=None, source_pdf=None)[source]¶

Bases: object

A single text chunk extracted from a PDF, optionally with its embedding vector.

Parameters:

chunk_id (str)
text (str)
embedding (List[float] | None)
source_pdf (str | None)

chunk_id: str¶

text: str¶

embedding: List[float] | None = None¶

source_pdf: str | None = None¶

Configuration

Environment variables (set in .env):

LLM_PROVIDER — Provider shortcut (openai, anthropic, ollama, etc.)
LLM_API_KEY — API key (required)
LLM_BASE_URL — Custom endpoint URL (optional)
LLM_MODEL — Model name (defaults to gpt-4o-mini)

See Configuration for all providers and setup.

API Reference¶

See Also¶