API Reference

Auto-generated documentation for all public classes and functions. Click any section header to expand it.

LLM Client
class src.llm.client.LLMClient(api_url=None, api_key=None, model=None, provider=None, timeout=60)[source]

Bases: object

Provider-agnostic LLM client supporting OpenAI, Anthropic, Gemini, DeepSeek, HuggingFace, Ollama, SambaNova, and any OpenAI-compatible API.

Parameters:
  • api_url (str)

  • api_key (str)

  • model (str)

  • provider (str)

  • timeout (int)

PROVIDER_URLS = {'anthropic': 'https://api.anthropic.com/v1', 'deepseek': 'https://api.deepseek.com/v1', 'gemini': 'https://generativelanguage.googleapis.com/v1beta/openai/', 'huggingface': 'https://router.huggingface.co/v1', 'ollama': 'http://localhost:11434/v1', 'openai': 'https://api.openai.com/v1', 'sambanova': 'https://ai.tejas.tacc.utexas.edu/v1'}
list_models()[source]

Return a list of model IDs available from the configured provider endpoint.

Return type:

List[str]

send_prompt(prompt, context=None, params=None)[source]

Send a prompt to the LLM and return the response text.

Parameters:
  • prompt (str)

  • context (str | None)

  • params (Dict[str, Any] | None)

Return type:

str

class src.llm.client.RoccoClient(api_url=None, api_key=None, model=None, provider=None, timeout=60)[source]

Bases: LLMClient

RoccoClient extends LLMClient for specific Rocco interactions.

Parameters:
  • api_url (str)

  • api_key (str)

  • model (str)

  • provider (str)

  • timeout (int)

evaluate_description(draft_text, rubric, examples, context=None)[source]

Evaluate a dataset description using the provided rubric and examples.

Parameters:
  • draft_text (str)

  • rubric (Dict[str, Any])

  • examples (List[Dict[str, Any]])

  • context (List[str] | None)

Return type:

str

improve_description(draft_text, context=None)[source]

Improve a dataset description based on the provided context.

Parameters:
  • draft_text (str)

  • context (List[str] | None)

Return type:

str

Evaluator
class src.evaluator.evaluator.DescriptionEvaluator(model, rubric, examples)[source]

Bases: object

Evaluates dataset descriptions against a rubric

Parameters:
  • model (RoccoClient)

  • rubric (List[Dict[str, Any]])

  • examples (List[Dict[str, Any]])

build_prompt(draft_text)[source]

Combine rubric, examples, and draft into prompt

Parameters:

draft_text (str)

Return type:

str

evaluate(draft_text)[source]

Call the LLM and return structured evaluation

Parameters:

draft_text (str)

Return type:

EvaluatorOutput

print_evaluation_result(evaluation_output)[source]

Utility to print evaluation results

Parameters:

evaluation_output (EvaluatorOutput)

Return type:

None

Editor
class src.editor.editor.DescriptionEditor(model, rubric, vector_store_manager=None, use_rag=True, top_k_context=5)[source]

Bases: object

Improves dataset descriptions

Parameters:
save_session(filepath)[source]

Save the current session to a file

Parameters:

filepath (Path)

Return type:

None

load_session(filepath)[source]

Load a session from a file

Parameters:

filepath (Path)

Return type:

None

get_session_summary()[source]

Get a summary of the current session

Return type:

str

retrieve_context(query=None)[source]

Retrieve relevant context from related papers

Parameters:

query (str)

Return type:

List[Document]

generate_search_query(draft_evaluation, query_all=True)[source]

Generate search queries based on evaluation feedback

Parameters:
Return type:

List[str]

build_prompt(draft_text, draft_evaluation, context=None, user_feedback=None, history_override=None)[source]

Prepare prompt for improving the draft

Parameters:
  • draft_text (str)

  • draft_evaluation (EvaluatorOutput)

  • context (List[Document] | List[str] | None)

  • user_feedback (str | None)

  • history_override (List[Dict[str, str]] | None)

Return type:

str

enhance(draft_text, draft_evaluation, retrieve_context=True, context_override=None, query_all_criterion=True, user_feedback=None, history_override=None)[source]

Improve the description draft using evaluation feedback and optional context from papers

Parameters:
  • draft_text (str)

  • draft_evaluation (EvaluatorOutput)

  • retrieve_context (bool)

  • context_override (List[str] | None)

  • query_all_criterion (bool)

  • user_feedback (str | None)

  • history_override (List[Dict[str, str]] | None)

Return type:

EditorOutput

print_enhancement_result(editor_output)[source]

Utility to print enhancement results

Parameters:

editor_output (EditorOutput)

Return type:

None

reset_conversation_history()[source]

Clear all stored conversation turns, starting a fresh refinement session.

Document Ingestor
class src.ingestor.document_ingestor.DocumentIngestor(chunk_size=500, chunk_overlap=100, separators=None)[source]

Bases: BaseIngestor

Ingests documents (PDF, DOCX) for RAG pipeline

Parameters:
  • chunk_size (int)

  • chunk_overlap (int)

  • separators (List[str] | None)

ingest(file_paths)[source]

Load and chunk document(s).

Parameters:

file_paths (str | List[str]) – Single file path (str) or list of file paths (List[str])

Returns:

List of chunked Document objects with enriched metadata

Return type:

List[Document]

Document Embedder
class src.ingestor.embedder.BaseEmbedder[source]

Bases: ABC

Base class for all embedders

abstractmethod embed_documents(texts)[source]

Embed a list of documents

Parameters:

texts (List[str])

Return type:

List[List[float]]

abstractmethod embed_query(text)[source]

Embed a single query

Parameters:

text (str)

Return type:

List[float]

abstractmethod get_embeddings()[source]

Get the underlying LangChain Embeddings object

Return type:

Embeddings

class src.ingestor.embedder.DocumentEmbedder(model_name='BAAI/bge-large-en-v1.5', model_kwargs=None, encode_kwargs=None)[source]

Bases: BaseEmbedder

HuggingFace embeddings implementation

Parameters:
  • model_name (str)

  • model_kwargs (Dict[str, Any] | None)

  • encode_kwargs (Dict[str, Any] | None)

embed_documents(texts)[source]

Embed a list of document strings and return their dense vectors.

Parameters:

texts (List[str])

Return type:

List[List[float]]

embed_query(text)[source]

Embed a single query string and return its dense vector.

Parameters:

text (str)

Return type:

List[float]

get_embeddings()[source]

Return the underlying LangChain Embeddings object (e.g., for use with FAISS).

Return type:

Embeddings

Vector Store Manager
class src.retriever.retriever.VectorStoreManager(embedder)[source]

Bases: object

Manages vector store operations (create, save, load, query)

Parameters:

embedder (DocumentEmbedder)

create_from_documents(documents)[source]

Create a new vector store from documents.

Parameters:

documents (List[Document]) – List of Document objects to index

Returns:

Created vector store

Return type:

VectorStore

add_documents(documents)[source]

Add documents to existing vector store.

Parameters:

documents (List[Document]) – List of Document objects to add

Return type:

None

save(path)[source]

Save vector store to disk.

Parameters:

path (str | Path) – Directory path to save the vector store

Return type:

None

load(path)[source]

Load vector store from disk.

Parameters:

path (str | Path) – Directory path containing the saved vector store

Returns:

Loaded vector store

Return type:

VectorStore

Search for similar documents.

Parameters:
  • query (str) – Query text

  • k (int) – Number of results to return

Returns:

List of most similar documents

Return type:

List[Document]

similarity_search_with_score(query, k=4)[source]

Search for similar documents with similarity scores.

Parameters:
  • query (str) – Query text

  • k (int) – Number of results to return

Returns:

List of (document, score) tuples

Return type:

List[tuple[Document, float]]

get_vector_store()[source]

Get the underlying vector store object

Return type:

VectorStore

Content Screener
class src.llm.content_screener.ContentScreener(model)[source]

Bases: object

Screen contents for usefulness

Parameters:

model (RoccoClient)

screen_user_content(content, context=None)[source]

Screen user provided content

Returns:

Screening result with keys:
  • is_valid (bool): Whether content is valid

  • issues (list): Issues found

  • confidence (float): Confidence score (0-1)

  • recommendation (str): Recommended action

Return type:

dict

Parameters:
  • content (str)

  • context (str)

Prompt Loader

Prompt loader utility for managing versioned YAML prompts.

src.prompts.loader.load_prompt(name)[source]

Load a prompt from src/prompts/{name}.yaml.

Parameters:

name (str) – Prompt name (e.g., ‘evaluator’, ‘editor’, ‘content_screener’)

Returns:

version, description, system (optional), user

Return type:

Dict with keys

Raises:
  • FileNotFoundError – If prompt file does not exist

  • yaml.YAMLError – If YAML parsing fails

src.prompts.loader.render(template_str, **kwargs)[source]

Render a Jinja2 template string with given variables.

Parameters:
  • template_str (str) – Template string with {{ variable }} placeholders

  • **kwargs – Variables to inject into template

Returns:

Rendered string

Return type:

str

Output Schemas
class src.llm.schemas.RubricItem(criterion, score, explanation=None)[source]

Bases: object

One criterion from the evaluation rubric, with its score and explanation.

Parameters:
  • criterion (str)

  • score (float)

  • explanation (str)

criterion: str
score: float
explanation: str = None
class src.llm.schemas.EvaluatorOutput(total_score, rubric_breakdown, comments=None)[source]

Bases: object

Structured output from DescriptionEvaluator

Parameters:
  • total_score (float)

  • rubric_breakdown (List[RubricItem])

  • comments (str | None)

total_score: float
rubric_breakdown: List[RubricItem]
comments: str | None = None
class src.llm.schemas.Citation(*, statement, source, quote, doc_title=None, page=None, chunk_index=None)[source]

Bases: BaseModel

Citation schema for each statemnt in the improved description.

Parameters:
  • statement (str)

  • source (str)

  • quote (str)

  • doc_title (str | None)

  • page (int | None)

  • chunk_index (int | None)

statement: str
source: str
quote: str
doc_title: str | None
page: int | None
chunk_index: int | None
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.EditorOutput(*, original_text, suggested_text, rationale, citation=<factory>, context_used=<factory>)[source]

Bases: BaseModel

Output from the description editor

Parameters:
  • original_text (str)

  • suggested_text (str)

  • rationale (str)

  • citation (List[Citation])

  • context_used (List[Dict[str, Any]])

original_text: str
suggested_text: str
rationale: str
citation: List[Citation]
context_used: List[Dict[str, Any]]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.EditingSession(*, metadata, created_at, original_description=None, current_description=None, conversation_history=<factory>, rubric, config=<factory>)[source]

Bases: BaseModel

Schema for saving/loading editing sessions

Parameters:
  • metadata (Dict[str, Any])

  • created_at (str)

  • original_description (str | None)

  • current_description (str | None)

  • conversation_history (List[Dict[str, str]])

  • rubric (Dict[str, Any])

  • config (Dict[str, Any])

metadata: Dict[str, Any]
created_at: str
original_description: str | None
current_description: str | None
conversation_history: List[Dict[str, str]]
rubric: Dict[str, Any]
config: Dict[str, Any]
get_summary()[source]

Get a human-readable summary of the session

Return type:

str

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.llm.schemas.PDFChunk(chunk_id, text, embedding=None, source_pdf=None)[source]

Bases: object

A single text chunk extracted from a PDF, optionally with its embedding vector.

Parameters:
  • chunk_id (str)

  • text (str)

  • embedding (List[float] | None)

  • source_pdf (str | None)

chunk_id: str
text: str
embedding: List[float] | None = None
source_pdf: str | None = None
Configuration

Environment variables (set in .env):

  • LLM_PROVIDER — Provider shortcut (openai, anthropic, ollama, etc.)

  • LLM_API_KEY — API key (required)

  • LLM_BASE_URL — Custom endpoint URL (optional)

  • LLM_MODEL — Model name (defaults to gpt-4o-mini)

See Configuration for all providers and setup.


See Also