An AI-powered SaaS collaboration platform that gives small teams a secure, intelligent teammate capable of understanding documents, images, and complex mathematical content.
Overview
Collabrium is a multi-modal Retrieval-Augmented Generation (RAG) platform designed for small teams that need to collaborate around complex, compliance-sensitive documents. Rather than relying on a generic chatbot, teams upload their own materials — PDFs, reports, research papers — and interact with an AI teammate that has actually read and indexed their content.
The platform ingests PDFs and intelligently extracts text, images, figure captions, and LaTeX-style mathematical notation into specialized vector stores. Each content type uses an embedding model best suited to it, resulting in far more accurate retrieval than single-modality approaches.
Built as part of Harvard's AC215 Applied Machine Learning course, the project demonstrates a production-grade MLOps workflow — from Docker-based development environments to a modular ingestion pipeline that can be extended with new data sources and embedding strategies.
Architecture
Collabrium's pipeline routes each content type through the embedding model best suited to it, then uses Reciprocal Rank Fusion to combine results from multiple vector stores at query time.
Features
Extracts text, images, figure captions, and LaTeX math from PDFs, routing each to a specialist embedding model for optimal retrieval.
The CustomRAGRetriever combines ranked results from multiple vector stores to improve relevance when a query spans multiple content types.
Designed from the ground up for teams with compliance requirements — your documents stay in your vector store, not a third-party training pipeline.
Docker Compose development environment with Jupyter Lab. Code changes persist on disk, enabling a fast iteration loop without rebuilding images.
HuggingFace models handle LaTeX-style mathematical notation that general-purpose text embedders routinely misrepresent.
CLIP embeddings let users query figures, charts, and diagrams in natural language, not just the surrounding text.
Technology
The project is primarily Python (42% of codebase) with JavaScript/CSS for frontend components, containerized via Docker.
| Category | Technology |
|---|---|
| Vector Database | ChromaDB |
| Text Embeddings | Google VertexAI |
| Image Embeddings | CLIP (OpenAI) |
| Math Embeddings | HuggingFace Transformers |
| Infrastructure | Docker, Docker Compose, Jupyter Lab |
| Languages | Python, JavaScript, CSS |
Team
Collabrium was a team capstone project for Harvard's AC215 Applied Machine Learning Engineering course.