Harvard AC215 Capstone • 2024

Collabrium

An AI-powered SaaS collaboration platform that gives small teams a secure, intelligent teammate capable of understanding documents, images, and complex mathematical content.

📄 View on GitHub 📰 Read on Medium

Overview

What is Collabrium?

Collabrium is a multi-modal Retrieval-Augmented Generation (RAG) platform designed for small teams that need to collaborate around complex, compliance-sensitive documents. Rather than relying on a generic chatbot, teams upload their own materials — PDFs, reports, research papers — and interact with an AI teammate that has actually read and indexed their content.

The platform ingests PDFs and intelligently extracts text, images, figure captions, and LaTeX-style mathematical notation into specialized vector stores. Each content type uses an embedding model best suited to it, resulting in far more accurate retrieval than single-modality approaches.

Built as part of Harvard's AC215 Applied Machine Learning course, the project demonstrates a production-grade MLOps workflow — from Docker-based development environments to a modular ingestion pipeline that can be extended with new data sources and embedding strategies.

Architecture

How It Works

Collabrium's pipeline routes each content type through the embedding model best suited to it, then uses Reciprocal Rank Fusion to combine results from multiple vector stores at query time.

📄 PDF Ingestion Smart chunking & content-type detection

→

🏳 Multi-Modal Embedding VertexAI text • CLIP images • HuggingFace math

→

📊 ChromaDB Separate vector stores per modality

→

🔍 CustomRAG Retriever Reciprocal Rank Fusion across stores

→

🤖 AI Teammate Grounded, context-aware responses

Features

Key Capabilities

📎

Multi-Modal Document Processing

Extracts text, images, figure captions, and LaTeX math from PDFs, routing each to a specialist embedding model for optimal retrieval.

📈

Reciprocal Rank Fusion

The CustomRAGRetriever combines ranked results from multiple vector stores to improve relevance when a query spans multiple content types.

🔒

Secure & Compliance-Friendly

Designed from the ground up for teams with compliance requirements — your documents stay in your vector store, not a third-party training pipeline.

🛠

Containerized MLOps

Docker Compose development environment with Jupyter Lab. Code changes persist on disk, enabling a fast iteration loop without rebuilding images.

🧮

Math-Aware Embeddings

HuggingFace models handle LaTeX-style mathematical notation that general-purpose text embedders routinely misrepresent.

📷

Image Understanding via CLIP

CLIP embeddings let users query figures, charts, and diagrams in natural language, not just the surrounding text.

Technology

Stack & Tools

The project is primarily Python (42% of codebase) with JavaScript/CSS for frontend components, containerized via Docker.

Category	Technology
Vector Database	ChromaDB
Text Embeddings	Google VertexAI
Image Embeddings	CLIP (OpenAI)
Math Embeddings	HuggingFace Transformers
Infrastructure	Docker, Docker Compose, Jupyter Lab
Languages	Python, JavaScript, CSS

Team

Project Contributors

Collabrium was a team capstone project for Harvard's AC215 Applied Machine Learning Engineering course.

Mark Elliott Kirk McGraw Patrick Nguyen Bentre Nguyen

Explore the full project

Source code, architecture docs, and a write-up on Medium

📄 GitHub Repository 📰 Medium Article