SumUP

SumUP is a data engineering platform designed to streamline clinical content management by automating the summarization of clinical data. It leverages economical Small Language Models (SLMs) deployed on Google Cloud Platform (GCP) to provide fast, API-based access to concise clinical summaries, reducing administrative overhead and enabling rapid information retrieval.

Architecture Overview

SumUP integrates several modern data technologies to enable real-time, scalable, and cost-effective summarization workflows:

Component	Description
MySQL Database	Stores staging summaries and clinical data.
Debezium Connector	Captures change data (CDC) events from MySQL and streams them to Kafka topics in real time.
Kafka	Acts as the backbone for streaming CDC events and summary data between components.
Vector Embeddings	Converts clinical content into vector representations for efficient semantic search and retrieval.
SLM on GCP	Small Language Model deployed as a service API endpoint for generating and serving summaries.

Key Features

Automated Change Data Capture:
- Uses Debezium to monitor and stream all inserts, updates, and deletes from the MySQL database into Kafka topics, ensuring real-time data flow.
Real-Time Data Pipeline:
- Kafka serves as the central message bus, enabling scalable and decoupled processing of clinical content updates.
Semantic Summarization:
- Incoming clinical data is transformed into vector embeddings and processed by an SLM to generate concise, context-aware summaries.
Cloud-Native API:
- Summarization services are exposed via a robust API endpoint on GCP for easy integration with downstream applications.
Administrative Efficiency:
- Reduces the manual burden of browsing and summarizing clinical records, accelerating access to critical information.

Workflow

Data Ingestion:
- Clinical summaries and content are staged in a MySQL database.
Change Data Capture:
- Debezium connector monitors MySQL and streams change events (CDC) into Kafka topics.
Vectorization:
- Kafka consumers process the data, generating vector embeddings for semantic understanding.
Summarization:
- The SLM, deployed on GCP, consumes vectorized data and produces concise summaries via an API endpoint.
Delivery:
- Summaries are made available to client applications through the GCP-hosted API, supporting fast and economical access.

Getting Started

Prerequisites

Docker (for local development)
Access to a MySQL database
Kafka and Zookeeper setup
Debezium MySQL Connector
GCP account for SLM deployment

Quick Start

Spin up MySQL, Kafka, and Debezium:
- Use Docker Compose to orchestrate containers for MySQL, Kafka, Zookeeper, and Debezium.
Configure Debezium Connector:
- Register the Debezium MySQL source connector to monitor the target database and stream changes into Kafka
Set Up Vectorization and SLM:
- Deploy vector embedding services and the SLM API on GCP.
Integrate API:
- Connect your client application to the SLM API endpoint for real-time summary retrieval.

Example Use Case

A clinician needs a quick summary of a patient's recent medical history. Instead of manually browsing multiple records, the application queries the SumUP API, which returns a concise, up-to-date summary generated by the SLM, reducing time and cognitive load.

Why SumUP?

Speed: Real-time updates and summarization.
Cost-Effective: Uses SLMs for economical inference.
Scalable: Built on Kafka and cloud-native services.
Interoperable: API-based integration with clinical applications.

Scheduler Report Assistant (Chatbot Add-on)

SumUP now includes a natural-language Report Assistant for the Scheduler clinical research scheduling system. It replaces the Standard Reports and Report Builder tabs with a conversational interface powered by Groq (llama3-70b).

How it works

Browser → frontend/chatbot.html → POST /chat → FastAPI (main.py)
                                                    ↓
                                          Groq llama3-70b → NL→SQL
                                                    ↓
                                      MySQL scheduler DB (read-only)

Users ask questions in plain English ("Show all cancellations this month", "List open studies by PI") and get back a data table, the generated SQL, and a CSV export button.

Quick Start

# 1. Create read-only DB user
mysql -u root < setup.sql

# 2. Configure environment
cp .env.example .env   # add GROQ_API_KEY and DB_PASSWORD

# 3. Install and run
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --port 8001 --reload

The frontend lives in frontend/ — drop chatbot.html, chatbot.js, and chatbot.css into the Scheduler webapp's web/src/main/webapp/ directory.

Safety

Only SELECT statements execute — all writes are blocked
Results capped at 200 rows
DB user has SELECT-only privileges

Next Steps

Accuracy was significantly improved by adding regex extraction of patient name and patient MRN — these enabled more targeted indexing of the vector database and provided the most relevant context to the SLM. Further improvements planned:

Weighted encoding and NLP extraction of keywords (study_name, visit_template, etc.) mapped to structured data for more targeted SLM context
Chatbot follow-up questions for ambiguous queries
Role-aware query filtering (restrict result sets by user role)

Acknowledgments

Debezium for CDC technology.
Apache Kafka for streaming infrastructure.
Google Cloud Platform for scalable model deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
frontend		frontend
model_evaluation		model_evaluation
mysql		mysql
.env.example		.env.example
Readme.md		Readme.md
ai-summarize.iml		ai-summarize.iml
cdc_listener.py		cdc_listener.py
connector.json		connector.json
create_chromadb_embeddings.py		create_chromadb_embeddings.py
docker-compose.yml		docker-compose.yml
get_chromadb_embeddings.py		get_chromadb_embeddings.py
get_inputexamples.py		get_inputexamples.py
main.py		main.py
requirements.txt		requirements.txt
setup.sql		setup.sql
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SumUP

Architecture Overview

Key Features

Workflow

Getting Started

Prerequisites

Quick Start

Example Use Case

Why SumUP?

Scheduler Report Assistant (Chatbot Add-on)

How it works

Quick Start

Safety

Next Steps

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SumUP

Architecture Overview

Key Features

Workflow

Getting Started

Prerequisites

Quick Start

Example Use Case

Why SumUP?

Scheduler Report Assistant (Chatbot Add-on)

How it works

Quick Start

Safety

Next Steps

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages