SumUP is a data engineering platform designed to streamline clinical content management by automating the summarization of clinical data. It leverages economical Small Language Models (SLMs) deployed on Google Cloud Platform (GCP) to provide fast, API-based access to concise clinical summaries, reducing administrative overhead and enabling rapid information retrieval.
SumUP integrates several modern data technologies to enable real-time, scalable, and cost-effective summarization workflows:
| Component | Description |
|---|---|
| MySQL Database | Stores staging summaries and clinical data. |
| Debezium Connector | Captures change data (CDC) events from MySQL and streams them to Kafka topics in real time. |
| Kafka | Acts as the backbone for streaming CDC events and summary data between components. |
| Vector Embeddings | Converts clinical content into vector representations for efficient semantic search and retrieval. |
| SLM on GCP | Small Language Model deployed as a service API endpoint for generating and serving summaries. |
- Automated Change Data Capture:
- Uses Debezium to monitor and stream all inserts, updates, and deletes from the MySQL database into Kafka topics, ensuring real-time data flow.
- Real-Time Data Pipeline:
- Kafka serves as the central message bus, enabling scalable and decoupled processing of clinical content updates.
- Semantic Summarization:
- Incoming clinical data is transformed into vector embeddings and processed by an SLM to generate concise, context-aware summaries.
- Cloud-Native API:
- Summarization services are exposed via a robust API endpoint on GCP for easy integration with downstream applications.
- Administrative Efficiency:
- Reduces the manual burden of browsing and summarizing clinical records, accelerating access to critical information.
- Data Ingestion:
- Clinical summaries and content are staged in a MySQL database.
- Change Data Capture:
- Debezium connector monitors MySQL and streams change events (CDC) into Kafka topics.
- Vectorization:
- Kafka consumers process the data, generating vector embeddings for semantic understanding.
- Summarization:
- The SLM, deployed on GCP, consumes vectorized data and produces concise summaries via an API endpoint.
- Delivery:
- Summaries are made available to client applications through the GCP-hosted API, supporting fast and economical access.
- Docker (for local development)
- Access to a MySQL database
- Kafka and Zookeeper setup
- Debezium MySQL Connector
- GCP account for SLM deployment
- Spin up MySQL, Kafka, and Debezium:
- Use Docker Compose to orchestrate containers for MySQL, Kafka, Zookeeper, and Debezium.
- Configure Debezium Connector:
- Register the Debezium MySQL source connector to monitor the target database and stream changes into Kafka
- Set Up Vectorization and SLM:
- Deploy vector embedding services and the SLM API on GCP.
- Integrate API:
- Connect your client application to the SLM API endpoint for real-time summary retrieval.
A clinician needs a quick summary of a patient's recent medical history. Instead of manually browsing multiple records, the application queries the SumUP API, which returns a concise, up-to-date summary generated by the SLM, reducing time and cognitive load.
- Speed: Real-time updates and summarization.
- Cost-Effective: Uses SLMs for economical inference.
- Scalable: Built on Kafka and cloud-native services.
- Interoperable: API-based integration with clinical applications.
SumUP now includes a natural-language Report Assistant for the Scheduler clinical research scheduling system. It replaces the Standard Reports and Report Builder tabs with a conversational interface powered by Groq (llama3-70b).
Browser → frontend/chatbot.html → POST /chat → FastAPI (main.py)
↓
Groq llama3-70b → NL→SQL
↓
MySQL scheduler DB (read-only)
Users ask questions in plain English ("Show all cancellations this month", "List open studies by PI") and get back a data table, the generated SQL, and a CSV export button.
# 1. Create read-only DB user
mysql -u root < setup.sql
# 2. Configure environment
cp .env.example .env # add GROQ_API_KEY and DB_PASSWORD
# 3. Install and run
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --port 8001 --reloadThe frontend lives in frontend/ — drop chatbot.html, chatbot.js, and chatbot.css into the Scheduler webapp's web/src/main/webapp/ directory.
- Only
SELECTstatements execute — all writes are blocked - Results capped at 200 rows
- DB user has
SELECT-only privileges
Accuracy was significantly improved by adding regex extraction of patient name and patient MRN — these enabled more targeted indexing of the vector database and provided the most relevant context to the SLM. Further improvements planned:
- Weighted encoding and NLP extraction of keywords (study_name, visit_template, etc.) mapped to structured data for more targeted SLM context
- Chatbot follow-up questions for ambiguous queries
- Role-aware query filtering (restrict result sets by user role)
- Debezium for CDC technology.
- Apache Kafka for streaming infrastructure.
- Google Cloud Platform for scalable model deployment.