Skip to content

zavera/sum-up

Repository files navigation

SumUP

SumUP is a data engineering platform designed to streamline clinical content management by automating the summarization of clinical data. It leverages economical Small Language Models (SLMs) deployed on Google Cloud Platform (GCP) to provide fast, API-based access to concise clinical summaries, reducing administrative overhead and enabling rapid information retrieval.


Architecture Overview

SumUP integrates several modern data technologies to enable real-time, scalable, and cost-effective summarization workflows:

Component Description
MySQL Database Stores staging summaries and clinical data.
Debezium Connector Captures change data (CDC) events from MySQL and streams them to Kafka topics in real time.
Kafka Acts as the backbone for streaming CDC events and summary data between components.
Vector Embeddings Converts clinical content into vector representations for efficient semantic search and retrieval.
SLM on GCP Small Language Model deployed as a service API endpoint for generating and serving summaries.

Key Features

  • Automated Change Data Capture:
    • Uses Debezium to monitor and stream all inserts, updates, and deletes from the MySQL database into Kafka topics, ensuring real-time data flow.
  • Real-Time Data Pipeline:
    • Kafka serves as the central message bus, enabling scalable and decoupled processing of clinical content updates.
  • Semantic Summarization:
    • Incoming clinical data is transformed into vector embeddings and processed by an SLM to generate concise, context-aware summaries.
  • Cloud-Native API:
    • Summarization services are exposed via a robust API endpoint on GCP for easy integration with downstream applications.
  • Administrative Efficiency:
    • Reduces the manual burden of browsing and summarizing clinical records, accelerating access to critical information.

Workflow

  1. Data Ingestion:
    • Clinical summaries and content are staged in a MySQL database.
  2. Change Data Capture:
    • Debezium connector monitors MySQL and streams change events (CDC) into Kafka topics.
  3. Vectorization:
    • Kafka consumers process the data, generating vector embeddings for semantic understanding.
  4. Summarization:
    • The SLM, deployed on GCP, consumes vectorized data and produces concise summaries via an API endpoint.
  5. Delivery:
    • Summaries are made available to client applications through the GCP-hosted API, supporting fast and economical access.

Getting Started

Prerequisites

  • Docker (for local development)
  • Access to a MySQL database
  • Kafka and Zookeeper setup
  • Debezium MySQL Connector
  • GCP account for SLM deployment

Quick Start

  1. Spin up MySQL, Kafka, and Debezium:
    • Use Docker Compose to orchestrate containers for MySQL, Kafka, Zookeeper, and Debezium.
  2. Configure Debezium Connector:
    • Register the Debezium MySQL source connector to monitor the target database and stream changes into Kafka
  3. Set Up Vectorization and SLM:
    • Deploy vector embedding services and the SLM API on GCP.
  4. Integrate API:
    • Connect your client application to the SLM API endpoint for real-time summary retrieval.

Example Use Case

A clinician needs a quick summary of a patient's recent medical history. Instead of manually browsing multiple records, the application queries the SumUP API, which returns a concise, up-to-date summary generated by the SLM, reducing time and cognitive load.


Why SumUP?

  • Speed: Real-time updates and summarization.
  • Cost-Effective: Uses SLMs for economical inference.
  • Scalable: Built on Kafka and cloud-native services.
  • Interoperable: API-based integration with clinical applications.

Scheduler Report Assistant (Chatbot Add-on)

SumUP now includes a natural-language Report Assistant for the Scheduler clinical research scheduling system. It replaces the Standard Reports and Report Builder tabs with a conversational interface powered by Groq (llama3-70b).

How it works

Browser → frontend/chatbot.html → POST /chat → FastAPI (main.py)
                                                    ↓
                                          Groq llama3-70b → NL→SQL
                                                    ↓
                                      MySQL scheduler DB (read-only)

Users ask questions in plain English ("Show all cancellations this month", "List open studies by PI") and get back a data table, the generated SQL, and a CSV export button.

Quick Start

# 1. Create read-only DB user
mysql -u root < setup.sql

# 2. Configure environment
cp .env.example .env   # add GROQ_API_KEY and DB_PASSWORD

# 3. Install and run
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --port 8001 --reload

The frontend lives in frontend/ — drop chatbot.html, chatbot.js, and chatbot.css into the Scheduler webapp's web/src/main/webapp/ directory.

Safety

  • Only SELECT statements execute — all writes are blocked
  • Results capped at 200 rows
  • DB user has SELECT-only privileges

Next Steps

Accuracy was significantly improved by adding regex extraction of patient name and patient MRN — these enabled more targeted indexing of the vector database and provided the most relevant context to the SLM. Further improvements planned:

  • Weighted encoding and NLP extraction of keywords (study_name, visit_template, etc.) mapped to structured data for more targeted SLM context
  • Chatbot follow-up questions for ambiguous queries
  • Role-aware query filtering (restrict result sets by user role)

Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors