Patient Clinical Timeline Extraction

This repository provides a set of tools for constructing patient clinical timelines from free text in Electronic Healthcare Records (EHRs).

You can choose between four different methods for relation extraction between clinical entities and dates, which are used to construct the timelines.

These are:

Naive model which finds the nearest date within a maximum distance of characters for each clinical entity
Finetuned BERT model for binary relation classification between dates and clinical entities with span pooling to focus on entity representations, designed to work with any BERT model from HuggingFace
LLM approach with few-shot prompting, either for binary relation classification or multi-relation extraction, designed to work with any text generation model from HuggingFace and/or any OpenAI API model
RelCAT, a module of the MedCAT framework, specifically designed for relation extraction between entities in clinical text

The repository provides utility functions for these extraction methods, as well as customisable notebooks for training, evaluation and inference.

Usage

Get Started

Clone the repo
Create a virtual environment and install dependencies:
```
pip install -r requirements.txt
```
If you plan to use OpenAI LLMs then create a .env file in the root folder with the following:
```
OPENAI_API_KEY = 'your_key_here'
```

Data Format

The system works with CSV files containing the following columns:

patient_id: Patient identifier
doc_id: Document identifier
document_timestamp: Document date (YYYY-MM-DD)
note_text: Clinical note text

A sample file data/data.csv is provided as an example of the required format.

Data Pipelines

There are three core pipelines that can be run, one for training/evaluation, one for inference, and one for post-processing to construct the patient timelines.

1. Training Pipeline via MedCAT Trainer

This pipeline uses the MedCAT Trainer tool to label entities, dates and relations in the clinical text, and uses this data to train, finetune and evaluate the various relation extraction methods.

Steps:

Upload the raw CSV file to MedCAT Trainer. Note the column names will need to be changed, see the MedCAT Trainer Documentation for further details
Annotate entities, dates, and relationships using the MedCAT Trainer interface
Download the JSON export and run this through the create_training_dataset.ipynb notebook. An example json file is provided in data/MedCAT_Export.json if you wish to try the training process without having to do the manual labelling
This creates training_dataset.csv with all required columns including ground truth relationships
Use this dataset for training and evaluating any extraction methods in the notebooks_training folder

2. Inference Pipeline

This pipeline is designed for processing new documents at scale and doesn't require manual labelling. It uses the MedCAT library to extract entities, various utility functions to extract dates and the models from the training pipeline to do the relation extraction and make predictions.

Steps:

Run a new raw CSV file through the create_inference_dataset.ipynb notebook (note: this should be in the exact format described above in the Data Format section)
This creates inference_dataset.csv with all required columns
Use any of the extractors in the notebooks_inference folder to do inference on the data and generate predictions
This will create files named {extractor_type}_predictions.json in the outputs folder

3. Post-Processing Pipeline

This pipeline aggregates notes by individual patient and uses the predicted relations to construct the final patient timelines.

Steps:

Run a {extractor_type}_predictions.json file created by any of the inference pipeline extractors through the create_patient_timelines.ipynb notebook
This will create:
- Interactive timeline visualizations (saved as HTML files)
- JSON timeline summaries for each patient
- Both can be found in the outputs folder

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
data		data
models		models
notebooks_inference		notebooks_inference
notebooks_training		notebooks_training
prompts		prompts
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patient Clinical Timeline Extraction

Usage

Get Started

Data Format

Data Pipelines

1. Training Pipeline via MedCAT Trainer

2. Inference Pipeline

3. Post-Processing Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Patient Clinical Timeline Extraction

Usage

Get Started

Data Format

Data Pipelines

1. Training Pipeline via MedCAT Trainer

2. Inference Pipeline

3. Post-Processing Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages