This repository provides a set of tools for constructing patient clinical timelines from free text in Electronic Healthcare Records (EHRs).
You can choose between four different methods for relation extraction between clinical entities and dates, which are used to construct the timelines.
These are:
- Naive model which finds the nearest date within a maximum distance of characters for each clinical entity
- Finetuned BERT model for binary relation classification between dates and clinical entities with span pooling to focus on entity representations, designed to work with any BERT model from HuggingFace
- LLM approach with few-shot prompting, either for binary relation classification or multi-relation extraction, designed to work with any text generation model from HuggingFace and/or any OpenAI API model
- RelCAT, a module of the MedCAT framework, specifically designed for relation extraction between entities in clinical text
The repository provides utility functions for these extraction methods, as well as customisable notebooks for training, evaluation and inference.
- Clone the repo
- Create a virtual environment and install dependencies:
pip install -r requirements.txt
- If you plan to use OpenAI LLMs then create a .env file in the root folder with the following:
OPENAI_API_KEY = 'your_key_here'
The system works with CSV files containing the following columns:
patient_id: Patient identifierdoc_id: Document identifierdocument_timestamp: Document date (YYYY-MM-DD)note_text: Clinical note text
A sample file data/data.csv is provided as an example of the required format.
There are three core pipelines that can be run, one for training/evaluation, one for inference, and one for post-processing to construct the patient timelines.
This pipeline uses the MedCAT Trainer tool to label entities, dates and relations in the clinical text, and uses this data to train, finetune and evaluate the various relation extraction methods.
Steps:
- Upload the raw CSV file to MedCAT Trainer. Note the column names will need to be changed, see the MedCAT Trainer Documentation for further details
- Annotate entities, dates, and relationships using the MedCAT Trainer interface
- Download the JSON export and run this through the
create_training_dataset.ipynbnotebook. An example json file is provided indata/MedCAT_Export.jsonif you wish to try the training process without having to do the manual labelling - This creates
training_dataset.csvwith all required columns including ground truth relationships - Use this dataset for training and evaluating any extraction methods in the
notebooks_trainingfolder
This pipeline is designed for processing new documents at scale and doesn't require manual labelling. It uses the MedCAT library to extract entities, various utility functions to extract dates and the models from the training pipeline to do the relation extraction and make predictions.
Steps:
- Run a new raw CSV file through the
create_inference_dataset.ipynbnotebook (note: this should be in the exact format described above in the Data Format section) - This creates
inference_dataset.csvwith all required columns - Use any of the extractors in the
notebooks_inferencefolder to do inference on the data and generate predictions - This will create files named
{extractor_type}_predictions.jsonin theoutputsfolder
This pipeline aggregates notes by individual patient and uses the predicted relations to construct the final patient timelines.
Steps:
- Run a
{extractor_type}_predictions.jsonfile created by any of the inference pipeline extractors through thecreate_patient_timelines.ipynbnotebook - This will create:
- Interactive timeline visualizations (saved as HTML files)
- JSON timeline summaries for each patient
- Both can be found in the
outputsfolder