wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jarNextflow for Synthea EHR Data
Summary
This notebook describes a mini-project I put together to practice the basics of using Nextflow and explore simulated EHR data.
Synthea is an open-source synthetic health record generator that produces realistic patient-level clinical data (demographics, encounters, conditions, medications, procedures). It was developed by the non-profit MITRE and is commonly used for pipeline protoyping for large-scale HL7 FHIR data.
In this workflow, I used Synthea as a parameterized data generator inside a reproducible pipeline. Nextflow orchestrates Synthea runs with user-defined inputs (patient count, random seed, location), producing raw CSVs with simulated patient information. A Python ETL step then standardizes and aggregates the Synthea outputs into a tidy patient-level feature table. The final output from the Nextflow pipeline is a single Parquet table that I load into R in this quarto doc for visualizations.
Synthea
Links:
A Java executable runs the simulation engine and generates synthetic electronic health record data. You can specify parameters like the number of patients to generate, the location (city/state), and a random seed for reproducibility.
Download
Example run
I modified the synthea.properties file to set the output format to CSV.
# Generate 10,000 synthetic patients from Denver, Colorado
java -jar /path/to/synthea-with-dependencies.jar \
-c /path/to/synthea/synthea.properties \
-p 10000 \
-s 18 \
Colorado DenverParsing Synthea CSVs
I created a python script to parse and standardize the Synthea CSV outputs into a patient-level feature table. The script performs the following:
- Reads the relevant CSV files (patients, encounters, conditions, medications) with pandas.
- Uses the janitor library to clean and standardize column names.
- Calculates patient age from birthdate and filters patients based on optional age criteria.
- Counts the number of conditions, medications, encounters, and ER admissions for each patient.
- Generates binary flags for specific conditions (diabetes, hypertension, asthma) based on SNOMED codes.
- Outputs the final standardized table as a parquet file for hand-off to R.
parse_synthea.py
import argparse
from pathlib import Path
import pandas as pd
import janitor
def load_csvs(input_dir):
p = Path(input_dir)
files = {}
for name in ["patients", "encounters", "conditions", "medications"]:
fp = p / f"{name}.csv"
files[name] = pd.read_csv(fp).clean_names() if fp.exists() else pd.DataFrame()
return files
def filter_last_n_years(df, date_col, cutoff_date):
if df.empty or date_col not in df.columns:
return df
df = df.copy()
df[date_col] = pd.to_datetime(
df[date_col], errors="coerce", utc=True
).dt.tz_convert(None)
return df.loc[df[date_col] >= cutoff_date]
def compute_features(files, age_min=None, age_max=None):
patients = files["patients"]
patients["birthdate"] = pd.to_datetime(patients["birthdate"])
patients["age"] = (
(pd.Timestamp.today() - patients["birthdate"]).dt.days // 365
).astype(int)
if age_min is not None:
patients = patients[patients["age"] >= age_min]
if age_max is not None:
patients = patients[patients["age"] <= age_max]
# count conditions, meds, encounters in the last 10 years
cutoff_date = (pd.Timestamp.utcnow() - pd.DateOffset(years=10)).tz_localize(None)
cond = files["conditions"]
enc = files["encounters"]
meds = files["medications"]
cond = filter_last_n_years(cond, "start", cutoff_date)
enc = filter_last_n_years(enc, "start", cutoff_date)
meds = filter_last_n_years(meds, "start", cutoff_date)
cond_counts = cond.groupby("patient").size().rename("n_conditions")
med_counts = meds.groupby("patient").size().rename("n_medications")
enc_counts = enc.groupby("patient").size().rename("n_encounters")
er_counts = (
enc.loc[enc["description"].str.contains("Emergency room admission", na=False)]
.groupby("patient")
.size()
.rename("n_er_admissions")
)
df = patients.set_index("id")[["age", "gender", "race"]].copy()
df = (
df.join(cond_counts, how="left")
.join(med_counts, how="left")
.join(enc_counts, how="left")
.join(er_counts, how="left")
)
df["n_conditions"] = df["n_conditions"].fillna(0).astype(int)
df["n_medications"] = df["n_medications"].fillna(0).astype(int)
df["n_encounters"] = df["n_encounters"].fillna(0).astype(int)
condition_code_dict = {
"has_diabetes": {"44054006"},
"has_hypertension": {"59621000"},
"has_asthma": {"195967001"},
}
if not cond.empty and "code" in cond.columns:
cond_codes = cond["code"].astype(str)
for flag_name, codes in condition_code_dict.items():
patients_with_condition = cond.loc[cond_codes.isin(codes), "patient"]
df[flag_name] = df.index.isin(patients_with_condition).astype(int)
else:
for flag_name in condition_code_dict:
df[flag_name] = 0
# reset index (patient id)
df = df.reset_index().rename(columns={"index": "patient"})
return df
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input-dir", required=True)
parser.add_argument("--out", required=True)
parser.add_argument("--age-min", type=int, default=None)
parser.add_argument("--age-max", type=int, default=None)
args = parser.parse_args()
files = load_csvs(args.input_dir)
table = compute_features(files, age_min=args.age_min, age_max=args.age_max)
outp = Path(args.out)
if outp.suffix == ".parquet":
table.to_parquet(outp, index=False)
else:
table.to_csv(outp, index=False)
if __name__ == "__main__":
main()Example run
python3 parse_synthea.py \
--input-dir /path/to/synthea/output/csv \
--out standardized_table.parquet \
--age-min 18 \
--age-max 120Nextflow
I created a simple pipeline to combine the Synthea data generation and parsing steps.
Download
Nextflow was just as easy to download and install as Synthea. Just need Java 17+. The documentation is a thing of beauty.
curl -s https://get.nextflow.io | bashmain.nf
nextflow.enable.dsl=2
params.patients = params.patients ?: 10000
params.city = params.city ?: Denver
params.seed = params.seed ?: 18
params.outdir = params.outdir ?: results
process generate_synthea {
tag { "${params.city}-${params.patients}" }
publishDir "${params.outdir}/synthea/${params.city}-${params.patients}", mode: 'copy'
output:
path "output/csv"
script:
"""
mkdir -p output
java -jar /home/jmcgir/scratch/synthea/synthea-with-dependencies.jar -c /home/jmcgir/scratch/synthea/make_csv/synthea.properties -p ${params.patients} -s ${params.seed} --output=./output ${params.city}
"""
}
process parse_and_standardize {
tag "parse"
publishDir "${params.outdir}/table", mode: 'copy'
input:
path synthea_dir
output:
path "standardized_table.parquet"
script:
"""
# pass the directory (synthea_dir) to the python ETL
python3 /home/jmcgir/scratch/nextflow/scripts/parse_synthea.py --input-dir ${synthea_dir} --out standardized_table.parquet
"""
}
workflow {
synthea_ch = generate_synthea()
parse_and_standardize(synthea_ch)
}Example run
nextflow run main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir resultsExploring the data
R setup
Code
tidy_packages <- c("readr", "tidyr", "stringr", "ggplot2", "dplyr")
required_packages <- c(
"arrow",
"plotly",
"gt",
tidy_packages
)
invisible(lapply(required_packages, library, character.only = TRUE))
# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'
jam_cols <- c(blu, red, gre, org, grb, lir, gry, bla)
jam_shapes <- c(21, 22, 23, 24, 25)
options(ggplot2.discrete.colour = jam_cols)
options(ggplot2.discrete.fill = jam_cols)
main_theme <- theme(
text = element_text(size = 14),
axis.text = element_text(size = 12),
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),
axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0))
)
jam_theme <- theme_minimal() + main_theme
jam_theme_bw <- theme_bw() + main_theme
remove.boxplot.outliers <- function(fig) {
stopifnot("plotly" %in% class(fig))
fig$x$data <- lapply(
fig$x$data,
\(i) {
if (i$type != "box") {
return(i)
}
i$marker = list(opacity = 0)
i$hoverinfo = "none"
i
}
)
fig
}EDA plots
synth_table <- arrow::read_parquet(
"//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/table/standardized_table.parquet"
)
head(synth_table) |>
gt() |>
tab_header(title = "First 6 rows standardized_table.parquet")| First 6 rows standardized_table.parquet | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| id | age | gender | race | n_conditions | n_medications | n_encounters | n_er_admissions | has_diabetes | has_hypertension | has_asthma |
| 5bab1293-644c-3564-23b6-bd095720101a | 1 | F | white | 6 | 1 | 7 | NA | 0 | 0 | 0 |
| 35f037eb-435b-ebf4-7c1a-5c362ee0e670 | 2 | M | white | 6 | 0 | 10 | NA | 0 | 0 | 0 |
| 5f3f9386-383f-d368-4656-32abba73047c | 1 | M | native | 6 | 2 | 11 | NA | 0 | 0 | 0 |
| 712f82b7-745a-7589-cb2a-81d2aef0c8de | 2 | F | white | 5 | 1 | 11 | NA | 0 | 0 | 0 |
| cb1a2337-7b1e-72ff-b37e-d455e0ded5f7 | 0 | F | white | 2 | 0 | 3 | NA | 0 | 0 | 0 |
| c8161ec5-f98e-a4c3-6a87-973493b38fee | 0 | M | white | 4 | 3 | 8 | 1 | 0 | 0 | 0 |
n_patients <- nrow(synth_table)
p1 <- ggplot(synth_table, aes(age, fill = gender)) +
geom_histogram(bins = 30, color = bla) +
facet_wrap(~gender, nrow = 2) +
jam_theme_bw +
labs(
title = paste0("Age distribution of ", n_patients, " synthetic patients"),
x = "Age (years)",
y = "Number of patients"
)
p1d1 <- synth_table |>
select(
id,
n_medications,
has_diabetes,
has_hypertension,
has_asthma,
gender,
age
) |>
pivot_longer(
cols = !c(id, n_medications, gender, age),
names_to = "diagnosis",
values_to = "value"
) |>
filter(value == 1)
p1 <- ggplot() +
geom_boxplot(data = d1, aes(diagnosis, n_medications), outlier.shape = NA) +
geom_jitter(
data = d1,
aes(diagnosis, n_medications, label = id, label2 = age),
alpha = 0.7,
height = 0,
width = 0.1,
shape = 1
) +
jam_theme_bw +
labs(
title = "Medications prescribed for patients that have been daignosed in the past 10 years with:\nasthma, diabetes, and/or hypertension",
y = "Number of medications"
)
remove.boxplot.outliers(ggplotly(p1))p1 <- d1 |>
filter(diagnosis %in% c("has_diabetes", "has_hypertension")) |>
ggplot(aes(age, n_medications, color = diagnosis)) +
geom_point(shape = 1, size = 2) +
facet_wrap(~diagnosis, scales = "free") +
geom_smooth(method = "lm") +
jam_theme_bw +
labs(
title = "Age vs. number of medications prescribed in the past 10 years",
x = "Age (years)",
y = "Number of medications"
)
p1pca_features <- c(
"age",
"n_medications",
"n_encounters",
"n_er_admissions"
)
metadata_cols <- intersect(
c("id", "gender", "race", "has_diabetes", "age", "n_encounters"),
names(synth_table)
)
pca_table <- synth_table %>%
select(all_of(c("id", pca_features))) %>%
mutate(across(all_of(pca_features), as.numeric)) %>%
drop_na() %>%
tibble::column_to_rownames("id")
pca_fit <- prcomp(pca_table, scale. = TRUE, rank. = 20)
pc <- as.data.frame(pca_fit$x)
pc$id <- row.names(pc)
patient_metadata <- synth_table %>%
select(all_of(metadata_cols)) %>%
distinct()
pc <- inner_join(pc, patient_metadata, by = "id")
var_exp <- round(summary(pca_fit)$importance[2, ], 4) * 100
plot.pca <- function(column_to_color_by, pc_x, pc_y) {
pc_x_lab <- paste0("PC", pc_x)
pc_y_lab <- paste0("PC", pc_y)
p1 <- pc |>
ggplot(aes(
x = !!sym(pc_x_lab),
y = !!sym(pc_y_lab),
color = !!sym(column_to_color_by),
text = id
)) +
geom_point(size = 3) +
theme_minimal(base_size = 14) +
xlab(paste0("PC", pc_x, " (", var_exp[pc_x], "%)")) +
ylab(paste0("PC", pc_y, " (", var_exp[pc_y], "%)"))
return(p1)
}
ggplotly(plot.pca("age", 1, 2))ggplotly(plot.pca("n_encounters", 1, 2))loadings <- as.data.frame(pca_fit$rotation)
loadings$feature <- row.names(loadings)
loadings |>
select(feature, PC1, PC2) |>
arrange(desc(abs(PC1))) |>
gt() |>
tab_header(title = "PCA Loadings for PC1 and PC2")| PCA Loadings for PC1 and PC2 | ||
|---|---|---|
| feature | PC1 | PC2 |
| n_medications | 0.5971994 | 0.06567827 |
| n_encounters | 0.5735261 | 0.04265387 |
| n_er_admissions | 0.4175287 | 0.57654288 |
| age | 0.3742867 | -0.81330518 |
loading_vecs <- loadings |>
mutate(
PC1 = PC1 * max(abs(pc$PC1)),
PC2 = PC2 * max(abs(pc$PC2))
)
p1 <- ggplot(pc, aes(PC1, PC2)) +
geom_point(alpha = 0.4) +
geom_segment(
data = loading_vecs,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.2, "cm")),
color = "black"
) +
geom_text(
data = loading_vecs,
aes(x = PC1, y = PC2, label = feature),
vjust = 1.2
) +
jam_theme +
xlab(paste0("PC", 1, " (", var_exp[1], "%)")) +
ylab(paste0("PC", 2, " (", var_exp[2], "%)"))
p1Run time
Sys.time() - start_timeTime difference of 8.089688 secs