Nextflow for Synthea EHR Data

Author

Joe McGirr | Bioinformatics Scientist

Published

February 3, 2026

Summary

This notebook describes a mini-project I put together to practice the basics of using Nextflow and explore simulated EHR data.

Synthea is an open-source synthetic health record generator that produces realistic patient-level clinical data (demographics, encounters, conditions, medications, procedures). It was developed by the non-profit MITRE and is commonly used for pipeline protoyping for large-scale HL7 FHIR data.

In this workflow, I used Synthea as a parameterized data generator inside a reproducible pipeline. Nextflow orchestrates Synthea runs with user-defined inputs (patient count, random seed, location), producing raw CSVs with simulated patient information. A Python ETL step then standardizes and aggregates the Synthea outputs into a tidy patient-level feature table. The final output from the Nextflow pipeline is a single Parquet table that I load into R in this quarto doc for visualizations.

Synthea

Links:

A Java executable runs the simulation engine and generates synthetic electronic health record data. You can specify parameters like the number of patients to generate, the location (city/state), and a random seed for reproducibility.

Download

wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar

Example run

I modified the synthea.properties file to set the output format to CSV.

# Generate 10,000 synthetic patients from Denver, Colorado

java -jar /path/to/synthea-with-dependencies.jar \
-c /path/to/synthea/synthea.properties \
-p 10000 \
-s 18 \
Colorado Denver

Parsing Synthea CSVs

I created a python script to parse and standardize the Synthea CSV outputs into a patient-level feature table. The script performs the following:

  1. Reads the relevant CSV files (patients, encounters, conditions, medications) with pandas.
  2. Uses the janitor library to clean and standardize column names.
  3. Calculates patient age from birthdate and filters patients based on optional age criteria.
  4. Counts the number of conditions, medications, encounters, and ER admissions for each patient.
  5. Generates binary flags for specific conditions (diabetes, hypertension, asthma) based on SNOMED codes.
  6. Outputs the final standardized table as a parquet file for hand-off to R.

parse_synthea.py

import argparse
from pathlib import Path
import pandas as pd
import janitor


def load_csvs(input_dir):
    p = Path(input_dir)
    files = {}
    for name in ["patients", "encounters", "conditions", "medications"]:
        fp = p / f"{name}.csv"
        files[name] = pd.read_csv(fp).clean_names() if fp.exists() else pd.DataFrame()
    return files


def filter_last_n_years(df, date_col, cutoff_date):
    if df.empty or date_col not in df.columns:
        return df

    df = df.copy()

    df[date_col] = pd.to_datetime(
        df[date_col], errors="coerce", utc=True
    ).dt.tz_convert(None)

    return df.loc[df[date_col] >= cutoff_date]


def compute_features(files, age_min=None, age_max=None):
    patients = files["patients"]
    patients["birthdate"] = pd.to_datetime(patients["birthdate"])
    patients["age"] = (
        (pd.Timestamp.today() - patients["birthdate"]).dt.days // 365
    ).astype(int)

    if age_min is not None:
        patients = patients[patients["age"] >= age_min]
    if age_max is not None:
        patients = patients[patients["age"] <= age_max]

    # count conditions, meds, encounters in the last 10 years
    cutoff_date = (pd.Timestamp.utcnow() - pd.DateOffset(years=10)).tz_localize(None)
    cond = files["conditions"]
    enc = files["encounters"]
    meds = files["medications"]

    cond = filter_last_n_years(cond, "start", cutoff_date)
    enc = filter_last_n_years(enc, "start", cutoff_date)
    meds = filter_last_n_years(meds, "start", cutoff_date)

    cond_counts = cond.groupby("patient").size().rename("n_conditions")
    med_counts = meds.groupby("patient").size().rename("n_medications")
    enc_counts = enc.groupby("patient").size().rename("n_encounters")

    er_counts = (
        enc.loc[enc["description"].str.contains("Emergency room admission", na=False)]
        .groupby("patient")
        .size()
        .rename("n_er_admissions")
    )

    df = patients.set_index("id")[["age", "gender", "race"]].copy()
    df = (
        df.join(cond_counts, how="left")
        .join(med_counts, how="left")
        .join(enc_counts, how="left")
        .join(er_counts, how="left")
    )
    df["n_conditions"] = df["n_conditions"].fillna(0).astype(int)
    df["n_medications"] = df["n_medications"].fillna(0).astype(int)
    df["n_encounters"] = df["n_encounters"].fillna(0).astype(int)

    condition_code_dict = {
        "has_diabetes": {"44054006"},
        "has_hypertension": {"59621000"},
        "has_asthma": {"195967001"},
    }

    if not cond.empty and "code" in cond.columns:
        cond_codes = cond["code"].astype(str)

        for flag_name, codes in condition_code_dict.items():
            patients_with_condition = cond.loc[cond_codes.isin(codes), "patient"]
            df[flag_name] = df.index.isin(patients_with_condition).astype(int)
    else:
        for flag_name in condition_code_dict:
            df[flag_name] = 0

    # reset index (patient id)
    df = df.reset_index().rename(columns={"index": "patient"})
    return df


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--out", required=True)
    parser.add_argument("--age-min", type=int, default=None)
    parser.add_argument("--age-max", type=int, default=None)
    args = parser.parse_args()

    files = load_csvs(args.input_dir)
    table = compute_features(files, age_min=args.age_min, age_max=args.age_max)

    outp = Path(args.out)
    if outp.suffix == ".parquet":
        table.to_parquet(outp, index=False)
    else:
        table.to_csv(outp, index=False)


if __name__ == "__main__":
    main()

Example run

python3 parse_synthea.py \
--input-dir /path/to/synthea/output/csv \
--out standardized_table.parquet \
--age-min 18 \
--age-max 120

Nextflow

I created a simple pipeline to combine the Synthea data generation and parsing steps.

Download

Nextflow was just as easy to download and install as Synthea. Just need Java 17+. The documentation is a thing of beauty.

curl -s https://get.nextflow.io | bash

main.nf

nextflow.enable.dsl=2

params.patients = params.patients ?: 10000
params.city     = params.city ?: Denver
params.seed     = params.seed ?: 18
params.outdir   = params.outdir ?: results

process generate_synthea {
  tag { "${params.city}-${params.patients}" }
  publishDir "${params.outdir}/synthea/${params.city}-${params.patients}", mode: 'copy'

  output:
    path "output/csv"

  script:
  """
  mkdir -p output
  java -jar /home/jmcgir/scratch/synthea/synthea-with-dependencies.jar -c /home/jmcgir/scratch/synthea/make_csv/synthea.properties -p ${params.patients} -s ${params.seed} --output=./output ${params.city}
  """
}

process parse_and_standardize {
  tag "parse"
  publishDir "${params.outdir}/table", mode: 'copy'

  input:
    path synthea_dir

  output:
    path "standardized_table.parquet"

  script:
  """
  # pass the directory (synthea_dir) to the python ETL
  python3 /home/jmcgir/scratch/nextflow/scripts/parse_synthea.py --input-dir ${synthea_dir} --out standardized_table.parquet
  """
}

workflow {
  synthea_ch = generate_synthea()
  parse_and_standardize(synthea_ch)
}

Example run

nextflow run main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir results

Exploring the data

R setup

Code
tidy_packages <- c("readr", "tidyr", "stringr", "ggplot2", "dplyr")

required_packages <- c(
  "arrow",
  "plotly",
  "gt",
  tidy_packages
)

invisible(lapply(required_packages, library, character.only = TRUE))

# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'

jam_cols <- c(blu, red, gre, org, grb, lir, gry, bla)
jam_shapes <- c(21, 22, 23, 24, 25)

options(ggplot2.discrete.colour = jam_cols)
options(ggplot2.discrete.fill = jam_cols)

main_theme <- theme(
  text = element_text(size = 14),
  axis.text = element_text(size = 12),
  axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),
  axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0))
)
jam_theme <- theme_minimal() + main_theme
jam_theme_bw <- theme_bw() + main_theme

remove.boxplot.outliers <- function(fig) {
  stopifnot("plotly" %in% class(fig))
  fig$x$data <- lapply(
    fig$x$data,
    \(i) {
      if (i$type != "box") {
        return(i)
      }
      i$marker = list(opacity = 0)
      i$hoverinfo = "none"
      i
    }
  )
  fig
}

EDA plots

synth_table <- arrow::read_parquet(
  "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/table/standardized_table.parquet"
)

head(synth_table) |>
  gt() |>
  tab_header(title = "First 6 rows standardized_table.parquet")
First 6 rows standardized_table.parquet
id age gender race n_conditions n_medications n_encounters n_er_admissions has_diabetes has_hypertension has_asthma
5bab1293-644c-3564-23b6-bd095720101a 1 F white 6 1 7 NA 0 0 0
35f037eb-435b-ebf4-7c1a-5c362ee0e670 2 M white 6 0 10 NA 0 0 0
5f3f9386-383f-d368-4656-32abba73047c 1 M native 6 2 11 NA 0 0 0
712f82b7-745a-7589-cb2a-81d2aef0c8de 2 F white 5 1 11 NA 0 0 0
cb1a2337-7b1e-72ff-b37e-d455e0ded5f7 0 F white 2 0 3 NA 0 0 0
c8161ec5-f98e-a4c3-6a87-973493b38fee 0 M white 4 3 8 1 0 0 0
n_patients <- nrow(synth_table)

p1 <- ggplot(synth_table, aes(age, fill = gender)) +
  geom_histogram(bins = 30, color = bla) +
  facet_wrap(~gender, nrow = 2) +
  jam_theme_bw +
  labs(
    title = paste0("Age distribution of ", n_patients, " synthetic patients"),
    x = "Age (years)",
    y = "Number of patients"
  )
p1

d1 <- synth_table |>
  select(
    id,
    n_medications,
    has_diabetes,
    has_hypertension,
    has_asthma,
    gender,
    age
  ) |>
  pivot_longer(
    cols = !c(id, n_medications, gender, age),
    names_to = "diagnosis",
    values_to = "value"
  ) |>
  filter(value == 1)

p1 <- ggplot() +
  geom_boxplot(data = d1, aes(diagnosis, n_medications), outlier.shape = NA) +
  geom_jitter(
    data = d1,
    aes(diagnosis, n_medications, label = id, label2 = age),
    alpha = 0.7,
    height = 0,
    width = 0.1,
    shape = 1
  ) +
  jam_theme_bw +
  labs(
    title = "Medications prescribed for patients that have been daignosed in the past 10 years with:\nasthma, diabetes, and/or hypertension",
    y = "Number of medications"
  )
remove.boxplot.outliers(ggplotly(p1))
p1 <- d1 |>
  filter(diagnosis %in% c("has_diabetes", "has_hypertension")) |>
  ggplot(aes(age, n_medications, color = diagnosis)) +
  geom_point(shape = 1, size = 2) +
  facet_wrap(~diagnosis, scales = "free") +
  geom_smooth(method = "lm") +
  jam_theme_bw +
  labs(
    title = "Age vs. number of medications prescribed in the past 10 years",
    x = "Age (years)",
    y = "Number of medications"
  )
p1

pca_features <- c(
  "age",
  "n_medications",
  "n_encounters",
  "n_er_admissions"
)

metadata_cols <- intersect(
  c("id", "gender", "race", "has_diabetes", "age", "n_encounters"),
  names(synth_table)
)

pca_table <- synth_table %>%
  select(all_of(c("id", pca_features))) %>%
  mutate(across(all_of(pca_features), as.numeric)) %>%
  drop_na() %>%
  tibble::column_to_rownames("id")

pca_fit <- prcomp(pca_table, scale. = TRUE, rank. = 20)

pc <- as.data.frame(pca_fit$x)
pc$id <- row.names(pc)


patient_metadata <- synth_table %>%
  select(all_of(metadata_cols)) %>%
  distinct()

pc <- inner_join(pc, patient_metadata, by = "id")

var_exp <- round(summary(pca_fit)$importance[2, ], 4) * 100

plot.pca <- function(column_to_color_by, pc_x, pc_y) {
  pc_x_lab <- paste0("PC", pc_x)
  pc_y_lab <- paste0("PC", pc_y)

  p1 <- pc |>
    ggplot(aes(
      x = !!sym(pc_x_lab),
      y = !!sym(pc_y_lab),
      color = !!sym(column_to_color_by),
      text = id
    )) +
    geom_point(size = 3) +
    theme_minimal(base_size = 14) +
    xlab(paste0("PC", pc_x, " (", var_exp[pc_x], "%)")) +
    ylab(paste0("PC", pc_y, " (", var_exp[pc_y], "%)"))

  return(p1)
}


ggplotly(plot.pca("age", 1, 2))
ggplotly(plot.pca("n_encounters", 1, 2))
loadings <- as.data.frame(pca_fit$rotation)
loadings$feature <- row.names(loadings)

loadings |>
  select(feature, PC1, PC2) |>
  arrange(desc(abs(PC1))) |>
  gt() |>
  tab_header(title = "PCA Loadings for PC1 and PC2")
PCA Loadings for PC1 and PC2
feature PC1 PC2
n_medications 0.5971994 0.06567827
n_encounters 0.5735261 0.04265387
n_er_admissions 0.4175287 0.57654288
age 0.3742867 -0.81330518
loading_vecs <- loadings |>
  mutate(
    PC1 = PC1 * max(abs(pc$PC1)),
    PC2 = PC2 * max(abs(pc$PC2))
  )

p1 <- ggplot(pc, aes(PC1, PC2)) +
  geom_point(alpha = 0.4) +
  geom_segment(
    data = loading_vecs,
    aes(x = 0, y = 0, xend = PC1, yend = PC2),
    arrow = arrow(length = unit(0.2, "cm")),
    color = "black"
  ) +
  geom_text(
    data = loading_vecs,
    aes(x = PC1, y = PC2, label = feature),
    vjust = 1.2
  ) +
  jam_theme +
  xlab(paste0("PC", 1, " (", var_exp[1], "%)")) +
  ylab(paste0("PC", 2, " (", var_exp[2], "%)"))
p1

Run time

Sys.time() - start_time
Time difference of 8.089688 secs