Nextflow for Synthea EHR Data

Author

Joe McGirr | Bioinformatics Scientist

Published

February 3, 2026

Summary

This notebook describes a mini-project I put together to practice the basics of using Nextflow and explore simulated EHR data.

Synthea is an open-source synthetic health record generator that produces realistic patient-level clinical data (demographics, encounters, conditions, medications, procedures). It was developed by the non-profit MITRE and is commonly used for pipeline protoyping for large-scale HL7 FHIR data.

In this workflow, I used Synthea as a parameterized data generator inside a reproducible pipeline. Nextflow orchestrates Synthea runs with user-defined inputs (patient count, random seed, location), producing raw CSVs with simulated patient information. A Python ETL step then standardizes and aggregates the Synthea outputs into a tidy patient-level feature table. The final output from the Nextflow pipeline is a single Parquet table that I load into R in this quarto doc for visualizations.

Synthea

Links:

A Java executable runs the simulation engine and generates synthetic electronic health record data. You can specify parameters like the number of patients to generate, the location (city/state), and a random seed for reproducibility.

Download

wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar

Example run

I modified the synthea.properties file to set the output format to CSV.

# Generate 10,000 synthetic patients from Denver, Colorado

java -jar /path/to/synthea-with-dependencies.jar \
-c /path/to/synthea/synthea.properties \
-p 10000 \
-s 18 \
Colorado Denver

Parsing Synthea CSVs

I created a python script to parse and standardize the Synthea CSV outputs into a patient-level feature table. The script performs the following:

Reads the relevant CSV files (patients, encounters, conditions, medications) with pandas.
Uses the janitor library to clean and standardize column names.
Calculates patient age from birthdate and filters patients based on optional age criteria.
Counts the number of conditions, medications, encounters, and ER admissions for each patient.
Generates binary flags for specific conditions (diabetes, hypertension, asthma) based on SNOMED codes.
Outputs the final standardized table as a parquet file for hand-off to R.

parse_synthea.py

import argparse
from pathlib import Path
import pandas as pd
import janitor


def load_csvs(input_dir):
    p = Path(input_dir)
    files = {}
    for name in ["patients", "encounters", "conditions", "medications"]:
        fp = p / f"{name}.csv"
        files[name] = pd.read_csv(fp).clean_names() if fp.exists() else pd.DataFrame()
    return files


def filter_last_n_years(df, date_col, cutoff_date):
    if df.empty or date_col not in df.columns:
        return df

    df = df.copy()

    df[date_col] = pd.to_datetime(
        df[date_col], errors="coerce", utc=True
    ).dt.tz_convert(None)

    return df.loc[df[date_col] >= cutoff_date]


def compute_features(files, age_min=None, age_max=None):
    patients = files["patients"]
    patients["birthdate"] = pd.to_datetime(patients["birthdate"])
    patients["age"] = (
        (pd.Timestamp.today() - patients["birthdate"]).dt.days // 365
    ).astype(int)

    if age_min is not None:
        patients = patients[patients["age"] >= age_min]
    if age_max is not None:
        patients = patients[patients["age"] <= age_max]

    # count conditions, meds, encounters in the last 10 years
    cutoff_date = (pd.Timestamp.utcnow() - pd.DateOffset(years=10)).tz_localize(None)
    cond = files["conditions"]
    enc = files["encounters"]
    meds = files["medications"]

    cond = filter_last_n_years(cond, "start", cutoff_date)
    enc = filter_last_n_years(enc, "start", cutoff_date)
    meds = filter_last_n_years(meds, "start", cutoff_date)

    cond_counts = cond.groupby("patient").size().rename("n_conditions")
    med_counts = meds.groupby("patient").size().rename("n_medications")
    enc_counts = enc.groupby("patient").size().rename("n_encounters")

    er_counts = (
        enc.loc[enc["description"].str.contains("Emergency room admission", na=False)]
        .groupby("patient")
        .size()
        .rename("n_er_admissions")
    )

    df = patients.set_index("id")[["age", "gender", "race"]].copy()
    df = (
        df.join(cond_counts, how="left")
        .join(med_counts, how="left")
        .join(enc_counts, how="left")
        .join(er_counts, how="left")
    )
    df["n_conditions"] = df["n_conditions"].fillna(0).astype(int)
    df["n_medications"] = df["n_medications"].fillna(0).astype(int)
    df["n_encounters"] = df["n_encounters"].fillna(0).astype(int)

    condition_code_dict = {
        "has_diabetes": {"44054006"},
        "has_hypertension": {"59621000"},
        "has_asthma": {"195967001"},
    }

    if not cond.empty and "code" in cond.columns:
        cond_codes = cond["code"].astype(str)

        for flag_name, codes in condition_code_dict.items():
            patients_with_condition = cond.loc[cond_codes.isin(codes), "patient"]
            df[flag_name] = df.index.isin(patients_with_condition).astype(int)
    else:
        for flag_name in condition_code_dict:
            df[flag_name] = 0

    # reset index (patient id)
    df = df.reset_index().rename(columns={"index": "patient"})
    return df


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--out", required=True)
    parser.add_argument("--age-min", type=int, default=None)
    parser.add_argument("--age-max", type=int, default=None)
    args = parser.parse_args()

    files = load_csvs(args.input_dir)
    table = compute_features(files, age_min=args.age_min, age_max=args.age_max)

    outp = Path(args.out)
    if outp.suffix == ".parquet":
        table.to_parquet(outp, index=False)
    else:
        table.to_csv(outp, index=False)


if __name__ == "__main__":
    main()

Example run

python3 parse_synthea.py \
--input-dir /path/to/synthea/output/csv \
--out standardized_table.parquet \
--age-min 18 \
--age-max 120

Nextflow

I created a simple pipeline to combine the Synthea data generation and parsing steps.

Download

Nextflow was just as easy to download and install as Synthea. Just need Java 17+. The documentation is a thing of beauty.

curl -s https://get.nextflow.io | bash

main.nf

nextflow.enable.dsl=2

params.patients = params.patients ?: 10000
params.city     = params.city ?: Denver
params.seed     = params.seed ?: 18
params.outdir   = params.outdir ?: results

process generate_synthea {
  tag { "${params.city}-${params.patients}" }
  publishDir "${params.outdir}/synthea/${params.city}-${params.patients}", mode: 'copy'

  output:
    path "output/csv"

  script:
  """
  mkdir -p output
  java -jar /home/jmcgir/scratch/synthea/synthea-with-dependencies.jar -c /home/jmcgir/scratch/synthea/make_csv/synthea.properties -p ${params.patients} -s ${params.seed} --output=./output ${params.city}
  """
}

process parse_and_standardize {
  tag "parse"
  publishDir "${params.outdir}/table", mode: 'copy'

  input:
    path synthea_dir

  output:
    path "standardized_table.parquet"

  script:
  """
  # pass the directory (synthea_dir) to the python ETL
  python3 /home/jmcgir/scratch/nextflow/scripts/parse_synthea.py --input-dir ${synthea_dir} --out standardized_table.parquet
  """
}

workflow {
  synthea_ch = generate_synthea()
  parse_and_standardize(synthea_ch)
}

Example run

nextflow run main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir results

Exploring the data

R setup

Code

tidy_packages <- c("readr", "tidyr", "stringr", "ggplot2", "dplyr")

required_packages <- c(
  "arrow",
  "plotly",
  "gt",
  tidy_packages
)

invisible(lapply(required_packages, library, character.only = TRUE))

# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'

jam_cols <- c(blu, red, gre, org, grb, lir, gry, bla)
jam_shapes <- c(21, 22, 23, 24, 25)

options(ggplot2.discrete.colour = jam_cols)
options(ggplot2.discrete.fill = jam_cols)

main_theme <- theme(
  text = element_text(size = 14),
  axis.text = element_text(size = 12),
  axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),
  axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0))
)
jam_theme <- theme_minimal() + main_theme
jam_theme_bw <- theme_bw() + main_theme

remove.boxplot.outliers <- function(fig) {
  stopifnot("plotly" %in% class(fig))
  fig$x$data <- lapply(
    fig$x$data,
    \(i) {
      if (i$type != "box") {
        return(i)
      }
      i$marker = list(opacity = 0)
      i$hoverinfo = "none"
      i
    }
  )
  fig
}

EDA plots

synth_table <- arrow::read_parquet(
  "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/table/standardized_table.parquet"
)

head(synth_table) |>
  gt() |>
  tab_header(title = "First 6 rows standardized_table.parquet")

First 6 rows standardized_table.parquet
id	age	gender	race	n_conditions	n_medications	n_encounters	n_er_admissions	has_diabetes	has_hypertension	has_asthma
5bab1293-644c-3564-23b6-bd095720101a	1	F	white	6	1	7	NA	0	0	0
35f037eb-435b-ebf4-7c1a-5c362ee0e670	2	M	white	6	0	10	NA	0	0	0
5f3f9386-383f-d368-4656-32abba73047c	1	M	native	6	2	11	NA	0	0	0
712f82b7-745a-7589-cb2a-81d2aef0c8de	2	F	white	5	1	11	NA	0	0	0
cb1a2337-7b1e-72ff-b37e-d455e0ded5f7	0	F	white	2	0	3	NA	0	0	0
c8161ec5-f98e-a4c3-6a87-973493b38fee	0	M	white	4	3	8	1	0	0	0

n_patients <- nrow(synth_table)

p1 <- ggplot(synth_table, aes(age, fill = gender)) +
  geom_histogram(bins = 30, color = bla) +
  facet_wrap(~gender, nrow = 2) +
  jam_theme_bw +
  labs(
    title = paste0("Age distribution of ", n_patients, " synthetic patients"),
    x = "Age (years)",
    y = "Number of patients"
  )
p1

d1 <- synth_table |>
  select(
    id,
    n_medications,
    has_diabetes,
    has_hypertension,
    has_asthma,
    gender,
    age
  ) |>
  pivot_longer(
    cols = !c(id, n_medications, gender, age),
    names_to = "diagnosis",
    values_to = "value"
  ) |>
  filter(value == 1)

p1 <- ggplot() +
  geom_boxplot(data = d1, aes(diagnosis, n_medications), outlier.shape = NA) +
  geom_jitter(
    data = d1,
    aes(diagnosis, n_medications, label = id, label2 = age),
    alpha = 0.7,
    height = 0,
    width = 0.1,
    shape = 1
  ) +
  jam_theme_bw +
  labs(
    title = "Medications prescribed for patients that have been daignosed in the past 10 years with:\nasthma, diabetes, and/or hypertension",
    y = "Number of medications"
  )
remove.boxplot.outliers(ggplotly(p1))

p1 <- d1 |>
  filter(diagnosis %in% c("has_diabetes", "has_hypertension")) |>
  ggplot(aes(age, n_medications, color = diagnosis)) +
  geom_point(shape = 1, size = 2) +
  facet_wrap(~diagnosis, scales = "free") +
  geom_smooth(method = "lm") +
  jam_theme_bw +
  labs(
    title = "Age vs. number of medications prescribed in the past 10 years",
    x = "Age (years)",
    y = "Number of medications"
  )
p1

pca_features <- c(
  "age",
  "n_medications",
  "n_encounters",
  "n_er_admissions"
)

metadata_cols <- intersect(
  c("id", "gender", "race", "has_diabetes", "age", "n_encounters"),
  names(synth_table)
)

pca_table <- synth_table %>%
  select(all_of(c("id", pca_features))) %>%
  mutate(across(all_of(pca_features), as.numeric)) %>%
  drop_na() %>%
  tibble::column_to_rownames("id")

pca_fit <- prcomp(pca_table, scale. = TRUE, rank. = 20)

pc <- as.data.frame(pca_fit$x)
pc$id <- row.names(pc)


patient_metadata <- synth_table %>%
  select(all_of(metadata_cols)) %>%
  distinct()

pc <- inner_join(pc, patient_metadata, by = "id")

var_exp <- round(summary(pca_fit)$importance[2, ], 4) * 100

plot.pca <- function(column_to_color_by, pc_x, pc_y) {
  pc_x_lab <- paste0("PC", pc_x)
  pc_y_lab <- paste0("PC", pc_y)

  p1 <- pc |>
    ggplot(aes(
      x = !!sym(pc_x_lab),
      y = !!sym(pc_y_lab),
      color = !!sym(column_to_color_by),
      text = id
    )) +
    geom_point(size = 3) +
    theme_minimal(base_size = 14) +
    xlab(paste0("PC", pc_x, " (", var_exp[pc_x], "%)")) +
    ylab(paste0("PC", pc_y, " (", var_exp[pc_y], "%)"))

  return(p1)
}


ggplotly(plot.pca("age", 1, 2))

ggplotly(plot.pca("n_encounters", 1, 2))

loadings <- as.data.frame(pca_fit$rotation)
loadings$feature <- row.names(loadings)

loadings |>
  select(feature, PC1, PC2) |>
  arrange(desc(abs(PC1))) |>
  gt() |>
  tab_header(title = "PCA Loadings for PC1 and PC2")

PCA Loadings for PC1 and PC2
feature	PC1	PC2
n_medications	0.5971994	0.06567827
n_encounters	0.5735261	0.04265387
n_er_admissions	0.4175287	0.57654288
age	0.3742867	-0.81330518

loading_vecs <- loadings |>
  mutate(
    PC1 = PC1 * max(abs(pc$PC1)),
    PC2 = PC2 * max(abs(pc$PC2))
  )

p1 <- ggplot(pc, aes(PC1, PC2)) +
  geom_point(alpha = 0.4) +
  geom_segment(
    data = loading_vecs,
    aes(x = 0, y = 0, xend = PC1, yend = PC2),
    arrow = arrow(length = unit(0.2, "cm")),
    color = "black"
  ) +
  geom_text(
    data = loading_vecs,
    aes(x = PC1, y = PC2, label = feature),
    vjust = 1.2
  ) +
  jam_theme +
  xlab(paste0("PC", 1, " (", var_exp[1], "%)")) +
  ylab(paste0("PC", 2, " (", var_exp[2], "%)"))
p1

Run time

Sys.time() - start_time

Time difference of 8.089688 secs

--- title: "Nextflow for Synthea EHR Data" author: "Joe McGirr | Bioinformatics Scientist" date: "`r Sys.Date()`" editor: source fig-responsive: false format: html: theme: sandstone fontcolor: black toc: true toc-location: left code-fold: false code-tools: true df-print: kable fig-width: 13 fig-height: 8 embed-resources: true grid: body-width: 2000px execute: warning: false --- ```{r} #| include: false start_time <- Sys.time() ``` # Summary This notebook describes a mini-project I put together to practice the basics of using Nextflow and explore simulated EHR data. Synthea is an open-source synthetic health record generator that produces realistic patient-level clinical data (demographics, encounters, conditions, medications, procedures). It was developed by the non-profit [MITRE](https://www.mitre.org/who-we-are) and is commonly used for pipeline protoyping for large-scale HL7 FHIR data. In this workflow, I used Synthea as a parameterized data generator inside a reproducible pipeline. Nextflow orchestrates Synthea runs with user-defined inputs (patient count, random seed, location), producing raw CSVs with simulated patient information. A Python ETL step then standardizes and aggregates the Synthea outputs into a tidy patient-level feature table. The final output from the Nextflow pipeline is a single Parquet table that I load into R in this quarto doc for visualizations. # Synthea Links: - [Website](https://synthea.mitre.org/about) - [GitHub](https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running) A Java executable runs the simulation engine and generates synthetic electronic health record data. You can specify parameters like the number of patients to generate, the location (city/state), and a random seed for reproducibility. ## Download ```{bash} #| eval: false wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar ``` ## Example run I modified the `synthea.properties` file to set the output format to CSV. ```{bash} #| eval: false # Generate 10,000 synthetic patients from Denver, Colorado java -jar /path/to/synthea-with-dependencies.jar \ -c /path/to/synthea/synthea.properties \ -p 10000 \ -s 18 \ Colorado Denver ``` # Parsing Synthea CSVs I created a python script to parse and standardize the Synthea CSV outputs into a patient-level feature table. The script performs the following: 1. Reads the relevant CSV files (patients, encounters, conditions, medications) with pandas. 2. Uses the janitor library to clean and standardize column names. 3. Calculates patient age from birthdate and filters patients based on optional age criteria. 4. Counts the number of conditions, medications, encounters, and ER admissions for each patient. 5. Generates binary flags for specific conditions (diabetes, hypertension, asthma) based on SNOMED codes. 6. Outputs the final standardized table as a parquet file for hand-off to R. ## parse_synthea.py ```{python, eval = FALSE} import argparse from pathlib import Path import pandas as pd import janitor def load_csvs(input_dir): p = Path(input_dir) files = {} for name in ["patients", "encounters", "conditions", "medications"]: fp = p / f"{name}.csv" files[name] = pd.read_csv(fp).clean_names() if fp.exists() else pd.DataFrame() return files def filter_last_n_years(df, date_col, cutoff_date): if df.empty or date_col not in df.columns: return df df = df.copy() df[date_col] = pd.to_datetime( df[date_col], errors="coerce", utc=True ).dt.tz_convert(None) return df.loc[df[date_col] >= cutoff_date] def compute_features(files, age_min=None, age_max=None): patients = files["patients"] patients["birthdate"] = pd.to_datetime(patients["birthdate"]) patients["age"] = ( (pd.Timestamp.today() - patients["birthdate"]).dt.days // 365 ).astype(int) if age_min is not None: patients = patients[patients["age"] >= age_min] if age_max is not None: patients = patients[patients["age"] <= age_max] # count conditions, meds, encounters in the last 10 years cutoff_date = (pd.Timestamp.utcnow() - pd.DateOffset(years=10)).tz_localize(None) cond = files["conditions"] enc = files["encounters"] meds = files["medications"] cond = filter_last_n_years(cond, "start", cutoff_date) enc = filter_last_n_years(enc, "start", cutoff_date) meds = filter_last_n_years(meds, "start", cutoff_date) cond_counts = cond.groupby("patient").size().rename("n_conditions") med_counts = meds.groupby("patient").size().rename("n_medications") enc_counts = enc.groupby("patient").size().rename("n_encounters") er_counts = ( enc.loc[enc["description"].str.contains("Emergency room admission", na=False)] .groupby("patient") .size() .rename("n_er_admissions") ) df = patients.set_index("id")[["age", "gender", "race"]].copy() df = ( df.join(cond_counts, how="left") .join(med_counts, how="left") .join(enc_counts, how="left") .join(er_counts, how="left") ) df["n_conditions"] = df["n_conditions"].fillna(0).astype(int) df["n_medications"] = df["n_medications"].fillna(0).astype(int) df["n_encounters"] = df["n_encounters"].fillna(0).astype(int) condition_code_dict = { "has_diabetes": {"44054006"}, "has_hypertension": {"59621000"}, "has_asthma": {"195967001"}, } if not cond.empty and "code" in cond.columns: cond_codes = cond["code"].astype(str) for flag_name, codes in condition_code_dict.items(): patients_with_condition = cond.loc[cond_codes.isin(codes), "patient"] df[flag_name] = df.index.isin(patients_with_condition).astype(int) else: for flag_name in condition_code_dict: df[flag_name] = 0 # reset index (patient id) df = df.reset_index().rename(columns={"index": "patient"}) return df def main(): parser = argparse.ArgumentParser() parser.add_argument("--input-dir", required=True) parser.add_argument("--out", required=True) parser.add_argument("--age-min", type=int, default=None) parser.add_argument("--age-max", type=int, default=None) args = parser.parse_args() files = load_csvs(args.input_dir) table = compute_features(files, age_min=args.age_min, age_max=args.age_max) outp = Path(args.out) if outp.suffix == ".parquet": table.to_parquet(outp, index=False) else: table.to_csv(outp, index=False) if __name__ == "__main__": main() ``` ```{bash} #| eval: false #| include: false # import argparse # from pathlib import Path # import pandas as pd # test_dir = ( # # "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/synthea/make_csv/output/csv" # "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/synthea/Portland-100/output/csv" # ) # files = load_csvs(test_dir) ``` ### Example run ```{bash} #| eval: false python3 parse_synthea.py \ --input-dir /path/to/synthea/output/csv \ --out standardized_table.parquet \ --age-min 18 \ --age-max 120 ``` # Nextflow I created a simple pipeline to combine the Synthea data generation and parsing steps. ## Download Nextflow was just as easy to download and install as Synthea. Just need Java 17+. The [documentation](https://www.nextflow.io/docs/latest/overview.html) is a thing of beauty. ```{bash} #| eval: false curl -s https://get.nextflow.io | bash ``` ## main.nf ```{groovy} #| eval: false nextflow.enable.dsl=2 params.patients = params.patients ?: 10000 params.city = params.city ?: Denver params.seed = params.seed ?: 18 params.outdir = params.outdir ?: results process generate_synthea { tag { "${params.city}-${params.patients}" } publishDir "${params.outdir}/synthea/${params.city}-${params.patients}", mode: 'copy' output: path "output/csv" script: """ mkdir -p output java -jar /home/jmcgir/scratch/synthea/synthea-with-dependencies.jar -c /home/jmcgir/scratch/synthea/make_csv/synthea.properties -p ${params.patients} -s ${params.seed} --output=./output ${params.city} """ } process parse_and_standardize { tag "parse" publishDir "${params.outdir}/table", mode: 'copy' input: path synthea_dir output: path "standardized_table.parquet" script: """ # pass the directory (synthea_dir) to the python ETL python3 /home/jmcgir/scratch/nextflow/scripts/parse_synthea.py --input-dir ${synthea_dir} --out standardized_table.parquet """ } workflow { synthea_ch = generate_synthea() parse_and_standardize(synthea_ch) } ``` ## Example run ```{bash} #| eval: false nextflow run main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir results ``` ```{bash} #| eval: false #| include: false cd /home/jmcgir/scratch/nextflow micromamba activate jam_py3 /nfs/cyto2/usr/jmcgir/apps/nextflow/nextflow run /home/jmcgir/scratch/nextflow/scripts/main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir results ``` # Exploring the data ## R setup ```{r} #| code-fold: true tidy_packages <- c("readr", "tidyr", "stringr", "ggplot2", "dplyr") required_packages <- c( "arrow", "plotly", "gt", tidy_packages ) invisible(lapply(required_packages, library, character.only = TRUE)) # Wong, B. Points of view: Color blindness. Nat Methods (2011). bla <- '#000000' blu <- '#0072b2' grb <- '#56b4e9' lir <- '#cc79a7' gre <- '#009e73' red <- '#d55e00' org <- '#e69f00' yel <- '#f0e442' gry <- '#BBBBBB' jam_cols <- c(blu, red, gre, org, grb, lir, gry, bla) jam_shapes <- c(21, 22, 23, 24, 25) options(ggplot2.discrete.colour = jam_cols) options(ggplot2.discrete.fill = jam_cols) main_theme <- theme( text = element_text(size = 14), axis.text = element_text(size = 12), axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)), axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)) ) jam_theme <- theme_minimal() + main_theme jam_theme_bw <- theme_bw() + main_theme remove.boxplot.outliers <- function(fig) { stopifnot("plotly" %in% class(fig)) fig$x$data <- lapply( fig$x$data, \(i) { if (i$type != "box") { return(i) } i$marker = list(opacity = 0) i$hoverinfo = "none" i } ) fig } ``` ## EDA plots ```{r} synth_table <- arrow::read_parquet( "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/table/standardized_table.parquet" ) head(synth_table) |> gt() |> tab_header(title = "First 6 rows standardized_table.parquet") n_patients <- nrow(synth_table) p1 <- ggplot(synth_table, aes(age, fill = gender)) + geom_histogram(bins = 30, color = bla) + facet_wrap(~gender, nrow = 2) + jam_theme_bw + labs( title = paste0("Age distribution of ", n_patients, " synthetic patients"), x = "Age (years)", y = "Number of patients" ) p1 d1 <- synth_table |> select( id, n_medications, has_diabetes, has_hypertension, has_asthma, gender, age ) |> pivot_longer( cols = !c(id, n_medications, gender, age), names_to = "diagnosis", values_to = "value" ) |> filter(value == 1) p1 <- ggplot() + geom_boxplot(data = d1, aes(diagnosis, n_medications), outlier.shape = NA) + geom_jitter( data = d1, aes(diagnosis, n_medications, label = id, label2 = age), alpha = 0.7, height = 0, width = 0.1, shape = 1 ) + jam_theme_bw + labs( title = "Medications prescribed for patients that have been daignosed in the past 10 years with:\nasthma, diabetes, and/or hypertension", y = "Number of medications" ) remove.boxplot.outliers(ggplotly(p1)) p1 <- d1 |> filter(diagnosis %in% c("has_diabetes", "has_hypertension")) |> ggplot(aes(age, n_medications, color = diagnosis)) + geom_point(shape = 1, size = 2) + facet_wrap(~diagnosis, scales = "free") + geom_smooth(method = "lm") + jam_theme_bw + labs( title = "Age vs. number of medications prescribed in the past 10 years", x = "Age (years)", y = "Number of medications" ) p1 ``` ```{r} pca_features <- c( "age", "n_medications", "n_encounters", "n_er_admissions" ) metadata_cols <- intersect( c("id", "gender", "race", "has_diabetes", "age", "n_encounters"), names(synth_table) ) pca_table <- synth_table %>% select(all_of(c("id", pca_features))) %>% mutate(across(all_of(pca_features), as.numeric)) %>% drop_na() %>% tibble::column_to_rownames("id") pca_fit <- prcomp(pca_table, scale. = TRUE, rank. = 20) pc <- as.data.frame(pca_fit$x) pc$id <- row.names(pc) patient_metadata <- synth_table %>% select(all_of(metadata_cols)) %>% distinct() pc <- inner_join(pc, patient_metadata, by = "id") var_exp <- round(summary(pca_fit)$importance[2, ], 4) * 100 plot.pca <- function(column_to_color_by, pc_x, pc_y) { pc_x_lab <- paste0("PC", pc_x) pc_y_lab <- paste0("PC", pc_y) p1 <- pc |> ggplot(aes( x = !!sym(pc_x_lab), y = !!sym(pc_y_lab), color = !!sym(column_to_color_by), text = id )) + geom_point(size = 3) + theme_minimal(base_size = 14) + xlab(paste0("PC", pc_x, " (", var_exp[pc_x], "%)")) + ylab(paste0("PC", pc_y, " (", var_exp[pc_y], "%)")) return(p1) } ggplotly(plot.pca("age", 1, 2)) ggplotly(plot.pca("n_encounters", 1, 2)) loadings <- as.data.frame(pca_fit$rotation) loadings$feature <- row.names(loadings) loadings |> select(feature, PC1, PC2) |> arrange(desc(abs(PC1))) |> gt() |> tab_header(title = "PCA Loadings for PC1 and PC2") loading_vecs <- loadings |> mutate( PC1 = PC1 * max(abs(pc$PC1)), PC2 = PC2 * max(abs(pc$PC2)) ) p1 <- ggplot(pc, aes(PC1, PC2)) + geom_point(alpha = 0.4) + geom_segment( data = loading_vecs, aes(x = 0, y = 0, xend = PC1, yend = PC2), arrow = arrow(length = unit(0.2, "cm")), color = "black" ) + geom_text( data = loading_vecs, aes(x = PC1, y = PC2, label = feature), vjust = 1.2 ) + jam_theme + xlab(paste0("PC", 1, " (", var_exp[1], "%)")) + ylab(paste0("PC", 2, " (", var_exp[2], "%)")) p1 ``` ## Run time ```{r} Sys.time() - start_time ```