CDC Data Analysis Exercise

Author

Anissa Waller Del Valle

Overview of the Data

I selected the Lyme disease public use line-listed data without georgraphy, 2022 - 2023 dataset, available on the Center for Disease Control and Prevention’s website. This contains publc health surveillance data that was reported to the CDC by U.S. states and territories through the National Notifiable Diseases Surveillance System (NNDSS). Data include both demographic characteristics (age, sex, race, ethnicity) and clinical characteristics (reported year, case status, omonth of illness onset, presecence of symptoms).

The dataset can be accessed through the following link: https://data.cdc.gov/d/9mtj-y2ba

Loading the Data

# Load libraries.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dplyr")
library("ggplot2")
# I downloaded the dataset from the CDC's website and put it into the cdcdata-exercise folder. I want to double-check that my .CSV file is accessible.


list.files()
[1] "cdcdata-exercise.qmd"                                                              
[2] "cdcdata-exercise.rmarkdown"                                                        
[3] "Lyme_disease_public_use_line-listed_data_without_geography,_2022-2023_20260211.csv"
# Load the dataset.
lymedisease_rawdata <- read_csv("Lyme_disease_public_use_line-listed_data_without_geography,_2022-2023_20260211.csv")
Rows: 152018 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Case_status, Sex, Age_cat_yrs, Race, Ethnicity, Onset_month, EM, A...
dbl  (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Determine the number of rows and columns. There should be 152K rows and 14 columns.  
dim(lymedisease_rawdata)
[1] 152018     14
# Explore the data.  
str(lymedisease_rawdata)
spc_tbl_ [152,018 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Year        : num [1:152018] 2022 2022 2022 2022 2022 ...
 $ Case_status : chr [1:152018] "Probable" "Probable" "Probable" "Probable" ...
 $ Sex         : chr [1:152018] "Male" "Female" "Female" "Female" ...
 $ Age_cat_yrs : chr [1:152018] "75-79" "75-79" "40-44" "30-34" ...
 $ Race        : chr [1:152018] "Unknown" "White" "Unknown" "Unknown" ...
 $ Ethnicity   : chr [1:152018] "Unknown" "Non-Hispanic" "Unknown" "Unknown" ...
 $ Onset_month : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ EM          : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Arthritis   : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Facial_palsy: chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Radiculo    : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Lymph_men   : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Enceph      : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Av_block    : chr [1:152018] "Unknown" "Unknown" "Unknown" "Unknown" ...
 - attr(*, "spec")=
  .. cols(
  ..   Year = col_double(),
  ..   Case_status = col_character(),
  ..   Sex = col_character(),
  ..   Age_cat_yrs = col_character(),
  ..   Race = col_character(),
  ..   Ethnicity = col_character(),
  ..   Onset_month = col_character(),
  ..   EM = col_character(),
  ..   Arthritis = col_character(),
  ..   Facial_palsy = col_character(),
  ..   Radiculo = col_character(),
  ..   Lymph_men = col_character(),
  ..   Enceph = col_character(),
  ..   Av_block = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(lymedisease_rawdata)
# A tibble: 6 × 14
   Year Case_status Sex    Age_cat_yrs Race    Ethnicity    Onset_month EM     
  <dbl> <chr>       <chr>  <chr>       <chr>   <chr>        <chr>       <chr>  
1  2022 Probable    Male   75-79       Unknown Unknown      Unknown     Unknown
2  2022 Probable    Female 75-79       White   Non-Hispanic Unknown     Unknown
3  2022 Probable    Female 40-44       Unknown Unknown      Unknown     Unknown
4  2022 Probable    Female 30-34       Unknown Unknown      Unknown     Unknown
5  2022 Probable    Female 5-9         White   Non-Hispanic Unknown     Unknown
6  2022 Probable    Female 20-24       Unknown Non-Hispanic Unknown     Unknown
# ℹ 6 more variables: Arthritis <chr>, Facial_palsy <chr>, Radiculo <chr>,
#   Lymph_men <chr>, Enceph <chr>, Av_block <chr>
tail(lymedisease_rawdata)
# A tibble: 6 × 14
   Year Case_status Sex    Age_cat_yrs Race    Ethnicity    Onset_month EM     
  <dbl> <chr>       <chr>  <chr>       <chr>   <chr>        <chr>       <chr>  
1  2023 Confirmed   Male   5-9         White   Non-Hispanic 4           Unknown
2  2023 Probable    Male   35-39       Unknown Unknown      Unknown     Unknown
3  2023 Probable    Male   25-29       Unknown Unknown      Unknown     Unknown
4  2023 Probable    Female 55-59       Other   Unknown      Unknown     Unknown
5  2023 Probable    Male   30-34       Unknown Unknown      Unknown     Unknown
6  2023 Probable    Male   30-34       White   Unknown      Unknown     Unknown
# ℹ 6 more variables: Arthritis <chr>, Facial_palsy <chr>, Radiculo <chr>,
#   Lymph_men <chr>, Enceph <chr>, Av_block <chr>

Several variables contain a number of observations coded as Unknown, indicating incomplete reporting in sureillance data.

 # Explore the data to see how many observations are coded as Unknown per category. Add a line to convert the tibble to a dataframe to obtain a summary table of all counts and percentages for Unknown without R trunacting it.
 
 # Note that Year is numeric; there are no observations coded as Unknown.


 unknown_summary <- lymedisease_rawdata %>%
  summarise(
    case_status_unknown = sum(Case_status == "Unknown"),
    sex_unknown = sum(Sex == "Unknown"),
    age_unknown = sum(Age_cat_yrs == "Unknown"),
    race_unknown = sum(Race == "Unknown"),
    ethnicity_unknown = sum(Ethnicity == "Unknown"),
    onset_unknown = sum(Onset_month == "Unknown"),
    em_unknown = sum(EM == "Unknown"),
    arthritis_unknown = sum(Arthritis == "Unknown"),
    facial_palsy_unknown = sum(Facial_palsy == "Unknown"),
    radiculo_unknown = sum(Radiculo == "Unknown"),
    lymph_men_unknown = sum(Lymph_men == "Unknown"),
    enceph_unknown = sum(Enceph == "Unknown"),
    av_block_unknown = sum(Av_block == "Unknown"),
   
    # Report as percentage.
    case_status_unknown_pct = mean(Case_status == "Unknown") * 100,
    sex_unknown_pct = mean(Sex == "Unknown") * 100,
    age_unknown_pct = mean(Age_cat_yrs == "Unknown") * 100,
    race_unknown_pct = mean(Race == "Unknown") * 100,
    ethnicity_unknown_pct = mean(Ethnicity == "Unknown") * 100,
    onset_unknown_pct = mean(Onset_month == "Unknown") * 100,
    em_unknown_pct = mean(EM == "Unknown") * 100,
    arthritis_unknown_pct = mean(Arthritis == "Unknown") * 100,
    facial_palsy_unknown_pct = mean(Facial_palsy == "Unknown") * 100,
    radiculo_unknown_pct = mean(Radiculo == "Unknown") * 100,
    lymph_men_unknown_pct = mean(Lymph_men == "Unknown") * 100,
    enceph_unknown_pct = mean(Enceph == "Unknown") * 100,
    av_block_unknown_pct = mean(Av_block == "Unknown") * 100
  )
# Convert to data frame.
unknown_summary_df <- as.data.frame(unknown_summary)


# View table.
unknown_summary_df
  case_status_unknown sex_unknown age_unknown race_unknown ethnicity_unknown
1                   0        1550         581        65737             87014
  onset_unknown em_unknown arthritis_unknown facial_palsy_unknown
1        144722     148199            148639               148723
  radiculo_unknown lymph_men_unknown enceph_unknown av_block_unknown
1           148907            149276         149092           149241
  case_status_unknown_pct sex_unknown_pct age_unknown_pct race_unknown_pct
1                       0        1.019616       0.3821916         43.24291
  ethnicity_unknown_pct onset_unknown_pct em_unknown_pct arthritis_unknown_pct
1              57.23927          95.20057        97.4878              97.77724
  facial_palsy_unknown_pct radiculo_unknown_pct lymph_men_unknown_pct
1                 97.83249             97.95353              98.19627
  enceph_unknown_pct av_block_unknown_pct
1           98.07523             98.17324

I will focus on the following variables for this exercise: - Year. This is the reporting year. No observations coded as Unknown. - Case_Status. This is the surveillance case classification as determined by local and state public health authorities. No observations coded as Unknown. - Sex. This is the patient’s sex. 1.02% observations coded as Unknown. - Age. This is the patient’s age, in 5-year age categories. 0.38% observations coded as Unknown. - Race. This is the patient’s race. 43.2% observations coded as Unknown.

All other variables have over 50% of observations coded as Unknown.

Cleaning the Data

# Next, clean and process the dataset.
# Convert Unknown to NA.


lymedisease_cleandata <- lymedisease_rawdata %>%
  select(Year, Case_status, Sex, Age_cat_yrs, Race) %>% # Select variables.
  mutate(
    Case_status = ifelse(Case_status == "Unknown", NA, Case_status),
    Sex = ifelse(Sex == "Unknown", NA, Sex),
    Age_cat_yrs = ifelse(Age_cat_yrs == "Unknown", NA, Age_cat_yrs),
    Race = ifelse(Race == "Unknown", NA, Race)
  ) %>%


# Convert to factors for categorical analysis.
mutate(
    Case_status = as.factor(Case_status),
    Sex = as.factor(Sex),
    Age_cat_yrs = as.factor(Age_cat_yrs),
    Race = as.factor(Race)
  ) %>%
# Less than 2% of observations under the Sex and Age categories are coded as NA. I do not feel that imputing missing values is appropiate with this dataset, so I will instead remove the missing data for these categories. I will keep the missing data for the race category, as I do not feel that imputing missing data or removing the data entirely is appropiate here.


filter(!is.na(Case_status), !is.na(Sex), !is.na(Age_cat_yrs))


glimpse(lymedisease_cleandata)
Rows: 149,887
Columns: 5
$ Year        <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
$ Case_status <fct> Probable, Probable, Probable, Probable, Probable, Probable…
$ Sex         <fct> Male, Female, Female, Female, Female, Female, Male, Female…
$ Age_cat_yrs <fct> 75-79, 75-79, 40-44, 30-34, 5-9, 20-24, 70-74, 65-69, 60-6…
$ Race        <fct> NA, White, NA, NA, White, NA, White, White, White, NA, Whi…

Exploring the Data

# Now that the data is cleaned, create a summary table for the categorical variables.


summary_table <- lymedisease_cleandata %>%
  summarise(
    total = n(),
   
    # Case status counts.
    case_status_probable = sum(Case_status == "Probable"),
    case_status_confirmed = sum(Case_status == "Confirmed"),
   
    # Sex counts.
    sex_male = sum(Sex == "Male"),
    sex_female = sum(Sex == "Female"),
   
    # Age counts (all 5-year categories to 100+).
    age_0_4 = sum(Age_cat_yrs == "0-4"),
    age_5_9 = sum(Age_cat_yrs == "5-9"),
    age_10_14 = sum(Age_cat_yrs == "10-14"),
    age_15_19 = sum(Age_cat_yrs == "15-19"),
    age_20_24 = sum(Age_cat_yrs == "20-24"),
    age_25_29 = sum(Age_cat_yrs == "25-29"),
    age_30_34 = sum(Age_cat_yrs == "30-34"),
    age_35_39 = sum(Age_cat_yrs == "35-39"),
    age_40_44 = sum(Age_cat_yrs == "40-44"),
    age_45_49 = sum(Age_cat_yrs == "45-49"),
    age_50_54 = sum(Age_cat_yrs == "50-54"),
    age_55_59 = sum(Age_cat_yrs == "55-59"),
    age_60_64 = sum(Age_cat_yrs == "60-64"),
    age_65_69 = sum(Age_cat_yrs == "65-69"),
    age_70_74 = sum(Age_cat_yrs == "70-74"),
    age_75_79 = sum(Age_cat_yrs == "75-79"),
    age_80_84 = sum(Age_cat_yrs == "80-84"),
    age_85_89 = sum(Age_cat_yrs == "85-89"),
    age_90_94 = sum(Age_cat_yrs == "90-94"),
    age_95_99 = sum(Age_cat_yrs == "95-99"),
    age_100_plus = sum(Age_cat_yrs == "100+"),
   
    # Race counts, including NA to account for missing data.
    race_white = sum(Race == "White"),
    race_black = sum(Race == "Black or African American"),
    race_asian = sum(Race == "Asian"),
    race_other = sum(Race == "Other"),
    race_na = sum(is.na(Race))
  )


# Convert counts to percentages.
summary_table_pct <- summary_table %>%
  mutate(across(-total, ~ . / total * 100))


# Convert to a data frame.
as.data.frame(summary_table)
   total case_status_probable case_status_confirmed sex_male sex_female age_0_4
1 149887               144690                  5197    85421      63324    2761
  age_5_9 age_10_14 age_15_19 age_20_24 age_25_29 age_30_34 age_35_39 age_40_44
1    8175      7136      5761      5399      5501      6718      7443      8055
  age_45_49 age_50_54 age_55_59 age_60_64 age_65_69 age_70_74 age_75_79
1      8003     10019     12730     15002     15190     13176      9336
  age_80_84 age_85_89 age_90_94 age_95_99 age_100_plus race_white race_black
1      5420         0         0         0            0         NA         NA
  race_asian race_other race_na
1         NA         NA   64121
as.data.frame(summary_table_pct)
   total case_status_probable case_status_confirmed sex_male sex_female
1 149887             96.53272              3.467279 56.99027   42.24783
   age_0_4  age_5_9 age_10_14 age_15_19 age_20_24 age_25_29 age_30_34 age_35_39
1 1.842054 5.454109   4.76092  3.843562  3.602047  3.670098  4.482043  4.965741
  age_40_44 age_45_49 age_50_54 age_55_59 age_60_64 age_65_69 age_70_74
1  5.374048  5.339356  6.684369  8.493065  10.00887   10.1343  8.790622
  age_75_79 age_80_84 age_85_89 age_90_94 age_95_99 age_100_plus race_white
1  6.228692  3.616057         0         0         0            0         NA
  race_black race_asian race_other  race_na
1         NA         NA         NA 42.77956
# View table.
print(summary_table, n = 1, width = Inf)
# A tibble: 1 × 31
   total case_status_probable case_status_confirmed sex_male sex_female age_0_4
   <int>                <int>                 <int>    <int>      <int>   <int>
1 149887               144690                  5197    85421      63324    2761
  age_5_9 age_10_14 age_15_19 age_20_24 age_25_29 age_30_34 age_35_39 age_40_44
    <int>     <int>     <int>     <int>     <int>     <int>     <int>     <int>
1    8175      7136      5761      5399      5501      6718      7443      8055
  age_45_49 age_50_54 age_55_59 age_60_64 age_65_69 age_70_74 age_75_79
      <int>     <int>     <int>     <int>     <int>     <int>     <int>
1      8003     10019     12730     15002     15190     13176      9336
  age_80_84 age_85_89 age_90_94 age_95_99 age_100_plus race_white race_black
      <int>     <int>     <int>     <int>        <int>      <int>      <int>
1      5420         0         0         0            0         NA         NA
  race_asian race_other race_na
       <int>      <int>   <int>
1         NA         NA   64121
print (summary_table_pct, n = 1, width = Inf)
# A tibble: 1 × 31
   total case_status_probable case_status_confirmed sex_male sex_female age_0_4
   <int>                <dbl>                 <dbl>    <dbl>      <dbl>   <dbl>
1 149887                 96.5                  3.47     57.0       42.2    1.84
  age_5_9 age_10_14 age_15_19 age_20_24 age_25_29 age_30_34 age_35_39 age_40_44
    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1    5.45      4.76      3.84      3.60      3.67      4.48      4.97      5.37
  age_45_49 age_50_54 age_55_59 age_60_64 age_65_69 age_70_74 age_75_79
      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1      5.34      6.68      8.49      10.0      10.1      8.79      6.23
  age_80_84 age_85_89 age_90_94 age_95_99 age_100_plus race_white race_black
      <dbl>     <dbl>     <dbl>     <dbl>        <dbl>      <dbl>      <dbl>
1      3.62         0         0         0            0         NA         NA
  race_asian race_other race_na
       <dbl>      <dbl>   <dbl>
1         NA         NA    42.8
# Explore the distribution of categorical variables.


ggplot(lymedisease_cleandata, aes(x = Case_status)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Case Status", y = "Count", x = "Case Status")

# Sex distribution
ggplot(lymedisease_cleandata, aes(x = Sex)) +
  geom_bar(fill = "salmon") +
  labs(title = "Distribution of Sex", y = "Count", x = "Sex")

# Age distribution
ggplot(lymedisease_cleandata, aes(x = Age_cat_yrs)) +
  geom_bar(fill = "forestgreen") +
  labs(title = "Distribution of Age Categories", y = "Count", x = "Age Category") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Race distribution
ggplot(lymedisease_cleandata, aes(x = Race)) +
  geom_bar(fill = "purple") +
  labs(title = "Distribution of Race", y = "Count", x = "Race") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • The CDC classified Lyme disease cases as either confirmed or probable. When assessing for Lyme disease, health care providers will consider (1) signs and symptoms of the disease, (2), exposure to infected blacklegged ticks, and (3) results of laboratory tests. Confirmed cases must met all three criteria. In Fig. 1, we can see that most reported cases are coded as probable, rather than confirmed. Probable cases dominate as many patients never get a lab test – or the results of the lab test were not in prior to reporting it to the NNDSS.
  • In Fig. 2, we see that the number of males with confirmed or probable Lyme disease is higher than the number of females with confirmed or probable Lyme disease. This may be related to behavioral exposure as males, especially in rural or wooded areas, are more likely to engage in outdoor activities (e.g., hiking, hunting, landscaping).
  • In Fig. 3, we see that older individuals (50s - 70s) have higher counts of confirmed or probable Lyme disease in comparision to their younger counterparts. This may also be due to behavioral exposure. This may also be related to healthcare-seeking behavior, as older adults are more likely to seek care consistently.
  • In Fig. 4, we see that those who identify as White have higher counts of confirmed or probable Lyme disease in comparision to other races. This may be related to the fact that Lyme disease is most common in Northeastern and Upper Midwest states, where the population is predominately White.

Below I have split the data by year to see trends over 2022 - 2023.

# Count cases by Sex and Year.
sex_year_summary <- lymedisease_cleandata %>%
  group_by(Year, Sex) %>%
  summarise(count = n(), .groups = "drop")


# Plot.
ggplot(sex_year_summary, aes(x = Sex, y = count, fill = as.factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Lyme Disease Cases by Sex and Year",
    x = "Sex",
    y = "Number of Cases",
    fill = "Year"
  ) +
  theme_minimal()

age_year_summary <- lymedisease_cleandata %>%
  group_by(Year, Age_cat_yrs) %>%
  summarise(count = n(), .groups = "drop")


# Plot
ggplot(age_year_summary, aes(x = Age_cat_yrs, y = count, fill = as.factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Lyme Disease Cases by Age Group and Year",
    x = "Age Group (Years)",
    y = "Number of Cases",
    fill = "Year"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

race_year_summary <- lymedisease_cleandata %>%
  group_by(Year, Race) %>%
  summarise(count = n(), .groups = "drop")


# Plot
ggplot(race_year_summary, aes(x = Race, y = count, fill = as.factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Lyme Disease Cases by Race and Year",
    x = "Race",
    y = "Number of Cases",
    fill = "Year"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From these figures, we see that the number of confirmed or probable cases of Lyme disease increased from 2022 to 2023.

Synthetic Data Generation

This section was contirbuted by Talia C. Swanson

Creating Your Synthetic Dataset

——Chat GPT was used to understand why code would not preview correctly missing dataset creation ——

We will first begin with the steps to ensure that the vraibles, observations, and overview of our data looks good.

set.seed(123)

n <- nrow(lymedisease_cleandata)
  # This will ensure the synthetic data set contains the same number of observations as the original Lyme dataset#
  # This will geneate synthetic data for each variable of your choice 
simulate_cat <- function(data, varname, n) {
  probs <- prop.table(table(data[[varname]]))
  sample(
    x = names(probs), # Categories
    size = n,         # Synthetic data rows
    replace = TRUE,   
    prob=probs        # Observed probs
    )
}
  # Generate synthetic data for Year
  Year = sample(
    unique(lymedisease_cleandata$Year),
    size = n,
    replace = TRUE,
    prob = prop.table(table(lymedisease_cleandata$Year))
  )
  # Generate synthetic data for Case_status
  Case_status = sample(
    levels(lymedisease_cleandata$Case_status),
    size = n,
    replace = TRUE,
    prob = prop.table(table(lymedisease_cleandata$Case_status))
  )
  # Generate synthetic data for Sex
  Sex = sample(
    levels(lymedisease_cleandata$Sex),
    size = n,
    replace = TRUE,
    prob = prop.table(table(lymedisease_cleandata$Sex))
  )
  # Generate synthetic data for Age categories
  Age_cat_yrs = sample(
    levels(lymedisease_cleandata$Age_cat_yrs),
    size = n,
    replace = TRUE,
    prob = prop.table(table(lymedisease_cleandata$Age_cat_yrs))
  )
  # Generate synthetic data for Race
  Race = sample(
    levels(lymedisease_cleandata$Race),
    size = n,
    replace = TRUE,
    prob = prop.table(table(lymedisease_cleandata$Race))
  )
#Combine all synthetic variables into a dataset
lymedisease_synthetic <- data.frame(
  Year, Case_status, Sex, Age_cat_yrs, Race
)
# Convert categorical variables back to factors
lymedisease_synthetic <- lymedisease_synthetic %>%
  mutate(
    Year = as.factor(Year),
    Case_status = as.factor(Case_status),
    Sex = as.factor(Sex),
    Age_cat_yrs = as.factor(Age_cat_yrs),
    Race = as.factor(Race)
  )
# Look at first few rows
head(lymedisease_synthetic)
  Year Case_status    Sex Age_cat_yrs  Race
1 2023    Probable Female       60-64 White
2 2022    Probable   Male       65-69 White
3 2023    Probable   Male       50-54 White
4 2022    Probable   Male       60-64 White
5 2022    Probable Female       65-69 White
6 2023    Probable   Male       65-69 White
# Check structure of synthetic datatset
glimpse(lymedisease_synthetic)
Rows: 149,887
Columns: 5
$ Year        <fct> 2023, 2022, 2023, 2022, 2022, 2023, 2023, 2022, 2023, 2023…
$ Case_status <fct> Probable, Probable, Probable, Probable, Probable, Probable…
$ Sex         <fct> Female, Male, Male, Male, Female, Male, Male, Male, Female…
$ Age_cat_yrs <fct> 60-64, 65-69, 50-54, 60-64, 65-69, 65-69, 45-49, 70-74, 75…
$ Race        <fct> White, White, White, White, White, White, White, White, Wh…
# Quick summary of synthetic dataset
summary(lymedisease_synthetic)
   Year          Case_status             Sex         Age_cat_yrs   
 2022:60835   Confirmed:  5277   Female    :63582   65-69  :15414  
 2023:89052   Probable :144610   Male      :85178   60-64  :15126  
                                 Suppressed: 1127   70-74  :13101  
                                                    55-59  :12733  
                                                    50-54  : 9985  
                                                    75-79  : 9363  
                                                    (Other):74165  
                            Race       
 Asian/Pacific Islander       :  1656  
 Black                        :  2233  
 Native American/Alaska Native:   139  
 Other                        :  4026  
 Suppressed                   :  4907  
 White                        :136926  
                                       

Exploring Your Synthetic Data (singular variables)

Now we can explore and analyze the synthetic data in the same way as the original set. We will look at the distributions for each catergory by looking at their frequency. All graphs and visulizations are generated in hot pink to show distinction of my contribution.

summary_table_synth <- lymedisease_synthetic %>%  #synthetic data marker
  summarise(
    total = n(),
    case_status_probable = sum(Case_status == "Probable"),
    case_status_confirmed = sum(Case_status == "Confirmed"),
    sex_male = sum(Sex == "Male"),
    sex_female = sum(Sex == "Female"),
    age_0_4 = sum(Age_cat_yrs == "0-4"),
    # ... other age categories
    race_white = sum(Race == "White"),
    race_black = sum(Race == "Black or African American")
  )
summary_table_pct <- summary_table %>% # percents for easier analysis later #
  mutate(across(-total, ~ . / total * 100))

summary_table_pct
# A tibble: 1 × 31
   total case_status_probable case_status_confirmed sex_male sex_female age_0_4
   <int>                <dbl>                 <dbl>    <dbl>      <dbl>   <dbl>
1 149887                 96.5                  3.47     57.0       42.2    1.84
# ℹ 25 more variables: age_5_9 <dbl>, age_10_14 <dbl>, age_15_19 <dbl>,
#   age_20_24 <dbl>, age_25_29 <dbl>, age_30_34 <dbl>, age_35_39 <dbl>,
#   age_40_44 <dbl>, age_45_49 <dbl>, age_50_54 <dbl>, age_55_59 <dbl>,
#   age_60_64 <dbl>, age_65_69 <dbl>, age_70_74 <dbl>, age_75_79 <dbl>,
#   age_80_84 <dbl>, age_85_89 <dbl>, age_90_94 <dbl>, age_95_99 <dbl>,
#   age_100_plus <dbl>, race_white <dbl>, race_black <dbl>, race_asian <dbl>,
#   race_other <dbl>, race_na <dbl>
# Bar plot for case status
ggplot(lymedisease_synthetic, aes(x = Case_status)) +
  geom_bar(fill = "hotpink") +
  labs(title = "Synthetic Case Status Distribution", x = "Case Status", y = "Count")

# Bar plot for sex
ggplot(lymedisease_synthetic, aes(x = Sex)) +
  geom_bar(fill = "hotpink") +
  labs(title = "Synthetic Sex Distribution", x = "Sex", y = "Count")

# Bar plot for age
ggplot(lymedisease_synthetic, aes(x = Age_cat_yrs)) +
  geom_bar(fill = "hotpink") +
  labs(title = "Synthetic Age Distribution", x = "Age Category", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bar plot for race
ggplot(lymedisease_synthetic, aes(x = Race)) +
  geom_bar(fill = "hotpink") +
  labs(title = "Synthetic Race Distribution", x = "Race", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exploring Your Synthetic Data (multiple variables)

—— Chat GPT was used to help generate the color change for the synthetic data——

Now we will combine multiple variables for analysis, looking at multiple variables that describe our Lyme dataset.

# Count synthetic cases by Sex and Year 
sex_year_summary <- lymedisease_synthetic %>% # synthetic data marker
  group_by(Year, Sex) %>%
  summarise(count = n(), .groups = "drop")

# Plot
ggplot(sex_year_summary, aes(x = Sex, y = count, fill = as.factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Synthetic Lyme Disease Cases by Sex and Year",
    x = "Sex",
    y = "Number of Cases",
    fill = "Year"
  ) +
  scale_fill_manual(values=c("lightpink", "hotpink")) + # synthetic colors
  theme_minimal()

# Count synthetic cases by Age category and Year
age_year_summary <- lymedisease_synthetic %>% # Synthetic data marker
  group_by(Year, Age_cat_yrs) %>%
  summarise(count = n(), .groups = "drop")

# Plot
ggplot(age_year_summary, aes(x = Age_cat_yrs, y = count, fill = as.factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Synthetic Lyme Disease Cases by Age Group and Year",
    x = "Age Group (Years)",
    y = "Number of Cases",
    fill = "Year"
  ) +
  scale_fill_manual(values=c("pink", "hotpink")) + # synthetic colors
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The synthetic dataset was generated to match the structure and distributions of the original cleaned Lyme disease data. Categorical variables (Year, Case_status, Sex, Age_cat_yrs, Race) were sampled. The exploratory analysis (summary tables and bar plots) shows that the synthetic data roughly mirrors the trends in the original data , maintain simulated values.