Report for a Data Analysis Project

Authors

Anissa Waller Del Valle \(^{1, *, \land}\)
Emil Joson \(^{2, *}\)

Author affiliations

Department of Cellular Biology, University of Georgia, Athens, GA, USA.
Department of Plant Pathology, University of Georgia, Athens, GA, USA.

\(*\) These authors contributed equally to this work.

\(\land\) Corresponding author: aw59557@uga.edu

1 Methods

1.1 Data acquisition

The dataset used in this study was provided with the data-analysis project template and can be found in data/raw-data. The dataset was updated to include two new variables, as described below.

1.2 Description of data and data source

Two new variables were added to the dataset and saved in exampledata2.xlsx. The first variable is age, reported as a numeric value >0 or non-applicable (NA) representing the age of each individual in years. The second variable is martial status, reported as either S (Single), M (Married), D (Divorced), W (Widowed), or NA. These variables are documented in the Codebook sheet of the updated dataset. Other variables include (1) height, reported in centimeters as a numeric value >0, (2) weight, reported in kilograms as a numeric value >0, and (3) gender, reported as either male, female, or other.

1.3 Data import and cleaning

Data import and cleaning were performed using scripts located in code/processing-code. This involved looking for obvious outliers (e.g., a weight of 7,000 kg) and removing them from the dataset, among other related processes. The cleaned dataset was saved as processeddata2.rds in the results directory for use in downstream analysis.

1.4 Statistical analysis

Exploratory data analysis was performed using the scripts in code/eda-code/eda-copy.qmd. A boxplot was generated to examine the distrubution of height in relation to marital status. A scatterplot was generated to examine the relationship between weight and age. These figures were saved to the results/figures directory.

Linear models were used to further explore the data. In addition to previously specified models, a third linear model was fit with height as the outcome and age and marital status as predictors. These were saved as resulttable3.rds in the results/able directory.

2 Results

2.1 Analysis

The dataset contained five variables: - Height, reported in centimeters. - Weight, reported in kilograms. - Gender, reported as either male, female, or other. - Age, reported in years. - Marital Status, reported as either single, married, divored, or widowed.

We first conducted exploratory analyses to examine the distributions of age, height, and weight. Figure Figure 1 shows the distribution of height. Figure Figure 2 shows the distribution of weight. Figure Figure 3 shows the distribution of age.

Table 1 shows a summary of the data.

Table 1: Data summary table. All caption text goes here.

Figure 4 shows a scatterplot figure produced by one of the R scripts to visualize the relationship between height and weight stratified by gender.

Figure 4: Height and weight stratified by gender.

Figure 5 shows the distribution of marital status. Figure 6 shows the boxplot generated in this analysis. Likewise, Figure 7 shows the scatterplot generated in this analysis.

Figure 5: Distribution of Marital Status

Below is a summary of all linear model fits.

Table 2: Linear model fit table.

term	estimate	std.error	statistic	p.value
(Intercept)	149.6997661	19.7518528	7.5790240	0.0001285
Weight	0.2277371	0.2708841	0.8407177	0.4282860

term	estimate	std.error	statistic	p.value
(Intercept)	149.2726967	23.3823360	6.3839942	0.0013962
Weight	0.2623972	0.3512436	0.7470519	0.4886517
GenderM	-2.1244913	15.5488953	-0.1366329	0.8966520
GenderO	-4.7644739	19.0114155	-0.2506112	0.8120871

term	estimate	std.error	statistic	p.value
(Intercept)	205.9240122	47.5127075	4.3340829	0.0123101
Age	-0.5094225	0.9626066	-0.5292115	0.6246660
`Marital Status`M	-30.2686930	26.3026870	-1.1507833	0.3139342
`Marital Status`S	-25.6790274	30.1016654	-0.8530766	0.4416881
`Marital Status`W	-2.9908815	35.6700641	-0.0838485	0.9372056