Report for a Data Analysis Project
Authors
- Anissa Waller Del Valle \(^{1, *, \land}\)
- Emil Joson \(^{2, *}\)
Author affiliations
- Department of Cellular Biology, University of Georgia, Athens, GA, USA.
- Department of Plant Pathology, University of Georgia, Athens, GA, USA.
\(*\) These authors contributed equally to this work.
\(\land\) Corresponding author: aw59557@uga.edu
1 Methods
1.1 Data acquisition
The dataset used in this study was provided with the data-analysis project template and can be found in data/raw-data. The dataset was updated to include two new variables, as described below.
1.2 Description of data and data source
Two new variables were added to the dataset and saved in exampledata2.xlsx. The first variable is age, reported as a numeric value >0 or non-applicable (NA) representing the age of each individual in years. The second variable is martial status, reported as either S (Single), M (Married), D (Divorced), W (Widowed), or NA. These variables are documented in the Codebook sheet of the updated dataset. Other variables include (1) height, reported in centimeters as a numeric value >0, (2) weight, reported in kilograms as a numeric value >0, and (3) gender, reported as either male, female, or other.
1.3 Data import and cleaning
Data import and cleaning were performed using scripts located in code/processing-code. This involved looking for obvious outliers (e.g., a weight of 7,000 kg) and removing them from the dataset, among other related processes. The cleaned dataset was saved as processeddata2.rds in the results directory for use in downstream analysis.
1.4 Statistical analysis
Exploratory data analysis was performed using the scripts in code/eda-code/eda-copy.qmd. A boxplot was generated to examine the distrubution of height in relation to marital status. A scatterplot was generated to examine the relationship between weight and age. These figures were saved to the results/figures directory.
Linear models were used to further explore the data. In addition to previously specified models, a third linear model was fit with height as the outcome and age and marital status as predictors. These were saved as resulttable3.rds in the results/able directory.
2 Results
2.1 Analysis
The dataset contained five variables: - Height, reported in centimeters. - Weight, reported in kilograms. - Gender, reported as either male, female, or other. - Age, reported in years. - Marital Status, reported as either single, married, divored, or widowed.
We first conducted exploratory analyses to examine the distributions of age, height, and weight. Figure Figure 1 shows the distribution of height. Figure Figure 2 shows the distribution of weight. Figure Figure 3 shows the distribution of age.
Table 1 shows a summary of the data.
Figure 4 shows a scatterplot figure produced by one of the R scripts to visualize the relationship between height and weight stratified by gender.
Figure 5 shows the distribution of marital status. Figure 6 shows the boxplot generated in this analysis. Likewise, Figure 7 shows the scatterplot generated in this analysis.
Below is a summary of all linear model fits.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 149.6997661 | 19.7518528 | 7.5790240 | 0.0001285 |
| Weight | 0.2277371 | 0.2708841 | 0.8407177 | 0.4282860 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 149.2726967 | 23.3823360 | 6.3839942 | 0.0013962 |
| Weight | 0.2623972 | 0.3512436 | 0.7470519 | 0.4886517 |
| GenderM | -2.1244913 | 15.5488953 | -0.1366329 | 0.8966520 |
| GenderO | -4.7644739 | 19.0114155 | -0.2506112 | 0.8120871 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 205.9240122 | 47.5127075 | 4.3340829 | 0.0123101 |
| Age | -0.5094225 | 0.9626066 | -0.5292115 | 0.6246660 |
Marital StatusM |
-30.2686930 | 26.3026870 | -1.1507833 | 0.3139342 |
Marital StatusS |
-25.6790274 | 30.1016654 | -0.8530766 | 0.4416881 |
Marital StatusW |
-2.9908815 | 35.6700641 | -0.0838485 | 0.9372056 |