Code
# loading packages
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
This is a Report Template
Slides: slides.html ( Go to slides.qmd
to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Generalized Linear Mixed Models (GLMMs) are a flexible class of statistical models that combine the features of two powerful tools: Generalized Linear Models (GLMs) and Mixed-Effects Models(Agresti 2015). Like GLMs, GLMMs can model non-normal outcome variables, such as binary, count, or proportion data. However, they go a step further by incorporating random effects, which account for variation due to grouping or clustering in the data, correlated observations, and overdispersion.
In practical terms, GLMMs are especially useful when data points are not independent, such as when students are nested within schools, patients are treated within hospitals, or repeated measures are taken from the same subject over time. For example, Thall wrote that issues with longitudinal clinical trial basic count data from repeated measures taken from the same subject over time will have problems detecting comparable between subject outcomes because it can be difficult to determine if outcomes are time dependent or due to treatment groups, thus a general linear mixed model method may be utilized to represent dependence upon each patient, incorporate covariate data, create time as a function, account for variability between patients,and be flexible and tractable (Thall 1988). The random effects help model the correlation within clusters and allow for unobserved heterogeneity—differences that are not captured by the measured covariates.
GLMMs are good for:
Handling hierarchical or grouped data (e.g., students within classrooms, patients within clinics)(Lee and Nelder 1996)
Modeling non-normal outcomes, such as:
Binary outcomes (using logistic GLMMs)(Wang et al. 2017)
Count data (using Poisson or negative binomial GLMMs)(Candy 2000)
Proportions or rates (Salinas Ruı́z et al. 2023)
Improving inference by accounting for both fixed effects (predictors of interest) and random effects (random variation across groups)
Reducing bias and inflated Type I error rates that can result from ignoring data structure(Thompson et al. 2022)
GLMMs are ideal when your data is both complex in structure and involves non-Gaussian response variables, making them indispensable in fields like medicine, ecology, education, and social sciences. Tawiah et al describes zero-inflated Poisson GLMMs, an extension of Poisson GLMM that allows for overdispersion due to a prevalence of zeros in the data, common in health sector data(Tawiah, Iddi, and Lotsi 2020). The paper compares a Poisson GLM, a zero-inflated Poisson GLM, a Poisson GLMM, and a zero-inflated Poisson GLMM, applied to clustered maternal mortality data. Another paper by Owili et al utilizes a GLMM to investigate the impact of particulate matter on maternal and infant mortality globally (Owili et al. 2020). They use a Poisson link function and take year and country as random effects to account for differences in global data.
We wish to analyze the data of federal maternal mortality deaths via VSRR Provisional Maternal Death Counts and Rates dataset by utilizing a General linear mixed model with Poisson link as it is count data. We wish to see if ethnicity (a fixed effect) has any influence upon maternal death count by year(random effect). Like other public health or clinical data there will be issues such as correlated observations and overdispersion but GLMM will be utilized to parse through the noise and determine if indeed there are some patterns of maternal mortality among mothers of differing ethnicties.
GLMMs can be considered an extension of GLMs, wherein a GLM includes the addition of random effects, or an extension of Linear Mixed Models (LMMs), where a linear model with fixed and random effects is extended for non-normal distributions. Let
\(\mathbf{y}\) be a \(Nx1\) column vector outcome variable
\(\mathbf{X}\) be a \(Nxp\) matrix for the \(p\) predictor variables
\(\boldsymbol{\beta}\) be a \(px1\) column vector of the fixed effects coefficients
\(\mathbf{Z}\) is a \(Nxq\) matrix of the \(q\) random effects
\(\mathbf{u}\) is a \(qx1\) vector of random effects, and
\(\boldsymbol{\epsilon}\) is a \(Nx1\) column vector of the residuals
Then the general equation for the model is given by:
\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\mathbf{u}+\boldsymbol{\epsilon}\] (Salinas Ruı́z et al. 2023). GLMMs typically include a link function that relates the response variable \(\mathbf{y}\) to a linear predictor, \(\eta\), which excludes the residuals. So then \[\boldsymbol{\eta}=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\boldsymbol{\lambda}\] The link function is \(g(\cdot)\), where \[g(E(\mathbf{y}))=\boldsymbol{\eta}\] where \(E(\mathbf{y})\) is the expectation of . The choice of link function depends on the outcome distribution. For this paper our data demonstrates a Negative Binomial distribution for count data, so we will use a log link function \[g(\cdot)=log_e(\cdot)\]
Before deciding to use a GLMM for our data, we had to check some assumptions (specific to our negative binomial distributed data).
Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.
A study was conducted to determine how…
# loading packages
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
# Load Data
kable(head(murders))
state | abb | region | population | total |
---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 |
Alaska | AK | West | 710231 | 19 |
Arizona | AZ | West | 6392017 | 232 |
Arkansas | AR | South | 2915918 | 93 |
California | CA | West | 37253956 | 1257 |
Colorado | CO | West | 5029196 | 65 |
= murders %>% ggplot(mapping = aes(x=population/10^6, y=total))
ggplot1
+ geom_point(aes(col=region), size = 4) +
ggplot1 geom_text_repel(aes(label=abb)) +
scale_x_log10() +
scale_y_log10() +
geom_smooth(formula = "y~x", method=lm,se = F)+
xlab("Populations in millions (log10 scale)") +
ylab("Total number of murders (log10 scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name = "Region")+
theme_bw()
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Summarize your key findings.
Discuss the implications of your results.