Supporting our vibrant community through resource sharing and mentorship

Research Blog

Research Blog

Resources for R

Software is expensive. I’m not sure which schools offer students free software packages, but mine doesn’t.

SPSS costs about $100, SAS costs around $120, and STATA costs about $250 per academic year. With every other expense I have to think about, I really find it difficult to justify using these programs. Free data analysis software, such as R, is rarely used in courses—meaning that students don’t necessarily get a low-stakes introduction to coding and running analyses. I think that this lack of access to courses that use R can make learning the software even more intimidating in the future—when we’re using our own data or working without access to instructors and mentors who can review all of our code. 

Luckily, I first learned R in my MSW program in 2018, and I’ve been using it ever since. It’s been rocky at times, and I’ve definitely had moments where I wished I could be using an “easier” software. However, as I’ve pushed through, I’ve become more competent and an even bigger proponent of using it. 

This newsletter piece mainly serves to offer resources for students who hope to use R in their research. Although it can be intimidating at first, it does get easier, and there’s a plethora of online resources that can help.

The list of resources below might be updated in the future, or I might make it into a repository of sorts. Before going into the list, though, I do want to say that another resource for identifying code is Chat GPT, which has been invaluable to me as of late. Rather than having to spend 45 minutes troubleshooting and trying to identify the code I need through Ggoogling, I’ve been able to just type my question into Chat GPT and receive an answer with applicable code. Here’s an example of using Chat GPT for identifying code.


Resources:

Learning Basics of R Coding:

Merging data and introduction to R, Hannah Boyke 2023: This links to a Google Drive folder that contains a video, some sample data, and a document that goes over the process for merging data sets. I made this document for a Lunch and Learn for my PhD program that shows how to merge data and troubleshoot the process. At the end, I include some preliminary information and code for running a variety of analyses and basic introductory information. 

UCLA Advanced Research Computing Statistical Methods and Data Analytics: This page links to UCLA’s data analysis examples that contain easy to follow code for a variety of analyses—such as multivariate linear regression, binary and categorical models, count models like negative binomial regression, and even mixed effects logistic regression. 

Dr. Miles Chen of UCLA has a YouTube playlist for an introductory stats class that uses R. This playlist can be really helpful for learning the basics of using R. UCLA’s YouTube channel in general has a lot of helpful information. 

Dr. Steph van Buren, the co-creator of the mice package for multiple imputation in R, has a plethora of resources regarding missing data, including publications and code available for free on his website. Also, van Buren has his book on multiple imputation Flexible imputation of missing data available as a free eBook here.

Data Visualization: 

Urban Institute has amazing information about data visualization! Here’s a link to their GitHub, which has R code for visualization with the ggplot package. Here’s a link to their page about making maps in R, and here’s a link for their introduction to R, which focuses on data visualization. 

The R graph gallery also provides a lot of great guidance  for making graphs and other types of visualization.

 Visualizing Models: 

This Vignette by Daniel Lüdecke introduces how to make marginal effects plots using the packages of sjPlot and ggeffects. 

This vignette, again by Daniel Lüdecke, introduces plotting interaction effects of regression models—specifically, the vignette reviews one-way, two-way, and three-way interactions. 

Making reproducible documents in R 

In addition to being free, R offers a variety of ways to develop research reports that support the reproducibility of research. We live in a time where neoliberalism has pervaded academic cultures, inundating researchers with a hyper-competitive culture that privileges publishing over all else. When our access to grant funding and career prospects are tied to the number of publications we have received, it’s not surprising that there have been so many recent scandals related to research misconduct (including with professors at Harvard and Stanford). As demonstrated by the blog of Nick Brown, PhD, making data public and placing it in a repository does little to prevent misconduct and fraud

Developing reproducible research documents is one way to promote transparency, ethical research, and inter-researcher collaboration. The most basic way to begin doing this is to use R Notebooks and RMarkdown to create research reports that are used as the basis for articles. When doing this, it’s important to consider a couple of things: 

  • Write sufficient notes to provide enough information to understand each step of the analysis to someone who is completely removed from the project. 

  • If the data is in a file other than an .RData file, use the `head()`--or `names()` if it’s necessary for protecting privacy–and `dim()` functions to provide insight to the structure of the data set throughout the cleaning and analysis processes. It’s also necessary to indicate any changes that were made to the data set outside of R in the documentation.

  • Similarly, make sure that there are extensive notes describing the variables before any analysis happens, as well as descriptions before and after any transformations. 

  • Although it might be “unattractive” and could potentially create a 100 page document, always have at least one version of the document that shows the code used to run everything that is done.

  • Making sure that random sampling, bootstrapping, or multiple imputation are reproducible requires more than just setting the seed. There can be differences that emerge from a variety of things, including the version of the package or whether the variables have been changed at all (for example, did the original author have one variable as a factor but it is a character variable for another). 

  • If using an aforementioned technique, make sure that there are ways to check their fidelity beyond just the model results. For example, if drawing a random sample of states, make sure you indicate which states end up in that random draw. 

Here are some other resources for learning how to make reproducible documents:


There are many more considerations that can be addressed, but this should hopefully offer a baseline for future research!