Crack the Code: Four Tips for New “R” Users
By: Yiwen Zhu
It has been over four years since I created my first R script.
Over those four years, there were countless moments when I mumbled in front of the computer, asking myself, “But…but you just worked a minute ago! What happened?”
Now, as a full-time data analyst, I cannot be more grateful for being able to learn R and use it in my work. Despite the occasional frustrations, learning R has empowered me to work with different types of data and address a wide range of scientific questions.
If you are just getting started with R and feeling a little intimidated by coding, fear not. The tips compiled below should help you become more comfortable with R in no time.
1. Make use of the resources around you.
With so many different resources offered for learning R, it can be hard choosing where to start. Whether you are looking for virtual online, in-person, or written tutorials, there are endless possibilities that support all different types of learners. If you have not worked in R or any other programming languages before, following a course is a good way to make sure your bases are covered before you tackle real data analyses and pick up more skills on the job.
For online resources, Coursera, edX, and DataCamp are great places to start, as they offer a large assortment of virtual instruction classes. Browse institutions around your area to see if they are hosting workshops online or in-person for beginners. These resources are great when you would like to get a handle on basics such as loading data, learning how data is stored, and writing your first function.
Another great way to learn R is to befriend people around you who have more experience with it, especially if you work on similar types of data and have similar workflows. Ask to grab coffee and talk with them about specific things that they find helpful in their data analysis process, or if there were any nerve-wracking moments that turned out to be good learning experiences. I have found it helpful to go through other people’s code with specific questions pertaining to my own data analysis in mind. I often ask questions like, “How did you examine the data distributions and identify important bivariate associations?” or “How did you go from regression output to making a table that can be exported to excel?” It is valuable to see other people’s different processes to create scripts, which will eventually help you find the approach that fits your style the most.
2. Chart out the intermediate steps.
No matter how long you have used R, code does not magically appear on the screen when our fingertips touch the keyboard. Especially when I work on data wrangling and visualization tasks, I often start by figuring out what the end product should look like by drawing it out on a piece of paper.
As a simple example, if you are interested in looking at the distribution of age in a sample and whether it differs by gender, the end product may be a box plot with one box for each gender. You can work backwards to sort out the intermediate steps by asking questions like:
In order to produce this plot, what should the data frame look like?
Should the data be stored as numeric or factor in these columns?
Is there anything that needs to be excluded?
By breaking down the big task into smaller steps, the process seems much less intimidating and more actionable.
Next, I suggest translating each step into pseudocode (i.e., natural language providing a step-by-step outline for the program and writing It down line by line as comments in your script. Everyone uses pseudocode differently, but I see it as a way to assemble the structure of your code in plain English. For example, if you would like to categorize participants as children, adolescents, and adults, the comment could be “## recode age as a categorical variable: age < 10 – children, 10<= age < 19 – adolescents, age >=19 – adults.”
This exercise not only helps you be explicit about what exactly you want the code to do (which your future self will be grateful for), but also gives you an idea of what to search for when you get stuck.
If you find yourself stumped when trying to translate the pseudocode into actual R code, simply enter the keywords (e.g., “categorical, recode, R” for the example above) into the search engine, and there will likely be a detailed blog post or Stack Exchange answer showing you how to tackle it. I also recently discovered rseek which allows you to search through a wide collection of amazing R resources.
3. Write the code with your future self in mind.
Data analyses often involve a lot of repetition. We may need to recode ten variables in similar ways, fit the same model in four different subsamples, and look at eight types of exposure. Sometimes we are tempted to just copy and paste the same code and change the variable names. I have found it is actually less time consuming to write a function to automate all the downstream analyses especially when you might need to rerun the analyses for a revision or repurpose it for a different project (Note: To learn more about how to write a function, there are many helpful blog posts out there, such as this one). Needless to say, creating functions for repetitive tasks also makes it easier to troubleshoot and tweak the code, while resorting to copying and pasting leaves greater room for error.
There is no “redo” button in R when you have modified an object incorrectly, which can happen quite a lot when you are figuring things out as a beginner. Reloading and rerunning everything from the top can be time consuming and frustrating.
One trick that has helped me is to save major versions of my data at critical stages (e.g., raw data, analytic sample, data with all derived variables). If the data is later accidentally modified, the “pristine” version can be retrieved by reloading the saved versions. Your future self will be so grateful for this type of second chance!
4. Find your style.
I am definitely someone who feels happier and more productive when my code is tidy and consistent. Finding a coding style that truly speaks to you takes time, and it evolves constantly, but here are a few things that have helped me.
· Keep the syntax consistent.
Style guides such as the Google’s R style guide or the tidyverse style guide introduce some best practices into your coding routine. While focusing on details like whether to put a space or not may seem obsessive at first, following a style consistently can eventually make your code easier to read and interpret.
· Develop your own script template.
As we have our own habits and goals when we write scripts, having a template helps you create a routine that makes sense for the kind of tasks you work on. For example, I start my script with a brief note on what this is for, followed by date, and author. An “appendix” section is sometimes included at the end of my script, to house snippets that are sanity checks not directly related to the main analyses, so that the main body of my script can flow logically and detail the analyses without interruptions.
· Pick one thing that works for you.
While there many ways to achieve the same goal in R, I personally think when you are getting started, it is helpful to pick one thing that works for you and keep practicing it when you encounter similar tasks. For example, I am the most comfortable with doing data cleaning using dplyr and other tidyverse packages, so when I have new questions about data wrangling, I try solutions that use these packages first. That way, your code will be compatible with previous scripts and you don’t need to learn new syntax.
Learning R has been a lot of fun for me. I am constantly amazed by how enthusiastic people in the R community are about sharing their knowledge and developing packages that make my life easier without asking for anything in return. If you are new to R, reach out to others whenever you have questions and don’t feel embarrassed – most R users have been there and made the same mistakes themselves. I hope you will end up enjoying R as much as I do, and that through your R adventures, you develop your skills to do inspiring, reproducible science.