As most data analysts know, a good data visualization starts with good data. Good data is a byproduct of something we call “data wrangling.” Like wrangling cattle on cattle ranches, wrangling data takes a lot of blood, sweat, and tears, but mainly it takes a good tool.
Excel can certainly perform most data wrangling tasks, but the issue comes down to time and efficiency. If you only perform a data wrangling series (say, for example, you open a CSV file, remove duplicates, format a column, create a pivot table of ethnicities, and then re-save the file as an Excel file) ONCE, then sure, use Excel. However, how many of you do this series only once? Almost never. Changes, revisions, new data, etc. happen all the time. But how do you streamline your work?
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/)
If you do any type of repetitive work in Excel, you owe it to yourself to install both R and R Studio (free) on your computer and then pick up a copy of “R for Excel Users” (https://www.amazon.com/Excel-Users-Introduction-Analysts/dp/1500566357).
As an example of what this can do – here is the R code to do the wrangling steps above:
data <- read.csv("enrollment.csv") %>%
distinct() %>%
group_by(ethnicity) %>%
summarize(total = n())
write.xls(data, "ethnicity_summary.xls")
This code took me about a minute to write. The best part of this is that I can run this script and have R do all of the calculations for me. I don’t have to manipulate the file in Excel to get the desired results. Also, if the underlying data should change, I don’t need to re-perform all of the Excel steps – I just need to re-run my R script. Further, if I need to do the same thing next week, next month, next quarter, etc., I don’t need to remember the EXACT steps I took, I just need to re-run my script.
There is a learning curve to R, and you will need to dedicate some time to learning, just as you did the first time you sat down with Excel. However, I have no doubt that this will save you time and increase your productivity.