If you work with data much, you don't need a statistical model to predict that the odds of consistently getting data in the format you need for analysis are pretty low. Those who do a great deal of data cleaning and reformatting often turn to scripting languages like Python or specialty tools such as OpenRefine or R.
But it turns out that there's a lot of data munging you can do in a plain old Excel spreadsheet -- if you know how to craft the proper formulas.
In a presentation at the recent 2014 Computer Assisted Reporting (CAR) conference, MaryJo Webster, senior data reporter with Digital First Media -- a newspaper group in New York -- shared some of her favorite Excel tricks. The goal of these tips, Webster said: Learn at least one new thing that will make you say, "Why didn't I know this before?"
Tip 1: Split dates into separate fields
You can extract the year, month and day into separate fields from a date field in Excel by using formulas =Year(CellWithDate), =MONTH(CellWithDate) and =DAY(CellWithDate). Splitting dates this way -- by year, month and day of month -- works in Microsoft Access as well, Webster said.
In addition, you can also get the day of the week for any date in Excel with =WEEKDAY(CellWithDate). The default returns numbers, not names of the days of week, with 1 for Sunday, 2 for Monday and so on.
To display the name of the weekday instead of a number, apply a custom format to the cells with the weekday numbers, using Format cells > Custom; then type ddd in the Type text box to get three-day abbreviations or dddd for the full day name.
Tip 2: Find someone's current age
If you have someone's date of birth, you can find his or her current age on whatever day you open the spreadsheet with the =DATEDIF() and =TODAY() functions. TODAY(), as you might guess, gives the current date. DATEDIF() gives the difference between two dates in units of years ("y"), months ("m") or days ("d"), using the syntax:
=DATEDIF(Date1, Date2, Unit of measure)
So, to get current age in years, use the formula:
Note that the years unit returns ages in whole numbers and does not round up.
Tip 3: Create multiple rows out of only one
Sometimes you need data in a format with one row for each observation, but what you already have comes with multiple observations for each row instead. In Webster's example of Affordable Care Act Exchange plan pricing, there is a column for prices in each age group: 1-20 years old, 21 years old, 22 years old and so on. However, some visualization and analysis tools require one row for each plan/price combination, not one row with multiple prices.
Reshaping Excel data
Tableau visualization software is one such tool that needs one data point per row, not multiple data points per row, so the vendor created a Tableau Reshaper Tool that works with recent versions of Excel on Windows.
You can download this free tool from the Tableau website. Although one add-in says it's for Excel 2010, it worked fine with Excel 2013 on my Windows 8 PC.
Several CAR attendees said they've spent hours reshaping large data sets by manually cutting and pasting, and the free Tableau tool will save them a lot of time. You don't need to have other Tableau software installed on your system to use it.
The columns you're keeping as row ID columns should be placed on the left, and all your data columns on the right. To use the reshaper tool, put your cursor on the first cell with data that you want transformed. Then go to the Tableau menu and choose reshape data. Say OK. You can watch a brief example below.
Tip 4: Create more easily sortable data
Another common data format problem is when you get a "spreadsheet" that's less like a sortable table of data and more like a Word document with column headers. One example: a spreadsheet with the name of a team on one row followed by all the players on that team, then the name of another team right below followed by the players and so on. It's difficult to analyze a worksheet where column headers are interspersed with data, since you can't easily sort, filter or visualize data by team.
One way to deal with this is to add a new column with the team name for each player.
Reformatting Excel data
"The trick is that you need to have a pattern to follow," according to Webster. In the example above, the position column is empty for the team name rows but filled in for the player rows. By filling in just the first cell with the team name manually, you can then use this formula to automatically fill in the rest:
That says: If cell B3 is blank, fill in the value of the cell in the first column of the same row (in this case A3). Otherwise, fill in the value from the cell that's just above it (in this case C2, which should be the team name from the row above for all the player rows). Make sure to start with the first player row after having manually entered the first header row.
Search and replace
Tip 5: Create a new column
You probably know that you can do a search and replace in Excel with a typical text-editor control-F find-and-replace. But did you know that you can also create an entirely new column in Excel based on search-and-replace on an existing column? That needs the =SUBSTITUTE function, using the syntax:
=SUBSTITUTE(CellWithText, "oldtext", "newtext")
For more of Webster's Excel tips, including how to do data lookups from another worksheet using VLOOKUP(), download her PDF document My Favorite (Excel) Things 2014 and the sample spreadsheet.
Sharon Machlis is online managing editor at Computerworld. Her e-mail address is [email protected] You can follow her on Twitter @sharon000, on Facebook, on Google+ or by subscribing to her RSS feeds: articles; and blogs.