Thursday, May 23, 2013

Basics: =IF functions/ Identifying duplicates manually

While my first two posts were great, as I was pondering what kind of super complicated spreadsheet to make next, I got fixated on the fact that my pace is pretty quick. I try to give brief explanations for everything. However if you do not have a foundation in use of some of the more basic Excel functions, you have no hope of understanding what I am saying when I get into a formula that consists of 8 nested =IF statements.

Because of this, today I'll be trying to give a 101 type presentation of what I find to be the most useful took in Excel, the =IF function.

If you go into Excel's help function, you will find the definition of their =IF statement:
IF(logical_test, [value_if_true], [value_if_false])
Couldn't have put it better myself, in fact, that's how I learned how to use them so quickly. To put this definition into simple English:
=IF(Here is my logical statement, if the logical statement is true do this, if the logical statement is false do this.)
Here are some simple explanations of the 3 components of an =IF function: Logical test, value if true, value if false.

Logical Statement

A logical statement is basically a mathematical equation. A logical can be as simple as "A2=1". In this instance, the =IF function is going to look at cell A2. If A2 is equal to 1, then the function will return the "value with true" that you have entered into the equation, if A2 is not equal to 1, then the value if false statement will be returned.

You can also create more complex logical arguments. You can use "greater than" and "less than" and multiple cell references in your logical statement. An example of this would be:
A2>2
A2>B2
You can also use "string" (letters or groups of letters) variables in your logical argument.
 A2="Complete"
You can make the logical argument just about as complicated as you need. In a single logical statement, you can analyze multiple cells to create your logical statement:
A2+B2>C2
Value if true and Value if false

The second 2 components are your return values. Both are in the exact same format. You can instruct the formula to populate the cell as blank or null by using the "" symbol. You can instruct the cell to populate as a word by putting the word in quotations (ie: "Complete"). You can also instruct the formula to perform another calculation, this is called "nesting".

So let us drop with the theory, and get into an example. Some of my earliest tinkerings in Excel were as an intern doing educational research for the State. On a constant basis, we were provided with datasets that had many duplicates. Versions of Excel after 2003 have a point and click feature that is capable of removing duplicates, but I did not have such luck in 2003 and needed to figure out a way to identify these duplicates. Here is a short example of the type of dataset I would have been looking at:



Given, this is only 28 rows which means you could manually find duplicates in a matter of a minute or two, but imagine that this is a document with 60 thousand rows. Well, the first step is to sort by the first row. The quickest way to do this is to hit "ctrl+a" on your keyboard to select the entire spreadsheet, and click the "sort ascending" button (alternatively: Data>Sort, and use that menu):



As you can see, there are a bunch of duplicate Student ID numbers. This was usually the case with student level data, seeing as how there are many undocumented families, and one child's SSN will be used for thousands of students. This was not an issue the Department of Education could do anything about, but we were frequently asked to identify which Students were showing up more than once so that these students could be removed from the data pool so as to not skew our analyses.

After sorting, the next step to identifying these students was to use an if statement. I would enter the following formula into the E2 cell:
=IF(A3=A2, 1,0)
Or in plain english: IF(The value in A3 is the same as the value in A2, populate as 1, if not, populate as 0.
 
I'd then autofill down to D29. Since I did not use any dollar signs in my formula, the autofill will automatically change the formula to keep it in context:
=IF(A4=A3,1,0)
=IF(A5=A4,1,0)
=IF(A6=A5,1,0)... etc.
 After entering my equation and autofilling, my spreadsheet would look like this:



This leaves a gap, however. Since I only made a simple =IF function, the last duplicate listed will not be counted. To compensate for this, I got clever. I did created a second equation that evaluated my first equation for a reverse evaluation.

My first equation looks downward and evaluates the next row down. So that means that if I apply reverse logic, that means the row below a "1" result must be a duplicate as well. To trigger a datapoint that I could zero in on, I created this second formula:
=IF(E1=1,1,0)
This cell could just check the formula result one row up to identify if the row was a duplicate or not. The result after autofill looked like this:






So now that I had my triggers built in to tell me where all the duplicates were, I created a third column to tell me all the positive results. I also would then turn on the filter function (Highlight top row>Data>Filter>Autofilter) so I could get a snapshot of both the dataset minus the duplicates as well as a list of all the duplicate data (this formula was "=E2+F2) entered into the G2 cell:



Using these 3 =IF functions creatively has now given me a simple view of all duplicates in the dataset. Any row that has a value of 0 in column G does not have any duplicates within the dataset. If the value in column G is greater than 0, that row has duplicates.

Filtered view of unique student numbers:



Filtered view of duplicate student numbers:


Being able to create these two views, I could just copy and paste each set of data into 2 new Excel spreadsheet and send my Director the completed data request: A spreadsheet of unique student data and a spreadsheet of duplicated student data.

This is one of my most frequently used tricks. As I have referenced, newer versions of Excel have a "Identify/Remove duplicates" feature. I don't know that I trust that feature yet. When I use my method as described above, and then use the Excel feature, I often get different results. Since I know my way is simple and accurate, I trust my results over Excel 2010's results.

This is just one creative application of =IF statements I have used. It really is easy, and simple =IF statements empower you to do just about anything in Excel, you just need to get creative with it.