4: Utilizing Best Practices for Preparing and Analyzing Data
4.4 Create a reusable data quality dashboard
4: Utilizing Best Practices for Preparing and Analyzing Data
4.4 Create a reusable data quality dashboard - Video Tutorials & Practice Problems
Video duration:
37m
Play a video:
<v Instructor>During this sub-lesson,</v> we will create a data quality dashboard that can be used to review the data within any column of a table within our model. Once completed, this dashboard can be used to evaluate the data quality and profile for the specified column and to identify values within the column that do not match an expected format. The data available for this demo includes an employee data table with their name, date of birth, age, phone number, and state. For a successful implementation, it is required that the user be able to enter a table name and then select a column for that table to analyze using the calculations and visuals on the dashboard. Furthermore, the design for our dashboard should allow it to be copied across multiple models for consistency and reusability. The ability to standardize such capability within our models will provide us with quick insight into our source data, which can help avoid some data integrity related issues we discussed earlier in this lesson. Throughout this demo, we will use the concepts and functions listed on this slide to generate the following information for our dashboard. Total number of values in the selected column, top 10 distinct or duplicate values, values that do not match a given format, count of empty cells or cells with errors, total number of distinct, unique, and duplicate values, count of values within the selected column that are not a number or maybe are zero, invalid date or dates that are in the future, and finally, the min, max, and averages for the values in the selected column. So with that, let's begin. So as you can see, I've already got a layout for our dashboard with a couple of data entry fields in the top left, which is for the table that we're gonna analyze and the column within that table that we're gonna look at. The couple of fields below that are some calculated properties for the selected column. To the right of this, we've got a couple of sections for the column quality and profile. And then below that, we've got another section for the data profile for the values within the column. To the right of that, we've got some empty space, where we're gonna place a couple of visuals for the top 10 distinct values and the top 10 duplicate values within the column. And below that, we've got a couple of boxes where we'll place the spill ranges that become the source for these two visuals. And we can also use these to validate the data within the visuals. Then to the extreme right, we've got the format check section where we can evaluate the values within the column for a specific format and then display some corresponding results in the section here. Off our screen to the right of the dashboard, we've got the tabular data that we're gonna reference. So here, we've got the name, date of birth, age, phone number, and state for the employees. We've named this table, DQD_tblEmpData. So this follows our standard naming convention where we've got the sheet acronym to start with with an underscore following that, then the tbl prefix because we're referencing tabular data, and then an appropriate name for the dataset. Coming back to our dashboard, we've also got some defined names that I've set up in advance just to save ourselves a little bit of time. Some of these defined names are listed right next to the fields or cells that they reference. And then there's some others at the bottom that we're gonna review as our demo progresses. You can see that all our defined names also follow a consistent naming convention. So we've got the ARR prefix for the arrays. We've got an FC_ for the format check section on the right, and then all of them in general have the DQD_ prefix added for the sheet that they're on. So with that, let's begin our demo and start building our dashboard. To start with, we'll specify the name of the table that we wanna analyze, which is DQD_tblEmpData. This is the table we just looked at. Now, if we click on the column to analyze cell and look at the dropdown in this cell, we'll see that it's already populated with the column names for the table that we just referenced. Now this is because I've already set up a data validation for the cell, but let's have a look at the formula we use for that data validation. So we'll go to data and then click on data validation. So here, we've used the list option under allow. And for the source, we've used the indirect function with a textual reference to the headers of the table. For that, we use the defined name for the table to analyze, which is referring to cell B4 here. And then we've concatenated the pound headers suffix for the structure reference to the headers of that table. So using this formula, we're able to display the column names and the dropdown for the cell. So let's select one of the column names. And then let's generate the structured reference for this column under column reference text. For this, we'll use the CONCAT function. We'll start with the table name and then open the square brackets. And then we wanna reference the column name and then close the square brackets. And then press enter. We can now use this structured reference for the selected column for other formulas within this dashboard. The first formula we'll use this for is to determine the data type of the first value within this column, using the type function within Excel. We'll also use the DQD_ColumnRef define name, which uses the column reference text that we determined up above and supplies that as an argument for the indirect function to then reference the actual values within the selected column. And finally, we'll use the Excel data types lookup table that we've created on the ref sheet, which maps the data type names to the numeric value returned by the type function. Now let's start entering our formula. We'll start by recreating the first value in this column using the index function. We'll use the DQD_ColumnRef define name and then supply the rule number argument as one because we wanna get the first value. And let's press enter. We will now use the type function to retrieve the data type for this value. And we can select the entire expression and use F9 to look at the calculated value. Using Ctrl + Z, we can get back to the formula. And we'll use the XLOOKUP function to look up this numeric type value and the type column off our Excel data types table. We can reference that using the acronym for the sheet. It is tabular data, and then now we can get to the Excel data tapes reference, and then supply the type column name. The value we wanna return is within the name column of the same table. And let's also provide the if not found argument. Then we just wanna return unknown if nothing's found. And then the search mode is zero for exact match. And then we press enter. As you can see, we get the text data type for the name column. Now let's try for the age column and we get number. So our formula seems to be working. Now let's integrate some stats for this column. We'll start with a count of all blank and non-blank values within this column. So we'll use the COUNTBLANK function and the COUNTA function for all non-blank values. As we press enter, we can confirm that we've got the right result, which is a total of 25 rows within our table. For the next stat, we only need a count of the blank values. So we'll use the COUNTBLANK function, and then we'll get a count for the number of errors in this column. I know I've got one error in this column, so let's see if we can get that result. For this formula, we first wanna check if the value is an error. So we'll use the ISERROR function. And if the value is ERROR, we wanna increase the count by one, or else leave the count as is. We wanna then sum all the counts to get the total number of errors in this column. As we press enter, we can confirm that we have the expected result. Now let's move on to the column profile section and look at the formula for the distinct count. Here, we'll start with the COUNTIF function and provide the column reference for both arguments, and then close the brackets. Using an array as the criteria argument for this function tells Excel to loop through the range for each value in the criteria array, and calculate the corresponding count. Using F9 on this expression, we can see that the count for each value in the column has been returned in an array. From here, if we divide one by each of these values and sum all the results, we'll get the total distinct values in this column. However, please note that we have some zeroes in this array as well, which will probably result in a divide by zero error once we enter the formula. We should thus handle the such error before summing the results. Using Ctrl + Z to get back to the expression, we can continue with the formula. We'll first divide one by this array result. And then check for the error. And in case we have an error, we want the value zero. And finally, we wanna sum up all the results. Pressing enter, we can see that there's 15 distinct values in this column. Using a similar approach, we can calculate the number of unique values in this column. Here, we'll use the same COUNTIF expression and then check for the value to be one. If it is one, you wanna increase the count by one. And if not, then we just wanna add another zero. Finally, we wanna just sum up all the results. For duplicate value count, we can subtract the number of unique values from the number of distinct values in this column. Next, for the not a number count. we can first determine a count for the numbers in this column, and then simply subtract that from the total count for this column. Here, we'll use the ISNUMBER function to check individual values in this column, which returns an array with Boolean values, which we can then use with a mathematical operator to convert to numeric values. We will then sum these results and subtract from the total count. For the zero value count, we can simply use the COUNTIFS function and check for zeros. Moving on to the data profile section, we can use the standard Excel functions for min, max, and average to get the corresponding values for this column. However, we probably wanna include some code to handle the error values within the column. For that, we'll start typing our standard function, and then use the IFERROR function within it. With this, we're telling Excel to replace any error values within this column with a blank value. We can use similar functions for the max value and the average value. Now before we author the formula for invalid dates, it is important to note that date and time values in Excel are stored as numbers, and our formulas will leverage this property. So for invalid dates, we're gonna count the number of values in this column that are either not a number, or that are a number less than zero. To do this, we'll first use the filter function for the values in this column that are not a number, or that are less than zero. It is important note that we've used a plus sign for the or operator in this expression. This is because we're dealing with array of results in this expression. And for a race, the plus sign acts as an or operator and the multiplication sign or star acts as the and operator. Now let's press enter on this expression. As you can see, we've got a divide by zero error here. That's because our age column does not have any dates within it. Thus, there's zero rows for the filter. To handle this better, we will use the IFERROR function around the FILTER function. And if there is an error, you wanna return zero. Now we've got one more thing left. That is to count the number of rows returned by the FILTER function. We'll use the ROWS function for that purpose. If we now select the date of birth column for our analysis, we'll see that our formula returns a single invalid date. And that matches our expected result. Continuing with the other date-related profile, we'll check for the number of dates that are in the future. To do this, we'll use the COUNTIFS function for the values in our formula and check for values where they're greater than today. For the criteria argument of this expression, we're concatenating a string with the greater than sign. With the greater than sign and then today's date. Excel uses this expression to qualify each value in our column to then return the final count that matches our criteria. Now before we look into the visuals for the top 10 distinct values and the top 10 duplicate values, let's change the column to analyze to our name column, and then look at a couple of defined names that we'll be using for this exercise. The first one we'll be looking at is DQD_arrValueCounts. Here, we'll generate a two-dimensional array with the distinct values in the first column and then the counts for those values in the second column of the two-dimensional array. To start with, we'll use the COUNTIFS function to get a list of counts for the values in this column. Next, we'll add the distinct values as a column to this array. For this, we'll use the CHOOSE function where we'll provide an array argument for the index numb argument. And then the column reference itself for the value one argument, and then leave the COUNTIF function as the value two argument Here, we're telling Excel to choose the first and second value from the list of values that follow the first argument. If you remember, commas within array constant are used to separate individual fields within the same row. Thus, with this array constant as the first argument, the return dataset will have the value and its count in the same row of the two-dimensional array. As you can see, we have some duplicate values in our dataset. To remove those, we'll use the UNIQUE function and provide our array formula as an argument. We're then able to copy this formula and provide it as the reference within our DQD_arrValueCounts define name. The other define name we wanna look at was the DQD_arrDuplicateValues. Here, we wanna return a two-dimensional array with the values that occur two or more times in the selected column and the count for those values. For this, we can just filter our value counts array that we just defined, where the count, which is the second column in the same two-dimensional array is greater than one. Now let's use these define names to generate the data for our visuals. To start with, we'll sort the value counts array based on the count column, which is the second index, and in descending order. Next, we wanna get the top 10 values from this list. So we'll use the INDEX function. And for the row number argument, we'll provide the first 10 indices. For the column argument, we need the first column, which is the value column, and the second column, which is the count column. Similarly, we'll enter the formula for the duplicate values dataset by replacing the array reference to the duplicate values array. As we press enter, we'll notice that some of the cells here returned an error. That's because we don't have enough values to provide the top 10 from this array. To handle this error, we'll again, use the IFERROR function. And in case of an error, we just want it to return a blank value. Now that we have our datasets ready, let's work on the charts above. We'll start with the distinct values chart and click on the left most header cell for our dataset. With this cell selected, we'll go under the insert tab on the ribbon bar, and then select the chart type we wanna use. Because we have text values for our category axes, typically a bar chart is suitable for such cases. We can now drag the chart to align it with the header above and press the Alt key while we're doing this to snap it to the cell edges. We'll then drag the bottom right edge of this chart to resize it to our desired space. We'll again, press the Alt key to snap it to the cell edges. With our chart resized, let's work on some of the formatting options. To start with, we'll select the value axis below, and click on the plus sign in the top right of the chart. Here, we'll uncheck the chart title and the gridline options. We'll also go under the axes sub menu by clicking on the black arrow and uncheck the primary horizontal axis. We can click on the plus sign again to hide the options menu. Next, we'll right-click on one of the bars in our series and select the format data series option. We will then go to the fill in line tab and select solid fill to get the orange shade. Next, we'll go to the chart elements dropdown at the top, and select the vertical category axis option. Here, we'll go to the axes options tab and then select the categories in reverse order checkbox. This will sort our chart data in the same order as the spilled range below. Now let's try to replicate the same process with the duplicate value. With the leftmost header cell selected again, we'll go under the insert tab on the ribbon bar. And then this time, we'll select the clustered column chart. As you can see, Excel failed to pick up the right chart data this time. This is due to the empty cells within our dataset. Now let's try to fix it. So we'll click on the chart, go under chart design, and then click on select data. Here, we'll notice that the chart data range shows the correct range selected for our dataset. However, the series and category axis labels aren't configured correctly. Let's work on the series first. This is where our count data should show up. So the series name should be count and the series values should be all numbers within this column. Moving on to the category axis labels. This should have all the values within the value column of our dataset. Accepting the selections, and then finally clicking okay on the select data source dialog box. We can see a more decent chart now. However, I think we can do better. However, before we proceed with the formatting, let's resize this chart and put it in the allotted space. We'll use the Alt button again to snap the edges of the chart with the edges of the cells and work to get it in the same size as the distinct values chart to the left of this. Now let's carry on with the formatting options. Again, we'll click on the category axis and then the plus sign. We'll uncheck the chart title and the gridlines. And then under the axes sub menu, we'll also disable the vertical axis. We'll click on the plus sign again to hide the context menu and then right-click on any bar in the series. Select format data series, go under fill and line tab, and then select solid fill option. From the colors, we'll select the orange color again. And then move on to the category axis from the chart elements dropdown. Here, we wanna have the individual category labels more readable. So we'll go under the text options and then the text box tab. And then we'll add a custom angle of minus 60 degrees. Next, let's fix some of the issues with the data and column profile cells. As you can see, our average value currently shows a divided by zero error. This is because for the name column, it cannot find values to calculate an average for. We'll fix that again, using the IFERROR function. And if there is an error for this formula, we need the value to be returned as zero. We'll also set center alignment for the column quality cells from the home tab. In addition to that, we should probably highlight some of these values in case they are greater than zero. We'll do that for the empty and error counts, for the duplicates, not a number, and then invalid dates, and dates in future. With the cell selected, we'll go under the home tab on the ribbon bar, expand the conditional formatting options, and then the highlight cell rules, and then select the greater than option Here, we wanna show a red formatting in case the value is greater than zero. As you can see, the selected cells with values greater than zero are now highlighted. Now to wrap up this demo, let's get to the format check section to the right. In this section, we'll use Excel wildcards to compare the values in our selected column to unexpected format and list some of the corresponding results below that. As you can see on the screen right now, Excel allows a couple of wild card options. A question mark to look for a single character, and an asterisks to look for a number of characters. Both these wildcard characters can be prefixed with the tilde symbol to search for the character itself. Now before we author any formulas for this section, I'll change the name of our selected column to phone, and then review the format we expect for the values within this column. You'll also notice in the top left that I've already defined the name for the cell containing this format. Now for the first formula in this section, we need to calculate the total number of values in this column. You'll probably remember that we've already calculated this value. So here, we can just reference the previous cell, press F4 to change this to an absolute reference, and then press enter. For the next formula, we'll use the COUNTIFS function to search the selected column for the expected format that we've listed above. Please note that we've used the COUNTIFS function because it inherently handles the wildcard searches. Now for the last part of this demo, we've got a spill range here that displays the first five values from our selected column that do not match the expected format. In addition to that, in the first column, it displays the row number in that original dataset for this specific value. As you can see, I've already authored a formula for the spill range. We'll try to replicate this formula in the cell below. To start with, we wanna generate a two-dimensional array with the first column being the row number for each value in our selected column, and the second column being the value itself. To do this, we'll use the CHOOSE function and a similar approach that we used previously on this dashboard. For the second argument of the CHOOSE function we'll use the SEQUENCE function to provide us an incremental row number for each value that we pick up from the third argument. Here, we'll provide the sequence function with an argument for the total number of rows in our selected column. And now for the third argument, we'll provide the column reference itself. Next, we wanna filter this array for the values that do not match our expected format. We'll use the FILTER function. And for the include argument, we'll use a combination of the ISERROR and the SEARCH functions. The SEARCH function also handles wildcards and returns an error if the search text is not found. So we'll provide the expected format as the fine text argument, and then we'll provide our column reference or the within text argument, and then we'll provide the start number to be one. To finish the FILTER function, we'll provide the IFEMPTY argument to be none found. As you can see, the formula now provides the values that do not match the expected format and the corresponding row numbers for these values. However, we wanna limit this array of values to a total of five rows. So we'll use an approach similar to what we did for the duplicate values dataset. And we'll use the IFERROR to handle the errors where the total number of rows returned is less than the maximum rows we expect, and then use the INDEX function for the current array formula. For the row numbers, we'll use a sequence with a five argument because we need a total of five rows. And then for the column argument, we'll use one comma two to get the row number and the value columns. We'll complete the INDEX function. And then in the next line, we'll provide a blank value for the IFERROR argument. As you can see from the blue border now, our spill range includes five rows. Okay, so with that, we have now completed a demo for the data quality dashboard. We hope you enjoyed it and that you'll be able to leverage these skills within your next Excel model.