2: Importing and transforming data using Power Query Editor
2.7 Rename columns and set data types
2: Importing and transforming data using Power Query Editor
2.7 Rename columns and set data types - Video Tutorials & Practice Problems
Video duration:
15m
Play a video:
<v Instructor>In this sub lesson,</v> we're going to explore data types, we'll discuss best practices when choosing data types, we'll discuss some best practices when renaming columns, and finally, we're gonna demo choosing data types and renaming columns. So you may have a query that looks something like this as you extract data from a source system. Maybe it has five columns and the columns have names that are kind of meaningless. So in this case here we have column one, column two, three, four, and five. And if we were to deliver this query to the data model nobody would know what these columns mean. So we wanna be as descriptive as we possibly can on these columns, because ultimately it's gonna help support self-service analytics. So I'm gonna choose a good column name for each of these. We also wanna make sure that we choose an appropriate data type for each one of these columns. So for example maybe column number one represents categories. So we're gonna go ahead and give that column name, category and then maybe the data types should be text. So we're gonna go and make sure that the data type for that particular column is set to text. So let's quickly go through the data types that are available in power query. So to start with, there is a text a data type which is a Unicode character string. We have the Any data type, which is the status given to a column that doesn't have an explicit data type definition. And one thing that we wanna do when we're going through a query is make sure that any columns of the any column data type are given an explicit data type. So it's up to you as the developer to go through and make sure that all of your columns get a proper data type. There's a true False data type, there's a Date data type, which represents only the date portion of the date with no time. There's a Time data type which represents only the time of day with no date portion and there's a date time data type was represents both. Now as a best practice while we're on the slide here, if you have analysis that requires both date and time for analysis, it's recommended to split this into two separate columns. And once again, that all depends on the type of analysis you're doing. If you're doing analysis that does not care about the time portion of the date, and you have that, go ahead and just get rid of it chop the time portion off and use the Date data type. But as time insignificance to then go ahead and keep it. So for example, if you were doing analysis on grocery stores and you want to understand traffic patterns across the time of day, time would be significant because we wanna know where we're getting more traffic in the morning, the afternoon, certain times of day, in that case, the time is significant. But as a best practice, we try to avoid the date time data type. So be explicit about which one of those two types that we want to go ahead and choose. There's the decimal number which represents a 64-bit eight-byte floating point number. It's the most common number type and corresponds to numbers as usually think of them. There's a fixed decimal number, also known as the currency type. This data type has a fixed location for the decimal separator. The decimal separator always has four digits to the right and allows for 19 digits of significance. There is a Whole Number data type which represents a 64-bit eight-byte integer value because it's an integer, it has no digits to the right of the decimal place. Other data types, just to make note of are the date, time, timezone, data type, the duration and percentage and the binary data type. If you'd like more details on these data types, you can go to the website listed at the bottom here and I'll go ahead and pull it up. And this will go into more in-depth discussion into the data types inside a power query. But for now the discussion that we've gone through in the first few slides is good enough to get you moving. Let's just briefly discuss some data type best practices. As a best practice, we wanna give columns a proper data type as soon as possible. And that is because when we actually get columns of information from our sources into power query, the type of data that column contains will influence all the different transformations that you can do from that point forward. So it's a best practice to choose a data type as quickly as possible. And that's largely one of the reasons why the power query editor will do a blanket determination of data types against your sources. We'll show you what that looks like as we move into the sublesson. We select the column data types based on how the column will be used and we also wanna consider storage considerations. We won't dive into the storage considerations too much right now because that's a little bit of an intermediate to advance the whole topic, but these are the things that are generally used to influence the type of data that you want to choose for a call. And the same reason that we wanna give a data type as early as possible we wanna actually go through and rename columns as quickly as we possibly can as well. And three of the reasons why we renamed columns are the following. They're used for better readability and self-service, so if I have a column that I'm expecting an end user to be able to immediately understand choosing a column name that makes sense and is descriptive is a good choice. So we wanna do it for readability and self-service. We'll also find that when we get into the reporting section choosing good column names is gonna encourage and help better results in the Q & A functionality. The last reason we wanna do it, is it just promotes consistency in the data model. So for example, let's just suppose that we have two columns that exist in two different tables and they both represent regions. We are better off in power query to go ahead and standardize the names of those columns across both tables and call them region. So we'll show you some examples as we move through here where we violate some of those best practices and standards, but where if we actually follow those best practices, we're gonna find that our power query edit experience is a whole lot quicker, seamless and more efficient. So let's go ahead and get ourselves into the Power BI desktop so we can go through and do some demos. So in our collection of files from GitHub, we can see that we've got a file here called SalesDataMart-SL 02_07, let's go ahead and open that file up and then we'll get ready to do some work in it. So I've got my Power BI desktop file open. I'm gonna go to the transform data command here get the power query editor window open. And what I'm gonna do here first is go into my order date query to go ahead and make sure that the data types have been set properly. So if I go into my order date query and just go across the top and take a look at each one of these columns, I can see that they've all been given the any data type designation. And the way I can tell that is at the very top, it says, ABC one, two, three and that is the symbol that represents or the icon that actually represents the, any data types. So I'm just gonna go ahead and zoom into that so you can see it a little bit easier. There it says ABC123. So on my Order Date column, what I'm gonna do, is I'm just gonna go down and make that a data type of date, like so. And we can see now that the icon has changed to this icon here and now that is representing a date with no time component to it. But if I look to the right here I can see that I've only managed to change one of the actual column data types. So what I'm gonna do here is actually show you how to blanket change all of these at once. So I'm just gonna zoom back out and what I'm gonna do here is I'm actually just gonna get rid of this change type step. So this was the step that was just added from me, changing from the any data type for order date, to the actual dates. So, I'm just gonna go ahead and get rid of that. And the way that we can go ahead and change all of the columns at once is by clicking on the first column here, by clicking on the order date here and going Ctrl + A which is gonna highlight all of the columns. Now, what we can do is to go up into the Transform tab here and what we can do in the any column group is there's this command here called Detect Data Type. So if I click that, it's gonna go through for every one of the columns in this particular query and go through and detect the data type that it figures should be chosen. So, we can see here in the order date, it shows date, so we're happy with that. Going across all these columns just doing a quick scan, it appears as though the data types have been set properly. The only one that I actually wanna change is this one here called Month Year. So if I was to go back into my Excel spreadsheet which I'll go ahead and bring up here right now, I can see in the ordered Excel spreadsheet, if I had to scroll to the right, I've got a column down here called Month Year. And my intention for that particular column was to have the first three letters being the abbreviation of the month, followed by a dash followed by the year. So the power crater was actually being a little friendly for me and as she attempted to change that into a date type in fact, it didn't just attempt it and actually did it but I really want the data format of like this. So I wanna force this value to be a text or the data type for that particular column. So I'm gonna go back into the Power BI desktop, into the power creator and I can see that the data type here has been set to Date which is what happened for us automatically when he did this blank detect data type but what I wanna do here is change this to a texted types and it clicked the icon next to Month Year, say text, I'm gonna get a dialogue box that pops up and says, do you want to replace the existing type conversion that was done in that blanket data type detect? Or do you wanna add a brand new step the applied steps? And what we're gonna do here is to say replace current. So I'm basically overriding what Power BI did for me automatically and replacing with text value. And now we can see that the Month Year column has the proper data type on it and the column is formatted the way that we're gonna wanna see it when it comes to report. So, carrying on with our data type selections. What I'm gonna do here is I wanna move to the United States query here and I go through all the columns here and I see that product key is a text value, which looks okay to me, order date is a date value, sales territory regions a text or to quantity is a whole number, which looks good to me and unit price is showing up as a text. So this is something that I wanna fix, but I'm gonna purposely leave this data type alone for now so that we can show you in a subsequent sublesson the ramifications of not doing this in this location and some of the benefits we can get by actually getting this data set properly. So we're gonna purposely leave this unit price as a text value. So next, let's go down to our product query and go through and take a look at each one of the data types. And what we're gonna do here is take a look at the product key that looks acceptable, sub category key is showing up as the Any data type, so I'm gonna go ahead and click on that and I'm gonna change this to text, like so, and I'm gonna go across my product description and once again, it's showing up as the Any data type and I'm gonna also change that to a text value Scrolling a little bit more to the right or going to the next column here, we can take a look at standard cost and this one is showing up as a text value, but I wanna make this one a fixed decimal and so I'm gonna go in here to fix decimal and Save and Replace Current like so, and now the standard cost has been given the data type of fixed decimal place. And lastly, let's go into our other countries query down below here, and what I wanna do, is I'm gonna go find my other ERP unit column and let's move over here, let's go ahead and find it at the very end here, we can see that the decimal, oh sorry, the data type was set to decimal number, we are actually gonna go ahead and change this to fix decimal. So I'll also choose replace current because Power BI did an automatic datatype detection for us and we're just gonna say, go ahead and replace what was currently there. Now keep in mind the air that I'm gonna try and trip here is later on in a subsequent sublesson we're going to upend the other country sales and the United States sales together. And keeping a mind here, that we actually now have two different data types on the unit price columns across both these queries. So I purposefully left them broken so we can show you some of the benefits of naming standardization and data type standardization early on in the process. So now that we've got all the data types set the way we want, now I'm gonna go through and do some column renaming. So sticking with the other country sales query, which I'm on right now, what we're going to do is go over to the column here called product alternate key and we're gonna put spaces between each one of these words here. So the number of different ways to rename a column, you can right click on the column header and go down a little bit lower and click Rename. So let's go ahead and do that in the first one. So I'm gonna just go ahead and put spaces in between each one of these words here. So it says product, space, alternate key with a space between all those and go ahead and hit it Enter, like so. And next, what I'm gonna do is I'm gonna go through and all of these columns have the word or the rest of the columns of other ERP in front of the call names. I wanna go through and get rid of other ERP across all these columns. So, what I'm gonna do this time instead of going into the dialog box, I'm just gonna double click on the column header itself and get rid of other ERP, go order space, date, hit, Enter and then I'm gonna go through and rename the rest of the columns following that technique. So what I'm gonna do here is let us scroll my data preview to the right a little bit and the rest of the columns here that have other ERP infront of them, I'm just gonna go ahead and change all those and get rid of other ERP in front of each one of these things and then put a space between all the words in these columns. So I'm just gonna go ahead and do that. So you can see that I've gone through at least the ones on the screen. You can see, have gone through and remove the other ERP prefix and put spaces in between each one of the words and gone through and done that in all of the columns themselves. If I want in my formula bar, I can actually drop this down here and we can see that here's the M code that is showing all of the different column renames. So the original name, and then the brand new name on the right hand side of this particular function here. You'll also notice given that I grouped all these renames together, it was able to actually put it all inside a one applied steps. So this is really nice reason for kind of doing this all at once and getting things standardized as quickly as possible but it wraps all into one step and doesn't scatter renamings all through your applied steps. So the one last column rename that we wanna point out is under product loads. So, I'm going over the product level query and we can see here that we have a column called product subcategory key with no spaces in between here. Now, ideally like to call us one product key with a space between, but I'm gonna leave this rename until a subsequent sublesson so I can once again, show you the value of coming through and giving your columns consistent names across your data model which really promotes ease in the data transmission steps within the power query editor. So with that, we're done talking about renaming columns and setting data types.