2.4 Store data in vectors - Video Tutorials & Practice Problems
Video duration:
16m
Play a video:
<v Voiceover>A different thing about R,</v> but one that makes it so easy to use, especially for statisticians, is vectorization. A vector in R is similar to an array in other languages. It is a container that holds multiple elements all of the same type. So, to illustrate, let's create a new variable, x, that contains all of the first 10 numbers. It is assigned the same way as always, with the arrow symbol, but the vector is created with the c function. Just a lowercase c. And each element is inserted one at a time, separated by commas. So, for instance, one, two, three, and so on. So if we run that, we now have a variable that has each of the first 10 numbers. Vector creation is as simple as that. Just creating these vectors wouldn't be so amazing if R didn't operate on them so nicely. Let's say we want to multiply each element of x by the same number, say three. Doing that in another language would require writing a loop. but in R, we simply multiply x times three and we get three, six, nine, 12, 15, et cetera, et cetera. Each of the numbers were automatically multiplied by three. Similarly with addition. If we do x plus two, we get three, four, five, six, seven and so on. It also works in the same exact way with subtraction. x minus three, and we get negative two, negative one, zero. So forth and so on. As would be expected, division works as well. x divided by four: each element was divided by four. Exponentiation: So if we square everything, we get one, four, nine, 16, 25. And its opposite: square root. being able to operate on each element of a vector all at the same time really makes R much easier to work with. Now, sometimes, you may want to create a vector and not be as verbose as typing in one, two, three, four, five. If they're all in a sequence, you can use a colon operator. For instance, to get the first 10 integers, you do 1:10, and they are all automatically created. This also works backwards. 10:1 gives you 10, nine, eight, seven, six. And... it works over zero, so -2:3 gets negative two, negative one, zero, one, two, three. And, of course, backwards with a negative: 5:-7... and it works just fine. So now that we have multiple ways to create a vector, let's see what we can do with two vectors. So let's reassign x to be one through 10, and y to be negative five through four. we now have these nice two vectors, which are both the same length. They both have 10 elements in them. Let's add them together: x plus y gives us the element by element addition of those two vectors. works the same way for subtraction, multiplication, and division. Element by element operation. It's even possible to raise, on an element by element basis, each element of x by its corresponding element in y. So x to the y gives us one to the negative fifth, two to the negative fourth, three to the negative third, and so on. There are a number of helper functions to go along with vectors. For instance, let's say we want to see how long a vector is. we can do that by checking length of x and it tells us it has 10 elements. Much like y also has 10 elements. And we can also check what's the length of x plus y. Since it's just element by element addition, they should still have a length of 10. Now, interesting things happen when you want to add two vectors that have different lengths. For instance, x, which has a length of 10, and a vector we will create of just one and two. So what happens here is that the one in x got added to the one in the second vector, the two in x gets added to the two in the second vector. Now it recycles. The three in x gets added to the first element in our vector. And so what happens is this vector, alternately, gets added back and forth to each element of x until they are all used up. Now in this case, the longer vector was a multiple of the shorter vector. It still works when... you have vectors that are not a multiple of each other, but there's just a warning letting you know that the longer object was not a multiple of the shorter object. It still would have worked, however. you could also do element by element logical comparisons of a vector. So, for instance, x is less than or equal to five. being that x is one through 10, remember, the first five are, indeed, less than or equal to five, and starting at six and onward, they are not less than or equal to five. you could also compare each element of a vector to another vector. So, for instance, x is greater than y. Across the board, that's all true. And, as a reminder, x is one through 10 and y is negative five through four. checking the opposite, we see y is greater than x, and we get it's all false. we could have alternately had said x is less than y and gotten the same result. Let's reassign x and y to different values and see how we can compare them then. Let's make x be 10 through one and y be negative four through five. So let's run both of these and then check is x less than y. we can see that most of them are false, but some of them are true. Now picking out which ones were true could be difficult. So, fortunately, there's a helper function, called any, which let's you see are any of these true. So we see here that yes, any of these comparisons was true. If you want a check of all of them, use the all function. Are all of these elements true. No, they are not. Now, vectors can hold all sorts of information. but remember, a vector all has to be of the same type. So it could all be numeric, or it could all be character, or it could all be dates. So now, let's assign a vector of sports to the variable q. Do q... gets c. we'll say hockey... football. Notice that each element in here has its own open and closing quote marks. baseball... curling... Rugby. And I'll go back and make sure that's a capital R. Lacrosse... basketball... Tennis... cricket... and Soccer. Now, you'll see that this sort of ran off my screen here. So what we could do is we could break up the vector on two lines, and as long as we run them both by highlighting them, they will run just fine. And we can check that by running q. And it gets displayed nicely. And we can see here it tells you the first element is hockey, then it goes through and the seventh element is basketball. So you sort of get a sense of which value in the vector is which element. As we saw before with square root, a number of functions are vectorized. So for characters, the nchar function is vectorized. Type in nchar of q. And we see that hockey has six letters, football has eight, baseball has eight, curling has seven, and so on. Now, it is important to note that every variable in R is technically a vector. Even if there's only one element in it. So, for instance, let's say F gets seven. If we run this, we still get that little symbol one. because even though it's a scaler, it's just one number, one integer, it is stored as a vector of length one. And that's an important distinction in R. So let's say we go back to x, which right now has the numbers 10 through one, and let's say we just want to grab one element out of there. Let's say we want to grab the first one. we use the square brackets to get at it. So you do x, square brackets, one. Now, some other programming languages, the first element might be zero. In R, the first element is one. It's more natural. So if we run this, we get 10, which was the first element. If we want the first two elements, we can say give me one through two. basically, the square brackets takes a vector that indicates which positions we want to grab. In this case, we get 10 and nine. If we want two nonconsecutive elements, we use a vector, but just have to create it more manually. For instance, x... I want elements one and four. we create that vector using c, and we get 10 and seven. Now, vectors can have names. So let's start with something simple just by creating a vector of three elements: a, y, and r, and we'll call the names one, two, and last. we create it using c. we can say One... gets a. Two.. gets y. And we'll say Last gets r. Notice I did not use the arrow assignment. because inside a function, and c is a function, the arguments are given values using an equals sign, not the arrow. It's a quirk of the language, and something you have to get used to. Let's say we run this, and we can see our vector now has names. So that's easy to identify things. we can also add the names after the fact. So let's say we create a vector called w of just the first three numbers. we'll go ahead and take a look at that just to confirm. And we want to give this names. we can do that with the names function. So... names... of w... gets... And now remember, w has three elements, so its names is a three element vector. So this gets, and we'll say just a, b, and c. a... b... and c. And now when we run w, a is one, b is two, c is three. Due to some mathematical reasons, which becomes more apparent when we start modeling, R also can store characters as factors. So to illustrate this, let's go ahead and create a new vector that consists of q and a few more elements. So we will do q2 gets c... q... and let's say the first one's just hockey. What this does is it combines the whole vector q and now some new elements, such as hockey, lacrosse again, another hockey, I'll create a new one, water polo, and hockey and lacrosse one more time. So when we run this, and we can look at it, it's just a longer vector with a number of repeats in there. Those repeats are important because we are going to create a new variable called q2Factor. And we're just going to make that the factor version of q2. When we look at that, and again I'll use tab completion to save typing, it prints out all of the values in the vector, but it also has something here called levels. What a level is, it is each unique value of the vector. For instance, hockey was in here one, two, three, four times, but it only shows up in the levels once. It's very important that the levels are the individual labels for the values. If we were to look at this factor as a numeric, we would see that each unique value is assigned an integer, and that's how it stores the data, and it only prints out these nice labels on top. So if we do as.numeric... q2Factor, we see that hockey is six, football is five, later we get to hockey again, where it's six, and again, it's six. Lacrosse is in there a few times as seven. And what it did was it took all the values of the vector, got just the unique levels of it, and put them in alphabetical order. So baseball is number one, basketball is number two, cricket is three. And then it just assigns the numbers in place as needed. Being we're dealing with the real world, there are often times when data is missing. Dealing with missing data can be a bit difficult, but it is very natural in R. So there are two types of non-information in R: NA and null, and they are very different. NA represents missing data. Null represents the absence of anything. So let's say we have a vector of z, and it takes on the values one, two, NA, eight, three, NA, and three. Looking at that, it prints the NAs. That's because missing data is very important in statistics. Missing data point doesn't mean it's invalid, it means it wasn't answered and care must be taken. You can check if elements are NA by doing is.NA. That tells us the first two were not NA, but the third one was, indeed, as was the second to last one. Now, NA is a special type, it's not a character. And this is illustrated by making a character vector. So, zchar... gets... c... say, hockey, NA, and lacrosse. We'll run it. Notice NA does not have quotes. If we do is.NA... of zchar, we see hockey's not NA, lacrosse is not NA, but NA is. This is very different than null. Because the null is absence of anything. If we override z to be one... null... and three, the null will actually just be ignored, and it's just one and three. We could, however, assign null to d. Then we type in d, it comes up with null, and is.null... returns true. So missing data is very important, and there are ways to handle missing data because you can't just ignore it, you need to take it into account in your analysis. So that is the basics of vectors, A very important concept in R. It's important to remember that all data is stored as a vector, even if it's just an element of one. And running operations on vectors is highly efficient in R. Calling the function once and having it apply to every single element is much quicker in R than building a loop and applying that function to numerous elements in that vector. So always take advantage of the vectorization in R as much as possible.