Pearson+ LogoPearson+ Logo
Start typing, then use the up and down arrows to select an option from the list.

Genetics

Learn the toughest concepts covered in biology with step-by-step video tutorials and practice problems by world-class tutors

15. Genomes and Genomics

Sequencing the Genome

1

concept

Sequencing Overview

clock
8m
Play a video:
Was this helpful?
Hi this video we're gonna be talking about sequencing the genome. So before we can study the genome and what all the different functions of all these different DNA pieces are. We have to be able to know the sequence of it. And so I'm going to go over a brief overview of just how sequencing works. Obviously there's different techniques with different minute details but this is just a general overview. So sequencing genomes uses a few main steps that are common to many of the different ways that sequencing occurs. So the first thing is that you have genomic D. N. A. Or you have the majority of genomic DNA. You have a bunch of D. N. A. You want to sequence it. The first thing you have to do is you have to process it and how you process it is actually that you chop it up into a bunch of pieces. And so these have to be random. They have to be overlapping which mean is that not every piece is unique. Some pieces have the same sequence as other pieces but they're overlapping. And that if you have this little piece of D. N. A, some of it will overlap here. Maybe another piece will overlap here. Something else will go like this and overlap here here and here and so all these different pieces have to be overlapping and we'll figure out why in a minute. But first how do you chop up the D. N. A. You can chop up the D. N. A. Using a special type of protein called a restriction enzyme. And these are proteins that chop the D. N. A. They chop usually there's a bunch of them and all of them have a specific sequence or two that they actually will. You know chop the D. N. A. Up. And so you can use combinations of restriction enzymes to get these overlapping segments. Um And chop the entire D. N. A. And these small fragments. So these fragments are given a special name. That name is called a read. Um and so these reeds can vary depending on which how you chop it up and which restriction enzymes you use can vary between 100 and 5000 base pairs long just generally on average. And so reads are super important and we'll talk about those in a second. But those are the overlapping fragments. So here we have D. N. A. This blue and pink you can see there's a sequence here. This is a restriction enzyme that comes in and it chops here and it chops here. So now we have these fragments of D. N. A. That exists they have do have these overlapping segments. Um Which can be useful but mainly what you need to know is that then you generate fragments of D. N. A. And you do this for the entire genome and you generate millions. Um If not even more than that fragments. So then you have all these fragments you have to sequence them. So there's many different ways that this happens one that I'm going to talk about that's mentioned in your book is called pyro sequencing. Um So what happens empire sequencing is you take each read you attach it to a bead and you amplify it. It means that you make multiple copies of that read. So you have multiple copies of that sequence. Then when you have multiple copies of it that means that you have enough to actually be able to check the signal from it. Because if you just have one copy it's going to be whatever signal you're using is going to be really faint. So if you have multiple copies you can really amplify that signal. So the signal that is using pirate sequencing is actually light. And how this is done is you have a machine and like you said you've attached the sequence to a bead. It's attached to some kind of molecule sitting on a plate in a machine. And this machine will actually take each of the nucleotides, A T. C. And G. And run them individually one at a time across this plate where all your sequences are Now these are special nucleotides and they contain a special molecule on them. So when that nucleotide binding will release that molecule and that molecule is called a pyro phosphate. And when it releases it interacts with other chemicals in there and that releases that converts it to a light signal. So let's say you have a you have a sequence and it's all teas here. Right. I mean this is not really going to happen but if you did and the machine puts in an A. That they will bind because it's complementary and when it does it releases a molecule that gives off a light signal. And so because you're doing this in a machine, there's a camera that camera detects the light signal. And because these nucleotides are passing by one at a time, it knows which nucleotide cause the light signal. And it will um say okay well this is the complementary sequence. So here's an example of what this print off of this might look like each of these peaks represents a light light signal. So you can see there's lots of Gs. They've been running some nucleotides over and over and I realized that you know, it's more complicated than what I'm making it. Which is why the X. Access isn't just a T. C. And G. But generally you can see that, you know right here, this G resulted in a light signal. And so that G is going to be complementary to the actual sequence. So we know the sequence here is C. And you can do this over and over and over again throughout the whole sequence however long it is 100 base pairs, 5000 base pairs. And eventually the computer will spit out what the sequence is now. Like I said, this is one way of doing this. There's a lot of different ways, shotgun sequencing um sort of more newer techniques that do this, But this is really the one that's highlighted in your book. So when you have the sequence, you know the sequence of each of these reeds. Remember you probably have like millions of these reeds. You use computer software to overlap the sequence. Remember when we originally designed the sequencing step, we did overlapping reads. And so you use a software to figure out where these sections are for every single read. And so what the computer dies is it finds those overlapping segments and it says okay, these, you know, this is the sequence here and this is the sequence on either side of it. And so that software continues to go and read each segment until it finally connects all the overlapping segments. And this is called sequence assembling. So this is slowly taking each individual read finding where it overlaps with all the rest of the reads and forming it into one sequence which is a consensus sequence. Now we may have gone over consensus sequences before it's differed from conserved sequence. So a conserved sequence is something that is exact between species. But census sequence doesn't have to be exact, which is important, right? Because if we're sequencing for instance, the human genome and we take my my genetic my genetic material to sequence that's not necessarily going to be completely representative of the human genome. Not everyone is a clone of me. Not everyone has blonde hair, has my eye color is my hike. So there's individual differences between where you get the genetic material, say for me and what other members of the species might look like and their genetic material. So it has to be a consensus sequence because this is close but there might be single nucleotides that are different between me and you and other people. And so the individual differences prevent a single sequence a my sequence from truly representing the entire human genome. So another thing that this requires is generally multiple reads of each base pair. So an example of this is say if you read something that there's been tenfold coverage Of the genome, that means that there are that each base pair is represented in at least 10 individual reads. So 10 individual fragments. So that even makes the number of reads even more because you have to have so many reads um covering the whole genome. So this looks like this. It actually looks a lot more complicated than this because you're dealing with millions and billions of reads. But this is exactly what it looks like. So you have so let's see what this is. So we have the red parts or things that we know these are kind of overlapping regions and the blue part is the unknown sequence. So you get these red parts here and you find out, okay, well where are these overlapping And we can say, well this represents this part of the genome. And now we have this whole sequence that we can compare and create a consensus sequence so that when we finally get the full genome, which is this, we know that each base pair has been represented multiple times. We know that these overlapping segments are located in the proper locations and that we can construct the entire genome based on the number of these reeds. So that, like I said, that's an overview of sequencing. There are many different technologies that do this in slightly different ways but that's generally how they all do it even though some of those minor details might be different. So with that let's not turn the page.
2

concept

Traditional vs. Next-Gen

clock
4m
Play a video:
Was this helpful?
Okay so now let's talk about traditional versus the next generation of sequencing. So of course from the time that the sequencing of the genome was possible there have been many different types and methods developed over the years that have improved upon this technology. So what your book called traditional whole genome sequencing or traditional A. W. G. S. Um requires the use of sales and this is kind of the earlier right is traditional. So it's going to be the earlier way the genome sequence. And so how this happens is you generate DNA fragments um Like we said before and how you actually sequence these is you put them into plasma. Remember plasmas are bacterial D. N. A. We give these plasma is a special name called vectors because we're putting the genetic information into them and then putting them into bacteria. And so so they're vectors of this um genetic material that we're putting in. So we generate the D. N. A. Fragments. We put them in vectors. So these plasmas and we actually put them into bacteria and grow them grow up the bacteria. And that's how you get multiple copies of that small read. Is that the bacteria is replicating itself. It's replicating that D. N. A. And it's making multiple copies of the fragment that you put into it. So um so um after you get enough bacteria you have a ton of copies of this. You can actually take that DNA back out of the bacteria sort of extract that D. N. A. And begin to read the sequence through the sequencing method Whatever sequencing method you want to use shotgun sequencing pyro sequencing whatever. And um so you tell you the reeds and then you use again computer software to overlap them connect them and in this case we call them sequence context context. And these are because they're continuous sequences where the overlap read is arranged into. So that was exactly like the picture I showed in the previous video of all those different reads being overlapped. That final sequence would be the hunting. Now the next generation whole genome sequencing is very similar. Right? I mean we went over the basic sequencing steps but this one does not use cells so you don't need sales to amplify that D. N. A. Instead you use cell free reactions um Using various laboratory techniques mainly PCR if you're familiar with this if you're not don't worry about it. But if you are pcR is a good way to amplify that D. N. A. And then you can use sequence software to sequence. Um and next generation sequencing. Whereas the traditional one you have to grow bacteria and bacteria take up a lot of room. You have to go a lot of it and it's you know it's not very easy if you have 10 billion Reeds to grow 10 billion flask with bacteria. But next generation whole genome sequencing actually uses very small reaction volumes. And it's generally automated the use of a robot and so you can actually do like billions of wells potentially through it. So this is an example of traditional whole genome sequencing. So you start with D. N. A. This is the genome you extract it, you fragmented, you put it into these vectors. Remember vectors are circular, these are plasmas, their circular bacterial D. N. A. And the green sequence here is the sequence you're interested in. You put them into bacteria, bacteria grow, they divide, they replicate, they create many different copies. You can isolate and extract it. Then you sequence the vector itself and you then you have a bunch of different fragments represented by these arrows which you overlap and determine the actual sequence. So that is um sort of the two main types that is traditional and the whole or the next gen traditional. It requires a lot more work a lot more material and growing in live cells. Whereas next gen is mainly much more automated and can be done in a very small setting with small reaction volumes in a machine um without sales. So with that let's not move on
3

concept

Sequencing Difficulties

clock
8m
Play a video:
Was this helpful?
Hello everyone in this lesson. We are going to be learning about the difficulties that come along with entire genome assembly or whole genome assembly. Okay, so the entire genome can be particularly difficult to sequence and it's particularly difficult because we are going to possess some characteristics in the genome that are difficult to track. For example, the majority of our genome, the majority of our D. N. A. Is composed of these repetitive sequences that are just 80 80 80 80 for thousands of base pairs. They don't particularly code for anything. But if we're trying to sequence the entire genome we're going to need to know those sequences and where they belong and how they align with the other complementary sequences. So this can pose a problem because it's difficult to know where a repetitive sequence of DNA begins, where it ends, where it overlaps with other repetitive sequences of DNA. So these are going to be some of these certain genome characteristics that make a genome very difficult to assemble because repetitive DNA sequences are generally much longer than the actual known sequences of DNA or the reeds and they're generally much longer than the coding genome repetitive sequences are very common in our genome. And that can make it difficult to determine where the overlaps begin, where they and where this entire giant string of A's and teas or jesus is actually came from. And the way that we're going to combat this issue is we're going to use paired end reads paired end reads are going to be a technique that we utilize to put these repetitive sequences in the correct location and in the correct alignment alignment is very important and I'll explain that in just a second. So paired end reads are pairs of sequences that are red from opposite ends of the genomic inserts. So basically we have this giant repetitive sequence and then we have these known sequences of DNA on either end of that giant repetitive sequence and pear and reads may span the gap and help determine the sequence between the two contigo. So if we look at this particular diagram here, what I want you guys to know is that the known sequence is going to be represented by the arrows, The unknown sequence? Yeah. Which is usually the repetitive one is going to be represented by the line. And as you guys can see in our key here, it says roughly known length but not known sequence. So we have a general understanding that maybe there's 1000 base pairs between these two known sequences, but we don't know the exact sequence because it's probably repetitive and we don't particularly need to know that. So each of these, wherever it has Um two arrows and an unknown piece in between is going to be a fragment and we're going to know the sequence that is represented by the arrow. Now let me show you how this can be useful to know this information. It's very useful to know this information, especially when you're trying to align the D N. A. So let's have a look and I'll give an example. So let's say that we have this sequence of D. N. A. Here and it's got the two arrows and it's got the unknown piece in the middle. So if I put the complementary strand we know that these two match up perfectly. They aligned correctly. They have complementary known strands or known reads represented by the arrows and they have complementary unknown sequences which are probably repetitive. So that is going to be matching this is normal. This is complementary but you can also see when things have been deleted or inserted or inverted or duplicated. So this is going to be the pair end reads is very helpful for determining how the chromosome has changed over time. Whether it's had a sequence insertion, deletion, inversion, rearrangement, duplication, anything like that. So if you guys see a deletion, this is what it's probably going to look like. So if this is the deletion, you're going to have your known sequences but then you guys can see that the unknown sequence, some of it has been deleted because it's not as long anymore. So we can see that the time top strand has a sequence of that unknown area that has been deleted. So there has been a deletion in here when it used to be this particular size. Okay guys, so now you can also see when there's been an inversion because if there's been an inversion, you're going to see the known sequences change their direction. So if you know this particular read on one end reads in this particular direction and then it completely flips that sequence. There has been an inversion. So this is what an inversion is going to look like. You're going to have the normal sequence here and then what you're going to have is you're gonna have the normal read on one end unknown sequence, and then you're going to have this known sequence going the incorrect direction and an inversion has happened on this read. So you guys can see that an inversion has happened in this chromosome in this particular area. We don't know where it might have started in here. We're not particularly sure but we know that that has happened. And since the unknown region is repetitive, we may never really know where in the D. N. A. It was inverted. Now, you can also see things like a duplications. So let me scroll down a little so we have some more room. You can also see things like duplications. So let's say that the this is normal right here and then you have this and it's much longer. It looks similar to the deletion. This could be a deletion or a duplication. You have to know what the normal length of this particular sequence is to know if part of it was deleted or part of it was duplicated. So we would say that a duplication of the unknown sequence happened in this particular strand of D. N. A. Because those two known sequences got farther and farther apart for some reason now there's more unknown sequence in between them. So some sort of duplication happened here. You can also see things like repeat insertions. I'll just draw that for you guys really quick so you guys know what it might look like. So this would be the normal one. Then you have what looks like a normal one again. But then what if you have this sequence added on? Let me get out of the way. So you guys can see what if you have another sequence. So we have our unknown region with our reads and then wait another unknown region and another read. Then this could be a repeat insertion. Where that sequence of D. N. A. Was duplicated and then inserted into the D. N. A. So basically these paired read ends or the pear and reads are going to be utilized to better understand DNA alignment and what may have happened to the D. N. A. At any point in time. A deletion inversion duplication insertion and it's utilized to understand regions of the genome that may have really repetitive unknown sequences of DNA but they're going to be flanked by known sequences of DNA. So we're simply going to sequence up to a particular point until we hit the D. N. A. And then we're probably not going to sequence anymore. But we know the general length of that particular sequence. And this is going to help us combat some of the difficulties that come along with sequencing the entire genome, which is made up a lot of repetitive sequences. Okay, everyone, let's go on to our next lesson.
4

concept

Sanger Sequencing

clock
4m
Play a video:
Was this helpful?
Okay. So now I want to talk about a new method of sequencing or it's not really a new method. It's actually a very old method called Singer sequencing. And Singer sequencing was one of the first methods used to sequence DNA. So what how Sanger sequencing worked is it took advantage of the special nucleotides. So these are a the base is essentially A. T. C. And G. But they were specially made and specially made nucleotides were called dioxin nucleotides or for short D. D. N. T. P. S. Now these were made so that if they were incorporated into a new strand, so if they were added to a new strand that stops that stops the replication it stops the creation of that strand. So as soon as one of these nucleotides are added if it is needed and then an A. D. D. N. D. P. S added, then that means it would stop. The preliminaries would fall off, replication will not continue. And so how you did this reaction is you have four separate reactions with a normal amount of everything you would need to replicate the D. N. A. But then in each reaction you place a really small amount of each D. D. N. P. For each of the nucleotides. So you had one reaction that had a little bit of a D. D. Or D. D. A. A Tpu had one reaction with a D. D. T. T. P. You had another reaction with a small amount of T. D. C. T. P. And again for A. G. Four reactions and they had small amounts of these. And the reason that you had a small amount of them, right, is because you want the majority of replication to take place normally, and then you want just just every once in a while, just rarely occasionally the incorporation of one of these D D M. T. P. S. And that way, if it's a rare inc you get a variety of different strands, but if it was a common inc it would never replicate at all because immediately it would stop and you don't want that, you want these variety of different strands created. And I'll show you an example of that in a second, if that's not making sense. So, because D D N. T. P would when the incorporation of that, the addition of that, it would cause stop and elongation or the creation of this. Um This this D. N. A. That means that you're going to generate a variety of different strand links within each reaction. And each reaction will have a variety of different strand links that differs from the others. And so then you take all of these reactions and now you have a bunch of different strand links from when that has been stopped. When replication was stopped, depending on the D. N. T. P. That was used, then you get these variety of sequences and you can run them on different ways where you can separate those sequences out by size because because you know that you separated into four different reactions. One of those will always stop at ace. One of them will always stop at T. S. One of them will always stop at CS. And one of them will always stop at Gs. And that is how you figure out you know which nucleotide has A G. Because that nucleotide caused the stop. So an example of this. So you have these four reactions right? Where the pink is T. The green is a. The bluish G. And the red is C. And you have the sequence here of D. N. A. And you want to know the sequence, will you put these into different reactions and you generate all these different lengths? And so what you get is you can say okay well this stopped here and that was in the T. Reaction. So that must be A. T. This one stopped here and that was in the a reaction. So this is gonna be an A. This one stopped here and this was in the T. Reaction. It's going to be A. T. And so on and so forth for G. And CS. And all the way to the end. Um until you get this entire sequence here because all these fragments have stopped because they've incorporated the appropriate D. D. N. D. P. Which caused stop creates these different fragments. And that allows you to figure out what this sequence is. So that is how singer sequencing was done and how some of the very earliest forms of DNA sequencing was performed. So with that, let's not move on.
5
Problem

Restriction enzymes are proteins responsible for what?

6
Problem

What is the name of a short sequenced DNA fragment?

7
Problem

The purpose of a sequence assembly is to what?

8
Problem

Which of the following sequence techniques requires the use of vectors?

9
Problem

Dideoxy nucleotides (ddNTPs) are used in Sanger sequencing because they have what function?

Divider