logo

HHMI BioInteractive: Bioinformatics

by Pearson
8 views
Was this helpful ?
0
PETER SKEWES-COX: I'm Peter Skewes-Cox. I'm a third year bioinformatics student in Joe DeRisi's lab. I'm particularly interested in deep sequencing data and how to analyze it looking for viruses in human disease. Pretty early on-- GRAHAM RUBY: I'm Graham Ruby, and I'm a post-doc in Joe DeRisi's lab. When I was in high school, I was interested in biology. But it really wasn't until graduate school that I started adding computer programming to be part of the ways that I approached biology. So one of the particular problems that I've been looking at is infants with respiratory illness. So I go through mucus collected from infants and look at all of the nucleic acid that's in that mucus and look for sequences of nucleotides that haven't been described before but look like they might belong to a pathogen. So from there, you can go from this being an area of mystery in medicine to an area of knowledge where then if someone has that sickness, you can do something about it. PETER SKEWES-COX: When people used to study genomes and things, they would study things one gene at a time. And they would do experiments on those one genes and look at their data and analyze it by hand. Well nowadays, with the technology, we're able to do millions of experiments simultaneously. And it's no longer feasible to go through and look at each gene at a time. Rather, you need to do this in a automated fashion if you ever want to graduate or publish a paper. So what I do is I write computer programs that think like I do and they're able to go through these data at a much faster rate than I am. This is the QB3 server room, and it's got thousands and thousands of computers in here all stored in cabinets stacked right on top of one another. We have one cabinet down here, as well. And we have about 100 nodes in there, computers. What we're able to do is when we do a deep sequencing run, in the next building over, we're able to transfer the data across the network to this big computer down here. So if you have a job that would take a month normally, you're able to do it on 30 computers in one day. Or if it would take a year to run on one computer, you could do it on 300 computers in a day. You're really able to increase your throughput and your productivity. That noise is the sound of computational biology. It's thousands of computers humming and thousands of computers humming produces a lot of heat. And so we have to have air conditioning on down there. It's a cooled room and it's a loud room. So you don't want to spend too much time there at once. GRAHAM RUBY: So these huge amounts of data that increase in scale is what really makes it a lot more efficient in terms of discovering new pathogens of interest. So if you take a little fragment of five or six nucleotides of sequence, five or six letters, and you say, OK, does this come from a virus? If you only look at one snippet of DNA, you have a very small chance that that fragment of DNA is actually going to come from the dangerous pathogen that you're interested in discovering. But if you have billions and billions of small fragments of DNA, then you have a very good probability that one of those small snippets is going to come from the pathogen that's causing a problem for humans. And so that's why you have to systematize the way that you go through this data so that you can collect the huge amounts of data that make it likely that you'll discover something of interest but still be able to go through all that data in a sufficiently detailed way that if you see something that's of interest, you can identify it as being interesting. PETER SKEWES-COX: When we actually do the deep sequencing run, we produce multiple terabytes of information, which is equivalent to thousands of CDs worth of data. And I'll get on the order of 20 gigabytes of text. That won't even fit on your standard flash drive. And so even though it's all just text data, just words and words and sequences, it takes a really long time to process these texts. You can't even open these files in Microsoft Word, for example. It will crash. You need special programs that go through line by line and do the analysis for you. We can process a human genome's worth of data on the order of hours now, which is really cool. It might take us a week or to generate those data, but once we have the data in my hands and on these fast computers processors, we're able to go through them extremely quickly, which wasn't-- possible before. When you get your deep sequencing datas, you need to pay close attention so that you don't miss anything subtle. Because the best projects and the best side projects often come out of these subtleties that you pick up on. And Joe encourages us to do this. And it happens all the time. So when we look at something, oh wait, what's this? This isn't normal. Whereas if you just ran it through a pipeline and put something in and expect the answer to come out, you'd miss all the subtleties along the way. So having all these nice computers and everything is great. But we also interact with them all the time. We're able to get our answers a little more quickly at each step, which is really awesome. But it still requires a lot of user intervention. And smart, intelligent thinking about the problems that we're trying to solve. GRAHAM RUBY: The diversity of pathogens, especially viruses, is quite different from what you might think about from the diversity of other species. And they evolve so much faster. They don't take millions and millions and millions of years to build up differences in their genetic code. They take years or even months or sometimes even days, even in the course of a single infection. And so because they evolve so much faster, it's a lot harder, even if you know about a type of virus. It can be a lot harder to identify that virus with a test that looks for some specific molecule because those molecules change rapidly. The reason to focus on things like computer programming and computer science so much when you're doing this work goes back to a quote from a Louis Pasteur where he says, "chance favors the prepared mind." So the better you are at that and the more elegant your analysis is on the computational, the more able you'll be to notice really interesting and exciting biology when it's sitting right in front of you.
PETER SKEWES-COX: I'm Peter Skewes-Cox. I'm a third year bioinformatics student in Joe DeRisi's lab. I'm particularly interested in deep sequencing data and how to analyze it looking for viruses in human disease. Pretty early on-- GRAHAM RUBY: I'm Graham Ruby, and I'm a post-doc in Joe DeRisi's lab. When I was in high school, I was interested in biology. But it really wasn't until graduate school that I started adding computer programming to be part of the ways that I approached biology. So one of the particular problems that I've been looking at is infants with respiratory illness. So I go through mucus collected from infants and look at all of the nucleic acid that's in that mucus and look for sequences of nucleotides that haven't been described before but look like they might belong to a pathogen. So from there, you can go from this being an area of mystery in medicine to an area of knowledge where then if someone has that sickness, you can do something about it. PETER SKEWES-COX: When people used to study genomes and things, they would study things one gene at a time. And they would do experiments on those one genes and look at their data and analyze it by hand. Well nowadays, with the technology, we're able to do millions of experiments simultaneously. And it's no longer feasible to go through and look at each gene at a time. Rather, you need to do this in a automated fashion if you ever want to graduate or publish a paper. So what I do is I write computer programs that think like I do and they're able to go through these data at a much faster rate than I am. This is the QB3 server room, and it's got thousands and thousands of computers in here all stored in cabinets stacked right on top of one another. We have one cabinet down here, as well. And we have about 100 nodes in there, computers. What we're able to do is when we do a deep sequencing run, in the next building over, we're able to transfer the data across the network to this big computer down here. So if you have a job that would take a month normally, you're able to do it on 30 computers in one day. Or if it would take a year to run on one computer, you could do it on 300 computers in a day. You're really able to increase your throughput and your productivity. That noise is the sound of computational biology. It's thousands of computers humming and thousands of computers humming produces a lot of heat. And so we have to have air conditioning on down there. It's a cooled room and it's a loud room. So you don't want to spend too much time there at once. GRAHAM RUBY: So these huge amounts of data that increase in scale is what really makes it a lot more efficient in terms of discovering new pathogens of interest. So if you take a little fragment of five or six nucleotides of sequence, five or six letters, and you say, OK, does this come from a virus? If you only look at one snippet of DNA, you have a very small chance that that fragment of DNA is actually going to come from the dangerous pathogen that you're interested in discovering. But if you have billions and billions of small fragments of DNA, then you have a very good probability that one of those small snippets is going to come from the pathogen that's causing a problem for humans. And so that's why you have to systematize the way that you go through this data so that you can collect the huge amounts of data that make it likely that you'll discover something of interest but still be able to go through all that data in a sufficiently detailed way that if you see something that's of interest, you can identify it as being interesting. PETER SKEWES-COX: When we actually do the deep sequencing run, we produce multiple terabytes of information, which is equivalent to thousands of CDs worth of data. And I'll get on the order of 20 gigabytes of text. That won't even fit on your standard flash drive. And so even though it's all just text data, just words and words and sequences, it takes a really long time to process these texts. You can't even open these files in Microsoft Word, for example. It will crash. You need special programs that go through line by line and do the analysis for you. We can process a human genome's worth of data on the order of hours now, which is really cool. It might take us a week or to generate those data, but once we have the data in my hands and on these fast computers processors, we're able to go through them extremely quickly, which wasn't-- possible before. When you get your deep sequencing datas, you need to pay close attention so that you don't miss anything subtle. Because the best projects and the best side projects often come out of these subtleties that you pick up on. And Joe encourages us to do this. And it happens all the time. So when we look at something, oh wait, what's this? This isn't normal. Whereas if you just ran it through a pipeline and put something in and expect the answer to come out, you'd miss all the subtleties along the way. So having all these nice computers and everything is great. But we also interact with them all the time. We're able to get our answers a little more quickly at each step, which is really awesome. But it still requires a lot of user intervention. And smart, intelligent thinking about the problems that we're trying to solve. GRAHAM RUBY: The diversity of pathogens, especially viruses, is quite different from what you might think about from the diversity of other species. And they evolve so much faster. They don't take millions and millions and millions of years to build up differences in their genetic code. They take years or even months or sometimes even days, even in the course of a single infection. And so because they evolve so much faster, it's a lot harder, even if you know about a type of virus. It can be a lot harder to identify that virus with a test that looks for some specific molecule because those molecules change rapidly. The reason to focus on things like computer programming and computer science so much when you're doing this work goes back to a quote from a Louis Pasteur where he says, "chance favors the prepared mind." So the better you are at that and the more elegant your analysis is on the computational, the more able you'll be to notice really interesting and exciting biology when it's sitting right in front of you.