10.3 DOM Manipulation at the Command Line - Video Tutorials & Practice Problems
Video duration:
18m
Play a video:
<v Instructor>In this section,</v> we're going to take the URL reading abilities we developed in the last section and apply it to a real problem that I once solved for myself. So let me give you a little background about how this problem arose, and then we'll see how to solve it with JavaScript. A few years back, I decided to use some of the new online resources to brush up on the Spanish that I learned in high school and college. One of the most useful resources for this is Wikipedia, which has lots of articles in Spanish. For example, let's take a look at JavaScript in Spanish. Here's the article. And one of the things I realized was really useful to do was to copy the paragraphs and paste them into Google Translate, which would not only let me see the translation but also use text to speech to hear it spoken. For example, let's copy the first two paragraphs here, paste it in, and it detects that it's Spanish. Translated to English. And if you click down here, or use the native macOS text to speech, you can listen to it. But there are two problems that I noticed. One is that copying a large number of paragraphs is really cumbersome. So if I wanted to copy the whole thing, this is annoying, and it also copies things like this table of contents that I don't want to copy. And I also noticed a smaller problem that's still annoying, which is that it copies these numbers, three, four, from the references in the original article. So if you use text to speech on this, it will read this sentence here as something like (speaks Spanish) So what is the word three or tres doing in that sentence? Well, it came along for the ride when I copy and pasted. To solve these issues, I decided to write a script. The original script was in Ruby, but in fact, if anything, it's easier in JavaScript, mainly because JavaScript is so good at processing HTML. The steps are fairly simple. What we wanna do is take an arbitrary URL at the command line, which we'll always think of as a Wikipedia URL, then manipulate the downloaded HTML as if it were a regular document object model, then remove these references, three and four, or tres and cuatro here, and then collect and output the paragraphs. Implementing this program in JavaScript is our task for the rest of this section. Let's call it wikp for Wikipedia paragraphs. Let's copy this here. Get the shebang line for convenience, and then almost always just start with hello world. Just make sure it's working. All right. I'll add in some comments too because for a utility script like this, I actually do wanna document it fairly thoroughly. Now our first step is to take in the URL as a command line argument. I didn't know how to do this in JavaScript when I first started this tutorial, so I Googled JavaScript node command line argument and figured out how. Because we'll be reading from this URL, we'll start by loading the request library. And the way to take in a command line argument is to use an object called process, and a property called argv. Well, actually, you know what? Let's just do this. We'll start with this. This isn't quite right, but we'll see in a second how it works. All right, wikip and then, aha, look at that. So we've got three elements in argv, user local bin node, then the full path to the wikip, executable, and then the argument. So that's zero, one, two is the URL. Let's call it with the real URL this time. Aha, it's working. All right, now let's look at the palindrome URL. This here. So we're gonna do this same thing here. Copy that. It's great to bootstrap off code you've already written. Okay, request URL. This is going to work. This is gonna give us the body. So let's just put it out here. Let's log it. This is the source for that JavaScript article. For scripts like this, I love running it almost every time I add a new code. All right, that looked good. You can see the closing body and HTML tags here. And, as promised, body here in this code is the entire body of the HTML page, not the contents of the body tag. All right, well now we've got the contents of this HTML page, so how are we going to remove the references and collect the paragraphs? And the answer is that we're going to treat this page as if it were a DOM. The specific task we're doing is called parsing. So I Googled node parse HTML, and found a library called jsdom. In particular, it's a node package, so we can install it like this. All right, and we'll use it like this. This code is just taken from the jsdom documentation. You might wonder why I'm using const instead of let as I did before. And the answer is it doesn't really matter which one I use in this case, and I'm just following the documentation. Being able to read code and not be confused by little details like this is a classic hallmark of technical sophistication. As is the next line. I'm gonna do const, open curly brace, JSDOM equals jsdom. What do these curly braces mean here? The truth is I actually don't know offhand. Again, this is taken right from the documentation, which is linked in the text. This is an essential skill to be able to copy and paste code that you don't necessarily understand in detail but that you can put to good use anyway. So here this all caps JSDOM is actually a constructive function that can be used to create the equivalent of document. Let's refresh our memory on that. This is in main.js. We have document.addEventListener, document.querySelector, document.querySelector here. In order to simulate the DOM of the page we're downloading, we need something like this document. And again, the jsdom documentation tells us how to do it. And here we're gonna do this. Let's, again, these weird curly braces. I'm not sure exactly what they do, but it just works, as we'll see. So this is new JSDOM of body.window. Again, right from the documentation. I never in a million years would've guessed this. Let's see how it works. We've got this document.querySelector. Gonna copy that, 'cause we can just use the same thing. Let's console.log, querySelector of, let's do a paragraph tag. This is just P. This will be the first paragraph in the document. I actually don't know what this is gonna do. Only one way to find out. Oh, look at that. So it's an HTML paragraph element, which is just the object returned by query selector. By running the Google search, JavaScript's DOM element print content, I discovered that this object has a text content property, so I can do this. Aha, that is the first paragraph too. If we look here, look at that. This is the contents of the first paragraph. So this is great progress. Now let's pull out all the paragraphs. The way to do this is with a powerful function closely related to query selector called query selector all. Let's bind it to a variable called paragraphs query selector all, and then like that. Let's add in some documentation here, just a little comment. So you can see that query selector all is gonna work just like query selector. Before, we selected an ID, CSS ID. Now we've learned that you can also just give it a tag. To pull out the references, we need to know a little bit about the structure of a Wikipedia page. Let's inspect it. How are we gonna find that reference? Inspect element. We can see here that this has a class equal reference. All right, so this is a good example of how you can figure stuff out on your own. If I told you that there is a function called query selector where if you give it hash symbol and then a string, it would find the element corresponding to that CSS ID, how would you find all of the elements with a particular CSS class? Well, we've just seen query selector all as how to find all the elements. And by analogy with this, we can say let references equals document dot query selector all. Remember, in CSS, hash symbol is an ID. The way you indicate a CSS class is with a dot. So this code should select all of the elements on the page with class equal to reference. Let's log a couple things just to track our progress. Just do the first paragraph and the same thing with the references. Right, so these are both the actual objects. Can do dot text content on both of these. There we go, look at that. Now note that this here is actually the number one, so it must be that there's an element with class equal to reference with this content. Ah, and here it is. You can see it's actually here in the sidebar. Let's close this down. That's what it's picking up. Let's look at maybe two. So the second one. So one is the index. It's two. Aha, there it is. So this is the one that we're actually looking at in the paragraph, but it doesn't matter, because we're going to remove them all. How do we do that? Well, the first step is to go to Google and type JavaScript DOM remove element, or something like that. And you'll discover that the JavaScript objects that represent a particular element or node in the DOM have a remove method. Let's take a look at that. So what we have to do here is for each reference in references, remove that reference. We saw that each reference is an HTML element and by Googling JavaScript DOM remove element, we can learn that this will remove that element from the DOM. I'm actually not sure what'll happen here. Look at that. Yeah, it removed it, and so now it's undefined. So what we wanna do is go through all of these references and for each one, call remove. As you might be able to guess from my suggestive choice of words, the way to do this is with for each. We'll put in a documentation line too, and we can use the same idea to output all of the paragraphs. Oh, this won't quite work. Let's take a look. I still like to run it though. Right, these are the paragraph elements. And we saw before that we have to call the property text content. Looks like that worked. Let's copy and paste it. Now I can just scroll up. It's a little cumbersome though, so I'll do my favorite trick on macOS, which is to pipe to pbcopy. This is the pasteboard. Pb for pasteboard, copy. Puts it in the buffer, and then we can paste it in. And now you can see that the reference is gone. We saw that there was a three here for the third reference here. Objetos tres. Now just objetos. And we've got all these paragraphs. So that's it. Look at how few lines there are here. Even including the extra new lines we've added in a few places and the documentation comments, it's only 27 lines. Moreover, the ideas that we've developed here are applicable to a huge variety of different applications. We're now in a position to download any URL from the public web, convert it to a DOM, and select HTML elements based on the tag name, the CSS ID, or the CSS class. When you think about it, it really is remarkable how much you can accomplish in just 27 lines of code.