Placeholder Image

字幕列表 影片播放

  • Hello, world.

  • This is CS 50 on Twitch.

  • My name is Colton Ogden.

  • And today, for the first time, we're joined by Emily Hong.

  • Welcome to Emily.

  • What are we talking about today?

  • Today we're gonna talk a little bit about Web scraping.

  • Nice.

  • What is whips creeping?

  • Yeah, Web scraping is basically you have the whole World Wide Web, right?

  • And I know as a normal computer user, you can kind of, like, go on a website and read information and so kind of There's so much information available on the Web and Web scraping is a way to kind of programmatically take that information off of those websites and do whatever you want to sort of parse it programmatically as opposed to an end user who might be going on Wikipedia or Google interacting with things to do with the mouse is reading, you know, information.

  • So we can make a script to maybe do some analytics and sort of the U.

  • S.

  • So if you think of, like, maybe the non tech way are like same concept of web scooping, maybe you're reading a website, say Wikipedia, and you're like copying down the words or information on paper.

  • And then we decided to do something with paper that obviously takes a really long time fishing.

  • Yeah.

  • And so with our computer science college, we can do some, uh, meet little tricks to speed up that process.

  • We're gonna be using Wikipedia at all by chance.

  • And yes.

  • What?

  • Yeah.

  • Fun fact.

  • Also, Emily and I are from the same exact Well, not the exact same place, but same county in California, Which is amazing.

  • Yeah.

  • Let's let's bring it over to your computer here just so we can see that.

  • And hopefully we did a little bit cropping in advance, hoping that actually, I didn't realize that might screw up the slide.

  • For example, we don't see that the image way if we're gonna stay.

  • Yeah, that's fine.

  • If you just want t o keep everything off of that.

  • But there is an image credit there.

  • Although I did think that Emily originally had created the image.

  • Or so.

  • Which is it?

  • It's a very beautiful representation of what?

  • Beautiful soup.

  • Yeah, exactly.

  • So what is beautiful soup?

  • Actually, because I'm curious.

  • Yeah.

  • So we talked a little bit about generally what Web scripting is.

  • That's kind of, like, a general computer science concept.

  • But then there are different ways to, like, carry out that idea on DSO today, our little like intro demonstration is gonna be doing web scraping, particularly with python.

  • Okay, and, ah, beautiful soup is one particular library that someone has so graciously made and, like, allowed us to use on and they called a beautiful suit kind of because they visualized all of the Internet as kind of this huge unorganized stoop in alphabet soup exactly like an alphabet soup, like on the screen.

  • And, um, and beautiful soup will help us make it beautiful.

  • Among other Is this sort of like the most popular?

  • Um, I guess when scraping library and Ivan, um, I'm actually not 100% sure.

  • I do know that it's one of the more easy to use one.

  • So maybe not as heavy duty as some other ones that you confined, but very like friendly to.

  • I'm very excited to dig in here a little bit.

  • It's making me a little hungry, too, but we have a few people in the chat.

  • Let me see.

  • Some folks are asking what is the difference between here and X here likely being a reference to CS 50 on Twitch.

  • This is just a twitch channel where we take in guests and we have from scratch implementations of projects.

  • We talk about concepts.

  • Ed X is more like a traditional lecture based set up a lecture based course platform, for example, sees fifties on X.

  • But also, of course, that I taught at Brian Todd that Jordan taught.

  • Those are all on Ed X.

  • This is a little bit more of a traditional Senate.

  • This is more collaborative teaching approach.

  • And ut dough 988 asking, What's the topic of today?

  • Web scraping with beautiful soup.

  • Hence the area.

  • Hence the title slide.

  • There is beautiful soup.

  • J s, uh, Pop.

  • What is it?

  • J.

  • J s pop?

  • Is that a thing?

  • Sure does.

  • I'm not entirely sure what you mean by that ex polar, uh, dreams.

  • Oh, maybe like O j soup.

  • Um, uh, J soup is like a job.

  • Oh, it's a Java Asian male part, sir.

  • Probably similar.

  • Imagine.

  • It's very so.

  • I'm not sure I'm not familiar with it, but it looks like in based on at least what I can read here, it says open source job HTML part, sir.

  • With Dom.

  • CSS and Jake were like methods for easy data extraction.

  • Sounds to me like probably spirit that I have to imagine is probably, like, across language version of this idea.

  • Yeah.

  • Yeah, probably.

  • I'm not sure.

  • Yeah, um, good questions all around.

  • And then Shin was saying Hello.

  • Emily and Colton.

  • Awesome.

  • Um, why don't we dive into a little bit of alphabet?

  • Beautiful suit.

  • Yeah.

  • Yeah, s o.

  • We talked a little bit about this, but there Sorry way.

  • Have a little bit of our chat windows a little bit large.

  • Don't want to cut off your slides.

  • I mean, you go ahead and shrink this down just a little bit like that.

  • Boom.

  • That should be okay.

  • Um, generically web scraping is the idea of using computer software.

  • It worked code thio extract information from websites.

  • And that's any website that you, as a user, could actually normally, um and so the idea would beautiful soup.

  • And, um, no more chat.

  • We're gonna We're gonna hide that for little finger back apologies, But, um, with beautiful soup is one the idea of making maybe unorganized websites more organized.

  • So if, like I know when in school, like sometimes I'll look up like a more niche academic concept and abuse him like plain HTML page with just some textbooks.

  • Say you wanted to take that content and maybe like, spruce up the look of it, make it have, like, fancy bonds or whatever, and you could do that.

  • So that's the idea of bringing unstructured data, or maybe like poorly structured data and making it like more user friendly.

  • Thea.

  • Other idea is, there's just a lot of information on the Web site on the Internet, and that's constantly updating.

  • And so you can grab that information through some kind of what's creeping through the issue Mel Page and put it in some kind of database or spreadsheet that you can then manipulate your sure, almost taking other people's database information, putting into our own database.

  • Yeah, yeah, we'll talk a little bit about some things to consider before you scrape, but no, my webs grapel.

  • I kind of I think everyone knows there's so much information and content being produced on the Internet.

  • These are just something like, I don't know, little statistics, uh, with, uh, if you can imagine all that kind of stuff you can access on 1.8 billion Web sites.

  • I'm actually kinda surprised.

  • I almost expected there to be, like, maybe 10 times more than Yeah, this might be a little outdated.

  • There's, uh Let's see if this website is still alive.

  • And also I wanna shoutout We have a couple followers.

  • Thio, Tasha Pen.

  • Well, cool game.

  • What?

  • 12 and a meat Microsoft India think follows.

  • Oh, wow.

  • Okay, so actual live website that shows all this stuff in it, like, really not sure how beautiful it looks pretty professional.

  • Random number generator.

  • I don't know.

  • It seems it seems legit.

  • So let's say that wow.

  • 146 1,000,000,000 email sent just today s so we probably can't access that kind of internet information because it's not publicly available, But that is happening on the Internet.

  • Wow, this is fascinating.

  • Okay?

  • Yeah.

  • So just some things to consider the next time you post something on Facebook, I'm a little surprised at 800 instagram post per second to I was like, makes you feel like that would probably be more, but let's say at least, so why rob scream there's again just all kinds of stuff that happens on the Internet, something that's particularly interesting, you could say like, Oh, why don't I just, like, download the HTML page, get the information that I never have to deal with it again?

  • Sure.

  • But there are also a lot of websites that constantly update right, so an example of that always comes.

  • Mine is like flights prices, although there are a lot of other service.

  • Is that centralized information for, you know, again like maybe a block post?

  • Maybe there's someone that is posting regularly and you want to take their information.

  • Do I don't know, showcase somewhere else?

  • I don't know.

  • That's something If your business and you want to keep track of your competitors, maybe that's a very specific task that no other company is doing.

  • This kind of general, whatever information you need, and it would be too tedious for you to like, do it by hand.

  • Maybe I want to, like, have like a dashboard page, which shows like my five top websites.

  • Maybe they make their information publicly.

  • Ville.

  • So happened like apart, sir, that puts it all together in one page.

  • Maybe that'd be compelling.

  • Yeah, you could have your own curated new speeders.

  • Yeah, Yeah, something like that, Actually.

  • Transit screen?

  • I don't know.

  • Sure, if they use the same.

  • I think they use Ap eyes for that.

  • But the same kind of idea?

  • Yeah.

  • Yeah.

  • Um, so anymore.

  • So Webb scraping is not the only way to select him.

  • For speaking of yeah.

  • Yeah.

  • Speaking of AP eyes, uh, generally, Web scraping is kind of like, I would say, the hacky ist way to take information because it's not really like official.

  • You can really just do it based on its HTML page that is given to you anytime you load a website.

  • But before you go straight to Web scraping, there are much easier ways to access information, the first of which is an FBI, which stands for application programming interfaces.

  • I don't have us on a Web stream on if you haven't done a web c'mon ap eyes yet I don't thank you.

  • I don't think we've touched.

  • I think in a couple streams.

  • Think we might have briefly mentioned what they are really got too much detail.

  • Yeah.

  • So, just generally speaking, usually larger companies organizations will actually create their own libraries and set of functions for user's usedto easily use.

  • Their service is, I think, the most common examples.

  • Maybe Google, like a lot of people, are familiar with Google maps.

  • If you want to integrate their map on, say, your block or something, you don't need to be in but that wheel or like, scraped their maps.

  • They have, like, a bunch of nice, uh, premade functions that you can use that's gonna be quite using Thio and often put this sort of like behind Ah, like a payment structure necessary, you know?

  • Yeah, I use it for free up to a certain amount of requests right where we ask their server for the information, but then passed, maybe, like 10,000 in a certain period of time.

  • They're like, No, you have to start paying for the service.

  • Yeah, yeah, yeah, for sure.

  • We had a couple of new followers as well.

  • I'm just gonna keep it on this page, I think Here which waas?

  • Uh, Aaliyah.

  • Not a cast and cyber videos.

  • Thank you very much for the followed.

  • Yeah, So that's kind of the most official way to interact with another company's information because that having an AP I kind of signals from the organization that, Yes, we do want you to access our data or maybe use our service is.

  • And here's officially how we want you to do it, right.

  • So that is kind of like the safest option if they if that is available, definitely use that.

  • And you'll get obviously maybe some kind of customer support in that case, too.

  • Yeah, Yeah, You kind of just get the whole suite of whatever the organization race, uh, wanting to give you the other thing is publicly downloadable bottles.

  • Like if there's files of data already available.

  • One example I can think of is maybe your city or county has statistics for your area.

  • That could be interesting to analyze.

  • No need to, like force it.

  • When you take that from them, that usually have maybe, like, take a siesta, be sort of giving you the database up front.

  • Exactly.

  • Yeah.

  • So these are the reason why I really want to highlight these before we move forward is because these air really like official notifications of like, this is how we want you to interact with us.

  • So always looking out for those.

  • But if maybe there are smaller organization, your personal block or something where they don't have the resources to necessarily put together an A p I or something Web scraping is a way for you to kind of, like take initiative and say like, Hey, I'm gonna structure this how I want and kind of build things from scratch and even even bigger websites to some big Web sites to keep everything sort of at least a p I wised.

  • You don't really expose one.

  • I think Wikipedia guessing doesn't have an a p I.

  • I actually haven't looked.

  • It's probably plenty of websites.

  • Even if they do, they're probably plenty of websites that do have a ton of data base, a ton of data and Andre and the chatter it's actually pointing out.

  • I am D.

  • B doesn't have a public a p I.

  • That is certain times where you kind of have to do it yourself.

  • Do the hack your way of getting the information.

  • Yeah, I think that that kind of falls into two cases like one.

  • Maybe the organization very intentionally doesn't want using you to access the information on.

  • I was like playing with beautiful soup a little.

  • And I was trying to scrape something, and I got a small page that I didn't recognize.

  • I was like, What is this?

  • And he said something like thoughts not allowed.

  • OK, ok.

  • Servers do have a little bit of protection against.

  • Ideally, I anticipate this problem.

  • Also harsh G okra and, uh, last ages from Princeton, Right.

  • Thank you so much for the follows.

  • Yeah.

  • So that's kind of one case.

  • The other case, Maybe they don't have the resources or just haven't gone to that.

  • Yeah, I'm very excited to see a little bit of what we're doing with beautiful soup.

  • It looks like what is beautiful.

  • So we talked about this a little, especially a python library that is able to parse and organize data from an inch, you mile or XML file.

  • We're gonna be focusing on HTML files today.

  • And a small files again are just the markup language that today's websites are written.

  • It's represents kind of the structuring outside of maybe the colors and the font size on whatever, but just kind of the real content of your website.

  • Okay.

  • Awesome.

  • Yeah.

  • So what does it do?

  • It parses it and breaks it down into a tree of objects.

  • Right?

  • And is this perhaps related to the dumb that pages air sort of constructed by a U S o Amazing.

  • Gotta flow here.

  • Yeah.

  • So if you ever use Asian Mallory, you haven't That's totally fine.

  • Basically, things are organized in tags Where the tide's label.

  • What kind of information you're storing between them.

  • OK, eso at the highest level, like all HTML files are like a document to just signal like, Hey, this is my website.

  • And then from there, it kind of slipped into different things, depending on what you want on your website.

  • So each tag is kind of like a tree node.

  • Almost.

  • Yeah, like yeah for Hmm.

  • Well, I think of it more of a label.

  • And then I think of the Dom is actually, like, translate writing that into the tree, but yeah, so you can think of maybe I want a title on my website.

  • Maybe I want some text and link or whatever abuser all denoted by tags in your mouth, folks Sure.

  • Like an e m tag, for example.

  • That's not a no.

  • That's a style attributes.

  • No.

  • Yeah, there's not a 1 to 1 necessarily mapping of yeah, s so kind of the general idea of HTML is it's a mark up.

  • It's just telling your browser, which is the thing that is actually making your website look like what it's looking like, what each item is.

  • So each one, for example, stands for heading line.

  • Script is actually not like something you would see, but that's a job script script title that's actually in the head.

  • So that actually doesn't show up, either.

  • But as something like a pew tag, that would be a paragraph text.

  • There's H one through, I think h six.

  • I believe just different styling on the world might think that you actually made this diagram yourself because they they don't see the attribution of the bottom.

  • Unfortunately, a disclaimer.

  • I did not make any of these images, unfortunately, but it's very it is very lovely looking very helpful.

  • Um, okay, so let's talk a little bit more about the object, the tree thing that you're talking about.

  • So Huml has kind of all these tags so your brother can render a certain way like I was saying, but what beautiful suit does and other HTML parsing interpreting library slash programs do is they say, Hey, you know how you structure HTML is very similar to something what we call a tree.

  • And, um, that tree can be like you can move around and manipulate that tree to do different things.

  • And so I guess, like what exactly is a tree?

  • A tree is this kind of structure you have.

  • Each of these boxes would be called a known, and you have various relationships between these nodes.

  • Either they're like Children of or descendants of or they're like siblings.

  • And so I think the most common example is like a family tree.

  • Maybe people are more familiar with that.

  • So maybe this is like the great great grandpa or whatever.

  • And this is like the next generation, the next generation, these air siblings, because they're horizontally even on same thing here.

  • These air siblings, maybe they share the same parents and so on.

  • And so basically what beautiful soup does is it takes this mark of language that was only used to maybe, like render a certain look.

  • Andi actually breaks that down into different objects, which, um, objects are just a way.

  • Um a fancy way to code such that there are certain features that you could access very conveniently.

  • Not typically.

  • I think most people are used to seeing trees.

  • Kind of like flipped, But programmers tend to represent their trees Is being, like, top down.

  • Yeah, Yeah.

  • I didn't even think about that.

  • It does tend to get lighter as you.

  • Yeah, Yeah, yeah, yeah, yeah.

  • Okay.

  • So kind of moving on.

  • I mentioned this idea of breaking down Ishmael into a tree of objects.

  • And so now, more specifically into, like, what beautiful soup does with those objects.

  • First and foremost is the tag.

  • That's kind of the like.

  • It usually identifies each note of your tree and HTML so attack could be in a todd a pew tag each one tag any of these maps sort of 1 to 1 to the aged him.

  • Oh, yes, unnavigable string is It was fancy word, but it's basically a string.

  • So say amicable.

  • It is navigable that some features that beautiful suit added but basically say like p tag in the Texas like Hello.

  • Uh, the string object would be that string.

  • Hello.

  • So it's not strictly a tag in the sense that it's not a link or something, but specifically the text inside the link.

  • Okay, and we'll go a little more into that.

  • The beautiful soup object is what they call the like overall object of your whole issue.

  • Mo page.

  • And so we'll see what that looks like.

  • But basically that's like maybe the very top node in our object tree in a comment is basically, if there's a comment in your issue, mo.

  • That doesn't usually render on the screen, but it's going to be in your file.

  • So it's a way to Well, that's cool.

  • So you could even store like metadata and comments then and parts that with beautiful suit.

  • Yeah, okay.

  • Probably Most people probably don't design their Web pages that way.

  • But you could theoretically do you good.

  • Yeah, I'm a fan of the conciseness and cleanliness of the slides, by the way, Yeah.

  • Um, so we talked a little bit of tags.

  • It corresponds pretty much wonder one with HTML tags, and they use the same identifiers.

  • P tag and beautiful is the same as a P tag in its Jamal.

  • And basically the object part of these is instead of just representing a time in humo.

  • You can add, like style attributes you can add, like maybe I d s And class is beautiful.

  • Suit basically encapsulates that into one object on and allows you to access that information by using the dot operator, which will go into again later.

  • So it's essentially a dictionary that holds what kind of target is but also some of the other information.

  • Okay, cool.

  • Sounds very robust.

  • So here's an example.

  • Um here in the top line here, we're using a beautiful soup constructor, and kind of let me actually show it really quick.

  • Where I'm getting all of this information is from the beautiful stoop documentation.

  • And so this is How did I get here?

  • I just Googled beautiful soup.

  • And this is like, the second thing that came up.

  • I can actually probably pace that in the chat.

  • So if I do beautiful soup docks and then go over here for our raising 4.4 point.

  • Oh, I guess Robert doesn't matter too much, but Okay.

  • Yeah, gonna pace that end of the chat.

  • And then X polar dreams is asking Is there a link for the slides?

  • Do you even have a link for the slides actually do.

  • Can I should I should that now you can share that.

  • You can, um, send that to me by email, and then I'll post in the chat.

  • Right.

  • Um and then I can actually I'll switch.

  • Thio are temporary paint here.

  • If you want your screens that visible now you send me an email.

  • I also quick open everybody who wants to see the slides can see them.

  • And also astro T thank you very much for the follow, um, someone saying that I wear a specific striped shirt often I kind of do, actually.

  • Um Okay.

  • Okay.

  • Mr.

  • Under was saying that web stripping isn't illegal, per se, just hacky.

  • And the sense of being hacked nonsense of being hacking.

  • Um and I am Watney.

  • Thank you very much for the followers.

  • Well, and if anybody has any questions on what we're doing as we go along differently, asked them the chat is currently hidden just because the it looks like I brought them back up on the screen.

  • Oh, no, because it's a different view.

  • But I had hidden the chat briefly just because it was taking up a large part of the slide deck.

  • Um, but, uh, sorry.

  • Oh, did you send me the slides and Justin?

  • So I won second everybody.

  • We get the I don't see them, Elsie and just said I might be having a slightly, but we can go ahead and continue here.

  • I'll go ahead and bring us back to your screen there.

  • All right?

  • Yeah.

  • So I was This is not what I wanted.

  • I documentation.

  • Basically, these air assumed the creators of this library, and they're creating this nifty guide to help us use their product.

  • Gets this library.

  • Okay, so yeah, kind of.

  • I change the form of the document documentation a little bit.

  • But here before you can parse anything, you essentially have to create a beautiful soup object.

  • The right kind of how this library is going to take in the information and do all these convenient things We did.

  • Brian did do a nice little stream on object oriented programming.

  • And so this kind of ties into that where there's some object the library creators have, doesn't it?

  • It should be a beautiful soup thing that does stuff in stores.

  • Information.

  • Exactly.

  • Yeah.

  • So we've made a beautiful soup object and we've called it to uh, S O.

  • This object is actually just one html tags with It's a bold tag and the text is extremely bold on dso I set that time to be so I took the suit which is here, and I parsed.

  • I said, give me particularly the be tag, which is happens to be the only tag in there.

  • And I said it equal to this variable tag.

  • So now if I do tag, not name, I should receive the B time.

  • So name being the name of the tag if I want the i d of this tag.

  • And so this is where you kind of seen the dictionary coming in where I d is.

  • The idea of the value for the section is a little confusing, but it makes it nicely sort of index herbal for us exactly like if you're familiar with an array, usually index my number.

  • But in a dictionary you can index by anything you want.

  • So here we have i d and its boldest as we can see here on Dhe.

  • Then if we get attributes, it will actually return us a set, or I think a dictionary of all the attributes.

  • So maybe if we had more types, maybe a class or something, it would actually return all of those things, right?

  • Thanks.

  • Thio Caesar 41 10 and beauty pie One for the follow up.

  • I think I won.

  • They were getting better.

  • So a little more about the navigable string is basically a normal python string, except it's got some added searching functionality.

  • Um, beautiful, super see entire document, as I mentioned earlier, and a comment is just a normal college, so not too much There.

  • Okay, Pretty cool.

  • So this this looks like it's getting into the sort of the meat and potatoes of what we wear.

  • What we d'oh.

  • Yeah.

  • So if we kind of understand this idea of every node of the HTML object of the Ishmael pages being stored at some kind of object with some fancy searching profit properties and indexing properties that beautiful soup has, could it for us?

  • They also found a way to kind of link all these pieces together to really form that tree.

  • I'm next.

  • I'm gonna be going through some functions to be able to navigate that tree.

  • And maybe if you want a piece of information, like the last heading or something.

  • How do I get there?

  • And even maybe to find a specific piece of text Maybe you're searching on.

  • Maybe.

  • You know, the area that you're looking for Says like schedule, for example?

  • Yeah, probably.

  • I'm guessing it.

  • Probably look for it in your document.

  • Yeah.

  • Hence the navigable string, right?

  • Searchable.

  • Yes.

  • Um, yeah.

  • So it's interesting to have know how to parse pawn tag, but more often than not an issue military Just gonna have many tags.

  • And so this is how to get to what you need, right?

  • So first, we're gonna look at navigating the tree.

  • So here's our beautiful tree diagram that I did not create again.

  • Nobody knows, though, because is because this is attribution s.

  • Oh, this is just a little more complicated than when we saw before.

  • We have our main issue male document, and then these different, um kind of like tags or categories within the html page.

  • So maybe an image heading all of these things are accessible.

  • Okay, form for N thank you very much for the follow s.

  • Oh, yeah.

  • Jin generically ever to talk about this but beautiful suit provides functions to be able to move along this tree in a designated way.

  • Eso here's an H two mile page example.

  • Looks kind of complicated.

  • Could honestly barely fit on this slide.

  • But just to give you some perspective, this is the issue.

  • I'm so very, very simple.

  • The code looks kind of complicated.

  • You can imagine that, like, I don't know who goes goo isn't even that fancy like I don't know Facebook Page has a lot more html taxman, but yeah, so genetically, this is a very simple HTML page.

  • We can see that this is the head.

  • This is actually kind of like meta information that it's not doesn't display directly on the screen where you typically for your website.

  • But the body is kind of the stuff that you actually see.

  • So we have a title under this p class.

  • We have it.

  • We've got a class title.

  • Um, this be tag just It's bold.

  • Here's another P class.

  • And then, as you can see, it's not just that the types around content but tags can be fit in other tabs.

  • Right?

  • And so that's kind of where we get this tree child notion.

  • So a good example of that is here this p class this p tagged with a class story.

  • That's one big chunk of paragraph texts.

  • Essentially, that I just needed Andi within it.

  • There's these a tags and eight times in html are usually links s O.

  • This is saying within my paragraph text I actually have three links, and this is what those links look like.

  • We're like two peas, like a note here and then branching off.

  • You get the Did the 38 tags?

  • Yeah.

  • So there's a Yeah, there's the three.

  • Kind of like this.

  • And then assembling to this p class would be maybe this p class, but doesn't really have anything, right?

  • And then the whole page ends Those be threepeat tax coming off of the body note.

  • So those one of the P texasthree coming off of it, So you get essentially the tree structure you showed earlier?

  • Yeah.

  • That true structure, just like I said, gets more and more complicated as you make things right.

  • Yeah, So yeah, very simple issue.

  • Male page s.

  • So how can we kind of access these things?

  • Um, we have this, actually.

  • Can you undo the crop.

  • I can read.

  • I can undo the crowd.

  • Yeah, it's probably more basically like we did before with the B type.

  • We can access certain things.

  • So say we've already made the soup and we called it the soup variable.

  • We can, uh, get the head by saying souped up ahead.

  • And this is just generally how you access information.

  • Um, the objects we use this like, operator.

  • So we're saying I want the head of the object of the soup, and it returns to me.

  • This which is exactly what I see here, Um, and then say I want this be tag, but I can't directly see that from the very high level soup.

  • So I have to designate.

  • Okay.

  • First, I need to get the body, which is kind of the first level of things that I see.

  • So this body tag here is like on the outside and then within that body, I want the first be tagged.

  • So this dot operator is You can think of it as, like, going step by step.

  • I'm saying get the soup, get the body, and then within the body.

  • Get the attack.

  • Were like nesting in indexing into the object.

  • Yeah, and so this is I don't know how to get this.

  • I don't think it's No, it's not that.

  • But, uh, basically, that will return to me time on.

  • And then Adam had asked, Is inevitable string part of a python or a beautiful suit?

  • That is a beautiful soup thing.

  • So it's particular for their tree structure and how to navigate it is helpful.

  • Um, okay, so that's using tags.

  • Another way is you can use contents.

  • And so contents is just kind of a way of saying like, What's inside this thing?

  • And as we saw before, there could be many things inside one thing.

  • So here we have the head tag, Um, and we're going to say, Stupid hothead.

  • We knew that returned the head tag before.

  • Let's assign it to a new variable just for convenience.

  • And then, if I like print that it's going to tell me that that's the head tag, which is what we saw before.

  • And now here we're gonna use the content dot operator, and it's going to tell me within the head tag.

  • There is one title tag, and within that there's actually a string That's us the Dormouse tonight.

  • So we can pretty easily get to whatever specific piece of information we want.

  • Yeah, we kind of do have to know the layout of the page that exactly.

  • But not only that, but using like print statements and various things like reveal things to yourself you can actually determined like, say, I don't know what's in the head tag.

  • If I use this content operator, it doesn't require me to know anything about the page ahead of time.

  • It's gonna actually return what was in there.

  • OK, yeah.

  • Export dreams.

  • Asking contents returns a list.

  • It does return a list.

  • Let's vary stooped.

  • So this is a list of length one say I wanted the contents of this P class.

  • There are a few things in there, so I think it return like a list of length three.

  • With all those things make sense wouldn't give you that text as well.

  • The once upon a time.

  • Yeah, so maybe it would be five, because there's, like, texts.

  • And then, in a tag tag tag in the text again, right?

  • I'm not completely sure, off the top my head, but it will give you those contents.

  • Something that our viewers could definitely try on the road with beautiful suit for four point.

  • Yeah, I missed some of the stuff.

  • How do I use beauty?

  • How do you How does one use beautiful soup?

  • How does one use it on a website?

  • Yeah, So beautiful soup is ah, library that is available for anyone to use how you would use it.

  • Is there instantly installation instructions on the documentation You should be able to use pip or the Python package manager to download that.

  • And then you use a simple import statement to use in whatever code on and then you can actually use it.

  • Like I'm a little confused about what you mean by how do we use on a website?

  • But basically, if you want to parse that websites html page, you could either, like, manually downloaded and do something with it.

  • Or you can use pythons, get, um, kind of like a small request library.

  • Http requests our library on, and we can talk about that little along, and we should clarify.

  • This is meant not to be used in an actual website that you deploy, but rather, if you're running a script on your local machine or on a server or something, that it's a back end, sort of use case, not get to be deployed in your actual Web page.

  • Yes, it's for analyzing other Web pages.

  • Yeah, maybe if you wanted to make, like, a web up that does that, theoretically do that to you?

  • Yeah, yeah, but typically, probably on the back end.

  • Lipstick said thank you very much for the clarification.

  • Also, thank you to peril cakes and C l prism for the r C.

  • I prison follows.

  • Okay, so that's kind of the contents tagged here.

  • Specifically say we only wanted the first item in contents as someone astutely noticed that it is a list.

  • So that specifies that we just want the title on and then here, within the contents of the title tag, is actually the text of I know how to get this to go away, but types of the title itself.

  • Oh, okay.

  • Oh, there we go.

  • I think I must just wait for your mouths to stop moving, I think.

  • Anyway, okay, so there's also this idea of descendants, and this ties back into the idea of the whole HTML page is a tree and you can go parents, Children, siblings just like a family tree.

  • So Descendants is saying everything below the head tag like what's inside there?

  • So we've identified the head tag.

  • It's the same doormat story that we were talking about.

  • We're gonna say a simple four loop head tag dot descendants is a list.

  • Oh, it's an honorable.

  • It's something we can use a fore loop for on.

  • We're gonna print every item in that list.

  • So if you can see here, it's just the title.

  • And then actually the content within that title.

  • So there are two items that are descendants of of virtue of it being miserable.

  • Super easy.

  • Thio Get off!

  • It makes a nice yeah and python as Job's not Mobs and Free Sep.

  • Bigger mystery follows.

  • Um, okay, so you can also use string and that will return actually, just the Unicode string of whatever's inside.

  • Andi think this is the navigable string thing.

  • So this allows you to search for that specific object?

  • Very similar.

  • Just using the dot operator.

  • Very nice.

  • Okay, So, generally, uh, we have directions of moving up and down the tree contents and Children being moving down and parent and parents very intuitively being moving up.

  • Um, also, we have ways to move horizontally in the tree.

  • Um, so, like siblings, exactly like there's a tags if you want the next sibling or next siblings, the only difference between the singular and plural here is one is going to return a single object, and the other one is gonna return a list of objects, depending on how many there are.

  • Okay, so yeah, very similar.

  • Just the other direction.

  • Previous and previous siblings, previous elements.

  • So that's kind of if you want to.

  • If you think of yourself as like, I don't know, a bug walking through the tree and you can only know directions.

  • That's kind of with these functions, too.

  • But often times you know exactly what you want.

  • And you could just say it like I see that one like, give me that one.

  • And that's where we get into searching.

  • So element would be more for like, in the case where you want to differentiate between, like, a sibling element versus maybe, like a styling element or styling tag, brother.

  • Yeah, I'm actually not 100% sure.

  • I think these might be equivalent.

  • And like, it's just like difference in tax.

  • But we should look into that.

  • And that's definitely something you can check on the documentation for cool specificity.

  • Yeah.

  • Okay.

  • So, searching the tree like maybe I identified something that I want and I don't want to follow their like, filtering through everything to get there.

  • So say I specifically want the title.

  • I can use this final function and the quote, um, our access by class.

  • And so I don't know, I guess I've talked about in the strange, but like, ideas and classes and like I did, Yeah, I did a couple streams html and CSS.

  • We went over the basic sort of.

  • Yeah, yeah, yeah.

  • So a class again.

  • It's just a way usually is for like, adding specific CSS.

  • But it kind of serves a double purpose here as an unidentified right.

  • So here we were saying we specifically want things with the class title on, and then it's returning as tried, and then we can use multiple attributes, actually.

  • So what if we want P tags, P tags with the class title and it's gonna give us that specific thing?

  • A meet was asking can we make a website and app without coding?

  • Um, and certainly I think there are graphical tools for doing this.

  • I mean, I think even scratch is kind of an example of that.

  • I mean, it's not a website or a nap, but its graphical, its graphical programming.

  • But there are tools like wicks and wordpress that you can use to develop websites without needing a coat.

  • Anything?

  • Yeah, Abbs, I'm It's been a long time since I've looked like the, like, ex coded, like IOS development.

  • But I think you can do certain amounts of things without needing the code, right?

  • Yeah.

  • I'm not sure.

  • I think maybe for like, you I There are easier ways to just create an interface.

  • But logic wise, I think really it was like learning too much.

  • I know.

  • I know.

  • For like, uh, like, if you're doing the unity and unreal game development, which is more where I have knowledge.

  • Um, unreal, for example, have a graphical programming language.

  • Very solar to scratch, but is robust, like you could do anything with it.

  • Basically, they're called blueprints.

  • Unity has the equivalent type thing.

  • I'm imagine that probably similar technologies that exist for other domains, like Web an APP development.

  • But I'm not entirely sure I don't have much expertise in that area.

  • But some people in the chatter saying yes, that is possible.

  • Weebly and Wicks says, Whip streak who?

  • Uh, um, sort of what I'm mentioned.

  • Yeah.

  • Wordpress, etcetera, etcetera.

  • Um, more ways to search the tree.

  • We can.

  • Sorry.

  • I misspoke.

  • The quotes are actually just for tags.

  • So we're looking for tags here and because I'm using final, it's actually gonna return a list with everything.

  • So here we have the three links that we saw earlier.

  • Nice.

  • Um, so again, using find all this is all a tag end be tax or just Kind of like a double.

  • I don't know, Serge.

  • Um, here we can actually search, particularly by i d or close and uses pythons.

  • Looks like named argument to do that.

  • Just quite cool.

  • So this is telling the fine function I want the i d.

  • Of link one.

  • Where's this before?

  • If we just have, like, a generic Whoa, it's gonna assume that's a tag.

  • So here.

  • We're getting this specific.

  • I think this one.

  • Yeah, link one, because this has a I d of that at and so it's gonna turn that object.

  • So that's using attributes.

  • Onda also.

  • So if you think searching my i d or class or tag is too restrictive like maybe what?

  • This very specific thing we're like the word is this and I don't know all these specificities, uh, do not fear.

  • There are ways you could do that.

  • A good thing about code is everything is super customizable.

  • So you can actually create your own function using your using the functions and beautiful suit.

  • But using some logic to maybe make it more complicated on DSO this function is called has class.

  • But no, I d Basically, I want, uh, items that have a class, But no, i d S o I'm using.

  • These are beautiful soup functions here using some logic here and then Now I can use my custom function to actually search.

  • So you're basically reading a filter sort of, yes, and using it as a higher order function.

  • Or I guess the first class function you're passing into find all and find all will basically call dysfunction for every element as looking through the tree.

  • Yes, that's pretty cool.

  • That's quite meet, um, and also thank you.

  • Two Very watch it sell 58 2019 and D.

  • J.

  • Miroki, for the follows.

  • Appreciated.

  • Um, you could do more stuff searching.

  • Particularly like putting things together a tag and this class, that kind of thing.

  • Search by string, rumor, string.

  • That's why we have this, like navigable string thing or beautiful soup.

  • Wouldn't know that that string is whatever that is.

  • I'm so that's kind of neat.

  • You can also say it's a huge Web page and there's, like, 25 a tag.

  • You're like, Oh, I just want to have them Well, there's another parameter called limit, which is there could be very efficient.

  • I'm assuming Yes, um, so just generically you can find maybe, instead of looking through the whole document, which maybe would take more time than you would want, you can say specifically within siblings.

  • I want to look for this particular on or whatever kind of comparing are combining the navigating and surging.

  • It was like CSS sort of a little bit.

  • Yeah, um, using, select, select and find our very similar.

  • I think that the syntax of the arguments is just slightly different.

  • Eso like We want three of the p A tag.

  • And we want these particular tags and so on.

  • Yeah.

  • So different functions Basically to pick out pieces of the soup that you want.

  • Um, it looks like exporter dreams s Can you grab any attributes?

  • Can you grab any attribute?

  • I think so.

  • Yeah.

  • Just give you all the attributes that the tide has, right?

  • Just look for the specific when you want.

  • Yeah, if it exists, it should be there.

  • Yeah.

  • Okay.

  • Um fie.

  • Foe says hi.

  • What's better?

  • Scrapie or beautiful soup?

  • Yeah.

  • I'm not super familiar with scrapie.

  • I think that is one of them.

  • Or have you Do you want?

  • I think so.

  • I think it's more for, like, recursive like web indexing.

  • What?

  • Crawling like html parsing?

  • Yeah.

  • Final parting, like, just lighter use.

  • Maybe you want to look at one website or something, right?

  • Yeah.

  • So but probably similar, Like, concepts and how they work.

  • Okay.

  • Yeah.

  • So what can you do with the tree?

  • We kind of talked about observing in, like, taking information for it.

  • But how did we started with, like, oh, make the soup beautiful.

  • So there's actually some functions that I think we're not gonna go into too much today that you can say, add styling to or like, in a programmatic ways.

  • Maybe I want to take all the attacks and change their font.

  • I could do that in a very efficient way.

  • Same thing modifying the tree.

  • Maybe I wanted to add a new paragraph or move this paragraph somewhere else.

  • Beautiful Soup also enables you to do that.

  • Yeah.

  • So next question is like, Okay, now I have the tools to get all this information.

  • Like, what am I going to do with it?

  • Um, so not gonna do with it.

  • Yeah, well, that's a different question.

  • S o a few things.

  • You can put your findings.

  • Maybe you're d

Hello, world.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

WEB SCRAPING TUTORIAL!英尺。BeautifulSoup - CS50 on Twitch, EP.40 (WEB SCRAPING TUTORIAL! ft. BeautifulSoup - CS50 on Twitch, EP. 40)

  • 0 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字