Placeholder Image

字幕列表 影片播放

  • What's going on

  • Everybody welcome to part two of our chat bot with Python and tensorflow tutorial series in this tutorial

  • We're going to be doing is beginning to build our

  • Database that's going to store our basically our parent comments - they're paired best reply comments

  • So the reason why we really want to do something like this is because well first of all

  • A lot of these files are way way way too big for us to just like read into RAM and then create the training files

  • from

  • even just individual months

  • But chances are you're gonna want to eventually if you wanted to create a really

  • Nice chat bot you're gonna be wanting to work on many months of data

  • so maybe possibly billions of comments you do have that your disposal so when that's the case we

  • Probably want to have some sort of database now for the purposes here. Just to keep things simple

  • We don't deal with my sequel servers or any other

  • Big database server type thing I'm just gonna use SQLite

  • It'll help us get the job done, but you can feel free to use pretty much whatever you want

  • But I'm gonna use SQLite here now before we get too deep

  • I just want to address kind of what what all our data should be looking like

  • So I want to bring up here

  • This is basically if you downloaded the reddit data, and you know extract it

  • It should look something like this you should have years

  • You know basically 2007 to 2015 again bigquery does have data all the way up to

  • Recent like you know last month whatever that would be

  • And also if you do go that route. Just know that your formats going to be totally different than ours, so you'll have to adapt

  • What you're doing?

  • If you want to go that way, but possibly if someone does share some way to officially pull bigquery

  • I'll probably append that to the end of this tutorial series because there's there's also multiple models that I'm working on with chatbots

  • So I also just I'm pretty confident that there will be some follow up videos

  • Anyways enough that if you click on any of these normally you'll just have all these compressed files

  • But if you extract them, it looks like this basically and then these files contain just a bunch of samples each sample looks

  • long

  • this

  • Alright, so this is just one sample as you can see there's a bunch of data here. It's obviously a JSON

  • It's key and value though so yeah, you know there's there's a lot of first of all wasted information here

  • So just putting into a database will severely

  • Decrease the size of this data right you just have one column name

  • And then all the data like you're basically you know this much data becomes just this right that makes more sense

  • Also, we don't need all of these we don't need like link ID for example. We don't really need name

  • We might be interested and created. We're probably not interested in when it was received author flare

  • We probably don't care about you might

  • We probably don't and so on obviously we do care about like things like score and ups and downs and maybe if they were gilded

  • Or not or stuff like that we might care about those especially like if trying to make some sort of a very specific bot

  • Same thing with like the subreddit or something like that if you want it again to create some sort of really specific

  • Type of chat bot for now I want it to be a fairly general

  • But I care about at least score one thing to note though is

  • I'm fairly confident score is miscalculated downs are always zero

  • So score is always miscalculated if I recall right I can't remember if that's truly the case, but anyway

  • It's really quick to test. I forget all I know is that take it with a grain of salt because it's improper

  • Anyway, I'm pretty sure it's the case that downs are always zero

  • But I can't remember if you can take ups and then score

  • You know basically ups - the actual score would equal the downs, or if score also always equals ups I can't remember

  • But anyway just know there's some sort of flaw there

  • anyway

  • Let's continue so

  • Working in Python now what I'm gonna. Go ahead and do is

  • We're just gonna start building out the code that we're going to be using here

  • So let's go ahead and import sqlite3 for our database import JSON

  • And then we'll go from date/time import date time and really

  • SQLite obviously database JSON to read that format basically and then date time we're really just going to use this to output where we are

  • As we're kind of outputting just some some logging information just so we know where we are

  • As you might imagine going through these huge files can take a lot of time so sometimes. I just like to put simple

  • Outputs that kind of tell us where we are at that at the time

  • Moving along we're gonna say time frame. I'm gonna say we're going to use 2015. Oh 5 so remember the format of the files

  • When you download them basically, so we're gonna basically be grabbing this one. They all have RC

  • I don't know what our C stands for us probably not release candidate, but it's

  • Reddit comments, maybe I don't know anyway

  • I don't know what it stands for but anyways they all have that same format obviously. This is May of 2015 so

  • This is the one that we want

  • Alternatively you could take lists of time frames and then iterate through them build the database the same way. I'm about to build the database

  • so

  • Once we've done that also I'm gonna have SQL

  • Transaction we're gonna have this because you don't want to be in specially like when you know you're gonna be working with like

  • millions of rows

  • You don't want to insert rows one by one if you don't have to that's really inefficient

  • Instead you want to build up a big transaction, and then do it all at once and it will be

  • Just gobs faster, so that's what we're going to use that for

  • Now what we're gonna. Do is build out the connection that's going to be SQLite 3 dot connect

  • We're gonna connect to something database not seeing the database that would still work though

  • taht format

  • And in time frame so this will just be a database called whatever the month and year is

  • Again alternatively if you wanted what you probably could do is

  • Like this well for example probably what we're in a color table is like

  • Parent reply or something like that

  • Instead you could actually make the database parent reply and then each table name could be the the month or something

  • To me I don't really think the month and year is all that valuable like there's no real reason why you would separate those out

  • So I'm not really gonna do that but you could if you wanted

  • anyway

  • Then we're gonna define our cursor, so that's just connection dots cursor

  • Okay now we're gonna. Go ahead and use creator table, so it's fine creates

  • table

  • And then this is just going to be your typical see to execute

  • create table it not

  • Exists and the tables to be parent

  • reply and

  • Then we're gonna have all of our columns so first of all we're gonna have parents ID and

  • This or. This is gonna be text type and then also it'll be our primary key

  • Yeah, this is gonna run way off the screen. I think you can get away with a triple quote here

  • We're gonna find out just

  • So I don't have to run everything off the screen so much

  • We'll see how it goes so yeah, so parent ID now we're going to need the comment

  • Comment ID and that again that's going to be a text

  • And a not a primary key

  • But it should be unique

  • unique

  • And then we're gonna have parent & parent

  • Will be also text text type and then we're gonna have the comment itself so the reply

  • Comment will be text type

  • I'd also like to go ahead and log the subreddit just simply because I do kind of see in the future

  • That's gonna be a useful thing to be tracking different subreddits have different ways of talking

  • And if you want a smarter soundings chatbot you could go with more scientific and engineering types of subreddits if you wanted a more

  • Nevermind I'm not gonna get myself in trouble well. We'll stop at that anyway. You could get different types of

  • Chatbots the unix time that's just gonna be an integer and then finally we'll go ahead and take the score

  • which also should be an int

  • Okay so with that. We've got our query and of course I did just run it off the screen anyway

  • But yeah so with that we should create the table if it doesn't exist so then what we do at the end here is if

  • We'll just start our main loop here our main chunk. I guess maybe name equals main

  • Let's go to create table, so this will just create the table if it doesn't exist

  • The other thing to note is if the database doesn't exist when you attempt to connect to it it creates a database

  • that's why we didn't have to create any database that's obviously just SQLite and

  • then finally

  • This obviously will only create the table if it doesn't exist and so it's relatively cheap to run it

  • So we'll go ahead and run that so

  • That's all for now what we're gonna. Do is in the next tutorial. We'll actually start working through

  • I'm not sure if we'll be able to insert any of the data

  • Too much because there's a lot of cleaning up of the data and stuff, but yeah in the next tutorial

  • We'll at least start

  • Buffering through the data and start kind of cleaning up that data and get it ready at least to insert it into the database

  • Anyways, if you have any questions comments concerns whatever feel free to leave them below, otherwise. I will see you in the next tutorial

What's going on

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

A2 初級

數據結構--用深度學習、Python和TensorFlow創建哈拉機器人 p.2 (Data Structure - Creating a Chatbot with Deep Learning, Python, and TensorFlow p.2)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字