Placeholder Image

字幕列表 影片播放

  • Hi, I'm Bill Appelbe and today

  • in seven minutes flat I'm going to explain how Hadoop works

  • and what you can do with it and what Big Data is

  • I've done a lot of Big Data projects in Australia

  • in Canada in the United States and I'm also a Learning Tree instructor

  • OK, so why big data? Firstly

  • we all know that governments and businesses

  • are all gathering lots of data these days, movies, images transactions

  • But why? The answer is that data is incredibly valuable

  • analyzing all data lets us do things like detect fraud going years back

  • these days too, disc is cheap. We can afford to keep all that data.

  • But there's a catch. All that data won't fit anymore

  • on a single processor or single disc so we have to distribute it

  • across thousands of nodes.

  • But there's a good side to that.

  • If its distributed, and we run in parallel, we can compute

  • thousands of times faster and do things we couldn't possibly do before.

  • And that's the trick behind Hadoop.

  • OK, how does Hadoop work? Suppose what I wanted to do was look for an

  • image spread across many hundreds of files. So first off

  • Hadoop has to know where that data is it goes and queries something called the

  • name node to find out

  • all the places where the data file is located. Once it has figured that out

  • it sends your job out to each one of those nodes.

  • Each one of those processors independently reads its input file

  • each one of them looks for the image and writes the result out to a local output

  • file.

  • That's all done in parallel. When they all report finished,

  • you're done.

  • okay

  • We've seen one simple example what you might want to do with Hadoop -

  • image recognition. But there's a lot more to it than that.

  • For example I can do statistical data analysis

  • I might want to calculate means, averages correlations

  • all sorts of other data. For example I might want one look at unemployment

  • versus population versus income versus States.

  • If I have all the data in Hadoop I can do that. I can also do machine

  • learning

  • and all sorts of other analysis. Once you've got the data in Hadoop

  • there's almost no limit to what you can do.

  • Okay we've seen that in Hadoop data is

  • always distributed, both the input and the output.

  • There's more to it than that. The data is

  • also replicated. Copies are kept of all the data blocks

  • so if one node falls over, it doesn't affect the result.

  • That's how we get reliability. But sometimes we need to communicate between

  • nodes

  • it's not enough that everybody processes their local data alone.

  • An example is counting or sorting.

  • In that case communication is required

  • and a Hadoop trick for that is called MapReduce.

  • Let's look at an example of how

  • MapReduce works.

  • What we are going to do is take a little application call Count Dates.

  • That counts the number of times a date occurred

  • spread across many different files. The

  • First phase is called the map phase.

  • Each processor that has an input file, reads the input file in,

  • counts the number of times those dates occurred,

  • and then writes it in as a set of key/value pairs.

  • After that's done we have what's called the shuffle phase.

  • Hadoop automatically sends all the 2000 data to one processor, all

  • the 2001 data to another processor and the 2002 data to another processor.

  • After that shuffle phase is complete

  • we can do what's called a reduce. In the reduce phase

  • all the 2000 data is summed up and written to the output file.

  • When everybody is complete with their summations,

  • a report done and the job is done.

  • Ok we've seen a couple of great examples

  • a how Hadoop works.

  • The next question is how does Hadoop compare

  • to conventional relational databases because they've dominated the market for

  • years.

  • Weve seen one big difference which is that in Hadoop

  • data distributed across many nodes

  • and the processing of that data is distributed.

  • By contrast, in a conventional relational database,

  • conceptually all the data sits on one server and one database.

  • But there are more differences than that.

  • The biggest difference is that in Hadoop data

  • is write once read many.

  • In other words once youve written data, you are not allowed to modify it.

  • You can delete but you cannot modify it. By contrast

  • in relational databases data can be written many times,

  • like the balance on your account. But in archival data

  • which Hadoop is optimized for, once youve written the data

  • you don't want to modify it. If it’s archival data about telephone calls or

  • transactions,

  • you don't want to change it once you written it. There's another difference too

  • In relational databases we always use SQL.

  • By contrast Hadoop doesn't support SQL at all.

  • It supports lightweight versions of SQL

  • called NoSQL but not conventional SQL.

  • Also Hadoop is not just a single product or platform. It's a very

  • rich eco-system of tools and technologies and platforms.

  • Almost all of which are open source and all work together.

  • So what’s in the Hadoop ecosystem?

  • At the lowest level, Hadoop just runs on commodity hardware and software.

  • You don't need to buy any special hardware,

  • it runs on many operating systems. On top of that, is the Hadoop Layer

  • which is MapReduce and a Hadoop distributed file system.

  • On top of that is a set of tools and utilities

  • such as: RHadoop which is

  • statistical data processing using the R programming language.

  • There's a machine learning tool. There are also tools

  • for doing NoSQL like Hive and Pig and the neat thing about those tools

  • is they support semi-structured or unstructured data

  • You don't have to have you data stored in a conventional

  • schema. Instead you can read the data and figure out the schema as you go along.

  • Finally we have tools for getting data

  • into and out of the Hadoop file system like Sqoop.

  • That ecosystem is constantly evolving. So for example there's now a new tool for

  • managing the Pig tool called Lipstick on Pig.

  • And there are many more and that environment keeps being added to

  • all the time.

  • So now we have seen how Hadoop

  • works and what it can do.

  • I’m sure you've got more questions than that such as how do I install Hadoop

  • and on what platforms? The differences between different

  • Hadoop versions or how to do Extract

  • Transform and Load in Hadoop.

  • Answers to those questions are on our website at the following URL

  • I really hope you enjoy this video. Take care,

  • Cheers!

Hi, I'm Bill Appelbe and today

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級 澳洲腔

什麼是大數據和Hadoop? (What is Big Data and Hadoop?)

  • 184 23
    Ron 發佈於 2021 年 01 月 14 日
影片單字