什麼是大數據和Hadoop？ (What is Big Data and Hadoop?)

字幕列表影片播放

Hi, I'm Bill Appelbe and today
in seven minutes flat I'm going to explain how Hadoop works
and what you can do with it and what Big Data is
I've done a lot of Big Data projects in Australia
in Canada in the United States and I'm also a Learning Tree instructor
OK, so why big data? Firstly
we all know that governments and businesses
are all gathering lots of data these days, movies, images transactions
But why? The answer is that data is incredibly valuable
analyzing all data lets us do things like detect fraud going years back
these days too, disc is cheap. We can afford to keep all that data.
But there's a catch. All that data won't fit anymore
on a single processor or single disc so we have to distribute it
across thousands of nodes.
But there's a good side to that.
If its distributed, and we run in parallel, we can compute
thousands of times faster and do things we couldn't possibly do before.
And that's the trick behind Hadoop.
OK, how does Hadoop work? Suppose what I wanted to do was look for an
image spread across many hundreds of files. So first off
Hadoop has to know where that data is it goes and queries something called the
name node to find out
all the places where the data file is located. Once it has figured that out
it sends your job out to each one of those nodes.
Each one of those processors independently reads its input file
each one of them looks for the image and writes the result out to a local output
file.
That's all done in parallel. When they all report finished,
you're done.
okay
We've seen one simple example what you might want to do with Hadoop -
image recognition. But there's a lot more to it than that.
For example I can do statistical data analysis
I might want to calculate means, averages correlations
all sorts of other data. For example I might want one look at unemployment
versus population versus income versus States.
If I have all the data in Hadoop I can do that. I can also do machine
learning
and all sorts of other analysis. Once you've got the data in Hadoop
there's almost no limit to what you can do.
Okay we've seen that in Hadoop data is
always distributed, both the input and the output.
There's more to it than that. The data is
also replicated. Copies are kept of all the data blocks
so if one node falls over, it doesn't affect the result.
That's how we get reliability. But sometimes we need to communicate between
nodes
it's not enough that everybody processes their local data alone.
An example is counting or sorting.
In that case communication is required
and a Hadoop trick for that is called MapReduce.
Let's look at an example of how
MapReduce works.
What we are going to do is take a little application call Count Dates.
That counts the number of times a date occurred
spread across many different files. The
First phase is called the map phase.
Each processor that has an input file, reads the input file in,
counts the number of times those dates occurred,
and then writes it in as a set of key/value pairs.
After that's done we have what's called the shuffle phase.
Hadoop automatically sends all the 2000 data to one processor, all
the 2001 data to another processor and the 2002 data to another processor.
After that shuffle phase is complete
we can do what's called a reduce. In the reduce phase
all the 2000 data is summed up and written to the output file.
When everybody is complete with their summations,
a report done and the job is done.
Ok we've seen a couple of great examples
a how Hadoop works.
The next question is how does Hadoop compare
to conventional relational databases because they've dominated the market for
years.
We’ve seen one big difference which is that in Hadoop
data distributed across many nodes
and the processing of that data is distributed.
By contrast, in a conventional relational database,
conceptually all the data sits on one server and one database.
But there are more differences than that.
The biggest difference is that in Hadoop data
is write once read many.
In other words once you’ve written data, you are not allowed to modify it.
You can delete but you cannot modify it. By contrast
in relational databases data can be written many times,
like the balance on your account. But in archival data
which Hadoop is optimized for, once you’ve written the data
you don't want to modify it. If it’s archival data about telephone calls or
transactions,
you don't want to change it once you written it. There's another difference too
In relational databases we always use SQL.
By contrast Hadoop doesn't support SQL at all.
It supports lightweight versions of SQL
called NoSQL but not conventional SQL.
Also Hadoop is not just a single product or platform. It's a very
rich eco-system of tools and technologies and platforms.
Almost all of which are open source and all work together.
So what’s in the Hadoop ecosystem?
At the lowest level, Hadoop just runs on commodity hardware and software.
You don't need to buy any special hardware,
it runs on many operating systems. On top of that, is the Hadoop Layer
which is MapReduce and a Hadoop distributed file system.
On top of that is a set of tools and utilities
such as: RHadoop which is
statistical data processing using the R programming language.
There's a machine learning tool. There are also tools
for doing NoSQL like Hive and Pig and the neat thing about those tools
is they support semi-structured or unstructured data
You don't have to have you data stored in a conventional
schema. Instead you can read the data and figure out the schema as you go along.
Finally we have tools for getting data
into and out of the Hadoop file system like Sqoop.
That ecosystem is constantly evolving. So for example there's now a new tool for
managing the Pig tool called Lipstick on Pig.
And there are many more and that environment keeps being added to
all the time.
So now we have seen how Hadoop
works and what it can do.
I’m sure you've got more questions than that such as how do I install Hadoop
and on what platforms? The differences between different
Hadoop versions or how to do Extract
Transform and Load in Hadoop.
Answers to those questions are on our website at the following URL
I really hope you enjoy this video. Take care,
Cheers!