Placeholder Image

字幕列表 影片播放

  • This this two-factor authentication; so basically to get in

  • I need my card and then I need a PIN and this is a scrambler pad so basically

  • every time that you look at that the numbers are in a different order.

  • This is the High Performance Computing facility for the university of Nottingham.

  • SEAN>> What do you use it for?

  • All sorts of things it's basically to do with the high compute

  • research so for example students and researchers will use this for doing

  • calculations based on things like fluid dynamics, aerospace, genomics...

  • All sorts of things anything which requires - astronomy - that's what anything that

  • requires a large amount of compute.

  • SEAN>> And you've got earplugs in today for obvious reasons

  • Yes it's a litte bit noisy in here yes yeah...

  • SEAN>> So we will do some talking outside (LINK IN DESCRIPTION) but can you show us a bit of it before we go outside?

  • Certainly yes yes the

  • main HPC facility which we call Minerva is...

  • ...and then we've got some extensions in, on the racks on here.

  • SEAN>> All of these's blinking lights what's going on is this data activity..

  • ...or processing what's going on there?

  • Both! The actual lights that you can see there, the brighter ones, those are

  • actually the storage, the disk storage the actual compute nodes don't actually

  • blink very much. The ones at the bottom there that's the network activity.

  • We do shut it down for maintenance once a year for a day or so

  • this at the moment is the third generation of HPC - the first one...

  • ...which was installed about eleven years ago and then we regularly refresh this.

  • SEAN>> So this one's been going for how long or how long is this been? This one's been going for about four years.

  • SEAN>> Okay and then I hear rumors of a new one on the horizon?

  • Yes we're in the procurement at the moment to put a replacement in

  • SEAN>> and will that mean this gets ripped out and the

  • whole new one just gets put in?

  • Good question, we would like to utilize as

  • much as possible because although it is old you know there is still life left in

  • it and we do try to - "sweat the assets" as they say but certainly some of

  • this will be replaced.

  • SEAN>> What's it running, would we recognise any of the operating system or any of that?

  • It's, yes the; most of the Nodes are running a version

  • of Linux and the the storage is fairly standard but above

  • that we use PBS as our main scheduler.

  • SEAN>> How many people might be using this at one time?

  • At any one time they're probably running hundreds of jobs

  • SEAN>> Do they run for a long time? Might they be running years? How does it work?

  • We wouldn't have jobs that are run for years but certainly we could have jobs which are

  • running for months. Most of the jobs -you know- we're probably only running for days

  • SEAN>> Okay and so when you look at a system like this can you put a figure on how much it costs?

  • Capital cost for a system like this we're probably talking in terms of about

  • one and a half to two million pounds ($2.1m - $2.8m)

  • The ongoing costs - We have about 250 kilowatts of air conditioning.

  • When we run this flat out - this particular block here running flat out pulls about 70 kilowatts of power

  • and you're drawing that all the time so to run this whole facility you're talking

  • about thousands of pounds just purely in power costs and then of course they're

  • all the ongoing licensing and the support for that... So it's not insignificant.

  • SEAN>> So that's a lot of power is there a big red

  • switch somewhere someone has to pull to turn it on?

  • Yes there is - and no I'm not going to press it for you

  • SEAN>> So its obviously a lot of

  • equipment and looks like it might be quite complicated

  • does it ever go horribly wrong? Does it ever have big problems?

  • Generally speaking it is pretty reliable. Individual nodes will fail.

  • Individual disks will fail but generally speaking the equipment itself is relatively...

  • ...modern computer equipment is inherently reliable - we probably have

  • more problems with the air conditioning than we do with the actual compute itself.

  • SEAN>> So the other thing I was thinking about when when you look at this it's is this

  • totally bespoke or is it's like a template or how does it work in terms of

  • how do you buy one of these - How would you go and buy a high-performance computer?

  • That's the $64,000 question basically you have to start to think

  • "What do we need it for?" because there is no one generic high performance compute job.

  • Different departments, different research, different requirements have

  • different computing requirements. Some are very very high performance computing you know

  • it's a lot of number crunching - others it's about manipulating data so there's a lot of

  • data movement. Other things it's about visualization.

  • So you the first thing you've got to do is to say right "What is our mix of jobs?" because the way

  • which you set it up for high analytics is a different hardware set to what you set

  • up for vizualization and things like that. So that's the first thing you've got to do.

  • You've then basically got to say okay these are the jobs that we want to run.

  • Once you've actually got that you then go up with a supplier to say right

  • this is what we want to do, this is how much money we've got to spend. What can you give us?

  • Although this is fairly old now, you know there is still quite a lot

  • of life left in here okay it's not cutting edge - but it'll

  • still do a lot of the jobs because a lot of the jobs are purely about number crunching.

  • This is perfect for that so basically we will put the new one in - We

  • will try and keep as much as we can of the old one so that that we "sweat our assets "

  • and that also means that we've got additionally capacity for our

  • researchers to use as well and then basically we will then go for a

  • gradual replacement so as new processors come online and as new research projects

  • come you know the balance of the jobs will change so that means we may have to

  • strip out a particular type of node replace it with a different type of node

  • but you know so that will be far more organic in the future we're not

  • expecting in the future to do a complete rip and shred. Unless something

  • comes up and oh you know we build a new data center - but that's not on the cards at the moment.

  • The equipment itself is fairly generic, you know, these are standard

  • blade enclosures. The storage is standard storage - We have about two hundred and

  • forty terabytes in this block here - it's all connected up by InfiniBand

  • SEAN>> Is InfiniBand a speed of network?

  • It's a standard - This is a 40 gigabit InfiniBand gigabit

  • SEAN>> So at home you might have Gigabit - this is 40 of those?

  • Yes, 40 Gigabit yes - and also of course it's also multi path as well so..

  • ...because you know there's no point in doing a lot of calculations if you

  • can't then get the result of those calculations off.

  • There're effectively two types of jobs. There are parallel jobs where you've got a job running on

  • multiple nodes and then you've got single node jobs where basically

  • it's all running on one node. So again, with the parallel jobs you need network

  • connectivity to make sure you're not processing the same bit twice.

  • SEAN>> So for a researcher or someone who's a

  • part of a project what's the big benefit of doing this rather than letting their

  • office computer do it? Is it the speed of compute? It the fact that they can set

  • it off and come back another day or, what's the main benefit?

  • Yes it's the capacity. Because basically the job will start to run it will then

  • continue to run and then so for example

  • Christmas is a very very busy time for us because a lot of researchers will

  • start a job going then come back after Christmas and pick up the data

  • As I say, you you could do these things at home, it's just that it would take you

  • months or years to do what this can do in days or hours.

  • SEAN>> Are they 'hot' swappable then?

  • Yes they are

  • SEAN>> (Joking) Come on then, let's pull one out...

  • No!

  • They're all single-phase power but because the phase on this rack is

  • different to the phase on this rack there is the possibility of having a

  • potential difference of more than 400 volts across the two racks. It's unlikely

  • because each of the... but from a "health and safety" point... and it's exactly the

  • same why you'll see a lot of these have got laser [warning stickers] because we use laser optics

  • SEAN>> For your networking?

  • Er, yes the fibre...

  • SEAN>> And what is that, the aircon?

  • Nope, that is the fire suppression

  • SEAN>> Oh let's go of a look at that then

  • The fire suppression system that we have in here is it's an IG55 system which is an inert gas.

  • It's 50% Argon, 50% Nitrogen

  • basically if there is a fire in here all of the gas in there is released in one go that

  • replaces about half the atmosphere in here which takes the oxygen level down to

  • a point where it doesn't support combustion. It is just about breathable but you wouldn't want to

  • run a marathon in it you know it's like trying to run at the top of Mount Everest.

  • SEAN>> So it suppresses the fire without damaging the kit?

  • Yes. The gas is released through these nozzles here.

  • SEAN>> They look like sprinklers but they're actually gas...

  • Gas nozzles, yes.

  • SEAN>> and how does it work with the cooling? Is it go in hot one side and out cold the other?

  • This is - Yes basically we use aisle

  • containment so this is the cold aisle when we put cold air

  • in it then goes through the equipment we'd expect to see a delta T in terms of

  • 20-odd degrees - and on the other side basically it gets vented through...

  • SEAN>> So through that glass is going to be 20 degrees warmer? Can we go in? yeah

  • OK I think I'd like to spend my time on this side...

  • If you come down here you can definitely feel the temperature difference.

  • So these are compute nodes.

  • SEAN>> ...and how many computers are in each one of those blocks then?

  • Each one of here so in this particular one you've got 1 2 3 4... ...8 individual blades in this blade enclosure here.

  • You asked about the big red button? That's the big red button

  • SEAN>> That would turn it off and on?

  • No, that would turn it off.

  • SEAN>> Ah that's like a "Danger danger!" - press that?

  • Basically if I press that then everything will die immediately

  • SEAN>> let's stay away from the big red button then...

  • But that is the big red button, yes....

  • Assuming that they are separate parts of the CPU if we look

  • back at our instructions here we execute instruction 1 it uses the load/store unit..

  • complicated. The point is what we're doing is by multiplying G by various numbers or

  • adding it to itself - this point addition - we're moving around this curve sort of

  • seemingly at random

This this two-factor authentication; so basically to get in

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

高性能計算(HPC) - Computerphile (High Performance Computing (HPC) - Computerphile)

  • 4 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字