字幕列表 影片播放 列印英文字幕 There's a parsing theme, devotees will know, and there's a Regexp theme. And our aim, in the end, is to unite these all into one, at the top, so that you're all totally happy - or completely unhappy - as the case may be. We have done Regular Expressions and the last one I did was really very simple and straightforward, and showed you absolutely nothing about what can go wrong. And some of you out there, coming up with the usual old jokes, quite right, about: "Oh! you're implementing using Regular Expressions. Now you've got two problems", Yeah! We'll look on in glee as we now take the story a bit further and see what can go right and how powerful Regular Expressions are. But also a little bit of what can go wrong and why you've got to be careful in your use. So what can it be this marvellous new example that consumes Life the Universe and Everything and exhibits a whole load of new features? And you're going to be quite convinced I'm off my head. It's Roman Numbers. Way way back when we first, here at Nottingham, got Lex and Yacc, in UNIX Version 7 - it was in the late 70s. So, believe me, I have been ... or, back in those days I was setting Roman Numbers as an example of what Lex - remember the Regular-Expression-based front-end to your compiler. Does tour identifiers and all that. I thought, all right, as an exercise, just write a Regular Expression that will accept Roman Numbers - all of the good ones and reject all the bad ones. Now, I can't emphasize too strongly: that rider is important! It is very easy, in a Regular Rxpression, quite easy, to get it to recognize what you want, but will it [also] reject all the stuff that's bad? And it is sometimes so easy to overlook [things]. I say with shame because, when I'd set that question, I scratched down a quick and easy answer - which I now see is full of holes .... Because if you input, into my regular expression MMII - again, looking ahead, Roman Numbers MMII [is] 2002. Fine. But then I thought ... well I thought now, I didn't think then: "What happens if I do MIIM? [winces at the thought]?" [Answer = ] 2002. [It] shouldn't be. There's a strict hierarchy in the way these work. So, at this stage we don't know how much, if anything, you remember about Roman Numerals. So, as an Extra Bits we've put out a simple tutorial on Roman Numerals based around this example, But for those of you that are totally happy with Roman numerals, we can just carry on with the main video now. So, just to locate ourselves then, and do a few examples to make everybody comfortable I'm going to write down MCDXCII. Start at the left. We've got a thousand [M] Then we see CD and, remember, this is one of these subtraction things CD is "a hundred less than 500" which is 400. MCD. So we're up to 1400 so far. XC? >> Sean: OK That's 90 >> DFB: 90. II? >> Sean: 2 >> DFB: 1492. What happens in 1492 Sean, in Western European folklore? "Columbus discovered horizons new in America (1492) >> Sean: I should have known that one >> DFB: Yeah! right. And, of course, the year we're in at the moment, as we film all of this, is - that's simple MMXX, 2020. And one which is dear to me, I'm sure you'll work out why. Yes, it's got a lot of these prepended abbreviation things in there look, MCM; 1900, XL: 10 less than 50 (40) IV: one less than 5 (4) -- 1944. 244 let's focus Let's focus ourselves now on how would we do a Regular Expression for all of this? In fact, let's make things very very simple and say: "All I'm gonna do at the moment is concentrate on getting a Regular Expression pattern that will either generate me one or two, or three, or four, or five - [writes on paper] forgive me I'll carry on down here. Or six, seven, eight, or nine. And I won't go any further than that. Because tens is a new adventure further to the left. So, those numbers then, from one to nine, which I've written know like that now. I hope you can see that if you're completely simple-minded about this and don't want to do complex Regular Expressions you really could write yourself a great long 'alternation' as it's called. In other words a great big sequence of ORs. And you could do everything from one up to thousands and just have this walloping great OR clause, that your Regular Expression parser will probably go bananas and say: "too many alternations", or something like this. So, no, that's not the way to do it. What one wants to do is to generalize and use, as it were, elegant Regular Expression capabilities to shorten this a bit. So, there we are the Roman numbers from I to IX, All written out. First of all let's try and separate these things into classes, and groups, that make sense. There are two weird things which involved the subtraction rule, you know, like IV (1 less than 5) IX (one less than 10). I think we will do those together, because they're weird in the same way, if you like. The other things are much more regular like you go from 1 2 and 3, which is with III. And similarly, here, you get the same pattern but coming after a V - a V on its own, or VI, VII, VIII, and so on. So, we've got three groupings: things involving Is with or without a V in front of them; and the two special cases IV and IX. So I would conjecture there's a wonderful piece of Regular Expression covering so far what we've done, but something new coming up - so do pay attention - would be this: my RE is going to be I - now this is square 'choice' brackets [], We have seen these before. But whereas, before, I put ranges inside them - you know like [A-Za-z] [0-9], any one of those digits. These aren't ranges; these are just explicit possibilities inside square brackets. It's called a 'choice' so I've got to have an 'I', if I'm on this side of the expression. But then it could be followed by either an X or a V, but it must be one or other of them, and only once. There's no repetition outside; there's no star there's no plus; no nothing. So those are the special cases taken care of, the IV and the IX. Now we put a walloping great big OR bar down the middle. Or if it's not a special case it's a regular straightforward case. But we've been clever with our REs now. Oh yes! look at this. Now, again, I think this is something we may have mentioned but probably not another piece of regular expression notation is: " ... if you put a question mark after something it means 0 or 1 of". So, what I'm saying here is I can either have a V, or I don't need a V. How can you not need a V? Well, what's following on here is going to be something that really is new. This is a counted repetition in curly braces zero - [should be 0,3]. Look at that! Bound to win an award! Let's just run through it. The two special cases are off on that left-hand side of the OR bar. But if you choose not to go down that route you go to here and it says: "Oh well if it's not one of the two special cases then it's like the following: anything between 0 and 3 Is. I mean you may not have Is at all - you can just have V on its own if you want - preceded by an optional V, nought or one these, >> Sean: Tthat's giving you the choice of either 1 2 3, or 5, or 6 7 8, right? >> DFB: Yeah, yeah, that's right. But it's all to do with sometimes having zero of something. Now, hereby hangs a tale about Regular Expressions. And those of you that keep on posting these notices about " ... you've got multiple problems with Regular Expressions !". Well. here's another one. It's the strange case of the 'empty alternative', which really can gang up to get you. Because what you're saying here, in this expression, is: " ... Is it possible for it to match nothing at all and yet that's a valid match?" Yes it is! Because if you choose to ignore what's on the left of the OR bar here - which you can do, it's either/or - we come here, and we say: " V, oh! I can have 0 or 1 of those. So, let's have 0 of them". The I ? I can have anything between 0 to 3 of those. So let's choose 0 of them. So, in other words choosing nothing at followed by nothing is a valid match for a Roman number [according to this RE]. And if you think about it it's true. Nothing in Roman Numbers is compulsory. You can have M's, or no M. Yyou can have C's or no C's. You know all of them. It's all optional. But as you start getting down to the units digits you think: "I've got to have something !" The paradox, then, about Roman Numbers done as a Regular Expression, is: you've got all sorts of cases that yield you nothing at all and yet you're going to be very disappointed if you don't have something! But saying "... overall you must have something" That's not a context-free assertion. That's a context-sensitive assertion. So we'll live with that for the moment. And I think also what I'd like to just show you now, if I can get the screen fired up ... and I'm using 'awk' here to demonstrate this, I'm going to add yet another little piece of notation to all of this. I'm going to say that you can, if you want to, insist that what you're matching must just be a valid Roman Number, on a line on its own, with no writing or messing about on either side of it. Just the number. Start of the number (^), end of the number ($). Finish. Right? And you denote that by saying start of string anchor, ^ End of string anchor, $. You're basically saying whatever pattern there is must fit between the start of the string, on a line on its own, and the end of the string - the newline right at the end. Now, this is the key thing about this. Those things [^$ markers] if you put them in, are not options. You can't have zero of them. If you say you must match the beginning of the line, you must match the beginning of the line. If you say you must match the end of the line, you cannot ignore it. You can't have zero of them. So, in a way, this is a kind of reality check and it's generally the case that if you use these correctly you won't end up with null strings being a pseudo-match. Because this thing is saying: "Well, whatever the finite [non-zero] length string is, it's got to fit between beginning of line end of line. Here's the 'awk' script that I hacked together, to try out a few alternatives. And I decided to go for this simple little piece first. Can I do everything from 1 to 9 in a single piece of Regular Expression. And you'll see here this is classic 'awk'. You, give a Regular Expression pattern, then you give an action. It's a C-like syntax, even in 'awk'. Basically what I'm saying is; "If I match that then I print out 'Pattern p1 matched' and so on". Now, if you look down here four of these things - because this is just a test-bed - four of them are commented out, they will not be executed. The one that will be executed is the one I've just been discussing with you, here, which says: "I must match start of string then here's my Regular Expression pattern - which I think will work from I to IX inclusive, and then it must match end of string". So, I'm hoping that by uncommenting this that this will work. And I'll type some good and bad Roman Numbers at it. And for every one it's happy with, it will say 'Pattern P0 matched is' And it will echo back what it your Roman Number [was] that you put in. So, what I want to do, now, is actually run that script, that we've been looking at. 'awk - f ' take your instructions from a file called 'test-roman.awk' OK, so when that runs now it should just hang there waiting for its input of a Roman number. And remember it is going to match that beginning of string, piece of regular expression, end of string. >> DFB: So. which one do you want me to try first Sean? >> Sean: OK, let's go really simply and just go for a number 5 [V]. >> DFB: V all on its own [waits] Look at that! That's a 'P0 matched is V >> Sean: All right - a bit trickier 4 [IV] because it's got that 'minus one' sort of idea. Let's put one in that shouldn't work. >> DFB: Go on that what do you want ? OK just put in C for Computerphile. >> DFB: C for Computerphile? Now you might say that's a valid Roman number. Yes, but it's it's not being covered by the bit of Regular Expression I've done, I've done a piece of Regular Expression that literally, I hope, does I to IX and nothing else. Let's see what happens there [with C] It echoes it back but it doesn't do anything with it. So let's now try IX, looking good. All right how about the ultimate test [which] is type in something utterly idiotic like HELLO No, doesn't do anything with it, doesn't recognize it. So, we're back showing the 'awk' script again and let me just remind you that what we've shown is that as long as we put the anchors in at the beginning and the end of this pattern then our minimal testing, but I think we've tried all the options, it's recognized V it's recognized IX. IV all this stuff. That one is absolutely fine and it rejects rubbish. I've actually put in, as a comment, StackOverflow's assertion that this [points at screen] is the thing that's the answer to life the universe and everything. Let's move right across to the right and the first thing you'll see here is the piece we've just done for the units as it were, I through to IX. But as you go leftwards can you see that the beauty of this is ... that has handled the units part of the Roman Numbers. It's dead easy to see, therefore, that it's like a repetition of the same idea, but with different symbols? When you come to the tens range What are the exceptions there XC (ten less than a hundred) XL (ten less than fifty). I's just like the I's and the X's and V's were before, look. Similar. X[CL]. Or it is is an optional L, for 50, followed by a tens digit, zero {0,3} of them. You can't say just {1,3} because L on its own is perfectly valid as a Roman Number Number. So, you see that piece of pattern - similar to the far-right one but just with a relabeling almost of the digits you're using- that copes with the tens. Up here [points to left of screen] it's coping with the hundreds, going on to the thousands. So, the story finishes off in here by saying: Ah! you can have CM (a hundred less than a thousand) i.e. 900 in other words. CD for 400. And the midpoint of the range, which everything pivots around. It was V there, it was L here, it is going to be D here. I hope you all know that D is 500? So there we are. how about you see three very similar pieces of pattern for the hundreds the tens and the units and right out here at the far left, M*. And M* means you can have zero or more 1000s. So, we've got the basis there for something that looks like it would do everything. The only problem is these empty matches. But that's such a big problem we probably ought to have an episode devoted to that all on its own.
A2 初級 RegEx羅馬數字 - Computerphile (RegEx Roman Numerals - Computerphile) 5 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字