字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] ELENA NIEDDU: I'm excited to be here and to talk to you about the In Codice Ratio project-- that is a project going on at Roma Tre University-- and to talk to you about TensorFlow help us build a module that is able to transcribe ancient manuscripts in the Vatican Secret Archive. So some introduction first. This is our team. On the right, you can see paleographers and archivists. And on the left, there is us, a data science team. And that's why I think the name we chose, In Codice Ratio, reflects us very well. Because it's a word play between the Italian and the Latin meaning of the word "codice." Now, in Latin, "in codice ratio" would mean a knowledge through manuscripts. But the word "codice" in Italian also means software code, so it's also knowledge through software, which is exactly what we're planning to do. And so you might ask yourselves, what brings paleographers and archivists and data scientists together? Well, they have one problem in common. They both want to discover knowledge from big data. We are used to think of big data as something that happens in the web. But actually, historical archives are endless source of historical information, of important information, of cultural information. And just to give you a scale of how large this information can be, let's just compare for a second the size of the Vatican Secret Archive to the height of Mount Everest. Now, if you were to take each shelving of a Vatican Secret Archive and stack it one top of the other, you would get to 85 kilometers tall. That is about 10 times the size of Mount Everest. And the content spans the centuries and the continents. For example, there, you have examples of letters coming from China, from Europe, from Africa, and, of course, from the Americas. So what is our goal? Our goal is to build tool and technology that enable historians, archivists, and scholars of the humanities in general to perform large-scale analysis on historical archives. Because right now, the process, let me tell you, is entirely manual. You still have to go there, consult the documents manually, and be able to read that very challenging handwriting. And then, if you find information that may be linked to another collection, then you have to do it all by yourself. But first, we have to face the very first challenge that is when you are dealing with web content-- for example, if you want to extract data from the internet-- well, that's already text. And when we said we're dealing with the historical documents, that's often scans. And traditional OCR is fine for printed text. But then you get to this. This is medieval handwriting. It's Latin, a language nobody uses anymore. It's a handwriting nobody is able to write or read anymore, for that matter. It's heavily abbreviated. And still, you want to get texts out of it. So you might want to train a machine learning module. Of course, you want. But then, we come to the second challenge. And that is scalability in the data set collection process. Now, the graph you see there is a logarithmic scale. And it might show you something that you already know that is known as the zip flow that tells you that there is very few words occurring humongous times. And then, most of the words, they do not occur that often. What does that mean for us? That if we want to collect data, for example, at word level, at vocabulary level, this means that we have to annotate thousands of lines of text, which means hundreds of pages, OK? And similar systems do exist. They are state of the art systems. But most of the paleographers, even when they know of these tools, get discouraged in using them because they say, well, it's not cost-effective for me. Because it can take up to months, or even years, of work on these documents just to get a transcription system that they will maybe use once or twice-- I don't know-- whereas they would like to do it faster. So we asked ourself, how can we scale on this task? And so we decided to go by easier step, simpler step. The very first things that we did was to collect data for single characters. And these enabled us not to involve paleographers but people with very less experience. We built a custom crowdsourcing platform that worked pretty much like CAPTCHA solving. What you see there is an actual screen from the platform. So the workers were presented with an image and with a target. And they had to match the target and select the areas inside of the image. And in this way, we were able to involve more than 500 high school students. And in about two weeks' work, we made more than 40,000 annotations. So now we had the data, we wanted to build a model. When I started working at the project, I was pretty much a beginner in machine learning. And so TensorFlow helped me put in practice what I was studying in theory. And so it was a great help that I could rely on tutorials and on the community and, where everything else failed, even the source code. So we started experimenting, and we decided to start small first. We didn't want to overkill. We wanted the model to fit, exactly, our data. So we started small and proceeded incrementally and, in this phase, in a constant cycle of tuning hyperparameters and model tuning and choosing the best optimizer, the best thing initializers, the number of layers and the type of layers, and then evaluating and training again. Then we used Keras. It was good for us because it allowed us to keep the code small and readable. And then, this is what we settled for. It might look trivial. But it allowed us to get up to a 94% average accuracy on our test characters. So where does this fit in the whole scheme of the transcription system? It's there in the middle. And it's actually, so far, the only [INAUDIBLE] part, but we are planning to expand. And you will see how later-- we will see how later. And so we have the input image. So far, we're relying on an oversegmentation that is old-school. It's a bit old-school, but it allows us to feed single characters or combinations of characters inside of the classifier, which then produces a different transcription who are ranked according to a Latin language model, which we also build from publicly available sources. How good do we get? We get about 65% exact transcription. And we can get up to 80% if we consider minor spelling errors or if the segmentation is perfect. If we had perfect segmentation, we could get up to 80%. We will see that this can be more challenging. OK. So what are our plans for a future? We're very excited about the integration of TensorFlow and Keras. Because I described the process as being fully Keras. What we actually found out was that sometimes some feature were lagging behind, and sometimes we wanted to get one part of the features from Keras or from TensorFlow. And so we found ourselves doing lots of-- I don't know if that's your experience, as well-- but we found ourselves doing lots of back and forth between TensorFlow and Keras. And now, we get the best of the two worlds, so we're very excited about that. And so how do we plan to expand our machine learning system? First thing first, we are trying U-Nets for semantic segmentation. These are the same Nets that achieved very good results on medical imaging. And we're planning to use them to get rid of these tricky computer vision, old-school segmentation. And that would also achieve the result of having classification together. Because this is semantic segmentation we're talking of. These are some preliminary examples that work particularly well. Of course, there is work still that we have to do. And then, of course, since there could still be ambiguity, we could do error correction and then transcription. But I think this would be, in itself, a significant improvement. And another thing we're experimenting with is enlarging our data set. Because we don't want to stick to characters. We want to evolve. We want to move to word level, and even sentence level, annotated characters. But still, our focus is scalability in the data set collection. So we want to involve paleographers as little as possible. So for example, this is our generated inputs from GAN. But we are also planning on using, for example, a variational autoencoder so that we can evolve our data set with little human interaction-- the less we can. And in the end, this would bring us to actually use sequence model that could take full advantage of the sentence level context, for example, and could even be able to solve things that we couldn't be able to solve with single character classification-- for example, abbreviation. In this kind of text, many words occur abbreviated, for example, just like you would text. In some texts, you would say me too and use two, the number, or 4U. And that's the same with this kind of manuscript. And that's one of the application you could have. Also, we are planning to use sequence models to get to a neural language model because so far, we only have experimented with statistics. And one last thing before I let you go. I mentioned the people in the team, but there is so many people I would like to thank that were not in that slides. And first of all Simone, who should have been here, but he couldn't make it. And he was my machine learning Jedi Master. And then Pi School of AI and Sébastien Bratiéres and Lukasz Kaiser for their amazing mentoring. And Marica Ascione, who is the high school teacher that actually allowed us to involve those students that were part of the platform. And, of course, all of the graduate and undergraduate students that worked with us and help us achieve what we have achieved and what we plan to achieve in the future. And of course, thank you for your attention. [APPLAUSE] [MUSIC PLAYING]
B1 中級 在Codice Ratio。梵蒂岡祕密檔案中的機器轉錄(TF Dev Summit '19) (In Codice Ratio: Machine Transcription in the Vatican Secret Archive (TF Dev Summit '19)) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字