Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • ELENA NIEDDU: I'm excited to be here

  • and to talk to you about the In Codice Ratio project--

  • that is a project going on at Roma Tre University--

  • and to talk to you about TensorFlow

  • help us build a module that is able to transcribe

  • ancient manuscripts in the Vatican Secret Archive.

  • So some introduction first.

  • This is our team.

  • On the right, you can see paleographers and archivists.

  • And on the left, there is us, a data science team.

  • And that's why I think the name we chose, In Codice Ratio,

  • reflects us very well.

  • Because it's a word play between the Italian

  • and the Latin meaning of the word "codice."

  • Now, in Latin, "in codice ratio" would mean a knowledge

  • through manuscripts.

  • But the word "codice" in Italian also means software code,

  • so it's also knowledge through software,

  • which is exactly what we're planning to do.

  • And so you might ask yourselves, what

  • brings paleographers and archivists and data scientists

  • together?

  • Well, they have one problem in common.

  • They both want to discover knowledge from big data.

  • We are used to think of big data as something

  • that happens in the web.

  • But actually, historical archives

  • are endless source of historical information,

  • of important information, of cultural information.

  • And just to give you a scale of how large this information can

  • be, let's just compare for a second

  • the size of the Vatican Secret Archive

  • to the height of Mount Everest.

  • Now, if you were to take each shelving of a Vatican Secret

  • Archive and stack it one top of the other,

  • you would get to 85 kilometers tall.

  • That is about 10 times the size of Mount Everest.

  • And the content spans the centuries and the continents.

  • For example, there, you have examples

  • of letters coming from China, from Europe, from Africa,

  • and, of course, from the Americas.

  • So what is our goal?

  • Our goal is to build tool and technology that

  • enable historians, archivists, and scholars

  • of the humanities in general to perform large-scale analysis

  • on historical archives.

  • Because right now, the process, let me tell you,

  • is entirely manual.

  • You still have to go there, consult the documents manually,

  • and be able to read that very challenging handwriting.

  • And then, if you find information

  • that may be linked to another collection,

  • then you have to do it all by yourself.

  • But first, we have to face the very first challenge that

  • is when you are dealing with web content-- for example,

  • if you want to extract data from the internet-- well, that's

  • already text.

  • And when we said we're dealing with the historical documents,

  • that's often scans.

  • And traditional OCR is fine for printed text.

  • But then you get to this.

  • This is medieval handwriting.

  • It's Latin, a language nobody uses anymore.

  • It's a handwriting nobody is able to write or read anymore,

  • for that matter.

  • It's heavily abbreviated.

  • And still, you want to get texts out of it.

  • So you might want to train a machine learning module.

  • Of course, you want.

  • But then, we come to the second challenge.

  • And that is scalability in the data set collection process.

  • Now, the graph you see there is a logarithmic scale.

  • And it might show you something that you already

  • know that is known as the zip flow that tells you

  • that there is very few words occurring humongous times.

  • And then, most of the words, they do not occur that often.

  • What does that mean for us?

  • That if we want to collect data, for example, at word level,

  • at vocabulary level, this means that we

  • have to annotate thousands of lines of text, which

  • means hundreds of pages, OK?

  • And similar systems do exist.

  • They are state of the art systems.

  • But most of the paleographers, even when

  • they know of these tools, get discouraged in using them

  • because they say, well, it's not cost-effective for me.

  • Because it can take up to months, or even years, of work

  • on these documents just to get a transcription system that they

  • will maybe use once or twice--

  • I don't know-- whereas they would like to do it faster.

  • So we asked ourself, how can we scale on this task?

  • And so we decided to go by easier step, simpler step.

  • The very first things that we did

  • was to collect data for single characters.

  • And these enabled us not to involve

  • paleographers but people with very less experience.

  • We built a custom crowdsourcing platform

  • that worked pretty much like CAPTCHA solving.

  • What you see there is an actual screen from the platform.

  • So the workers were presented with an image

  • and with a target.

  • And they had to match the target and select

  • the areas inside of the image.

  • And in this way, we were able to involve more than 500

  • high school students.

  • And in about two weeks' work, we made

  • more than 40,000 annotations.

  • So now we had the data, we wanted to build a model.

  • When I started working at the project,

  • I was pretty much a beginner in machine learning.

  • And so TensorFlow helped me put in practice

  • what I was studying in theory.

  • And so it was a great help that I

  • could rely on tutorials and on the community

  • and, where everything else failed, even the source code.

  • So we started experimenting, and we

  • decided to start small first.

  • We didn't want to overkill.

  • We wanted the model to fit, exactly, our data.

  • So we started small and proceeded incrementally

  • and, in this phase, in a constant cycle

  • of tuning hyperparameters and model tuning

  • and choosing the best optimizer, the best thing initializers,

  • the number of layers and the type of layers,

  • and then evaluating and training again.

  • Then we used Keras.

  • It was good for us because it allowed us to keep

  • the code small and readable.

  • And then, this is what we settled for.

  • It might look trivial.

  • But it allowed us to get up to a 94% average accuracy

  • on our test characters.

  • So where does this fit in the whole scheme

  • of the transcription system?

  • It's there in the middle.

  • And it's actually, so far, the only [INAUDIBLE] part,

  • but we are planning to expand.

  • And you will see how later-- we will see how later.

  • And so we have the input image.

  • So far, we're relying on an oversegmentation

  • that is old-school.

  • It's a bit old-school, but it allows

  • us to feed single characters or combinations of characters

  • inside of the classifier, which then produces

  • a different transcription who are ranked according

  • to a Latin language model, which we also build from publicly

  • available sources.

  • How good do we get?

  • We get about 65% exact transcription.

  • And we can get up to 80% if we consider minor spelling errors

  • or if the segmentation is perfect.

  • If we had perfect segmentation, we could get up to 80%.

  • We will see that this can be more challenging.

  • OK.

  • So what are our plans for a future?

  • We're very excited about the integration

  • of TensorFlow and Keras.

  • Because I described the process as being fully Keras.

  • What we actually found out was that sometimes some feature

  • were lagging behind, and sometimes we

  • wanted to get one part of the features from Keras

  • or from TensorFlow.

  • And so we found ourselves doing lots of--

  • I don't know if that's your experience, as well--

  • but we found ourselves doing lots of back and forth

  • between TensorFlow and Keras.

  • And now, we get the best of the two worlds,

  • so we're very excited about that.

  • And so how do we plan to expand our machine learning system?

  • First thing first, we are trying U-Nets

  • for semantic segmentation.

  • These are the same Nets that achieved very good results

  • on medical imaging.

  • And we're planning to use them to get rid

  • of these tricky computer vision, old-school segmentation.

  • And that would also achieve the result of having classification

  • together.

  • Because this is semantic segmentation we're talking of.

  • These are some preliminary examples

  • that work particularly well.

  • Of course, there is work still that we have to do.

  • And then, of course, since there could still be ambiguity,

  • we could do error correction and then transcription.

  • But I think this would be, in itself,

  • a significant improvement.

  • And another thing we're experimenting with

  • is enlarging our data set.

  • Because we don't want to stick to characters.

  • We want to evolve.

  • We want to move to word level, and even

  • sentence level, annotated characters.

  • But still, our focus is scalability

  • in the data set collection.

  • So we want to involve paleographers

  • as little as possible.

  • So for example, this is our generated inputs from GAN.

  • But we are also planning on using,

  • for example, a variational autoencoder

  • so that we can evolve our data set

  • with little human interaction--

  • the less we can.

  • And in the end, this would bring us to actually use

  • sequence model that could take full advantage of the sentence

  • level context, for example, and could even

  • be able to solve things that we couldn't be able to solve

  • with single character classification-- for example,

  • abbreviation.

  • In this kind of text, many words occur abbreviated, for example,

  • just like you would text.

  • In some texts, you would say me too

  • and use two, the number, or 4U.

  • And that's the same with this kind of manuscript.

  • And that's one of the application you could have.

  • Also, we are planning to use sequence models

  • to get to a neural language model because so far,

  • we only have experimented with statistics.

  • And one last thing before I let you go.

  • I mentioned the people in the team,

  • but there is so many people I would like to thank

  • that were not in that slides.

  • And first of all Simone, who should have been here,

  • but he couldn't make it.

  • And he was my machine learning Jedi Master.

  • And then Pi School of AI andbastien Bratiéres and Lukasz

  • Kaiser for their amazing mentoring.

  • And Marica Ascione, who is the high school teacher that

  • actually allowed us to involve those students that

  • were part of the platform.

  • And, of course, all of the graduate

  • and undergraduate students that worked with us and help us

  • achieve what we have achieved and what we

  • plan to achieve in the future.

  • And of course, thank you for your attention.

  • [APPLAUSE]

  • [MUSIC PLAYING]

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

在Codice Ratio。梵蒂岡祕密檔案中的機器轉錄(TF Dev Summit '19) (In Codice Ratio: Machine Transcription in the Vatican Secret Archive (TF Dev Summit '19))

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字