Frederic Kaplan: How to build an information time machine

This is an image of the planet Earth. It looks very much like the Apollo pictures that are very well known. There is something different; you can click on it, and if you click on it, you can zoom in on almost any place on the Earth. For instance, this is a bird's-eye view of the EPFL campus. In many cases, you can also see how a building looks from a nearby street. This is pretty amazing. But there's something missing in this wonderful tour: It's time. i'm not really sure when this picture was taken. I'm not even sure it was taken at the same moment as the bird's-eye view. In my lab, we develop tools to travel not only in space but also through time. The kind of question we're asking is Is it possible to build something like Google Maps of the past? Can I add a slider on top of Google Maps and just change the year, seeing how it was 100 years before, 1,000 years before? Is that possible? Can I reconstruct social networks of the past? Can I make a Facebook of the Middle Ages? So, can I build time machines? Maybe we can just say, "No, it's not possible." Or, maybe, we can think of it from an information point of view. This is what I call the information mushroom. Vertically, you have the time. and horizontally, the amount of digital information available. Obviously, in the last 10 years, we have much information. And obviously the more we go in the past, the less information we have. If we want to build something like Google Maps of the past, or Facebook of the past, we need to enlarge this space, we need to make that like a rectangle. How do we do that? One way is digitization. There's a lot of material available -- newspaper, printed books, thousands of printed books. I can digitize all these. I can extract information from these. Of course, the more you go in the past, the less information you will have. So, it might not be enough. So, I can do what historians do. I can extrapolate. This is what we call, in computer science, simulation. If I take a log book, I can consider, it's not just a log book of a Venetian captain going to a particular journey. I can consider it is actually a log book which is representative of many journeys of that period. I'm extrapolating. If I have a painting of a facade, I can consider it's not just that particular building, but probably it also shares the same grammar of buildings where we lost any information.

Esta é uma imagem do planeta Terra. Parece demais com as fotos tiradas da Apollo, que são bastante conhecidas. Há uma novidade: você pode clicar nela e, se você clicar nela, você pode ver de perto quase qualquer lugar da Terra. Por exemplo, esta é uma visão panorâmica do <i>campus</i> da EPFL. Em muitos casos, também é possível ver de perto um prédio de uma rua próxima. É incrível. Mas há algo faltando nesta maravilhosa viagem: o tempo. Não tenho certeza de quando esta foto foi tirada. Nem mesmo de que tenha sido tirada no mesmo instante que a vista panorâmica. Em meu laboratório, desenvolvemos ferramentas para viajar não somente no espaço, mas também no tempo. A dúvida que temos é: seria possível criar algo como um Google Mapas do passado? Posso adicionar uma barra deslizante no Google Mapas e simplesmente mudar o ano, vendo como era 100 anos antes, mil anos antes? Isso seria possível? Posso recuperar redes sociais do passado? É possível criar um Facebook da Idade Média? Então, posso criar maquinas do tempo? Talvez simplesmente digamos: "Não, isso não é possível". Ou, talvez, possamos pensar nisso do ponto de vista da informação. Isso é o que eu chamo de cogumelo de informação. Verticalmente, temos o tempo, e, horizontalmente, a quantidade de informação digital disponível. Obviamente, nos últimos dez anos, temos muita informação. E, é claro, quanto mais no passado, menos informação temos. Se quisermos criar um Google Mapas do passado, ou um Facebook do passado, precisamos ampliar esse espaço, precisamos transformá-lo em um retângulo. Como fazemos isso? A digitalização é uma maneira. Existe muito material disponível: jornais, milhares de livros impressos. Posso digitalizá-los todos. Posso extrair informações deles. É claro que, quanto mais no passado, menos informações teremos. Então, talvez isso não seja suficiente. Aí, posso fazer o que os historiadores fazem. Posso pensar de forma mais ampla. É o que chamamos, em ciência da computação, de simulação. Se eu pegar um diário de bordo, posso considerá-lo não apenas como um diário de bordo de um capitão veneziano, saindo em uma jornada em particular. Posso considerar que seja, de fato, um diário de bordo que representa muitas viagens daquele período. Estou pensando de forma mais ampla. Se tenho uma pintura de uma fachada, posso considerar que não seja apenas de um prédio em particular, mas que, provavelmente, também compartilhe da mesma gramática de prédios dos quais não tenhamos informação.

So if we want to construct a time machine, we need two things. We need very large archives, and we need excellent specialists. The Venice Time Machine, the project I'm going to talk to you about, is a joint project between the EPFL and the University of Venice Ca'Foscari.

Então, se quisermos criar uma máquina do tempo, precisamos de duas coisas. Precisamos de arquivos bem grandes e de especialistas excelentes. A Máquina do Tempo de Veneza, o projeto sobre o qual vou falar, é um projeto conjunto entre a EPFL e a Universidade Ca'Foscari de Veneza.

There's something very peculiar about Venice, that its administration has been very, very bureaucratic. They've been keeping track of everything, almost like Google today. At the Archivio di Stato, you have 80 kilometers of archives documenting every aspect of the life of Venice over more than 1,000 years. You have every boat that goes out, every boat that comes in. You have every change that was made in the city. This is all there. We are setting up a 10-year digitization program which has the objective of transforming this immense archive into a giant information system. The type of objective we want to reach is 450 books a day that can be digitized. Of course, when you digitize, that's not enough, because these documents, most of them are in Latin, in Tuscan, in Venetian dialect, so you need to transcribe them, to translate them in some cases, to index them, and this is obviously not easy. In particular, traditional optical character recognition method that can be used for printed manuscripts, they do not work well on the handwritten document. So the solution is actually to take inspiration from another domain: speech recognition. This is a domain of something that seems impossible, which can actually be done, simply by putting additional constraints. If you have a very good model of a language which is used, if you have a very good model of a document, how well they are structured. And these are administrative documents. They are well structured in many cases. If you divide this huge archive into smaller subsets where a smaller subset actually shares similar features, then there's a chance of success.

Veneza tem algo bem peculiar: sua administração tem sido bem, bem burocrática. Eles têm informações sobre tudo, quase que como a Google de hoje. No Archivio di Stato, existem 80 km de arquivos que documentam cada aspecto da vida de Veneza, num período de mais de mil anos. Você tem cada navio que parte, cada navio que chega. Você tem cada mudança que foi feita na cidade. Está tudo lá. Estamos instalando um programa de digitalização de dez anos, que tem por objetivo transformar esse imenso arquivo em um sistema gigante de informação. Nosso objetivo é chegar a 450 livros sendo digitalizados por dia. É claro, ao digitalizar, isso não é suficiente, porque a maioria desses documentos está escrita em latim, em toscano, em dialeto veneziano. Então, é necessário transcrevê-los, em alguns casos traduzi-los, indexá-los, e é claro que isso não é fácil. Em particular, o método tradicional de reconhecimento óptico de caracteres, que pode ser usado para documentos impressos, não funciona bem para documentos manuscritos. Então, a solução é, na verdade, pegar inspiração de outra área: o reconhecimento de fala. Este é um campo que parece impossível, mas que, na verdade, pode ser feito simplesmente adicionando limites, se você tiver um modelo muito bom de uma linguagem que seja usada, se tiver um modelo muito bom de documento, a forma como são estruturados. E esses são documentos administrativos. São bem estruturados em muitos casos. Se você dividir esse enorme arquivo em subgrupos, que, na verdade, compartilhem das mesmas características, então, existe uma chance de dar certo.

If we reach that stage, then there's something else: we can extract from this document events. Actually probably 10 billion events can be extracted from this archive. And this giant information system can be searched in many ways. You can ask questions like, "Who lived in this palazzo in 1323?" "How much cost a sea bream at the Realto market in 1434?" "What was the salary of a glass maker in Murano maybe over a decade?" You can ask even bigger questions because it will be semantically coded. And then what you can do is put that in space, because much of this information is spatial. And from that, you can do things like reconstructing this extraordinary journey of that city that managed to have a sustainable development over a thousand years, managing to have all the time a form of equilibrium with its environment. You can reconstruct that journey, visualize it in many different ways. But of course, you cannot understand Venice if you just look at the city. You have to put it in a larger European context. So the idea is also to document all the things that worked at the European level. We can reconstruct also the journey of the Venetian maritime empire, how it progressively controlled the Adriatic Sea, how it became the most powerful medieval empire of its time, controlling most of the sea routes from the east to the south.

Se chegarmos a esse estágio, então, há outra coisa: podemos extrair eventos desse documento. Na verdade, provavelmente 10 bilhões de eventos podem ser extraídos desse arquivo. E esse sistema de informação gigante pode ser pesquisado de várias formas. Você pode perguntar coisas como: "Quem vivia nesse palácio em 1323?" "Quanto custava uma brema no mercado Realto em 1434?" "Quanto ganhava um vidraceiro em Murano, talvez, durante uma década?" Você pode até fazer perguntas maiores porque tudo estará semanticamente codificado. Então, você pode transformar isso em espaço, porque muito dessa informação é espacial. E, a partir disso, você pode, por exemplo, reconstruir uma jornada extraordinária dessa cidade que conseguiu se desenvolver de forma sustentável, ao longo de mil anos, conseguindo ter sempre uma certa harmonia com o ambiente. Você pode reconstruir essa jornada e visualizá-la de diferentes maneiras. Mas, é claro, não é possível entender Veneza se apenas observar a cidade. É preciso colocar isso em um contexto europeu maior. Então, a ideia é também documentar tudo que funcionava em termos de Europa. Podemos reconstruir também a jornada do império marítimo veneziano, como, progressivamente, controlou o Mar Adriático, como se tornou o império medieval mais poderoso de sua época, controlando a maior parte das rotas marítimas do leste ao sul.

But you can even do other things, because in these maritime routes, there are regular patterns. You can go one step beyond and actually create a simulation system, create a Mediterranean simulator which is capable actually of reconstructing even the information we are missing, which would enable us to have questions you could ask like if you were using a route planner.

Mas você pode até fazer outras coisas, porque, nessas rotas marítimas, existem padrões regulares. Você pode dar um passo além e, na verdade, criar um sistema de simulação, criar um simulador mediterrâneo que seja capaz de reconstruir até informações que não tenhamos, o que possibilitaria que houvesse perguntas que você pudesse fazer, como se estivesse usando um planejador de rotas.

"If I am in Corfu in June 1323 and want to go to Constantinople, where can I take a boat?"

"Se eu estiver em Corfu, em junho de 1323, e quiser ir a Constantinopla, onde posso pegar um navio?"

Probably we can answer this question with one or two or three days' precision.

É provável que possamos responder essa pergunta com um, dois ou três dias de precisão.

"How much will it cost?"

"Quanto vai custar?"

"What are the chance of encountering pirates?"

"Quais são as chances de encontrarmos piratas?"

Of course, you understand, the central scientific challenge of a project like this one is qualifying, quantifying and representing uncertainty and inconsistency at each step of this process. There are errors everywhere, errors in the document, it's the wrong name of the captain, some of the boats never actually took to sea. There are errors in translation, interpretative biases, and on top of that, if you add algorithmic processes, you're going to have errors in recognition, errors in extraction, so you have very, very uncertain data.

É claro, vocês entendem, o principal desafio científico de um projeto como este é qualificar, quantificar e representar a incerteza e a inconsistência a cada passo desse processo. Há muitos erros. Erros no documento, o nome errado do capitão. Alguns dos barcos, na verdade, nunca viram o mar. Há erros de tradução, interpretações tendenciosas, e, além disso, se adicionar processos algorítmicos, você vai ter erros de reconhecimento, erros de extração. Então, você tem dados bem incertos.

So how can we detect and correct these inconsistencies? How can we represent that form of uncertainty? It's difficult. One thing you can do is document each step of the process, not only coding the historical information but what we call the meta-historical information, how is historical knowledge constructed, documenting each step. That will not guarantee that we actually converge toward a single story of Venice, but probably we can actually reconstruct a fully documented potential story of Venice. Maybe there's not a single map. Maybe there are several maps. The system should allow for that, because we have to deal with a new form of uncertainty, which is really new for this type of giant databases.

Então, como podemos detectar e corrigir essas inconsistências? Como podemos representar essa forma de incerteza? É difícil. Uma coisa que você pode fazer é documentar cada passo do processo, não apenas codificando a informação histórica, mas aquilo que chamamos de informação meta-histórica, como o conhecimento histórico é construído, documentando cada passo. Isso não vai garantir que, de fato, vamos convergir para uma única história de Veneza, mas talvez possamos reconstruir uma história em potencial de Veneza, inteiramente documentada. Talvez não haja nenhum mapa sequer. Talvez haja diversos mapas. O sistema deve permitir isso, porque temos de lidar com uma nova forma de incerteza, que é realmente nova para esse tipo de bancos de dados gigantes.

And how should we communicate this new research to a large audience? Again, Venice is extraordinary for that. With the millions of visitors that come every year, it's actually one of the best places to try to invent the museum of the future. Imagine, horizontally you see the reconstructed map of a given year, and vertically, you see the document that served the reconstruction, paintings, for instance. Imagine an immersive system that permits to go and dive and reconstruct the Venice of a given year, some experience you could share within a group. On the contrary, imagine actually that you start from a document, a Venetian manuscript, and you show, actually, what you can construct out of it, how it is decoded, how the context of that document can be recreated. This is an image from an exhibit which is currently conducted in Geneva with that type of system.

E como devemos comunicar essa nova pesquisa a um grande público? Novamente, Veneza é extraordinária por isso. Com os milhões de visitantes que vêm todos os anos, é um dos melhores lugares para se tentar criar o museu do futuro. Imagine que, horizontalmente, você veja o mapa reconstruído de um determinado ano, e, verticalmente, você veja o documento que serviu de base para a reconstrução. Pinturas, por exemplo. Imagine um sistema imersivo que nos permita mergulhar e reconstruir a Veneza de um determinado ano, experiências que você possa compartilhar com um grupo. Ao contrário, imagine, na verdade, que você comece a partir de um documento, um manuscrito veneziano, e você mostre o que pode criar a partir dele, como ele é decodificado, e como o contexto desse documento pode ser recriado. Esta é uma imagem de uma exposição que está sendo conduzida em Genebra, com esse tipo de sistema.

So to conclude, we can say that research in the humanities is about to undergo an evolution which is maybe similar to what happened to life sciences 30 years ago. It's really a question of scale. We see projects which are much beyond any single research team can do, and this is really new for the humanities, which very often take the habit of working in small groups or only with a couple of researchers. When you visit the Archivio di Stato, you feel this is beyond what any single team can do, and that should be a joint and common effort. So what we must do for this paradigm shift is actually foster a new generation of "digital humanists" that are going to be ready for this shift.

Então, para concluir, podemos dizer que a pesquisa na área de humanas está prestes a passar por uma evolução que, talvez, seja similar ao que aconteceu com as ciências da vida, trinta anos atrás. É realmente uma questão de escala. Vemos projetos que estão muito além do que qualquer equipe de pesquisa pode fazer, e isso é muito novo para a área de humanas, que, com frequência, tem o hábito de trabalhar em grupos pequenos ou com apenas dois pesquisadores. Ao visitar o Archivio di Stato, você percebe que isso está além do qualquer equipe possa realizar, e que esse deve ser um esforço comum e em conjunto. Então, o que precisamos fazer para essa mudança de paradigma é, na verdade, promover uma nova geração de "humanistas digitais" que vão estar prontos para essa mudança.

I thank you very much.

Muito obrigado.

(Applause)

(Aplausos)

"If I am in Corfu in June 1323 and want to go to Constantinople, where can I take a boat?"

"Se eu estiver em Corfu, em junho de 1323, e quiser ir a Constantinopla, onde posso pegar um navio?"

Probably we can answer this question with one or two or three days' precision.

É provável que possamos responder essa pergunta com um, dois ou três dias de precisão.

"How much will it cost?"

"Quanto vai custar?"

"What are the chance of encountering pirates?"

"Quais são as chances de encontrarmos piratas?"

I thank you very much.

Muito obrigado.

(Applause)

(Aplausos)

Frederic Kaplan: How to build an information time machine

Frederic Kaplan: How to build an information time machine

Related talks

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Blaise Agüera y Arcas: Augmented-reality maps

Brewster Kahle: A free digital library

David McCandless: The beauty of data visualization

JP Rangaswami: Information is food

Aris Venetikidis: Making sense of maps

Related talks

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Blaise Agüera y Arcas: Augmented-reality maps

Brewster Kahle: A free digital library

David McCandless: The beauty of data visualization

JP Rangaswami: Information is food

Aris Venetikidis: Making sense of maps