TBRC launches collaboration with SOAS to improve access to Tibetan digital texts

29 1 月, 2015

A delegation of researchers from SOAS, University of London, visited TBRC for two weeks this autumn. The delegation came to learn about the inner-workings of the TBRC's data management, search mechanism, and the new eText collection. The SOAS team, as part of the research project 'Tibetan in Digital Communication', funded by the UK's Arts and Humanities Research Council, has been working to develop software tools that will keep the Tibetan language thriving in the era of global digital communication. During the visit to the US, the researchers from London presented their project not only at the TBRC, but also to Tibetan studies students at Harvard and Columbia. The collaborations established between the TBRC and SOAS has already resulted in an improved TBRC search interface. Further joint projects focussing on Tibetan word breaking and part-of-speech tagging are currently underway.

Tibetan word breaking

Whereas English puts a space between each word in a sentence, Tibetan writes all words together without any way of marking where a word begins or ends. The start and end of words is vital information to almost all digital resources, but figuring out how break Tibetan text into individual words is a hard problem. The team in London has prototyped a tool for word-breaking, but with an accuracy of 93% it makes a mistake in every sentence. The TBRC and SOAS are committed to working together to solve this problem.

Tibetan part-of-speech tagging

Almost all digital technologies used in modern life, web searches, cellphone autocompletion, Google translate, Appel's Siri, etc. use as one of their building blocks a software tool that indentifies the part-of-speech category of words in a text. For example, in English 'chair' is a noun that you can sit on, but is also a verb that you can do to a meeting. The SOAS team has developed this type of tool for Tibetan and have already achieved an accuracy of 99.9%, i.e. only one mistake in every thousand words. Implementing this tool across the TBRC's eText collection will improve search and will enable sophisticated tools for Tibetan linguistics and Tibetan studies more generally.