How An IIT Kharagpur Computer Scientist Is Unveiling Sanskrit Literature Through Artificial Intelligence

How An IIT Kharagpur Computer Scientist Is Unveiling Sanskrit Literature Through Artificial IntelligenceDr Pawan Goyal is an associate professor at IIT Kharagpur.
  • Researchers led by IIT Kharagpur's Dr Pawan Goyal have developed an artificial intelligence-based system to process Sanskrit texts.

    The goal is to provide access to ancient literature.

Sanskrit may be the least spoken of the scheduled languages in India, but it has been mighty resilient — even perhaps making a small comeback — in recent years.

According to 2011 Census data, close to 25,000 people had Sanskrit as their mother tongue, and while that may not seem like much in a population of over 130 crore, it marks a growth of more than 10,000 speakers over a decade.

For all we know, this number may be even higher.

Dr Bibek Debroy, addressing a gathering in Paris for a 2016 UNESCO event, said, “Indians speak more than one language. For Sanskrit to be the first language or mother tongue is rare. But it can be the third or fourth language. We don’t capture that. Hence, we don’t know how many Indians speak Sanskrit.”

For its part, the Union government has been keen to preserve and promote Sanskrit. Among the more recent examples is the passing of the Central Sanskrit Universities Bill — granting central university status to three deemed-to-be Sanskrit universities — to broaden the scope of Sanskrit education and research in the country.

Additionally, the New Education Policy 2020 called for Sanskrit to be available for study across the school and higher education levels, notably pulling back the grade at which a student can begin learning Sanskrit, to the primary level.

The reasons to promote Sanskrit are good and many, but perhaps the most tempting is to unveil the vast treasure house of ancient Indian knowledge.

There’s more to Sanskrit literature than the Vedas, Puranas, and the epics Mahabharata and Ramayana. If only Sanskrit is made more accessible to the interested, it allows us to tap into an earlier era of Indian history, culture, and thought, and discover more about our past.

In this regard, the efforts of a computer scientist at the Indian Institute of Technology (IIT), Kharagpur, and his team is commendable.

Researchers led by Dr Pawan Goyal are working to make Sanskrit texts more accessible with the help of automated computational processing techniques.

The work combines machine learning and traditional linguistic knowledge to build an artificial intelligence system for processing ancient literature.

Dr Amrith Krishna (left) and Dr Pawan Goyal presenting their work at EMNLP 2018 (Brussels).
Dr Amrith Krishna (left) and Dr Pawan Goyal presenting their work at EMNLP 2018 (Brussels).

Dr Goyal studied electrical engineering at IIT Kanpur (2003-2007) and received his doctoral degree from the University of Ulster, United Kingdom, in 2011. He thereafter worked as a postdoctoral researcher at INRIA Paris Rocquencourt, developing the “Sanskrit Heritage Site” with Gérard Huet, for two years.

In 2013, he joined the Department of Computer Science and Engineering at IIT Kharagpur and has been there ever since, exploring the research areas of text mining, natural language processing, and Sanskrit computational linguistics.

Dr Goyal has also taught the course “natural language processing” as part of the National Programme on Technology Enhanced Learning (NPTEL) educational initiative by the seven IITs and the Indian Institute of Science.

He is also the recipient of various awards, like INAE Young Engineer Awards 2020, Google India AI/ML Research Awards 2020, and Facebook AI and Ethics Research Award India, 2019.

Dr Goyal spoke to Swarajya about his work combining two things dear to him — Sanskrit and computation.

1) You have devised a way to digitally process Sanskrit texts. What led you in this direction of research?

Sanskrit has a rich literary tradition spanning more than two millennia that encapsulates the cultural ethos of this civilisational nation. Works in Sanskrit, numbering more than 30 million extant manuscripts, include extensive epics, subtle and intricate philosophical, mathematical, and scientific treatises, and rich literary, poetic, and dramatic texts. The design and implementation of computer-aided processing tools is thus of paramount importance to analyse the enormous store of knowledge and literature available as Sanskrit text.

My first encounter with the computational aspects of Sanskrit was through Panini’s grammar. More recently, the Paninian computational system was recognised as a pioneer in information theory and informatics. While my initial efforts were concentrated on computational implementation of Panini’s grammar, as I pursued my research in Natural Language Processing (NLP), I realised that NLP can really be an effective tool to provide a better access to the Sanskrit knowledge.

The proposed AI-based system, used in conjunction with interactive tools such as the Sanskrit Heritage reader, can aid the users in easier analysis of these manuscripts with word by word analysis and translation, relation between words, poetry to prose conversion, search and question answering, etc.

2) Your work with computer science and Sanskrit extends back to your days at the French National Institute for Research in Digital Science and Technology (Inria). You worked there on "The Sanskrit Heritage Site". Could you tell us about your work there and what you were trying to build?

The Sanskrit Heritage Site at INRIA provides various web services for Sanskrit users. Developed by Prof Gérard Huet, the main web service for this site is Sanskrit Heritage Reader which, given a Sanskrit sequence, provides all possible ways of doing word segmentation (संधि विच्छेद). At the back end, it uses elegant finite state machinery to provide all possible solutions in real time.

As can be seen from an example sentence from Kalidasa, the number of solutions can be really large for long sentences. One of the works we did was to develop this lean interface, which could show all the solutions in a compact form, and a Sanskrit expert could then select the correct segmentation solution (pada patha) by using clicks.

Computational work on an example sentence from Kalidasa
Computational work on an example sentence from Kalidasa

Other notable work was to align all the root forms (lemma) information for this interface to Sanskrit-English Monier Williams dictionary. Thus, the interface now offers lexicon access to both Heritage dictionary (Sanskrit-French) and Monier Williams.

3) Sanskrit is considered a 'morphologically rich language'. What does that mean exactly, and how does it dictate your approach to working with it?

The sentence construction and comprehension in any language relies a lot on how the grammatical information is encoded. Sanskrit has a rich morphology and the grammatical information is encoded using morphological markers. Thus, “सुन्दरः रामः रावणम् हन्ति” and “रावणम् हन्ति सुन्दरः रामः” mean the same because the morphological markers encode the information, but this is not the case for “Dog kills cat” and “Cat kills dog”, where the information is encoded in the word order. This also makes Sanskrit a relatively free word order language. A lot of the literature is in the form of verses and follows relatively free word order.

Keeping these in mind, we propose a generic graph-based framework that is able to take advantage of the free word order nature of the language. Our search-based structured prediction framework allows encoding of relevant linguistic information as constraints. Further, we automate feature learning for the structured-prediction model, based on the morphological properties expressed by the words in the language.

Flow diagram
Flow diagram

4) Ever since the Rick Briggs article about Sanskrit and AI appeared in AI Magazine in 1985, there have been various interpretations of how the two (Sanskrit and AI) might be linked favourably or otherwise. From your point of view as an expert in the field, could you clarify what we should be taking away on this matter?

That was a very interesting article, which focused on how concepts from Indian grammatical tradition could be used for knowledge representation in artificial intelligence research. The article itself is rigorous and interesting, but it spread on the Internet as uncontrollable disinformation. At some point it was quoted saying that Sanskrit is the ultimate programming language, studied in secret NASA laboratories. These are just rumours. Sanskrit, like any other natural language such as English or Hindi, is definitely not directly usable as a programming language. These rumours are detrimental to the respect that the ancient science of Vyakaraṇa genuinely deserves.

5) How is it to computationally process Sanskrit relative to any other language(s)?

Sanskrit presents unique challenges in automated computational processing. In addition to the sheer volume and diversity, both stylistic and chronological, found in these texts, the linguistic peculiarities expressed by the language, pose several challenges in making these works accessible to the world.

First, following the oral tradition, the phonetic transformations at the word boundaries (sandhi) are reflected also in writing. Secondly, the language shows high lexical productivity in terms of inflection, derivation, and compounding. Thirdly, a lot of the literature is in the form of verses and follows relatively free word order.

At the International Sanskrit Computational Linguistics Symposium 2019, IIT Kharagpur. Dr Pawan Goyal lights the lamp. In attendance are IIT KGP's CSE head, Prof Dipanwita Roychowdhuri, IIT Bombay's Prof Malhar Kulkarni, the University of Hyderabad's Prof Amba Kulkarni, IIT KGP's Prof P K Das, and Prof Gerard Huet of INRIA Paris.
At the International Sanskrit Computational Linguistics Symposium 2019, IIT Kharagpur. Dr Pawan Goyal lights the lamp. In attendance are IIT KGP's CSE head, Prof Dipanwita Roychowdhuri, IIT Bombay's Prof Malhar Kulkarni, the University of Hyderabad's Prof Amba Kulkarni, IIT KGP's Prof P K Das, and Prof Gerard Huet of INRIA Paris.

6) What doors do you see your AI system opening for us in the future, especially in regard to our relationship with Sanskrit? Will we see, for instance, better access to our ancient literature?

The current work published in the Computational Linguistics journal (published by the MIT Press) presents a machine learning system and addresses the tasks of word segmentation (संधि विच्छेद), morphological parsing (पद विश्लेषण), dependency parsing (कारक विश्लेषण), and poetry to prose conversion of Sanskrit text (अन्वय). We are now actively collaborating with several research groups to extend the application of the proposed system for automatic speech recognition and question-answering in Sanskrit.

The end-objective is to develop robust tools that can be used to access the ancient literature seamlessly, like we are able to access information available in other languages.

7) Does your work with Sanskrit have a personal significance for you?

Yes, it has a lot of personal significance. During my B.Tech days at IIT Kanpur, I attended various seminars and courses organised by Prof. Laxmidhar Behera on the essence of the Bhagavad Gita. Going through the books of His Divine Grace A C Bhaktivedanta Swami Srila Prabhupada, I developed immense respect for our ancient literature, in particular, the Bhagavad Gita and Srimad Bhagavatam. I have been greatly benefited by this timeless knowledge in my own pursuit, and I hope that my small efforts will help in making this knowledge accessible to one and all.

Karan Kamble writes on science and technology. He occasionally wears the hat of a video anchor for Swarajya's online video programmes.


Latest Articles

    Artboard 4Created with Sketch.