Pakistani linguistic technology expert, Dr. Agha Ali Raza and his team at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory, has released a corpus of Urdu sentences that covers all possible distinct sounds, called phoneme by linguists, used in everyday speech.
By Umair Rasheed
The possibility of some software application developer coming up with an Urdu speech recognition program just got more likely as the most fundamental tool needed for it has just been developed at Lahore’s Information Technology University.
Linguistic technology expert Dr. Agha Ali Raza and his team at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory has released for public use a corpus of Urdu sentences that covers all possible distinct sounds, called phoneme by linguists, used in everyday speech. This corpus comprising 708 sentences that covers all 63 phonemes will soon be available for download at the C-SALT website.
Those interested in developing an Urdu speech recognition software will now have access to the most basic ingredient needed for the purpose. They will just need a repository of words used in everyday speech to proceed with developing the application, says Dr. Raza.
“Speech recognition is a two-step process. The corpus will give the computer application access to all possible phonemes used in formation of meaningful Urdu words from everyday speech,” he says. Though there are 63 distinct phonemes in Urdu, in everyday speech these don’t correspond to 63 distinct sounds. Dr. Raza explains that sound made for a phoneme may vary from one utterance to another depending on the phoneme used before and after it in a word accuracy is by adding to the repository (for future use) words spoken to them by users.
Dr. Raza says he had started work on the corpus under supervision of Dr. Sarmad Hussain, as part of his master’s’ thesis at the Lahore-based, National University of Computer and Emerging Sciences. Then, he proceeded to Carnegie Mellon University where he completed his doctorate of philosophy (PhD) in Language Technologies. He and Dr. Hussain were assisted by a colleague, Huda Sarfaraz, and two linguists, Inamullah and Zahid Sarfaraz, in compiling the list of 708 sentences for the corpus.
“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world. Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages,” he says.
The corpus is available for download at: http://csalt.itu.edu.pk/
(MIT Technology Review)