Estonia crowdsources speech data for the preservation of the Estonian language
By Yogesh Hirdaramani
The eastern European country is launching a call for speakers of the Estonian language to donate their speech for a database that aims to support companies, public sector institutions, and research institutions in creating services based on speech technology.
Estonia aims to build an open database comprising 4,000 hours of spoken language to support the development of various language technologies. Image: Envato Elements
By creating an open database of 4,000 hours of spoken language, the nation aims to support companies, public sector institutions and research institutions in creating services and products based on speech technology. Speech technology can be used to record meetings, convert interviews into written form, and generate automatic subtitles to media.
Ott Velsberg, Chief Data Officer of Estonia, tells GovInsider that the campaign aims to promote the uptake of language technologies in public sector information systems as well as the private sector, including speech recognition technology, real-time subtitling solutions, and text to speech software. This will improve access to services and provide better ways for Estonians to interact with public and private sector services.
The campaign will support the Estonian Language Strategy 2021 - 2035, which aims to ensure that the Estonian language remains the primary language in every sphere of life in the Republic and to strengthen the status of the Estonian language.
According to the language roadmap, the use of Estonian has decreased over the past decade in certain fields such as the service sector, IT, and higher education due to the growth of the international workforce. Developing language technology that accounts for the Estonian language and its variants is a key goal of the language strategy, as it will support people’s participation in an increasingly digital society.
With open data, agencies no longer need to collect datasets for individual projects.
Velsberg says that so far, language datasets have mainly been collected to serve the needs of individual projects, and “workflows for their publication have not been firmly established”.
An open data portal containing high quality and abundant language data including speech data, translation materials, and sign language datasets, will empower more agencies to develop services that use language technologies, he explains. The portal currently aims to capture spontaneous speech in Estonian and its dialogue, as spoken by both native and foreign speakers of the language.
To collect the data necessary for this campaign, Velsberg says that the nation plans to carry out a wide-scale publicity campaign across all media channels to raise awareness of the importance of language technology and the sustainability of the Estonian language.
Speech materials collected within the project will be transcribed, and all personally identifiable information will be removed, according to the official website. People will also have the option of requesting for their recordings to be removed from the database should they choose to do so.
Estonia’s current speech technologies
Speech recognition software is already commonplace in the Eastern European country.
In the beginning of 2022, the country launched Bürokratt, which provides people with voice-activated public services – the “Siri of digital public services”. With Bürokratt, citizens can access all public services within Estonia, from applying for benefits to renewing a passport, through voice-based interactions with an AI-based virtual assistant, according to Emerging Europe.
The Estonian Public Broadcasting agency has also introduced real-time subtitles generated by artificial intelligence for live programmes on television, reaching close to 20,000 people, says Velsberg.
The Estonian parliament uses speech recognition technology to prepare verbatim records of parliamentary sittings, which is revised by editors before being published, according to the official E-Estonia website.
Speech recognition technologies have supported agencies worldwide in improving public services and preserving lesser-used languages.
In Singapore, AI Singapore developed a speech recognition programme that can recognise the colloquial English spoken in the country. This supports the country’s civil defence force in transcribing emergency calls quickly, allowing them to dispatch emergency services with greater speed.
A non-profit media organisation in New Zealand, Te Hiku, has been accumulating an extensive audio-visual archive of Māori words, phrases, and idioms. It uses an open-source app to collect oral recordings in Indigenous languages, which will be used to train AI models.
The non-profit is currently collaborating with local and international data scientists to perfect speech tech tools, from apps that teach Te Reo Māori pronunciation to virtual assistants, says the United Nations’ International Telecommunication Union.
UNESCO’s Global Action Plan for the International Decade of Indigenous Languages (2022 - 2032) also includes calls for communication technology companies to play a role in creating an enabling environment for building up the capacities of Indigenous institutions working on projects for preserving and revitalising lesser-used languages.
As public services go digital and increasingly adopt language technologies to better serve people, agencies will need to account for spoken language. Initiatives like Estonia’s new campaign will play a crucial role in building the data infrastructure needed to power such technology in an inclusive manner.