Google Research introduces Amplify Initiative — building a global, open, and community-based data platform to scale data collection in various languages.

Generative AI models are capable of transforming aspects of life from education to innovation globally, but their reach is not matched by the breadth of their training data, which is limited in terms of languages, topics, and geographies.

To ensure AI can address critical local needs — such as accessible health information, culturally relevant curricula, and financial services — we need diverse, high-quality data. This data should represent people, their needs, and values from across the globe, in their own languages. How this data is collected also matters. The future of data collection needs to be locally respectful, community-oriented, and responsible.

To help reach these goals, we introduce Amplify Initiative — an effort focused on building an open, community-based data platform that can scale novel data collection and validation globally. We describe Amplify’s approach to co-creating datasets with domain experts through a pilot conducted in Sub-Saharan Africa. Implemented with an Android app, this research resulted in an annotated dataset of 8,091 adversarial queries in seven languages authored collaboratively with 155 experts. Moreover, Amplify aims to scale this methodology in Brazil and India, and to identify innovative methods for capturing knowledge not currently available online.

Amplify Initiative

Amplify Initiative is designed to create structured, culturally relevant datasets through an app with local communities. At a high-level, this platform enables people to:

  • Co-create participatory, structured datasets that reflect needs around the world. Building on the current pilot in Sub-Saharan Africa, Amplify Initiative enables a community of researchers in each region to define the data needs to develop AI responsibly and to address region-specific problems. These data needs will be shared between participants and researchers so they can align to create high-quality datasets.
  • Access high-quality, multilingual datasets for AI innovation. AI developers and researchers can utilize the datasets created with Amplify to develop techniques, models, and tools. Access to open data will particularly enable researchers from the Global South to use AI for their communities and solve pressing societal issues. The data are suitable for fine-tuning and evaluation. For example, this could include a benchmarking dataset for misinformation in Swahili or a fine-tuning dataset to simplify financial terminology for individuals with low financial literacy in India.
  • Receive recognition and rewards for their valuable contributions to AI. The platform provides rewards and recognition for participation, including data authorship attribution, professional certificates, and research acknowledgements. In the future, the data authors may be able to track and see how their contributions impact AI innovation.

Pilot across Sub-Saharan Africa

To make this initiative a reality, Google Research partnered with Makerere University’s AI Lab in Uganda for an on-the-ground pilot program to co-develop high-quality datasets with experts across Sub-Saharan Africa. Researchers at Makerere were already primed to participate in such a program thanks to an ongoing collaboration with Google involving the study of potential harms found in LLMs across Africa.

Together, we:

  • Created a methodology for collecting and validating data about salient domains (e.g., health, education, finance) with relevant experts (i.e., people with domain specific professional or academic expertise, such as health workers and teachers).
  • Identified rewards for data creation (e.g., compensation, certificates).
  • Established an ecosystem using an app to collect data.
  • Trained and onboarded 259 experts in Ghana, Kenya, Malawi, Nigeria, and Uganda using in-person workshops and on-app training.
  • Collected 8,091 annotated adversarial queries in seven languages, co-authored by 155 experts from various industries.

How it works

Before beginning data collection, the team — Google Research and partner institutions — identify which specific domains are most important to the region. Experts equipped with professional or academic experience in these domains are invited to help with the data collection process. This intentional approach is the first step toward collecting data from a diverse group of individuals who can identify the most pressing local issues.

The team members and local partners (country-specific research leads) then define guidelines that need to be factored into the data creation process. The team also builds training materials and holds hands-on workshops for experts in their languages, making sure to include instruction on responsible practices, potential bias issues, and annotation techniques.

To scale training and data collection, the team built a privacy-preserving Android app for experts to use before creating data. The training is a necessary step to communicate data goals and capture locally relevant issues related to key generative AI themes, such as stereotypes, specialized advice, and misinformation.

Amplify1_Hero

A view of the Android app which contains training materials about responsible AI and query creation.

With this app, experts create and annotate data. The app provides automated feedback to ensure the queries are relevant to the data collection goals and they are not creating queries that are duplicates or semantically similar to other queries in the dataset. Experts annotate each query with thematic and domain specific topics.

Experts see annotation topics that are specific to their domain. The app makes it easy for participants to receive rewards for their contributions. It is localized for each participating country, including adapted recognition and compensation by region.

Amplify2_App

A view of the query creation flow on the app with an example of an localized query and suggested annotations.

Once data collection is completed, regional partners and country research leads with language and regional expertise translate, evaluate, and validate the queries for local relevance, coherence, fluency, and coverage. The team also utilizes automated approaches using AI to translate and validate the data before finalizing.

Pilot data

As part of the pilot, Makerere AI Lab and Google Research collected 8,091 annotated adversarial queries in English and six African languages (e.g., Pidgin English, Luganda, Swahili, Chichewa). The queries are adversarial in nature and have a high likelihood of producing unsafe responses from an LLM as a means of testing and mitigating for potential harm. This dataset in turn can be used to evaluate models for their safety and cultural relevance within the context of these languages. The dataset is open-source and available for exploration.

Experts from seven sensitive domains (e.g., culture and religion, employment) annotated these queries with ten topics within their domain of expertise (i.e., “corruption and transparency” for politics and government domain), five generative AI themes (e.g., public interest, misinformation) and 13 sensitive characteristics (e.g., age, tribe) that are relevant to the African context.

The most prominent domains were health (2,076) and education (1,469), with the top topics being chronic disease (373) and education assessment and measurement (245), respectively. Almost 80 percent of the queries contained contextual information about misinformation or disinformation, stereotypes, and content relevant to public welfare such as health or law. The majority of the queries were about social groups belonging to gender (e.g., “Chibok girls”), age (e.g., “newborns”), religion or belief (e.g., “Traditional African” religions), and education level (e.g., “uneducated”).

Amplify3_Results

Distribution of number of queries per thematic area and domain across all countries.

The dataset captured unique concerns, concepts, and social groups specific to each country. This includes adversarial queries rooted in local contexts, misconceptions, and fallacies. For example, one query captures the concerns around Ugandan women consuming a specific type of clay during pregnancy, which is a prevalent cultural practice that poses potential health risks. AI models can be enhanced by using the diverse cultural nuances found within the dataset, enabling them to detect and respond appropriately to a wide range of populations.

Amplify Initiative in the future

Building trust with communities around the world is central to Amplify Initiative’s approach. To accomplish this, Amplify is scaling the pilot in Latin America and South and Southeast Asia. The team has already partnered with Universidade Federal de Minas Gerais in Brazil and Indian Institute of Technology Kharagpur in India.

Together with the partners, the next step is to collect and validate data around salient, localized issues that cannot be generated using an AI model. The app might enable experts from these regions to prompt Gemini about critical issues in their languages and countries, and to modify the generated responses to capture contextual information missing in the current AI models. By enabling domain experts to collaborate with Gemini, Amplify could identify and fill potential data gaps related to salient issues globally: from crop selection for farmers in Brazil to value of staying in school for girls in India.

A demonstration of the new feature using Gemini on Amplify Initiative’s web application.

Join Amplify Initiative

Amplify Initiative aspires to empower communities around the world and put them in the driver’s seat of the next wave of AI innovation. If you’re interested to learn more about the project or get involved in your country, please express interest here.

Source

You May Also Like

Across the globe, Apple and its teams find new ways to give

The company’s Employee Giving program has raised over $880 million, with more…

Helping Indian startups drive global app innovations with MeitY Startup Hub

India is one of the fastest-growing app markets in the world. Millions…

New immersive AR experience brings student creativity to life

Australian artists create a new immersive educational experience, inspiring global cocreation and…

Samsung Electronics Unveils Far-Reaching, Next-Generation Memory Solutions at Flash Memory Summit 2022

Samsung Electronics, the world leader in advanced memory technology, today unveiled an…

New Cisco 800G Innovations Help to Supercharge the Internet for the Future

News Summary: Cisco’s new 28.8T / 36 x 800G line card, powered…

Apple lands historic first Best Picture Oscar nomination for “CODA,”and secures six Academy Award nominations including Best Actor for Denzel Washington in “The Tragedy of Macbeth” and Best Supporting Actor for Troy Kotsur in “CODA”

CUPERTINO, CALIFORNIA Apple today made history, landing six Academy Award nominations in several…

Accelerating telco transformation in the era of AI

AI is redefining digital transformation for every industry, including telecommunications. Every operator’s…

Mars and Microsoft work together to accelerate Mars’ digital transformation and reimagine business operations, Associate experience and consumer engagement

Mars and Microsoft work together to accelerate Mars’ digital transformation and reimagine business operations, Associate experience and consumer engagement