ORCA
Online Research @ Cardiff

Clear Cookie - decide language by browser settings

BLEND: A benchmark for LLMs on everyday knowledge in diverse cultures and languages

Myung, Junho, Lee, Nayeon, Zhou, Yi

, Jin, Jiho, Afina Putri, Rifki, Antypas, Dimosthenis, Borkakoty, Hsuvas, Kim, Eunsu, Perez-Almendros, Carla

, Ali Ayele, Abinew, Gutiérrez-Basulto, Víctor

, Ibáñez-García, Yazmín, Lee, Hwaran, Hassan Muhammad, Shamsuddeen, Park, Kiwoong, Sabuhi Rzayev, Anar, White, Nina, Muhie Yimam, Seid, Taher Pilehvar, Mohammad, Ousidhoum, Nedjma, Camacho-Collados, Jose

and Oh, Alice 2024. BLEND: A benchmark for LLMs on everyday knowledge in diverse cultures and languages. Presented at: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks, Vancouver, BC, Canada, 9-15 December 2024. Published in: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. and Zhang, C. eds. NeurIPS Proceedings: Advances in Neural Information Processing Systems. , vol.37 Curran Associates, Inc., pp. 78104-78146.

[thumbnail of Cultural_Benchmark___NeurIPS_D_B (1).pdf]

Preview

PDF - Accepted Post-Print Version
Download (2MB) | Preview

Official URL: https://proceedings.neurips.cc/paper_files/paper/2...

Abstract

Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

Item Type:	Conference or Workshop Item (Paper)
Date Type:	Publication
Status:	Published
Schools:	Schools > Computer Science & Informatics
Publisher:	Curran Associates, Inc.
Date of First Compliant Deposit:	5 November 2024
Date of Acceptance:	26 September 2024
Last Modified:	13 Feb 2025 12:15
URI:	https://orca.cardiff.ac.uk/id/eprint/173668

Actions (repository staff only)

Edit Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)