Cross-lingual categorization: Datasets

Datasets used in

Frermann, Lea and Mirella Lapata (2021). Categorization in the Wild: Category and Feature Learning across
Languages. Proceedings of the 43rd Annual Meeting of the Cognitive Science Society.

[multilingual_categories.tsv] The dataset of 491 basic-level concepts grouped into 31 categories.
The original concept set and categorization is derived from [1,2,3]. See Frermann and Lapata (2021)
for further details. The original set of categories and concepts was created by native speakers of
English. The data set was translated into German, French, Mandarin Chinese and Arabic by a native
speaker of each language, respectively. Categories are separated by empty lines, and the first line in
each block refers to the category name, while the remaining lines list the concepts belonging to the category.

[multilingual_stimuli.zip] A zip file containing one corpus (set of stimuli derived from Wikipedia)
for each target language (en, ge, fr, zh, ar). In each language-specific file, one line corresponds
to one observation of a target concept (first column) in context (second column).

[1] McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large
set of living and nonliving things. Behavioral Research Methods, 37(4), 547–59.

[2] Vinson, D., & Vigliocco, G. (2008, February). Semantic feature production norms for a large set of objects and
events. Behavior Research Methods, 40(1), 183–190

[3] Fountain, T., and Lapata, M. (2010). Meaning representation in natural language categorization. Proceedings of
the Annual Meeting of the Cognitive Science Society. Vol. 32. No. 32.

Click here for poster and a short video presentation.