Popular Names by Country Dataset

Human names are so ubiquitous and fundamental to our day-to-day lives that it's easy to forget how important -- and complex -- they can be. For example, did you know:

The U.S. Census removes all spaces from surnames, which makes culturally distinct surnames like "de la Cruz" (predominantly Hispanic), "dela Cruz" (Filipino), and "Delacruz" (anglicized) the same in the eyes of census reporting.¹
The majority of Indonesians do not have family names. Rather, their given names are geographically and culturally specific.²
In traditional Lithuanian culture, the ending of a woman's surname indicates whether she is married or not. Last names of married women end in -ienė while those of unmarried girls end in -ytė, -iūtė, -utė, -aitė.³
Many Portuguese surnames may be preceded by of/from (de, d') or of the/from the (do, da, dos, das) as in de Sousa, da Costa, d'Oliveira. Those elements are not part of the surname and are not considered in an alphabetical order.⁴

If your program deals with international names and you're not testing aggressively, then your users are probably going to have a bad time. This dataset can help.

The Need

Despite the importance of names, it's difficult to find large, trustworthy, free, easy-to-use name datasets with clear provenance about given names/first names/forenames and family names/lastnames/surnames from across the world online. There are many datasets available that do an outstanding job at addressing some of these needs, but not all. For example:

The U.S. Census provides outstanding data, but only for US surnames.
The U.S. Social Security Administration provides outstanding data, too, but only for US forenames.
FiveThirtyEight most common name dataset is also US-only, since it's based on the Census.
solenium/names-dataset is simple and easy to use, but doesn't indicate what countries names come from, nor their popularities.
davidam/damegender is well-documented, easy to use, and multinational, but only covers forenames.
philipperemy/name-dataset dataset is clearly comprehensive, but there is no indication of the most popular names, and there is no romanization. (And some users may have legitimate concerns about the provenance of the data, too.)
census.name is paid only.

I need a names dataset for doing some software testing, so this was disappointing.

The Dataset

Since there wasn't one already -- or, at least, not one I could find -- I made one myself. It attempts to address the following needs:

Free -- This dataset is released under the Creative Commons CC0 license.
Popular names -- Included names are reported to be the most popular, by country. Each name includes a count of people with the name within the country when available.
Big Enough for Testing -- 2,370 Forenames and 2,278 Surnames, many with multiple representations, i.e., different spellings and native vs. romanized forms.
Multinational -- Forenames from 106 countries, Surnames from 75. Generally at least 10 each per country. In particular, there are many names from CJK and RTL languages available.
Clear provenance -- Data is pulled from https://en.wikipedia.org/wiki/Lists_of_most_common_surnames and https://en.wikipedia.org/wiki/List_of_most_popular_given_names the week of Jul 8, 2023.
Easy-to-use -- Data is available in simple JSON formats.
Romanization -- Names in non-Latin scripts include Romanization, either as provided or sourced from Google Translate.

These data should allow users to build comprehensive test suites for important features involving names, particularly i18n and l10n.

The following countries are represented:

🇦🇩🇦🇪🇦🇱🇦🇲🇦🇷🇦🇹🇦🇺🇦🇼🇦🇿🇧🇦🇧🇩🇧🇪🇧🇬🇧🇴🇧🇷🇧🇾🇨🇦🇨🇭🇨🇱🇨🇳🇨🇴🇨🇷🇨🇺🇨🇾🇨🇿🇩🇪🇩🇰🇩🇴🇩🇿🇪🇨🇪🇪🇪🇬🇪🇸🇫🇮🇫🇯🇫🇴🇫🇷🇬🇧🇬🇪🇬🇬🇬🇮🇬🇱🇬🇶🇬🇷🇬🇹🇭🇷🇭🇹🇭🇺🇮🇪🇮🇱🇮🇲🇮🇳🇮🇶🇮🇷🇮🇸🇮🇹🇯🇪🇯🇲🇯🇴🇯🇵🇰🇬🇰🇭🇰🇷🇰🇼🇰🇿🇱🇧🇱🇮🇱🇰🇱🇹🇱🇺🇱🇻🇱🇾🇲🇦🇲🇨🇲🇩🇲🇪🇲🇰🇲🇱🇲🇳🇲🇹🇲🇽🇲🇾🇳🇱🇳🇴🇳🇵🇳🇿🇵🇦🇵🇪🇵🇫🇵🇭🇵🇰🇵🇱🇵🇷🇵🇹🇵🇾🇷🇴🇷🇸🇷🇺🇸🇦🇸🇮🇸🇰🇸🇲🇸🇷🇸🇻🇹🇭🇹🇯🇹🇳🇹🇷🇹🇼🇺🇦🇺🇸🇺🇾🇻🇪🇻🇳🇽🇰🇿🇦

The Data

The dataset is comprised of the following data files:

Surnames

common-surnames-by-country.csv -- This is the "master" surname file. All other surname files are generated from this file, either directly or indirectly. The format is not documented, but it's not hard to grok, especially if you refer to surnames2json.py.
common-surnames-by-country.json -- The same data as common-surnames-by-country.csv, but in a clearer JSON format.
common-surnames-by-country.min.json -- The same data as common-surnames-by-country.json, just minified.
common-surnames.txt -- Just want the names? Then this is the file for you. Contains all unique surnames, one per line.

Forenames

common-forenames-by-country.csv -- This is the "master" forename file. All other forename files are generated from this file, either directly or indirectly. The format is not documented, but it's not hard to grok, especially if you refer to forenames2json.py.
common-forenames-by-country.json -- The same data as common-forenames-by-country.csv, but in a clearer JSON format.
common-forenames-by-country.min.json -- The same data as common-forenames-by-country.json, just minified.
common-forenames.txt -- Just want the names? Then this is the file for you. Contains all unique forenames, one per line.

Downloading

You can get the data a few different ways:

Download the CSV files from this repo
Download the CSV, TXT, and JSON files from the releases on this repo
Clone this repo and run make

The License

The data is available under the Creative Commons CC0 license. This places the data into the public domain, so you can do whatever you'd like with it, but I'd appreciate a note if you find it useful! You can open a ticket to say hello 👋 if nothing else.

Maintenance

I have no plans to keep this dataset up to date over time. In my opinion, names do not change fast enough to require regular tracking for the purpose of testing. However, it is very important that the data represent here be trustworthy and easy to use, so if you find an error -- for example, if I left a comma in a name -- then please open a ticket!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
common-forenames-by-country.csv		common-forenames-by-country.csv
common-surnames-by-country.csv		common-surnames-by-country.csv
forenames2json.py		forenames2json.py
surnames2json.py		surnames2json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

common-forenames-by-country.csv

common-forenames-by-country.csv

common-surnames-by-country.csv

common-surnames-by-country.csv

forenames2json.py

forenames2json.py

surnames2json.py

surnames2json.py

Repository files navigation

Popular Names by Country Dataset

The Need

The Dataset

The Data

Surnames

Forenames

Downloading

The License

Maintenance

References

About

Releases 3

Languages

License

sigpwned/popular-names-by-country-dataset

Folders and files

Latest commit

History

Repository files navigation

Popular Names by Country Dataset

The Need

The Dataset

The Data

Surnames

Forenames

Downloading

The License

Maintenance

References

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages