Datasets for Fair Machine Learning Research

[Home]


Note: The original page is here, but it's not being maintained as of now. I've just cloned the page to make some fixes to the data available.


Credit Card Default Data Set [Research, Website]

Contains 20,000 individuals described by 23 attributes (e.g., gender, age). We have removed individuals with missing attributes and reduced sample size to 20,000 from 30,000.

Label is Default Payment (1:yes; 0:no).

Sensitive feature is Education Degree. We have binarized the original value (1:graduate school; 2:university; 3:high school; 4:others) into (1:lower education) if it is <=3 and (0:higher education) otherwise (as done in The Price of Fair PCA: One Extra dimension)

creditcarddefault.csv is the data set; each row is an individual; the 24th column is label; the 3th column is sensitive feature.

creditdefault_index.csv contains 50 random shuffles of individual indicies; each row is a random shuffle.

Data Source


Communities and Crime Data Set [Research]

Contains 1,993 communities described by 101 attributes (e.g., population, household size).

Label is Crime Rate (1:high; 0:low).

Sensitive feature is Percentage of African American Residents. We have binarized the original value into (1:high) if it is >=50% and (0:low) otherwise.

crimecommunity.csv is the data set; each row is a community; the 101th column is label; the 1th column is sensitive feature.

crimecommunity_index.csv contains 50 random shuffles of community indicies.

Data Source


COMPAS Data Set [Research]

Contains 16,000 defendents described by 16 attributes (e.g., sex, ethnic).

Label is Risk of Recidivism (1:high; 0:low).

Sensitive feature is Race (1:black; 0:white).

compas.csv is the data set; each row is a defendant; the 16th column is label; the 15th column is sensitive feature.

compas_index.csv contains 50 random shuffles of defendant indicies.

Data Source


COMPAS Data Set 2 [Website]

Contains ~17,000 defendents described by 16 attributes (e.g., sex, age, priors).

Label is Event of Recidivism (1:high; 0:low).

Sensitive feature is Race (1:black; 0:all others).

compas.csv is the data set; each row is a defendant; the 16th column is label; the 3rd column is sensitive feature.

compas_index.csv contains 50 random shuffles of defendant indicies.


This webpage is maintained by Austin Okray (arokray@gmail.com).