Using Customized Models

Fangzhou Xie

October 18, 2024

library(rethnicity)
#> ══ WARNING: ══════════════════════════════════════════════════════ rethnicity ══
#> ! This package predicts race from names, with inherent limitations and bias risks. Use cautiously.
#> ! Critically examine methodology and results for biases and ethical implications.
#> ✖ Results should not be considered definitive and must NOT be used for discrimination of any kind.
#> ✖ Intended for academic research purposes only, NOT for commercial use.
#> ══ INFO: ═════════════════════════════════════════════════════════ rethnicity ══
#> ℹ For detailed documentation, visit: rethnicity homepage (<https://fangzhou-xie.github.io/rethnicity/index.html>) and methodology paper (<https://www.sciencedirect.com/science/article/pii/S2352711021001874>).
#> 
#> ══ CITATION: ═════════════════════════════════════════════════════ rethnicity ══
#> ℹ Please use `citation("rethnicity")` to cite my work, thanks!

Design of the Package

I built this package to help applied researchers for research on ethnic equality/inequality. More specifically, this package provides a race-prediction method based on names. I designed the package in such way that the method is empowered by deep learning models, without the need to install the deep learning libraries, the installations of which are usually a daunting task. Hence, the methods provided in this package are not designed to be updated/fine-tuned/trained on custom datasets. This is the trade-off one has to be willing to make for the ease of use.

That said, from version 0.2.0 onward, I provide two additional lower-level functions: predict_fullname and predict_lastname, which would allow users to provided their customized models. (There is only one function prior to v0.2.0: predict_ethnicity. This function is still the RECOMMENDED one to use for most people.)

Usage on Customized Models

Since the package disables training by design, you need to train your own model in Keras and then convert the trained model to .json format by the frugally-deep project.

Train the model in Keras

If you are reading this vignette, most likely you know what you are doing and you must have heard Keras. Otherwise, you will have to stick to the default method predict_ethnicity.

You can refer to the following links to see how I trained the models and create your own version: fullname model, lastname model.

Before training the model, you need to process your dataset and you will need to use keras.utils.to_categorical() to transform the outcome variable into integers and you need to know the mapping between them. For example, 0, 1, 2, 3 refer to asian, black, hispanic, white respectively. You will need this and we will call it labels = c("asian", "black", "hispanic", "white").

Just remember to save the model without the optimizers (more on the frugally-deep website):

model.save('keras_model.h5', include_optimizer=False)

Convert the Model to .json

Then, use the convert_model.py script to convert your model into .json format. This is what I did as well. You will encounter an error in the conversion process, if you include the optimizers in the saved model.

python convert_model.py keras_model.h5 keras_model.json

Predict with Your Own Model

Now you have the model trained and converted and you need the file path of this model file. I am loading the default models without training new ones.

# remember the list of labels we mentioned?
labels <- c("asian", "black", "hispanic", "white")

# change to your own model file path
model_path <- system.file("models", "fullname_aligned_distill.json", package = "rethnicity", mustWork = TRUE)

# run the prediction
predict_fullname(firstnames = "Alan", lastnames = "Turing", labels = labels, model_path = model_path)
#>   firstname lastname prob_asian prob_black prob_hispanic prob_white  race
#> 1      Alan   Turing 0.02842531  0.2051059    0.02074102  0.7457278 white

In fact, if you tweak the code to predict gender from names, this will also work.