Image from and credit to “Language Models are Few-Shot Learners” from Brown Et al., 2020

Guide: Finetune GPT-NEO (2.7 Billion Parameters) on a single GPU with Huggingface Transformers using DeepSpeed

5 min readApr 10, 2021

GPT-NEO is a series of languages model from EleutherAI, that tries to replicate OpenAI’s GPT-3 language model. EleutherAI’s current models (1.7 Billion and 2.7 Billion Parameters) are not yet as big as OpenAIs biggest GPT-3 model Davinci (175 Billion Parameters). But unlike OpenAI’s models, they are freely available to try out and finetune.

Finetuning large language models like GPT-NEO is often difficult, as these models usually are too big to fit on a single GPU.

This guide explains how to finetune GPT-NEO (2.7B Parameters) with just one command of the Huggingface Transformers library on a single GPU.

This is made possible by using the DeepSpeed library and gradient checkpointing to lower the required GPU memory usage of the model, by trading it off with RAM and compute.

I also explain how to set up a server on Google Cloud with a V100 GPU (16GB VRAM), that you can use if you don’t have a GPU with enough VRAM (12+ GB) or if you don’t have enough enough normal RAM (60 GB+).

1. (Optional) Setup VM with V100 in Google Compute Engine

Note: The GPT-NEO model does run on any server with a GPU with at least 16 GB VRAM and 60 GB RAM. If you use your own server and not the setup described here, you will need to install CUDA and Pytorch on it.

Requirements

Install the Google Cloud SDK: Click Here
Register a Google Cloud Account, create a project and set up billing (only once you set up billing, you can use the $300 dollar sign up credit for GPUs).
Request a quota limit increase for “GPU All Regions” to 1. Here is a step by step guide. The UI changed a bit and looks now like this.
Log in and initialize the cloud sdk with gcloud auth login and gcloud init and follow the steps until you are set up.

Create VM

Replace YOURPROJECTID in the command below with the project id from your GCE project.
You can add the --preemptible flag to the command below, this reduces your cost to about 1/3, but Google is then able to shut down your instance at any point. At the time of writing, this configuration only costs about $1.28 / hour in GCE, when using preemptible.
You can change the zone, if there are no ressources available. Here is a list of all zones and whether they have V100 GPUs. Depending on the time of the day you might need to try out a few.
We need a GPU server with at least 60 GB RAM, otherwise the run will crash, whenever the script wants to save/pickle a model. This setup below gives us as much RAM as possible with 12 CPU cores in GCE (without paying for extended memory). You also can’t use more than 12 CPU cores with a single V100 GPU in GCE.

Run this to create the instance:

gcloud compute instances create gpuserver \
   --project YOURPROJECTID \
   --zone us-west1-b \
   --custom-cpu 12 \
   --custom-memory 78 \
   --maintenance-policy TERMINATE \
   --image-family pytorch-1-7-cu110 \
   --image-project deeplearning-platform-release \
   --boot-disk-size 200GB \
   --metadata "install-nvidia-driver=True" \
   --accelerator="type=nvidia-tesla-v100,count=1"

After 5 minutes or so (the server needs to install nvidia drivers first), you can connect to your instance with the command below. If you changed the zone, you also will need to change it here.

replace YOURSDKACCOUNT with your sdk account name

gcloud compute ssh YOURSDKACCOUNT@gpuserver --zone=us-west1-b

Don’t forget to shut down the server once your done, otherwise you will keep getting billed for it. This can be done here.

The next time you can restart the server from the same web ui here.

2. Download script and install libraries

Run this to download the script and to install all libraries:

git clone https://github.com/Xirider/finetune-gpt2xl.git
chmod -R 777 finetune-gpt2xl/
cd finetune-gpt2xl
pip install -r requirements.txt

(Optional) If you want to use Wandb.ai for experiment tracking, you have to login:

wandb login

3. Finetune GPT-NEO (2.7 Billion Parameters)

Add your own training data, if you don’t want to train on the provided Shakespeare examples:

If you want to use your own training data, replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.
If you want to feed the model separate examples instead of one continuous block of text, you need to pack each of your examples into an separate line in the csv train and validation files.
Be careful with the encoding of your text. If you don’t clean your text files or if just copy text from the web into a text editor, the dataloader from the datasets library might not load them.
Be sure to either login into wandb.ai with wandb login or uninstall it completely. Otherwise it might cause a memory error during the run.

Then start the training run this command:

deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-2.7B \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 4 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10

This uses a smaller “allgather_bucket_size” setting in the ds_config_gptneo.json file and a smaller batch size to further reduce gpu memory.
I tried these hyperparameters and they worked well for me. But if you want to change them, you could try hyperparameters closer to the orignal EleutherAi training config. You can find these here.

4. Generate text with a GPT-NEO 2.7 Billion Parameters model

I provided a script, that allows you to interactively prompt your GPT-NEO model. If you just want to sample from the pretrained model without finetuning it yourself, replace “finetuned” with “EleutherAI/gpt-neo-2.7B”. Start it with this:

python run_generate_neo.py finetuned

Or use this snippet to generate text from your finetuned model within your code:

# credit to Suraj Patil - https://github.com/huggingface/transformers/pull/10848 - modifiedfrom transformers import GPTNeoForCausalLM, AutoTokenizermodel = GPTNeoForCausalLM.from_pretrained("finetuned").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")text = "From off a hill whose concave"
ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")max_length = 400 + ids.shape[1] # add the length of the prompt tokens to match with the mesh-tf generationgen_tokens = model.generate(
  ids,
  do_sample=True,
  min_length=max_length,
  max_length=max_length,
  temperature=0.9,
  use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)