In this blog post, we will walk you through the fine-tuning process of an Ada model for classifying text data into two sports categories: Baseball and Hockey. We’ll utilize a dataset from fetch_20newsgroups and fine-tune an Ada model to improve its performance in distinguishing between these two sports.
The dataset can be easily loaded using sklearn and is essentially a collection of newsgroup documents. The sample data mainly consists of emails to sports mailing lists, with 1197 examples evenly distributed between Baseball and Hockey.
categories = ['rec.sport.baseball', 'rec.sport.hockey']
Data is transformed into a pandas dataframe with prompt (the email text) and completion (the sport category) columns. For demonstration purposes, we will only use 300 examples; however, in real use cases, more data would likely improve performance.
A crucial step in the data preparation is to add a specific suffix separator (\n\n###\n\n) between the prompt and completion. This separator helps the model identify where the input ends and the prediction begins. The data is then split into training and validation sets to measure performance and expected accuracy on unseen data.
With the prepared dataset, the Ada model is fine-tuned to improve its classification accuracy between Baseball and Hockey. Ada is chosen for its cost-effectiveness and comparative performance on classification tasks.
!openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" -m ada
The model trains successfully, reaching a high accuracy of 99.6% on the validation set. The accuracy on the validation set is a good indicator of the model’s performance and generalization capability on new, unseen data.
With the trained model, predictions can be generated easily. For each input prompt (an email or other text data), the model predicts whether the text is more likely related to Baseball or Hockey.
ft_model = 'ada:ft-openai-2021-07-30-12-26-20'
Interestingly, the model is not limited to the type of data it was trained on. Despite being trained on emails, it can also successfully classify tweets into the correct sports categories. This versatility shows the model’s ability to generalize and apply its learned knowledge to different text formats and contexts.
Through this example, we’ve seen how to fine-tune an Ada model for a classification task involving sports categories. The model not only achieves high accuracy but also demonstrates impressive generalization capabilities across various text formats. This step-by-step process can be adapted and used for fine-tuning models for different classification tasks, providing a powerful tool for various applications in text classification and analysis.