How to Efficiently Chunk Text for OpenAI API Calls

data engineering

Publish Date: 2023-04-11

When working with the OpenAI API, it’s essential to manage the text input efficiently, especially when dealing with large amounts of text. The API has a token limit, which means you need to break your text into smaller chunks before making API calls. In this blog post, we will discuss two methods to wrap the text into chunks and ensure you stay within the token limits.

Method 1: Using the textwrap Library

The first method involves using the textwrap library, which is a built-in Python library that provides a simple way to wrap text into lines of a specified width. This method is a rough estimate, as it chunks texts by character size rather than actual token size. However, it can still be useful for quick and easy text wrapping.

Here’s how to use the textwrap library to wrap your text into chunks:

import textwrap

# Wraps the single paragraph in text (a string) so every line is at most width characters long. Returns a list of output lines, without final newlines.

article = "how are you ....."
chunks = textwrap.wrap(article, 10, replace_whitespace=False)
for chunk in chunks:
    print(chunk)
    # call openai api

results:

how are
you .....

Method 2: Using a Custom Class with Tiktoken

The second method is more precise, as it chunks texts by actual token size using the tiktoken library. This library allows you to count tokens in a text string without making an API call, ensuring that you stay within the token limits.
Packages to install

pip install tiktoken
pip install langchain

Here’s how to create a custom class that uses tiktoken to chunk your text:

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

encoding = tiktoken.encoding_for_model("text-davinci-003")


class MyClass:
    def count_token(self, text):
        num_token = len(encoding.encode(text))
        return num_token

    def process_article(self, article):
        splitter = RecursiveCharacterTextSplitter(chunk_size=10, length_function=self.count_token,
                                                  separators=['\n\n', '\n', ' ', ''],
                                                  chunk_overlap=2)
        chunks = splitter.split_text(article)
        return chunks
    
    

article = '''how are you doing? great, see you soon.'''
my_instance = MyClass()

chunks = my_instance.process_article(article)

for chunk in chunks:
    print(chunk)
    # call openai API

results:

how are you doing?
doing? great, see you
you soon.

Conclusion

In summary, when working with the OpenAI API, it’s crucial to manage your text input efficiently to stay within the token limits. The two methods discussed in this blog post provide different ways to wrap your text into chunks, with the first method using the textwrap library for a rough estimate and the second method using a custom class with tiktoken for a more precise token count. Choose the method that best suits your needs and ensure a smooth experience when making API calls.

robot learner

https://datasciencebyexample.github.io/2023/04/11/chunk-text-for-openai-api-calls/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

openai GPT chunk text

How to Upload an Existing Folder to GitHub

2023-04-13 data engineering

git github

How to count tokens precisely when using openAI GPT models

2023-04-10 data engineering

chatGPT openai token limit