Putting SSE to Work: A Load Test LLM API Case Study

3 min read
Jun 11, 2024 10:31:41 AM

Server-Sent Events enable servers to send real-time updates to clients through an HTTP connection, creating a one-way data stream from the server to the client.SSE is an excellent choice for real time applications like an LLM (large language model) such as ChatGPT or Anthropic.

OpenAI and others have extended the SSE standard by adding the POST method. They did this to allow the user to pass prompts that would otherwise be truncated by web servers if passed as query parameters because of their length. To support this use case, Gatling has updated their SSE support to include the POST method, as you can see here.

Let's dive into an example to illustrate how to load test LLM using Gatling. We'll create a simulation to load test a ChatGPT endpoint that can use SSE to stream responses.

 

Step 1: Setting Up the Project

First, ensure you have Gatling installed and set up. If you don't have Gatling installed, you can download it on the documentation and download the project on Github.

 

Step 2: Configuring the HTTP Protocol

theatre1

Picture yourself at a grand theater in Paris, comfortably seated and admiring the set and ambiance. In Gatling, just as the theater environment shapes the audience experience, the HTTP protocol provides the framework for your test scenarios. The baseUrl defines where the performance takes place, guiding all interactions to the correct destination.

In your Gatling project, configure the HTTP protocol to specify the base URL of ChatGPT (OpenAI) API. We use sseUnmatchedInboundMessageBufferSize in order to buffer the inbound message

import static io.gatling.javaapi.core.CoreDsl.*;
import static io.gatling.javaapi.http.HttpDsl.*;
import io.gatling.javaapi.core.*;
import io.gatling.javaapi.http.*;

public class SSELLM extends Simulation {
   String api_key = System.getenv("api_key");
   HttpProtocolBuilder httpProtocol =
      http.baseUrl("https://api.openai.com/v1/chat")
          .sseUnmatchedInboundMessageBufferSize(100);

 

Step 3: Defining the Scenario

theatre2Now the piece has started, the actors enter the scene and follow their scripts. At Gatling, we call this a scenario, and it defines the steps your test will take (connecting, parsing messages, user interaction, etc.,).

In our case, our scenario it’s pretty small:

  • People will connect to the completion endpoint of Open AI and send a prompt using SSE.
  • Process all the messages until ChatGPT sends us {"data":"[DONE]"}.
  • Close the SSE connection.
  ScenarioBuilder prompt = scenario("Scenario").exec(
      sse("Connect to LLM and get Answer")
          .post("/completions")
          .header("Authorization", "Bearer "+api_key)
          .body(StringBody("{\"model\": \"gpt-3.5-turbo\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Just say HI\"}]}"))
          .asJson(),
      asLongAs("#{stop.isUndefined()}").on(
          sse.processUnmatchedMessages((messages, session) -> {
            return messages.stream()
            .anyMatch(message -> message.message().contains("{\"data\":\"[DONE]\"}")) ? session.set("stop", true) : session;        
          }) 
      ),
      sse("close").close()
  );

The processUnmatchedMessages method allows us to process the inbound messages. This function catches all the messages that ChatGPT sent us and when we receive {"data":"[DONE]", we set a stop variable to true in order to exit the loop.

 

Step 4: Injecting Users

theatre2As the audience arrives and fills their seats, the theater comes alive. In Gatling, this is the injection profile. It permits you to choose how and when users enter your test, whether gradually, all at once, in waves,...

In our tutorial, we will simulate a low number of users (i.e. 10 users) arriving at once on our website. Do you want to use different user arrival profiles? Check out our various injection profiles.

  {
    setUp(
        prompt.injectOpen(atOnceUsers(10))
    ).protocols(httpProtocol);
  }

 

Step 5: Running the Simulation

Run the simulation to see how the LLM handles the load. Use the following command to execute the test:

export api_key = your_token # on Linux and Mac
set api_key=your_token # on Windows
./mvnw gatling:test

 

Step 6: Analyzing the Results

After the simulation is complete, Gatling generates an HTML link in the terminal that you can use to access your report. Review metrics like response times, the number of successful and failed connections, and other metrics to spot potential issues with your service.

 

Conclusion

By updating SSE support to add the post method, Gatling enables load testing for applications using this method like LLMs, and many more. This practical example using the OpenAI API demonstrates how you can use Gatling to ensure your applications effectively manage user demands. So, don't streSSE about it and use Gatling to keep your servers and users happy.