Server-Sent Events enable servers to send real-time updates to clients through an HTTP connection, creating a one-way data stream from the server to the client.SSE is an excellent choice for real time applications like an LLM (large language model) such as ChatGPT or Anthropic.
OpenAI and others have extended the SSE standard by adding the POST method. They did this to allow the user to pass prompts that would otherwise be truncated by web servers if passed as query parameters because of their length. To support this use case, Gatling has updated their SSE support to include the POST method, as you can see here.
Let's dive into an example to illustrate how to load test LLM using Gatling. We'll create a simulation to load test a ChatGPT endpoint that can use SSE to stream responses.
First, ensure you have Gatling installed and set up. If you don't have Gatling installed, you can download it on the documentation and download the project on Github.
Picture yourself at a grand theater in Paris, comfortably seated and admiring the set and ambiance. In Gatling, just as the theater environment shapes the audience experience, the HTTP protocol provides the framework for your test scenarios. The baseUrl defines where the performance takes place, guiding all interactions to the correct destination.
In your Gatling project, configure the HTTP protocol to specify the base URL of ChatGPT (OpenAI) API. We use sseUnmatchedInboundMessageBufferSize
in order to buffer the inbound message
import static io.gatling.javaapi.core.CoreDsl.*;
import static io.gatling.javaapi.http.HttpDsl.*;
import io.gatling.javaapi.core.*;
import io.gatling.javaapi.http.*;
public class SSELLM extends Simulation {
String api_key = System.getenv("api_key");
HttpProtocolBuilder httpProtocol =
http.baseUrl("https://api.openai.com/v1/chat")
.sseUnmatchedInboundMessageBufferSize(100);
In our case, our scenario it’s pretty small:
{"data":"[DONE]"}
. ScenarioBuilder prompt = scenario("Scenario").exec(
sse("Connect to LLM and get Answer")
.post("/completions")
.header("Authorization", "Bearer "+api_key)
.body(StringBody("{\"model\": \"gpt-3.5-turbo\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Just say HI\"}]}"))
.asJson(),
asLongAs("#{stop.isUndefined()}").on(
sse.processUnmatchedMessages((messages, session) -> {
return messages.stream()
.anyMatch(message -> message.message().contains("{\"data\":\"[DONE]\"}")) ? session.set("stop", true) : session;
})
),
sse("close").close()
);
The processUnmatchedMessages
method allows us to process the inbound messages. This function catches all the messages that ChatGPT sent us and when we receive {"data":"[DONE]"
, we set a stop variable to true in order to exit the loop.
In our tutorial, we will simulate a low number of users (i.e. 10 users) arriving at once on our website. Do you want to use different user arrival profiles? Check out our various injection profiles.
{
setUp(
prompt.injectOpen(atOnceUsers(10))
).protocols(httpProtocol);
}
Run the simulation to see how the LLM handles the load. Use the following command to execute the test:
export api_key = your_token # on Linux and Mac
set api_key=your_token # on Windows
./mvnw gatling:test
After the simulation is complete, Gatling generates an HTML link in the terminal that you can use to access your report. Review metrics like response times, the number of successful and failed connections, and other metrics to spot potential issues with your service.
By updating SSE support to add the post method, Gatling enables load testing for applications using this method like LLMs, and many more. This practical example using the OpenAI API demonstrates how you can use Gatling to ensure your applications effectively manage user demands. So, don't streSSE about it and use Gatling to keep your servers and users happy.