I used o3 to find a remote zeroday in the Linux SMB implementation

admin

May 24, 2025 - 19:45

0 0

I used o3 to find a remote zeroday in the Linux SMB implementation

In this post I’ll show you how I found a zeroday vulnerability in the Linux kernel using OpenAI’s o3 model. I found the vulnerability with nothing more complicated than the o3 API – no scaffolding, no agentic frameworks, no tool use.

Recently I’ve been auditing ksmbd for vulnerabilities. ksmbd is “a linux kernel server which implements SMB3 protocol in kernel space for sharing files over network.“. I started this project specifically to take a break from LLM-related tool development but after the release of o3 I couldn’t resist using the bugs I had found in ksmbd as a quick benchmark of o3’s capabilities. In a future post I’ll discuss o3’s performance across all of those bugs, but here we’ll focus on how o3 found a zeroday vulnerability during my benchmarking. The vulnerability it found is CVE-2025-37899 (fix here), a use-after-free in the handler for the SMB ‘logoff’ command. Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I’m aware, this is the first public discussion of a vulnerability of that nature being found by a LLM.

Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you’re an expert-level vulnerability researcher or exploit developer the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective. If you have a problem that can be represented in fewer than 10k lines of code there is a reasonable chance o3 can either solve it, or help you solve it.

Aside: If you work at a frontier lab and want to discuss evaluating your model’s capabilities on these sorts of tasks then drop me an email via firstname.lastname @ gmail.com.

o3 re-finds CVE-2025-37778

Lets first discuss CVE-2025-37778, a vulnerability I found manually but which o3 was also able to find. CVE-2025-37778 is a use-after-free vulnerability. The issue occurs during the Kerberos authentication path when handling a “session setup” request from a remote client. To save us referring to CVE numbers, I will refer to this vulnerability as the “kerberos authentication vulnerability“.

The root cause looks as follows:

static int krb5_authenticate(struct ksmbd_work *work,
			     struct smb2_sess_setup_req *req,
			     struct smb2_sess_setup_rsp *rsp)
{
...
	if (sess->state == SMB2_SESSION_VALID) 
		ksmbd_free_user(sess->user);
	
	retval = ksmbd_krb5_authenticate(sess, in_blob, in_len,
					 out_blob, &out_len);
	if (retval) {
		ksmbd_debug(SMB, "krb5 authentication failed\n");
		return -EINVAL;
	}
...

If krb5_authenticate detects that the session state is SMB2_SESSION_VALID then it frees sess->user. The assumption here appears to be that afterwards either ksmbd_krb5_authenticate will reinitialise it to a new valid value, or that after returning from krb5_authenticate with a return value of -EINVAL that sess->user will not be used elsewhere. As it turns out, this assumption is false. We can force ksmbd_krb5_authenticate to not reinitialise sess->user, and we can access sess->user even if krb5_authenticate returns -EINVAL.

This vulnerability is a nice benchmark for LLM capabilities as:

It is interesting by virtue of being part of the remote attack surface of the Linux kernel.
It is not trivial as it requires:
- (a) Figuring out how to get sess->state == SMB2_SESSION_VALID in order to trigger the free.
- (b) Realising that there are paths in ksmbd_krb5_authenticate that do not reinitialise sess->user and reasoning about how to trigger those paths.
- (c) Realising that there are other parts of the codebase that could potentially access sess->user after it has been freed.
While it is not trivial, it is also not insanely complicated. I could walk a colleague through the entire code-path in 10 minutes, and you don’t really need to understand a lot of auxiliary information about the Linux kernel, the SMB protocol, or the remainder of ksmbd, outside of connection handling and session setup code. I calculated how much code you would need to read at a minimum if you read every ksmbd function called along the path from a packet arriving to the ksmbd module to the vulnerability being triggered, and it works out at about 3.3k LoC.

OK, so we have the vulnerability we want to use for evaluation, now what code do we show the LLM to see if it can find it? My goal here is to evaluate how o3 would perform were it the backend for a hypothetical vulnerability detection system, so we need to ensure we have clarity on how such a system would generate queries to the LLM. In other words, it is no good arbitrary selecting functions to give to the LLM to look at if we can’t clearly describe how an automated system would select those functions. The ideal use of an LLM is we give it all the code from a repository, it ingests it and spits out results. However, due to context window limitations and regressions in performance that occur as the amount of context increases, this isn’t practically possible right now.

Instead, I thought one possible way that an automated tool could generate context for the LLM was through expansion of each SMB command handler individually. So, I gave the LLM the code for the ‘session setup’ command handler, including the code for all functions it calls, and so on, up to a call depth of 3 (this being the depth required to include all of the code necessary to reason about the vulnerability). I also include all of the code for the functions that read data off the wire, parses an incoming request, selects the command handler to run, and then tears down the connection after the handler has completed. Without this the LLM would have to guess at how various data structures were set up and that would lead to more false positives. In the end, this comes out at about 3.3k LoC (~27k tokens).

The final decision is what prompt to use. You can find the system prompt and the other information I provided to the LLM in the .prompt files in this Github repository. The main points to note are:

I told the LLM to look for use-after-free vulnerabilities.
I gave it a brief, high level overview of what ksmbd is, its architecture, and what its threat model is.
I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

To run the query I then use the llm tool (github) like:

$ llm --sf system_prompt_uafs.prompt                \

        -f session_setup_code.prompt                \

        -f ksmbd_explainer.prompt                   \

        -f session_setup_context_explainer.prompt   \

        -f audit_request.prompt

My experiment harness executes this N times (N=100 for this particular experiement) and saves the results. It’s worth noting, if you rerun this you may not get identical results to me as between running the original experiment and writing this blog post I had removed the code context in session_setup_code.prompt and had to regenerate it. I believe it is effectively identical, but have not re-run the experiment.

o3 finds the kerberos authentication vulnerability in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. In other words, with a ratio of 1:4.5 of true positives to false positives we would have had to go through, at most, 5 false positive reports to get to one of the true positives*. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

* Note: This does not mean that the ratio of true positives to false positives would be 1:4.5 if you were to run this approach to using o3 over the entire ksmbd code-base. Recall I explained that this experiment mimics a tool checking each handler individually, and iteratively expanding the code it gives to the LLM from each handler. So, an entire run of this approach on ksmbd would involve 100 queries times the number of handlers times the maximum expansion depth. There’s no reason to believe the TP:FP ratio from this experiment is an accurate predictor of the ratio you’d get from that full run.

For the curious, I have uploaded a sample report from o3 (here) and Sonnet 3.7 (here). One aspect I found interesting is their presentation of results. With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log. There are pros and cons to both. o3’s output is typically easier to follow due to its structure and focus. On the other hand, sometimes it is too brief, and clarity suffers.

o3 finds a 0-day (CVE-2025-37899)

Having confirmed that o3 can find the kerberos authentication vulnerability (CVE-2025-37778) when given the code for the session setup command handler, I wanted to see if it could find it if I give it the code for all of the command handlers. This is a harder problem as the command handlers are all found in smb2pdu.c, which has ~9k LoC. However, if o3 can still find vulnerabilities when given all of the handlers in one go then it suggests we can build a more straightforward wrapper for o3 that simply hands it entire files, covering a variety of functionality, rather than going handler by handler. It’s worth noting that while the top level command handlers are found in smb2pdu.c not all of the functions they call are found in that file. There is functionality for virtual file system access, IPC, crypto etc. that these top level handlers make use of but that are found in other files, and I did not provide these to the model in this experiment.

Combining the code for all of the handlers with the connection setup and teardown code, as well as the command handler dispatch routines, ends up at about 12k LoC (~100k input tokens), and as before I ran the experiment 100 times.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

int smb2_session_logoff(struct ksmbd_work *work)
{
	struct ksmbd_conn *conn = work->conn;
	struct ksmbd_session *sess = work->sess;
	
    ...
	
    ksmbd_close_session_fds(work);
	ksmbd_conn_wait_idle(conn);
    
    ...
	
    down_write(&conn->session_lock);
	sess->state = SMB2_SESSION_EXPIRED;
	up_write(&conn->session_lock);

	if (sess->user) {
		ksmbd_free_user(sess->user);
		sess->user = NULL;
	}
    ...

I’ll let o3 explain the issue:

Short description

=================

While one ksmbd worker thread is still executing requests that use

sess‑>user, another thread that processes an SMB2 LOGOFF for the same

session frees that structure.  No synchronisation protects the pointer,

so the first thread dereferences freed memory – a classic

use‑after‑free that leads to kernel memory corruption and arbitrary code

execution in kernel context.

Affected code path

------------------

1.  A second transport is bound to an existing session

    (SMB 3.0 or later, `conn->binding == true`).

2.  Worker‑A (running on connection C2) receives any normal request

    (e.g. WRITE).

       • `smb2_check_user_session()` stores a pointer to the already

         existing `struct ksmbd_session` in `work->sess` and

         increments the session ref‑count, **but it does not take

         any reference on `sess->user`.**

       • The pointer to `sess` is kept for the whole request

         processing; every helper that needs authentication data

         accesses `sess->user` directly.

3.  Before Worker‑A finishes, Worker‑B (running on the first connection

    C1) processes an SMB2 LOGOFF for the same session and executes

    `smb2_session_logoff()` (smb2pdu.c).

Relevant part of smb2_session_logoff()

--------------------------------------

```c

...

	if (sess->user) {

		ksmbd_free_user(sess->user);   /* (1) frees memory     */

		sess->user = NULL;             /* (2) clear the field  */

...

```

`ksmbd_free_user()` ultimately ends in simple `kfree(user)`.

4.  Worker‑B **does not wait for the other connections that are still

    using the session**; it only waits for running requests on *its own*

    connection (`ksmbd_conn_wait_idle(conn)`).

5.  Worker‑A continues to execute (e.g. inside `smb2_open()`,

   `smb2_write()`, …) and dereferences `sess->user`:

```c

/* examples (many of them) */

if (user_guest(sess->user))        ← dereference after free

ksmbd_compare_user(sess->user, …)

sess->user->uid

```

Because the memory was already `kfree()`‑ed in step (1) the access is to

freed memory.  Depending on exact timing it is either:

• a use‑after‑free (pointer still points into now‑reused slab object),

  enabling controlled kernel‑memory overwrite, or

• a NULL–deref (if Worker‑A reads after step (2)), still a DoS.

Reading this report I felt my expectations shift on how helpful AI tools are going to be in vulnerability research. If we were to never progress beyond what o3 can do right now, it would still make sense for everyone working in VR to figure out what parts of their work-flow will benefit from it, and to build the tooling to wire it in. Of course, part of that wiring will be figuring out how to deal with the the signal to noise ratio of ~1:50 in this case, but that’s something we are already making progress at.

One other interesting point of note is that when I found the kerberos authentication vulnerability the fix I proposed was as follows:

diff --git a/fs/smb/server/smb2pdu.c b/fs/smb/server/smb2pdu.c
index d24d95d15d87..57839f9708bb 100644
--- a/fs/smb/server/smb2pdu.c
+++ b/fs/smb/server/smb2pdu.c
@@ -1602,8 +1602,10 @@ static int krb5_authenticate(struct ksmbd_work *work,
 	if (prev_sess_id && prev_sess_id != sess->id)
 		destroy_previous_session(conn, sess->user, prev_sess_id);
 
-	if (sess->state == SMB2_SESSION_VALID)
+	if (sess->state == SMB2_SESSION_VALID) {
 		ksmbd_free_user(sess->user);
+		sess->user = NULL;
+	}
 
 	retval = ksmbd_krb5_authenticate(sess, in_blob, in_len,
 					 out_blob, &out_len);
-- 
2.43.0

When I read o3’s bug report above I realised this was insufficient. The logoff handler already sets sess->user = NULL, but is still vulnerable as the SMB protocol allows two different connections to “bind” to the same session and there is nothing on the kerberos authentication path to prevent another thread making use of sess->user in the short window after it has been freed and before it has been set to NULL. I had already made use of this property to hit a prior vulnerability in ksmbd but I didn’t think of it when considering the kerberos authentication vulnerability.

Having realised this, I went again through o3’s results from searching for the kerberos authentication vulnerability and noticed that in some of its reports it had made the same error as me, in others it had not, and it had realised that setting sess->user = NULL was insufficient to fix the issue due to the possibilities offered by session binding. That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution. Still, that ratio is only going to get better.

Conclusion

LLMs exist at a point in the capability space of program analysis techniques that is far closer to humans than anything else we have seen. Considering the attributes of creativity, flexibility, and generality, LLMs are far more similar to a human code auditor than they are to symbolic execution, abstract interpretation or fuzzing. Since GPT-4 there has been hints of the potential for LLMs in vulnerability research, but the results on real problems have never quite lived up to the hope or the hype. That has changed with o3, and we have a model that can do well enough at code reasoning, Q&A, programming and problem solving that it can genuinely enhance human performance at vulnerability research.

o3 is not infallible. Far from it. There’s still a substantial chance it will generate nonsensical results and frustrate you. What is different, is that for the first time the chance of getting correct results is sufficiently high that it is worth your time and and your effort to try to use it on real problems.