Skip to content
All posts

Reliable automatic code fixes with AI

Introduction

At Mobb, we are doing our best to provide automatic fixes for vulnerabilities detected by Static Application Security Testing Tools (SAST). 

Today's vulnerability detection capabilities

Static application security testing tools scan your code and recognize potentially dangerous coding flaws such as command execution, SQL injection, or cross-site scripting.

A conventional way to prevent such vulnerabilities is to sanitize user inputs. However, that's easier said than done. Each dangerous function requires a different sanitization approach: a SQL query needs to be parameterized (and the way you provide parameters varies for different libraries), system command arguments should be quoted, and so on. The amount of variations is almost infinite.

Our previous blog demonstrated how using ChatGPT to generate code fixes often leads to unreliable and unstable results. Organizations relying solely on AI to generate code fixes are at risk, which is why we've taken a Hybrid appraoch to AI.

This publication will demonstrate how Mobb curbs AI hallucinations to achieve predictable results. Be warned — this article is full of juicy technical details.

What are Hybrid fixes?

PInternally, we divide all code fixes into three main categories based on the amount of AI used to modify the code.

Algorithmic fixes

The first category is purely algorithmic or rule-based fixes—no AI is used. In some cases, complex logic parses the code and matches potentially dangerous code lines against the vulnerability reports provided by the SAST tools. A huge advantage of this approach is that even if we miss something, we can always tune our logic to satisfy user needs.

As an example of an algorithmic fix, you can imagine a Log Injection vulnerability. Log Injection allows an attacker to inject a new line separator (\r\n) to the log message and introduce unintended data to your log collector.

For example:


    ```
log.info("Logged in user: " + username);
```
If, in this example, we set the username to "Kirill\r\nLogged in user: Jonathan" in the logs, we will see:
```
[INFO] Logged in user: Kirill
[INFO] Logged in user: Jonathan
```
To prevent this vulnerability, we can replace new line separators:
```
log.info(("Logged in user: " + username).replace("\\n", "").replace("\\r", ""));


As you can see, AI is redundant for this type of fix. AI will always be slower and more expensive than simple code logic in such cases. In any case, we had no intention of using AI, so we can say we use AI when better alternatives exist.

Pure AI fixes

The second category is pure LLM AI fixes. In this method, we process all the relevant data on the issue, feed it to the AI model, and, with some prompt engineering, ask it to fix the code. Our analysis shows that even with data preprocessing and detailed prompt engineering, the AI models "do the work" in only very few cases. As a result, while scoring better than a developer using ChatGPT, less than half of the fixes produced in this method are accurate, and we mark them as "experimental" in the Mobb platform. More info about the challenges in this approach can be found in our recent blog.

It's outside the scope of this publication to describe all the code transformations and LLM prompt customizations we use to achieve better results in this category. Generally, we don't recognize easy wins in pure automatic fixes using AI only.

Hybrid fixes

Last but not least, the third category of fixes. In this method, we mix the best of two worlds. We use an algorithmic approach to parse the code and find vulnerable lines of code. We use an LLM to provide parts of the fix, which would otherwise be hard or impossible to generate algorithmically.

Case study

As an example of a Hybrid fix, we will demonstrate how we solve the Null Dereference vulnerability reported by the OpenText Fortify SAST scanner. 

The vulnerability

As mentioned above, we will focus on fixing one specific vulnerability – Null Dereference in C#. First, let's have a quick look at it:


    ```
string cmd = Environment.GetEnvironmentVariable("cmd");
string trimmed = cmd.Trim();
Console.WriteLine(trimmed);
```



In this code, Fortify will report the string trimmed = cmd.Trim(); as vulnerable because cmd can be null, which can cause unexpected crashes in the application — System.NullReferenceException. Unhandled System.NullReferenceException may lead to denial of service of your application if an attacker figures out a way to trigger it continuously.

The fix

There are several ways to fix this problem, but the most straightforward is wrapping the problematic lines with an if.


    ```
string cmd = Environment.GetEnvironmentVariable("cmd");
if (cmd != null)
{
 string trimmed = cmd.Trim();
 Console.WriteLine(trimmed);
}
```

 

Note how we also need to wrap all subsequent lines of the same code block to the if block because any of them can be dependent on the problematic line.

If any of the subsequent lines are return statements – our fix may affect the execution flow and invalidate the code. Example:


    ```
static string Test()
{
 string cmd = Environment.GetEnvironmentVariable("cmd");
 if (cmd != null)
 {
 string trimmed = cmd.Trim();
 return trimmed;
 }
}
// error CS0161: 'Program.Test()': not all code paths return a value
```

 

To avoid breaking the code, we can detect cases that include return statements within the same code block after the vulnerable line and ignore such cases. Fortunately, in the testing dataset we collected, only a small portion of the samples were unfixable due to this problem.

Wrapping the problematic lines with an if statement is easy programmatically. However, creating the accurate "if" condition is more complex. Consider the following example:


    ```
Console.WriteLine($"Setting value: {settings["test"].val[0]["foo"]}");
```

 

We must consider all possible options because we don't know which part can be null. The correct solution for this line of code will look similar to the following:


    ```
if (settings != null && settings.ContainsKey("test") && settings["test"] != null &&
 settings["test"].val != null && settings["test"].val.Count > 0 &&
 settings["test"].val[0] != null && settings["test"].val[0].ContainsKey("foo"))
{
    Console.WriteLine($"Setting value: {settings["test"].val[0]["foo"]}");
}
```

 

This is where AI comes into the picture. Implementing it algorithmically would require iterating over all possible variables in the line, determining their type, figuring out how to access nested properties, and then iterating over the properties, doing the same again. Additionally, we'd need to consider all edge cases, such as non-nullable properties for some library objects. However, for generative AI, coming up with the accurate condition is easy because it "knows" the C# documentation and is good at producing code by well-defined description.

The algorithmic fix part

To produce fixes for this issue successfully, we first need to find specific places in the code. Let's see step by step what we do under the hood.

AST parsing

The best way to programmatically analyze the source code is to parse it to a convenient representation. The most common way to do that is with an Abstract Syntax Tree (AST for short). AST is actively used in many code analysis tools, such as linters, formatters, and security scanners.

In our code, we use Python bindings for Tree-sitter. Tree-sitter supports parsing many different programming languages; it is incredibly fast and has a very convenient API for interacting with the parsed tree.

Let's look at this code for example:


    ```
static string Test()
{
 string cmd = Environment.GetEnvironmentVariable("cmd");
 string trimmed = cmd.Trim();
 return trimmed;
}
```

 

Here's a simplified AST representation of it:


    ```
local_function_statement (representing the function definition)
    block (representing the function body surrounded by curly braces)
        variable_declaration (the string cmd definition line)
        variable_declaration (the vulnerable string trimmed = cmd.Trim(); line)
        return_statement (the line with the return)
```

 

Feel free to use an online live Tree-sitter playground to understand the AST structure better.

Detecting fixable cases

We already have a reference point since Fortify reported the vulnerable line for use. We can easily find AST nodes matching the vulnerable line in the AST.

However, we also need to find all subsequent lines within the same code block and wrap them to if.


    ```
static string Test()
{
 string cmd = Environment.GetEnvironmentVariable("cmd");
 string trimmed = cmd.Trim(); // << this is what we have
 return trimmed; // << this is what we also need to include
}
```

 

To do that, we first need to traverse to the closest parent block node (the function block in this case) and filter all statements within the block showing after the vulnerable line.

Additionally, as we concluded before, we should refrain from attempting to provide a fix for cases with the return statement within the same block. This can be done using `tree-sitter` query language.

Prompting the LLM

Now, we have all the components to ask LLM for a fix. To make the result as predictable as possible, we don't want to give LLM more context than needed to answer our question.

In this specific case, we only inform the LLM that we have a NullReferenceException in one C# code line and explicitly ask it to surround this line with the if statement to prevent potential exceptions.

The resulting prompt is short and provides little space for the LLM to make a mistake. The simplified version of the prompt:


    ```
I have a NullReferenceException in C# code in line `string trimmed = cmd.Trim();`. You must surround the line of code with an `if` statement to avoid potential NullReferenceException.
```

 

One of our goals when we generate the fix is to provide the customer with as few changes as possible. To do that, we parse the LLM response with the same AST parser and extract only the if condition.

Consider the response (it is a real response from ChatGPT):


    ```
Certainly! To avoid a `NullReferenceException`, you should check if `cmd` is `null` before attempting to call the `Trim` method. Here's an example of how you can do this with an `if` statement:

"`csharp
string trimmed = null;

if (cmd != null)
{
 trimmed = cmd.Trim();
 // Your other code using the trimmed string
}
else
{
 // Handle the case when cmd is null
 // You can assign a default value to trimmed or log an error, etc.
}
```

 

This way, you ensure that the `Trim` method is only called if `cmd` is not `null`, preventing a `NullReferenceException`.
```
Note: We don't use ChatGPT; we will discuss the model we use later in this publication. It was used here only as an example.

The AI model added many unwanted things, but using the parser, we will extract only cmd != null from the answer.

Rendering the fix

At this point, we have all the pieces of the puzzle. We know all the places in the original code we need to replace, and we have AI-generated if conditions. All we need to do now is glue them together. 

But it may be harder than it looks. We need to track how your source code is aligned, if you use tabs or spaces to pad lines, if you use Windows \r\n or Unix \n as a line separator. All these tiny details make the fix look better. Nobody wants to merge poorly formatted code to their main branch.

From the original example:


    ```
string cmd = Environment.GetEnvironmentVariable("cmd");
// insert one line before the vulnerable line with the original offset from the line start
// use AI generated condition in the if
if (cmd != null)
// insert a curly brace with the original offset
{
 // add proper amount of tabs and spaces at the beginning of the line
 string trimmed = cmd.Trim();
 // add proper amount of tabs and spaces at the beginning of the line
 Console.WriteLine(trimmed);
// insert a curly brace with the original offset
}
```


Finally, the fix is done, and we can present it to the user.

The AI part

You may have noticed that we haven't discussed the specific LLM AI model much in this publication. Let's dive into that.

Testing

The most important thing when interacting with an LLM is to ensure your prompt produces the correct result, and that's difficult to achieve; any change to the prompt could lead to undesirable results.

To ensure that, we have implemented a testing pipeline to verify we are obtaining the desired outcome. It looks like this:

We run tests for every model we want to evaluate, and every prompt change, and we carefully review the true positives (good fixes).

The model

We considered several models, including our own fine-tuned ones. For the hybrid fixes, we chose Mixtral 8X7B. In this paper, we won't go into the details of how we evaluated the different models and why we picked this specific model.

Hosting the model

Running a self-hosted LLM is quite an expense; we had to keep that in mind the whole time without compromising the quality of the results. So, why not use a server-less provider? Well, I am glad you asked. Mobb guarantees the security of our users' data, and source code is one of the most valuable assets. We take that very seriously and assure our customers their code is not shared with undesired third parties. On top of that, being that Mobb has SOC2 Type2 certification, we can only use services that comply with the same regulation.

That means we don't just have to select the model we want to run but also how it will be served. Our first choice was to use text-generation-inference. Text Generation Inference (TGI) is a toolkit for deploying and serving LLMs. TGI enables high-performance text generation for the most popular open-source LLMs (from the GitHub description).

The first tests showed the start-up time was several minutes (~3 minutes). It seems that most of the time consumed was sharding the model into the GPUs — yes, that is plural. Running Mixtral 8X7B with TGI requires at least 2 GPUs (A100 in our case with 80GB each). That means we couldn't do any inference until the server was up and running. Why not keep it running all the time?

Well, as mentioned earlier, we need to consider the cost. Our ideal scenario was to pay for what we use and not more than that.

In the second iteration, after the solution was built end to end, we considered using LLaMA.cpp HTTP Server with a quantized version of Mixtral 8X7B. With this combination, we managed to reduce the start-up time to ~20 secs running in 1 GPU. This is the framework we are using so far.

Knowing the average length of the prompts we need to handle, we can parallelize ~10 requests with a duration of ~2 seconds each. This architecture scales up depending on the load the platform needs to process. When the load is low, the servers shut down to avoid extra pay for the resources we are not using.

This project is ongoing; even though we are serving our users, we keep researching better ways to utilize the LLMs to remediate the vulnerabilities and test the changes we perform over the code.

Our conclusion?

Generative AI is still very far away from being a silver bullet in the automatic software code remediation domain. But with vigilant supervision, it may become a really helpful tool.

At Mobb, we believe in Hybrid fixes. In our experience, they are very stable and require significantly less effort to implement than pure algorithmic fixes. In fact, we believe this so much that we wrote a couple of patents on this approach.

DDFFFFFFF

Jonathan Santilli and Kirill Efimov
Kirill Efimov is a highly skilled software engineer and security expert with a strong background in software development and team leadership. Jonathan Santilli brings over a decade of experience to the field of cybersecurity. Together they're paving the path at Mobb for AppSec.