Introduction

Introduction

The Fixes feature allows users to remediate vulnerabilities leveraging the power of GenAI. By requesting fixes, users can either receive a detailed step-by-step guide or the code modifications that should close the vulnerability.
Warning
Caution
As with all GenAI answers, the accuracy of the solution should be reviewed prior to being committed to the codebase.

Public Oath

Fluid Attacks offers a GenAI-assisted vulnerability remediation service. No sensitive or customer-specific information will be used or stored by a third party; customer code will not be used to train an LLM, and any results will be viewable only by the customer.

Architecture

Fixes architecture
  1. Fix request: Automatic fixes can be requested from Retrieves via the 'Get Custom Fix' and 'Apply Suggested Fix' functionalities, or from Views via the 'How to fix' option in the vulnerability modal.

  2. Subscriptions: The functionalities mentioned above always use one of these two GraphQL subscriptions: getCustomFix and getSuggestedFix, passing the required parameters accordingly.

  3. Validation and prompt construction: After validating the provided inputs, the backend (Integrates) gathers the vulnerability context, which includes the URL of the S3 object where the vulnerable file is located.

  4. Fixes API: Integrates sends a request to the Fixes WebSocket API Gateway, using a pre-signed URL for authentication and to transmit the previously obtained vulnerability context.

  5. Fixes Lambda: Through the API, the Fixes Lambda is instructed to:

    • Retrieve the vulnerable code from S3.

    • Analyze the code.

    • Extract the vulnerable snippet.

    • Generate a prompt with instructions for the AI model, either to produce a remediation guide or to directly remediate the vulnerable code snippet.

  6. Sending the prompt to the LLM: From the Lambda, the prompt is sent via the Boto client to a large language model (LLM) hosted on Amazon Bedrock, using an inference profile.

  7. LLM response: The LLM processes the input and generates a response.
    Since the complete output may take several seconds, it is returned as a streamed response over WebSockets, where the response is progressively delivered in chunks.
    This technique improves the user experience by enabling partial results to be displayed as they are generated.

  8. Transmission to the final client: Integrates relays the streamed response, which is then sent to the Retrieves or Views client through the initial GraphQL subscription.

  9. Displaying the result: The response is shown to the user either

    • as a Markdown-formatted remediation guide, or

    • as a structured text containing the remediated code snippet with placeholders to replace the vulnerable code.

Data security and privacy

As this service requires sending user code to a third-party GenAI model, measures must be taken to ensure the safety of the whole process:

Amazon Bedrock

AWS infrastructure hosts the LLMs used by this service.

Amazon Bedrock doesn’t store or log prompts and completions. Neither does it use them to train any AWS LLM models nor distribute them to third parties. See the Bedrock data protection guide.

Data, both at rest and in transit, is also encrypted. See the data encryption guide.

As an additional precaution, this service has been disabled for vulnerabilities related to leaked secrets in code.

Evaluation

Fixes includes two LLM-as-a-Judge evaluators within its testing pipeline.

An LLM-as-a-Judge evaluator is a workflow designed to measure, through artificial intelligence, the quality of the responses generated by a system that also produces results using AI. In the case of Fixes, every time a commit is pushed, the evaluator runs the getCustomFix and getSuggestedFix functionalities multiple times, using different inputs taken from a collection of test cases that simulate real scenarios encountered in production.

Each output generated by Fixes is evaluated individually by a language model, which determines whether the response meets the quality criteria. To do this, the evaluator relies on a set of rubrics that define what a valid response must satisfy. Based on these rules, the LLM assigns a score of 0 or 1 for each execution, depending on whether the response meets the established quality standards.

Once all test cases have been processed, the evaluator computes the average of all the scores (zeros and ones). This average becomes the final score of the evaluation.

It is important to note that because both Fixes and the evaluators are powered by LLMs, the results are not fully deterministic. Two identical executions do not guarantee the same score. For this reason, although the goal is to set the threshold as close to 1 as possible, achieving a perfectly consistent score of 1 is difficult in practice.

Quality Control

These evaluations run as jobs within the CI pipeline. Each job is configured with a threshold between 0 and 1. If the final score of any evaluator does not exceed this threshold, the job will fail and prevent the changes from moving to production.

This ensures that every deployment of Fixes maintains or improves the quality of the generated responses. If a degradation in quality is detected, the deployment is blocked and cannot proceed to production.

Historical Records

To manage these evaluations, we rely on LangSmith, which allows us to store and review the full history of all executions. You can browse this history directly at smith.langchain.com.

To Do

  1. Use AWS GuardRails to sanitize code snippets and remove sensitive information before feeding the prompt to the LLM.
  2. Instead of getting the context from criteria and adding it to the prompt, use RAG to give the model a knowledge base to consult, improve the quality of the results, and simplify the prompt.
  3. Consider using a provisioned, open-source LLM on transparency grounds.
Idea
Tip
Have an idea to simplify our architecture or noticed docs that could use some love? Don't hesitate to open an issue or submit improvements.