
Illustrated by Jeff Prymowicz
It's no secret that Claude and ChatGPT are being used to write code. An entire style of coding, called “Vibe Coding”, exists where the user has the AI generate code that can then be used elsewhere. Whether this code is run as-is, or is used as a starting point, depends on the project, user, and use case. But there’s no doubt that this trend will continue for the foreseeable future.
At Cloud Security Partners, not only do we discuss the use and risks of such practices, we have been outright asked by clients about improving the code quality. The challenge posed is that the clients want to produce better-quality code when prompting the LLM. Their development cycles are moving faster and faster and they want to limit potential bottlenecks introduced by fixing issues in the generated code.
While we certainly hope that security reviews are performed against any and all generated code, vibe-coded or otherwise, we’ll outline our findings and recommendations below to get your tools to produce better quality code.
Starting Points
To begin, we want to generate code without explicit instructions concerning secure coding practices. Using this code, we will determine what areas are lacking and can be improved with refinements made later in this chapter. We will operate under the assumption that a user is asking for code to perform some action for them rather than specifying specific security risks within the prompt. This assumption works for two reasons. First, we want a benchmark to compare against later to show improvements in the generation process. And second, this is likely more realistic; a user will likely ask for what they want and not what they don’t want (e.g., security flaws).
We tested using OpenAI’s Codex and GPT-OSS-120b. Results can change from model to model and even within the same model. Parameters such as temperature affect the randomness of the output when given the same input. Altering the input prompt in some minor way may also affect how the output is generated. As a result, testing methodologies are extremely difficult to nail down compared to traditional software testing. But, for the purposes of this analysis, this is what we have to work with.
We started by generating several web applications in different languages and frameworks. These were:
- PHP + Laravel
- Python + Flask
- Python + Django
- Ruby on Rails
- C# + .Net Core MVC
- Java + SpringBoot
- JavaScript + ExpressjS
For the purposes of later remediation, we will focus on the Python and Flask application.
We created an AI Skill for Codex to generate sample code. At first, no mentions of security techniques, security libraries, or security-related desires were provided - don’t worry, we’ll add those later. The user explicitly called the skill and requested a sample e-commerce application in the framework of choice. This app would need to handle multiple users, handle “orders”, and store various products. The users need to have usernames and passwords, and must be able to log in or out.
Once the application was generated, Semgrep ran against the generated code and we performed a quick manual analysis of the code. With this step, we are trying to determine a baseline of the security posture of the generated code without specific prompting or references provided.
Starting Results
All applications were generated but, generally speaking, additional work was required. For example, the Django application did create the custom Django code but the core Django files were missing. Additionally, some of the Django templates were not generated correctly.
But we’re not here to determine how pretty or easy the application was to spin up. We want to know the security posture of the applications generated. While no issues like SQL Injection arose, there were some other issues.
- CSRF was common. Multiple generations were either missing CSRF protections on specific endpoints or just did not have CSRF protections entirely.
- The Rails project contained an IDOR vulnerability, allowing users to view other users’ carts and orders via unscoped .find() calls.
- The ExpressJS application contained a list of low-risk findings like weak cookie flags and a hardcoded secret.
- The Flask application used an environment variable for the application secret, but defaulted to an extremely weak and hard-coded secret if that environment variable was missing.
While not particularly damaging, this appears in line with our experiences internally. We tend to notice that coding patterns that fix vulnerabilities tend to occur in the LLM outputs (e.g., parameterized SQL queries or utilizing ORMs). However, context-specific adjustments or framework configurations may be lacking. For example, applying CSRF preprocessing correctly to endpoints does not always occur. Or, as another example, utilizing Rails’s .find() without scoping for products can be fine, but applying that logic to orders or shopping carts can be an issue as users can view other users’ data.
There is room for improvement here.
Prompt Engineering
The only option available to improve the generated quality is to better refine our prompts. We must be able to either specify what we want from the LLM or to give better examples for the LLM to follow. We will refer to this as prompt engineering though the inclusion of full documentation may stretch that definition.
The idea of providing better instructions sounds easy but it is far from it. LLMs do better as the requirements become clearer and the input remains short and concise. “Respond with JSON in the following format.” “Does the following HTML sample for the X template engine properly escape all output data?” Giving clear goals with small, concise requests tends to work well.
As we become more generic, our outputs become worse. “Does the following code look malicious?” This is highly contextual. An explicit webshell can be easier to flag than a threat actor updating code to remove permissions checking. The latter can be flagged as the permissions model for the application changes naturally, muddying the waters a bit. As another example, would updating an image file to load be malicious? Website defacement would fit that bill. We will get better outputs if we provide an example such as “does the following code sample allow remote code execution in my web application?” as we have a clearer expectation.
It may be tempting to just instruct the LLM to provide “secure code,” but this falls into the “generic” category of fixes. We need to be more concise in what we want. One technique we can use is called One-Shot or Few-Shot Prompting. Here, we will provide explicit examples of code that we expect to see. For example, we might want to tell the LLM exactly how to embed a CSRF token for Flask applications.
Generate a CSRF token for all POST requests. The following is how a CSRF token is embedded into the form using CSRF Protection from Flask WTF.
<form method="post">
{{ form.csrf_token }}
</form>
Unfortunately, this is language and framework specific; it will be harder to get good results from generic requests. However, skills with Claude Desktop and OpenAI’s Codex make this easier. Skills can include reference files that include specifics for your language and/or framework of choice. An excellent example is from OpenAI’s sample skills repository. Their Security Best Practices skill includes multiple reference files for several sample frameworks: Flask, Go (generic), ExpressJS, Django, and more. The skill instructs the LLM to use the appropriate reference.
Before moving on to the improved results with better prompting, we would like to share a word of warning. LLM outputs tend to degrade as the input and output increases. While not focused on inputs, multiple papers from organizations like Google and Anthropic have noted that even heavy reasoning can degrade the quality of the generated output. This reasoning generates additional output tokens that then are used as additional input. In our experience, letting the LLM generate more output tends to enable the LLM to drift from the intended results. While the LLMs haven’t “gone rogue” and created unrelated code, they may skip steps or stop partway through the intended workflow. All of this is intended to conclude with a simple warning; though it may be tempting to throw tons of examples and documentation to the LLM, try to be concise and brief with your examples. Excessive input may “distract” the LLM and degrade the output quality.
Better Prompts and Skills
We generated the earlier applications utilizing an AI Skill we made ourselves. The format of a skill can include scripts and references that the LLM can use if prompted correctly. While we could add scripts, we will be focusing on the references.
- {skill directory}
- SKILL.md
- references
- python-flask-documentation.md
- python-django-documentation.md
- (...)
These reference files will also be markdown just like the SKILL.md file. We can add sections for the types of vulnerabilities to fix or behaviors that are desired. A short, concise statement about what to do followed by an example is a great starting point. For example:
### Unvalidated Redirects
Do not pass request parameters to a redirect() call. All redirects must be hard-coded, static targets and not user defined. An example is below.
return redirect(url_for("main.index"))
We will create several of these sections for the vulnerabilities that we care about. For the purposes of this experiment, we will focus just on the Python and Flask implementation. This section will depend heavily on the vulnerabilities that we want to correct and may require adjustments before the LLM regularly makes use of them. Based on our generations and observations, we added sections for the following vulnerabilities:
- Unvalidated Redirects
- Cookie Settings
- HttpOnly, SameSite, Secure
- CSRF
- Secret Values in Source Code
These were the sections that regularly needed improvement.
As an additional measure, we added an instruction to the skill to analyze the generated code. This analysis is vague; a simple step along the lines of “analyze the code for security issues file-by-file and patch any identified issues.” While we stated earlier that vague statements are not great practices, this is an attempt to replicate what others may be tempted to do.
Improved Results
While not perfect, improvements were observed. The included reference instructed the LLM not to create default secret strings but to raise an exception if they are not found, preventing the application from starting. While disruptive, this prevents secrets from ending up in source code control and also prevents launching the application with easily guessable secret keys. The LLM followed the instructions and generated code that only utilized environment variables.
Semgrep still reported CSRF, but these were false positives. The earlier-mentioned CSRF Protection provides different means of inserting CSRF tokens into forms. Semgrep falsely flagged instances where a CSRF was indeed being inserted.
A new issue arose while generating the source code. User passwords were set without validation. This does not mean that the user’s password wasn’t matching the set password; this means that a password policy wasn’t being validated. While we can and do add this to our skill, this is a good example of how the skill references should be iterative. One round of generated code may differ from another. Using just one generated application is not sufficient to test any further prompts.
While we could call this a rousing success, there are still issues to be transparent about here. First, as discussed in the prior paragraph, new and undocumented issues may occur in future generations. This will not be a set-it-and-forget-it solution. As models update and as more code is observed, these skills and references may need to be adjusted and added to.
Second, the generations were not all foolproof. While instructing the LLM to use the reference, things don’t always go according to plan. For example, the reference file was not always used. Save for explicitly copying and pasting the contents into the skill, the reference file being used was hit or miss. Better results came from instructing the agent to explicitly look in the skill’s references directory, so keep that in mind for your own experiments. As another example, some code generations did not adequately use CSRF protections. Despite instructing the LLM to use example code for all HTML Forms for POST requests, the LLM did not always do so.
We included a step in the skill to analyze the security of the generated code. In our attempts to generate secure code, this step appeared to provide no real value. The unvalidated redirect was included and identified with this measure in place. Cookie settings were not added. Weak, guessable, default application secrets were generated and undetected. CSRF protections were not added adequately. Ultimately, the inclusion of the examples in the framework-specific reference was far more valuable than including a vague step instructing the LLM to review its own work. Giving clear and concise examples for specific problems was far more effective than telling the LLM to simply do a better job.
Conclusion
Ultimately, if you are trying to generate secure code, try to create prompts or references that are as technically accurate as possible. Include examples, explicitly name frameworks and libraries, and leave as little room for interpretation as possible. If you have multiple frameworks and languages in place, create framework-specific prompts and/or references to provide accurate information without distracting the LLM with unnecessary information.
Despite passing explicit instructions, the outputs will not be foolproof. Iterate upon your prompts and still perform downstream analyses of the generated code, human analysis or via pipeline tooling. Do not be tempted to include a vague “review the code for security issues” instruction. This is vague and ultimately may do nothing other than waste input and output tokens at inference time.
Thank you for reading. As always, if you have any questions or concerns about your code or infrastructure, feel free to reach out to us at Cloud Security Partners. If you have concerns about your code generation and AI capabilities, we can help out there as well.
About the Author
Sean Lyford is a Senior Security Consultant with Cloud Security Partners. He has over 11 years of experience within the information security and development fields. Sean focuses on application and cloud security practices.
Sean has a career of both application security consulting and software engineering. As a software engineer, Sean has experience with high-level web applications, AI/ML integrations, and network application development. With his experience as a software engineer, Sean is able to effectively communicate with development teams and provide remediation guidance and prioritization.
In his free time, Sean spoils good walks (i.e., plays golf) and enjoys video games.
Stay in the loop.
Subscribe for the latest in AI, Security, Cloud, and more—straight to your inbox.
