Table of contents
State-of-the-art LLMs like Claude 3.5 Sonnet and GPT-4o have greatly improved the performance of generative AI applications, including AI code assistants. However, these LLMs are trained on vast amounts of data collected from all corners of the internet, including code that may have restrictions on how it can be used, introducing the risk of IP infringement. Since the copyright law for the use of AI-generated content is still unsettled, engineering teams at enterprises want to strike a balance: leveraging the performance gains that come from these powerful models while minimizing the likelihood of copyleft-licensed code getting in their codebase.
To support these goals, Tabnine is thrilled to announce Provenance and Attribution, a new feature that can drastically reduce the risk of IP infringement when using models like Anthropic’s Claude, OpenAI’s GPT-4o, and Cohere’s Command R+ for software development. Tabnine now checks the code generated within our AI chat against the publicly visible code on GitHub, flags any matches it finds, and references the source repository and its license type. This critical information makes it easier for you to review code suggestions and decide if they meet your specific requirements and policies.
It is well known that popular models like Anthropic’s Claude 3.5 Sonnet, Open AI’s GPT-4o, and Cohere’s Command R+ are trained on a large corpus of data. This includes publicly available information like content from websites, text on internet forums, and information in code repositories. Some of these repos contain permissively licensed code (for example, code with BSD, and MIT licenses) while many other repos contain code that has restrictions on how it can be used (for example, code with copyleft licensing like GPL).
A model’s training dataset is crucial in shaping its output, as LLMs inherently tend to replicate patterns from their training data. This means that third-party models like Claude 3.5 Sonnet and GPT-4o can potentially regenerate code that exists in their training dataset — including code with copyleft licensing. If you inadvertently accept such code suggestions, then it introduces nonpermissive code in your codebase, resulting in IP infringement.
In the past, Tabnine solved for this by offering a license-compliant model, Tabnine Protected 2, an LLM purpose-built for software development and trained exclusively on code that is permissively licensed. This model remains a critical offering, as many companies believe that even the use of an LLM trained on unlicensed software may introduce risk. With Tabnine’s new Provenance and Attribution capability, we’re more easily able to support legal and compliance teams that are comfortable using a wider variety of models as long as they specifically don’t inject unlicensed code.
Models trained on larger pools of data outside of permissively licensed open source code can offer superior performance (particularly in languages and frameworks where not much freely usable code exists), but run the risk of introducing nonpermissive code in their codebase. Tabnine’s Provenance and Attribution capability eliminates this tradeoff and offers the ability to choose between two levels of protection:
The Provenance and Attribution capability supports the full breadth of software development activities inside of Tabnine, including generating code, fixing code, generating test cases, implementing Jira issues, and more. After you invoke an AI agent or submit your prompt in AI chat, Tabnine checks the generated code against all the publicly visible code on GitHub.
If Tabnine finds a match, it notifies you within the chat interface and lists the source repo. Additionally, Tabnine informs you about the license type for the code in the repo (e.g., GPL, BSD, MIT). Since Tabnine reads code like a human, it not only flags the output that exactly matches the open source code, but also flags output if there are functional or implementation matches. This way, we flag cases where the overall function is the same, but the variable names are changed. Tabnine also keeps a log so that Tabnine administrators can keep track of the code that matches the open source code across the organization.
This feature makes it easy for you to review the output from Tabnine and determine if it complies with the requirements of your specific project, use case, or team.
In the near future, we plan to add a capability that allows you to also identify specific repos for Tabnine to check generated code against. For example, you can specify repos maintained by competitors, and thus ensure that your competitors’ code is not introduced in your codebase. In addition, we plan to add a censorship capability allowing Tabnine administrators to remove any matching code before it’s displayed to the developer. For example, the Tabnine administrator can censor any code with GPL licenses.
The Provenance and Attribution is currently in private preview and is available to any Tabnine Enterprise customer. Existing Tabnine Enterprise customers should contact our Support team to request access. Once enabled, the Provenance and Attribution capability works on all available models: Anthropic, OpenAI, Cohere, Llama, Mistral, and Tabnine.
Not yet a customer? If you’d like to learn more about how Tabnine can help you accelerate software development and ship better, more compliant code, we encourage you to reach out to us.
To learn more about the Provenance and Attribution capability, please check out the Docs or attend our Tabnine Live session on January 9 at 8 am PT.