Source Code Protection with DLP

Chris Roney June 25, 2024June 25, 2024 Data Loss Prevention

Tech enterprises that develop custom applications, either for internal use or for sale, have a precious asset that deserves just as much protection as customers’ personal information or credit card numbers – the source code. For many, the source code is the absolute heart of the business. If your company develops a unique IT product, its source code should be as heavily guarded as the original recipe for the world’s most famous fried chicken. Unfortunately, for many reasons, guarding the source code is much more difficult than a recipe stored in a steel safe with multiple locks.

Why is the source code such a precious commodity?

Many commercial applications have a major competitive advantage because nobody except the authors knows exactly how these applications work. They could use unique algorithms or clever programming tricks, which make them more accurate, faster, or in any way better than the competition. Such secrets are hidden in the source code.

Understanding why the source code is so precious also requires a little understanding of how computer programs are created. In almost all modern architectures, the programmer writes the program in a so-called high-level language such as Java, Python, Ruby, or C++ – this is exactly what is called the source code. These languages are easy to understand for humans and allow for easy collaboration between people. Unfortunately, they are not so easy to understand for the computer’s processors, which need very basic sets of instructions.

Every time the developers create a new version of the source code, it has to be converted into these instructions for the computer. In the past, many architectures did that as the code was running using special applications called interpreters, but this approach is not very efficient. Nowadays, almost all languages need a special application called a compiler, which takes the source code and converts it into the actual application, which can be run on the computer, published online, or downloaded and installed by your customers.

But you may think that if an application is downloaded or published online, anyone should be able to understand how it works and get to the source code, right? It’s not so. Just as you can’t take a bite of fried chicken, analyze it, and come up with all the original spices used to make it (and then copy it), you can’t easily convert the application back into the source code. This is why businesses feel safe making applications publicly available but would not do the same with their source code.

Who would be interested in stealing your source code?

While the obvious culprit behind source code theft attempts would be your sneaky competitor, possibly from a less friendly part of the world where the laws look the other way, many more parties would find your unique algorithms a tasty treat:

Dark web dealers: Probably the biggest threat to your source code are professionals who make their sorry living by stealing from others and then either selling it to the highest bidder or asking for a ransom to not publicize the code. Just like a jewel thief rarely plans to exhibit the trophy in their own home for enjoyment, most source code thieves are not interested in what’s inside at all. They look for an easy victim, get the source code, and publish it for bidding on the dark web or blackmail the victim for substantial funds.

Wannabe hackers and pseudo-activists: Another party that is often not interested in your algorithms but in the act of theft itself are so-called “script kiddies”. These are most often troubled young people who try to improve their self-image by participating in illegal hacking communities, where they get the most respect if they manage to break protection and publicly expose their trophies. A similar case may happen with people who believe that your secrets should be known to the public so that everyone can benefit from your research “for the good of the world”. The damage in these cases can be much worse than with professional thieves because your precious information ends up being available to be downloaded by anyone.
Nation-state actors: It’s not just your shady competitors from the other side of the world that may be interested in your unique algorithms but also the governments behind them. Their goal may not be taking advantage of your inventions but rather weakening your position and the position of your country or its allies. The theft itself helps them power their propaganda machine with proof of how easy of a target their enemies can be.

The three threats above are just examples and there could be others that would benefit either from the stolen content, from its exposure, or from the fact of the theft itself. Last but not least, attackers can quietly use stolen source code to explore vulnerabilities in the software. This makes the source code one of the most targeted jewels in today’s information-driven world.

How can source code be stolen?

Modern apps are developed by teams that include numerous roles and all of these roles need some kind of access to the source code. That unfortunately means that attackers have multiple potential points of entry to steal your valuable information, and you have to protect every single one of these points. All-in-all, stealing source code is way easier than stealing a secret formula, and does not require such an intricate web as the one depicted in the movie The Spanish Prisoner.

Again, to understand why unprotected source code is very easy to steal, we have to understand how today’s software is made. In almost every business that builds applications, the entire source code of their applications is stored using special software called a revision control system, and the place where the source code is placed is usually simply called the repository. This special software makes it easy to track any changes made by any of the teams involved in building and modifying the source code, revert them if needed, or even create several versions of the software at the same time.

The most popular revision control system today is called Git, and primary repositories can be stored either on a dedicated company server or, for example, managed using cloud solutions such as GitHub and GitLab. While these repositories are usually very well protected and the only thing businesses have to worry about is using strong passwords and multi-factor authentication to access them, unfortunately, every person who works on the software, such as a developer/programmer or a tester providing quality assurance, needs to copy the entire source code to their local computer. That means that your precious code is on tens or even hundreds of laptops, often all around the world!

This fact alone means that source code is very easy to steal if, instead of targeting the well-protected central repository, the attacker targets the weak spot: those developers or testers and their laptops. It’s much easier to fool a person using social engineering, or even get physical access to such a laptop than try to break through the protection of cloud giants or break into dedicated company servers.

Famous source code theft cases

Source code theft is not just a theory, it’s an everyday reality. While such cases are not as often exposed by popular media as personal information theft, and we don’t hear as much about them as about the Ashley Madison hack, now even documented by Netflix, they happen almost every year and they happen even to the biggest market players. Here are some famous cases and their consequences.

2018 – Apple: An intern who left the company took a copy of the source code for iOS and gave it to a community that specializes in breaking operating system protection mechanisms (“jailbreaking”). Subsequently, the community published part of the source code publicly, which showed how Apple implemented secure boot mechanisms.
2020 – Mercedes: The car giant publicly exposed their entire source code repository by mistakenly publishing a security token (similar to a password) that could be found using Google Search. This security hole was, luckily, exposed by a security engineer (Shubham Mittal of RedHunt Labs), but it’s difficult to say if anyone else managed to access this sensitive data before.
2022 – Microsoft: The company was the victim of an attack from the hacking group Lapsus$, which accessed the code for Bing and Cortana among others. Attackers were able to steal this source code due to an unprotected Azure cloud server, which could have been accessed by anyone without a password for a month.
2023 – Riot Games: This game publisher had the code of one of the most famous games stolen: League of Legends. The attackers demanded a ransom and, supposedly, the attack was orchestrated using social engineering.
March 2024 – Microsoft: This time the giant had its source code stolen by one of the most notorious Russian nation-state actors, Midnight Blizzard, who were previously famous for the SolarWinds hack. This attack was orchestrated using password spraying, which means trying the same stolen password to access many targets, assuming that most people use the same password for many systems (which is unfortunately true).

The above are just a few cases, but there were many others, for example, Adobe in 2013, Symantec in 2012 and 2019, Microsoft Windows 10 in 2017, Snapchat in 2018, and more. It seems like there are more high-profile source code theft cases happening in the world than personal information breaches.

The challenges of preventing source code theft with DLP

As we mentioned before, the easiest access point to stealing the source code are the endpoints – primarily laptops that are used by software developers and software testers and which, at almost any time, have an entire copy of the application’s source code in a completely unprotected local Git repository, which is simply a directory on the local disk. The only type of security software that can help to protect such a widely exposed asset is data loss prevention (DLP) software for endpoints.

DLP software works by allowing the owner of the device to access the protected resource but not allowing this sensitive content to be copied and pasted into other applications such as a browser window, a messenger app, or an email client. But the first challenge is actually identifying the resources that need protection, in this case – the source code.

While it is relatively easy for security software to scan all files on the local disk, discovering source code in those files is not as easy as it may seem. Today’s high-level programming languages are made to be easy for a human to understand so they use words that sound like natural language. While some of them are full of constructs not that common in other content, such as semicolons and brackets, many go away from that conundrum. All in all, DLP software must be able to recognize not just the most common and easy-to-recognize programming languages such as Java, but also more challenging ones such as Python, Ruby, or SQL. This is where Endpoint Protector DLP software shines with its unique N-gram-based text categorization algorithm, which reaches 98% accuracy in identifying programming languages.

The importance of efficiency when protecting source code

Accuracy in identifying source code, for which Endpoint Protector is well-known already, is just one reason why this software is especially well-suited to secure source code. However, there is no way for DLP software to recognize between proprietary source code and other source code on the local disk, such as open-source libraries or operating system source code (e.g. in macOS systems). Therefore, efficiency in processing is very important due to the potential need to cover more than just the most sensitive source code fragments.

Last but not least, the source code is not a commodity that is downloaded just for a moment and then never used again, such as a lot of types of personal information. The source code is on a developer’s disk permanently and it’s in a permanent state of flux. Every day, the developer modifies the files, copies their fragments, creates new files, opens them in the development environment, and synchronizes everything with the cloud/server repository, potentially downloading a lot of new information created by other developers. Therefore, any type of DLP activity must be also very efficient not to slow down any of these intense access processes.

In terms of efficient prevention of source code theft from the endpoint, Endpoint Protector ticks all the boxes. And, as a bonus, we can admit that yes, we protect the Endpoint Protector source code using our own DLP, too!