Data Loss Prevention – Content Awareness: Human vs Computer Classification

Data Loss Prevention (DLP) is a newer area of information security and assurance  which has arrived in recent years.  There are a host of software products, controls and solutions which have found there way onto the market to help facilitate DLP, whether those losses be malicious or inadvertent.  This market seems fledgling but is maturing as time goes on.  People are just starting to understand the effects of losing data, most of which is lost by mistake. Around 77% of data loss is “inadvertent” and unintended. Basically, people make mistakes. A much lower percentage of data loss is malicious.  Compliance seems to be a major driver for the implementation of the solutions and many key security players are positioning DLP as a core element of ongoing strategy.  The question I have is, at this stage is are we ready to effectively apply AI(Artificial Intelligence) based systems, where the intended objective is for those AI systems to scan, analyse and more important classify information as sensitive or unimportant?

The DLP market does seem to be a slow starter with a very small percentage of companies intending to deploy, with a further fraction of that minority actually having a deployed system.  The bulk of these solutions are what Gartner terms “content aware”.  They generally monitor network/email traffic and at the same time deploy agents which can scan internal network resources (file shares, etc) for sensitive data which is available where it shouldn’t be.  The idea is, that when sensitive information is located, it should be either removed, quarantined, blocked in transit or authorised to remain in place or be distributed.  The problem is, that while it is easy enough to recognize information like credit card numbers, it becomes exponentially more difficult for these systems to understand more qualitative content. Qualitative content (e.g. information that is expressed in verbose literal wording and not distinctive formats or patterns) is difficult for an AI system match up against a particular pattern or template for it to effectively classify the information.  Examples of this type of information may include, a new product idea for an investment bank, a ground breaking formula for a new medicine in a pharmaceutical company or perhaps even a world cup winning team strategy for a national football team.  Information of this nature is usually specific on a company-by-company basis and also a case-by-case basis. One sports team strategy may not look anything like another.

 It is for this reason, the term “False Positive” is becoming widely used in the market and anyone who’s worked with DLP systems (or tried to deploy one) will  certainly understand what a False Positive is.  A False Positive is where the system has incorrectly classified an information asset and blocked the normal use or distribution of that information because it believes it needs to be protected.  False positives can become a nightmare in administrative terms and also hinder the day-today working of individuals within an organisation.  They create a necessity for an extensive amount of “tuning” to allow the right balance of security to applied. The deploying organisation has to decide what an acceptable level of false positives is and trade the restrictions that will be applied off against the new security afforded to the information.  The problem is that this creates a massive amount of work, not just for the administrators, but also for the deployment teams and in some cases additional work is required by the users.  After spending all of this effort to tune the system to understand what the business is all about, they can then get hit with another tirade of tuning, when the business or organisational model changes. Constant tuning may be required to change with the business, what could be hot information today, could be of no importance to the business tomorrow.

Introducing the capability for users themselves to classify information is becoming increasing important. How many times have you heard an organisation say this before: “Our employees are our most valuable asset.”? This is a phrase which most organisations like to throw onto a website or into a brochure where appropriate, but when it comes to DLP this couldn’t be closer to the truth.  By using the intelligence of your employees, and their natural ability to understand what is sensitive information and what isn’t we can significantly improve our ability to prevent data loss.   Giving users control over there own information by introducing the capbility to ensure that they make an informed judgement on the classification of the information they are working with, allows us to implement appropriate controls on that information.

Boldon James provides an Information Classification product called ICS (SAFEmail ICS works with Microsoft Outlook and SAFEoffice ICS works with Microsoft Office Documents).  Using a system like ICS with some simple firewall filtering rules can decrease data loss in a simple way, without introducing the extensive overheads of deploying, tuning and maintaining an AI based system.  This does not however become an either or decision. ICS style user classification can be deployed alongside an AI based DLP to help it do it’s job better. By having clear labels in both content and Meta-data, we can reduce not only the analyse requirement from the DLP system but also the amount of false positives, further decreasing the cost of deployment and and maintenance. It will be interesting to see the AI based DLP market develop, but we are many years off having the power and capability for a system to do a better job than the human brain.

If you would like to find our more about ICS, please contact me or Boldon James on the link provided.

Boldon James – SAFEmail ICS & SAFEOffice ICS