As automatic data analysis simplifies lawyer

There are entire industries – and not small – which are known only to specialists. E-discovery market fall into this category. Its volume is billions of dollars and continues to grow. For a slice of this cake dozens of competing IT corporations, including giants such as Symantec, HP and IBM. But what can you say about him?

Meanwhile, many companies are unlikely to be able to function normally, if not modern software solutions for e-discovery. Take, for example, the case of Transatlantic Reinsurance Company – one of the largest U.S. reinsurers.

Reinsurance companies take on some of the risk of other insurers, which they can not be worn alone. Such activities are tightly regulated by the state. In practice, this means a huge burden on lawyers reinsurer, which now and then have to comb the archives of performing queries regulators or legal counsel and guidance.

Archives in this case – have not cupboards and dusty folders. This is an electronic document and this, oddly enough, only complicates the work. The fact that electronic documents multiply much faster than paper. Large companies often have to store all e-mails between employees and various databases associated with their activities. It turns out that that processing incoming request results in the need to sort out hundreds of thousands of documents. And since in such matters accuracy above all, a fair amount of the work is done manually.

It is this time-consuming process termed e-discovery. Usually, finding the right documents in several stages. First – the primary selection by keyword (machine cope with it easily). The selected documents are again filtered out by means of different software. At the end of each document, and not filtered machine scans the people.

Initial selection and filtering documents Transatlantic Reinsurance Company engaged deputy general counsel Edward Kelley. Search by keyword, usually require one to three working days (if desired files stored on tape drives, this period grew at least a week). Then the legal department for a few weeks to manually process documents found. Information found at this stage are often forced to expand your keyword list. It shifts the completion of the work for a few days.

Everything would be fine, if the complexity of e-discovery remained at a stable level. This, alas, is not so: for the past twenty years, the number of documents that have to sift through the lawyers of large companies increased by several orders of magnitude. If in the early nineties, even several thousand requests for documents were an exceptional event, and now millions are not uncommon. And in the near future will be added to the text documents audiovisual recordings. They are difficult to automatic filtering, and it definitely does not simplify matters.

However, even plain text – it’s not so simple. Classic software solutions e-discovery are unable even to rank documents in order of relevance, as do the search engines on the Internet. According to statistics, 85% found their documents are not necessary, but to identify them specifically, lawyers should see them all and deliver a verdict.

Analysis of documents by hand – it’s the long and costly part of e-discovery. According to research by Rand Corporation, the average cost of studying gigabytes of documents is $ 18 thousand a more thorough analysis significantly increases the price. In one of the cases examined Rand Corporation, the processing of each gigabyte costs $ 358 thousand

Loud litigation between Samsung and Apple illustrates how expensive can be e-discovery. When it was proposed to raise the electronic correspondence manual Samsung, the Korean company’s lawyers said that the processing of such a request would require millions of dollars. It still had to perform, and it cost more expensive Samsung: the study found documents to convince the court of the correctness of Apple.

This is not the first process, the outcome of which is largely dependent on e-discovery. The main evidence that support the antitrust charges against Microsoft, had emails Bill Gates and other executives of the company. Investigation of the bankrupt energy company Enron also rested on her correspondence study managers.

Given how much money swirling in the global market e-discovery, it is hardly surprising interest shown him an IT company. The rapid development of machine learning technology that makes possible the automatic classification of documents, only spurred competition. Gartner analysts predict that by 2017 this market will double over the level of 2012.

In the diagram, the horizontal axis represents Gartner innovation solutions, and the vertical – the ability to realize it. The most innovative and realistic solutions that are concentrated in the upper right corner.

Experts looking for methods to reduce the burden on lawyers and before the phrase “machine learning” has been in vogue. In 2008, Symantec Research Center published a paper titled “Reducing the costs of e-discovery by filtering emails included.” The proposed method it has reduced the number of considered letters by 20% and simultaneously united them within the meaning for easy and quick viewing.

The method is based on the well-known fact: when you reply to an e-mail quoting the text adopted. As a result, the same piece of text can be included in dozens of letters. If correspondence will get a lawyer, he would have to see them all, even if the right keyword was only in the first, and the rest just quoted him.

Get rid of the quotes, without losing too much, is not so easy. Method of building conversations used email clients is not reliable. The authors have proposed to identify quotes by comparing the text. This is coupled with other difficulties: not every matching text is a quote. For example, a paragraph consisting of the single word “yes”, can occur in the letters that are not connected to each other.

Designed in Symantec Research Lab algorithm uses a probabilistic data structure called “Bloom filter” to check for paragraphs in the text and then compare the letters to each other. Bloom filter ensures no false positives if there is an element in the set, he always confirm this (reverse, however, is not true). This feature is very important for this application. Passing algorithm filter out duplicate signatures, greetings and other verbal garbage.

The idea is not in vain: now similar functionality built into Symantec eDiscovery Platform, Symantec solution for e-discovery. It brings its own work Symantec, the company’s technology Clearwell, which merged with Symantec in 2011, and the application’s Symantec Enterprise Vault, designed to manage corporate email and file archives. It was decided to implement it in the Transatlantic Reinsurance Company, to cope with the growing burden on lawyers.

Symantec eDiscovery Platform – a major step forward in comparison with a simple keyword search. Symantec solution analyzes the content and metadata of documents, compared with known organizational structure of the company and looks for patterns of communication. This, like linguistic and statistical analysis of the content, it helps to group them by topic. Email correspondence is filtered, gets rid of duplicates and grouped into chains of questions and answers.

But the most interesting technology that supports Symantec eDiscovery Platform, appeared recently called predictive coding. It allows you to train the computer to distinguish the documents content. For this part of the user manually divides documents into categories. A special algorithm analyzes the user’s actions and tries to classify other documents yourself. User corrects errors committed machine – and so as long as accuracy is not getting sufficient for practical use.

According to one study, predictive coding allows to reduce by 80% the time required to review documents. This means that the cost of e-discovery can be reduced by more than half.

Edward Kelly of Transatlantic Reinsurance Company describes the benefit of advocates with eloquence: “Between the analysis of documents by hand and using Clearwell is an abyss. The difference is almost the same as that between cooking dinner in the microwave and heat, friction diluted using sticks against one another. “