About nine months ago, Google’s search results improved thanks to breakthrough work from their AI team. Google released this model, BERT, and the applications have proven endless. Using AI to analyse text has never been easier.
This revolution has not arrived in the legal world. Providers rely on training individual small models for tasks instead of using a single larger version which can deliver superior performance on trained examples and reasonable results on problems they’ve never seen before thanks to their broader understanding of the way that language works.
The language model revolution:
Language models attract headlines and excitement from the big tech firms. BERT assists Google searches; GPT2 was initially not released by Elon Musk’s OpenAI group thanks to its potential for misuse and Satya Nadella personally announced Microsoft’s contribution to the field in February and GPT3 would represent a huge step forward in the state of the art when we get access to computers that can actually run it. The sheer variety of applications drives interest; they can complete your sentences in Gmail, make chatbots seem more human and augment machine translation.
In the legal sphere there hasn’t really been a similar revolution. Existing players still advertise the fact that they use hundreds of different models to solve problems for their users.
Moving from words to sentences:
AI in language is generally divided into different tasks such as text classification, sentiment analysis and named entity recognition. Previously, these problems were addressed with bespoke algorithms specifically and individually designed by engineers to spot the relevant pattern they’re looking for and solve their problem.
Early algorithms simply counted things like word frequency. If a clause talks about assignment every third word, it’s probably an assignment clause. This ‘bag of words’ approach can work quite well for some problems and operates as a good baseline to test the performance of a more sophisticated solution. In a similar vein, some of the rule based searches looking for words like assignment within five words of or in the same sentence as ‘termination’ are similar to the hand-crafted approach and can be effective.
One advance treated each word as a high dimensional number which allows computers to do ‘verbal arithmetic’. For example, take the idea of a ‘king’, subtract the idea of ‘man’, add the concept of ‘woman’ and what you should have is ‘queen’. Pushing the analogy further, you can represent the meaning of sentences or paragraphs by taking an average of the number for each word in the sentence. Representing words like this allows computers to understand text similarities even if the words aren’t identical.
Extracting the meaning of a sentence requires more than understanding the meaning of each word, it requires that you understand the context and the grammar of the sentence and the semantics of surrounding sentences. This was the leap that BERT, building on previous advances like ELMo, made; understanding sentences as collections of words in a specific order and taking account of this when trying to understand their meaning.
Machine learning requires a lot of data to work. Having humans supervise making data is expensive and tedious. BERT addressed this by using ‘unsupervised’ training over huge volumes of data to generate contextual representations of words. They asked the AI to fill in masked words in a chunk of text. This forced it, over time, to recognise how words, sentences and paragraphs fit together.
This produced rich representations of words that pay attention to their context and can be used for a variety of tasks. Essentially, through exposure to vast amounts of data, BERT automates the feature engineering that engineers used to do by hand and tends to do it better because it doesn’t just treat sentences as patterns.
The next steps:
At Della we take full advantage of the insight that a single model that can address as many data points as you like in multiple languages. So the question is why others aren’t talking about using the same systems. There is the possibility that the current generation of legal service providers may fall into the same trap that they would have told their customers to avoid; attachment to legacy techniques.