The internet has a problem. A glance at the million most visited websites suggests that 60% of them are in English. Russian comes in next at 8.3% closely chased by Turkish. This isn’t immediately linked to AI, technology or even law but it does have an impact on how researchers get to the first step on the journey of understanding contracts and other documents, the data.
Many of the advances that we’ve seen have come from using the huge volume of data that the internet provides through Wikipedia, the Google Books corpus or even just links coming from Reddit. More data leads to better algorithms so with so much English data, we expect good performance in English. But the question is, how can AI expect to become bilingual or even multilingual with a dearth of data to learn from?
How can we solve the issue? – Cheese and Noodles
One step was building specific models for specific languages. A group of French researchers released a French model called, jokes aside, CamemBERT. Similar individual models have been released for other languages like Vietnamese (PhoBERT) and Polish (HerBERT) and many others.
This approach works but it is expensive and time consuming. You need to gather a huge volume of text in the target language to train the model. Then you need to duplicate the painstakingly and expensively assembled English specialised datasets to do the classification, generation or other task that you want to perform.
What’s the solution? – Multilingual models
The alternative approach is to use one giant model to represent more than one language. The theory is that it can observe and use similarities between languages and take advantage of any similarities in structure between them. The Latin-based European languages are obvious examples where knowledge of one would likely assist with learning another.
What’s the benefit? – Reuse of existing materials
The benefit for lower resource languages is that we can take advantage of the datasets that already exist in English. For example, if you want to train a question answering system in Spanish, you can start by training a multilingual model with a dataset in English and then fine-tune with Spanish examples. The amount of data required for this is much less than would be needed for building such a system from scratch and is much cheaper to do. This is because knowledge can transfer from one domain and language to another.
The AI/NLP community has been building large datasets in this area for a while now. One of the pioneering efforts was the SQuAD (Stanford QUestion Answering Dataset) which initially contained 100,000+ question answering pairs manually generated at considerable expense from Wikipedia articles and became the de-facto leaderboard for simple comprehension. This dataset is far from alone however with new sets with different focuses emerging all the time. For example, questions that are impossible to answer, questions about newspaper articles or tweets and a dataset that requires explanations understandable by a five year old have now been developed.
Why is this important for the legal sector? – Law is expensive
If general machine learning datasets are expensive to train, this is doubly true for legal ones. First, the overwhelming majority of legal documents are confidential so getting hold of documents to extract examples from can be difficult. Secondly, the expertise to extract examples is expensive as it requires people to have undergone extensive training to confidently extract the correct information from a document.
How accurate is NLP at answering questions in non-English languages? – Very accurate
Transfer learning means that NLP in less common languages is going to get better faster than it did in English. Some examples from my own experience confirm this. We can take contracts in Swedish, ask questions in English and get sensible answers. This is because, whilst there are jurisdictional differences, a lot of the things people want to know are similar. Namely, who’s involved, how much does it cost and what happens when it goes wrong.
An Austrian law firm wanted to test how accurate Della could be at reviewing Austrian leases. The firm put together a list of 77 questions that needed to be answered. They wanted to see whether they could use Della for the following use cases: multi-jurisdictional lease review and reporting, mass document review, contract risk assessment and repapering. Without any prior training on the Austrian language, Della was able to provide 50% accuracy on the first answer it suggested. After only exposure to minimal Austrian language contracts, Della’s accuracy level reached 89%.
Join the ‘giant model’ SQuAD
Not only does this alternative approach to transfer learning show promise in terms of high degrees of accuracy, it also does so without breaking the bank or eating into your precious time. Della has already joined the ‘SQuAD’ and is trained on 20+ languages and do you know the best part? The more exposure the AI has to different languages, the better the model’s accuracy will be for all future users of the platform.
Let us show you how Della’s giant multilingual model can save you time and money on your contract review.