Development of Language Models to Process Legal Language

One possibility that has intrigued me for years at the intersection of law and technology has been getting computer systems to “understand” the substance of all kinds of contracts. This could open up a world of possibilities for automating legal processes and making legal services more accessible.

Imagine being able to ask your virtual assistant, “Hey Siri/Alexa, I’m thinking of moving to Chicago. What contracts that I’m party to would prevent this or would need to be updated?” The system would be aware of all your personal agreements and provide a synthesized answer, including your employment contract, company policies, car lease, apartment lease, health, car, life, renter’s insurance, and service agreements, including cell phone, internet, cable, etc. The system could even make the necessary adjustments for you, including giving appropriate notice to withdrawal from some agreements and entering into new ones, such as a new lease on an apartment. This idea is not as far-fetched as it was just a few years ago.

At the core of this idea is the challenge of getting computer systems to understand the substance of contracts. Most contracts exist as unstructured data to a computer system. Up until about five years ago, there were two main ways to go about building a system that could understand contracts.

The first approach involves developing a law-specific programming language. Since much of a contract can be broken down into if/then statements, then perhaps instead of (or in addition to) writing contracts in English, contracts could also be expressed in a type of programming language, then computers could understand contracts. These already exist in finance and in some insurance settings. However, this approach depends on the contract being originally expressed in a machine-readable format at the time of drafting. The Codex center at Stanford has some projects related to this generally referred to as computable contracts.

The second approach involves using machine learning to gather a bunch of contracts, break them down into the important terms and clauses, label all of that data, and build an ML model based on that data. That contract model could then be used to analyze new contracts. LawGeex is an example of a company that has been using this approach.

While the machine learning approach is more doable than creating a law-specific programming language, it still requires a ton of work gathering and labeling legal language, just to build what might be a fragile AI model. There are also challenges in terms of understanding the nuance and context of legal language, as well as the potential for bias in the data used to train the AI model.

Evolutions in Natural Language Processing

All of this takes place in the context of what is known in the AI world as natural language processing (NLP). One of the earliest breakthroughs in NLP was the release of word2vec in 2015. This learning algorithm helped computers understand semantic relationships between words by learning vector representations of words.

Word vectors are important in NLP because they provide a way for computers to understand the meaning and context of words. By representing words as vectors, NLP models can perform mathematical operations on them, such as addition and subtraction, to infer relationships between words. For example, the vector for “king” minus the vector for “man” plus the vector for “woman” would result in a vector close to the vector for “queen”. This ability to capture semantic relationships between words was a significant advancement in the field of NLP and opened up new possibilities for applications such as text classification and information retrieval.

Additionally, transfer learning became a dominant paradigm in NLP with the release of BERT in 2019, which pre-trains a large neural network on a vast corpus of text data and fine-tunes it for a specific downstream task. More recent advancements include the release of GPT-3 in 2020, which is a neural language model with 175 billion parameters (these large language models are often referred to as LLMs). It achieved impressive results on various language tasks such as language translation and text completion. This model became super-popular in 2021 with the release of ChatGPT – a public application based on an underlying GPT model.

With this as a background, in my next article I’ll walk through using GPT-3 and Python to break down complex legal documents for basic question and answering.