← Back to all posts

I Built a Legal Document Classifier Because I Got Tired of Watching People Sort PDFs

legalsort-og.jpg

I Built a Legal Document Classifier Because I Got Tired of Watching People Sort PDFs

I spent two years as a solo CTO at a legal tech company. One of the first things I noticed was a paralegal spending half her morning opening PDFs, skimming the first page, and dragging them into folders. Contracts here, filings there, NDAs in that one, correspondence in the other.

She was fast at it. She'd been doing it for years. And it was an enormous waste of her time.

That's legal document classification. Before anything useful happens to a document, before it gets reviewed, routed, stored, tracked, someone has to look at it and decide what it is. In a five-person firm doing twenty documents a week, fine. In a legal department processing hundreds a week across multiple matters and jurisdictions, you're burning real hours on what is essentially a sorting exercise.

Nobody writes incident reports about this. It never makes the priority list. It just quietly eats time, every day, across every team I've ever seen.

The tools that exist don't actually solve this

I need to be specific here because "why don't you just use [software]" is the first thing everyone says.

Document management systems like iManage, NetDocuments, and Worldox are filing cabinets, not classifiers. They'll organize your documents beautifully once you already know what they are. They won't tell you what something is in the first place. Some of them have tagging features, but tagging is just classification with extra steps and a better UI. You still need a human reading the document and picking the right label.

Enterprise AI platforms exist too. The big eDiscovery tools. They're built for litigation, priced for AmLaw 100 firms, and require weeks of onboarding. If you're a mid-size legal team that just needs to know whether the PDF that landed in your inbox is an NDA or a services agreement, you don't need a platform. You need an answer.

Why this is harder than it sounds

I assumed this would be straightforward when I started building. I was wrong.

Legal documents are structurally wild. A contract drafted by a New York firm looks nothing like one drafted in London. An NDA from a startup that used some template off the internet reads completely differently from one a Big Four firm spent three weeks negotiating. I had contracts in my training data that never used the word "contract" anywhere in the document.

Then there's scanned documents. A surprising amount of legal work still involves paper. Contracts pulled from archives, faxed filings, photographed signature pages. I spent more time than I'd like to admit dealing with OCR edge cases where a scanned lease agreement got classified as correspondence because the first page was a cover letter.

And the taxonomy problem is real. Legal documents don't fit into ten neat categories. You need dozens of types with subcategories, and the boundaries between them are genuinely fuzzy. Is a letter of intent a contract? Sometimes. Is a settlement agreement a filing or correspondence? Depends on context. Every edge case is a judgment call, and judgment calls are exactly what's hard to automate.

What actually matters in practice

After building this for a while, I've landed on a few things that separate a useful classifier from a demo.

First, confidence scores change everything. "This is a contract" is a guess. "This is a contract, specifically a non-disclosure agreement, 96% confidence" is something you can route automatically. The difference determines whether you've eliminated the bottleneck or just moved it. Now instead of classifying documents, someone's reviewing the classifier's output.

Second, speed matters more than I expected. Classification has to happen at intake speed. Seconds, not minutes, definitely not overnight batch processing. If the tool is slower than the paralegal, nobody will use it.

Third, it has to handle the ugly documents. Clean digital PDFs are the easy case. The hard case is a photographed contract with coffee stains and scanned upside down. If your classifier only works on pristine inputs, you've solved maybe 60% of the actual problem.

The cost nobody quantifies

Here's what bugs me about how legal teams think about this. Everyone knows they spend too much time sorting documents. Everyone agrees it's inefficient. And then it gets deprioritized because there's a deal closing Thursday and a filing deadline next week and the intake process has been broken for three years so what's another quarter.

But the cost compounds in ways people don't track. Misfiled documents that delay closings. Junior associates doing sorting work instead of legal analysis. The downstream friction in every single workflow that starts with "first, figure out what this document is." Which is basically all of them.

Legal ops people are starting to get this. The intake layer, those first few minutes after a document arrives, is the highest-leverage point in the whole pipeline. Get classification right and routing, review, storage, search, and compliance all get faster. Leave it manual and every step downstream inherits that drag.

What I'm building

I'm not going to pretend this post isn't partly about my product. It is. I built LegalSort to solve exactly this problem: classify incoming legal documents across 70+ types, handle scanned documents, return results in seconds.

But I'm also genuinely curious how other legal teams handle this today. If you're in legal ops and you've built a workflow around document intake, whether it's manual, semi-automated, or held together with shared drives and hope, I'd like to hear what's working and what isn't.

← Back to all posts

No comments yet. Be the first.