← Work · Legal Sector

Fifteen years of case files. Precedent search down from hours to seconds.

A UK commercial litigation firm had 2.3 million pages of internal case files and a research process that ran on partners' memory. We built a retrieval system that reads legal context, not keywords, and cites every answer back to its source.

73%

Research time cut

2,000+

Queries per day

2.3m

Pages indexed

8 wks

To production

The challenge

A mid-sized UK commercial litigation firm. Partners had access to fifteen years of case files — about 2.3 million pages of mixed-format material. PDFs of court orders. Scanned letters. Drafted contracts. Internal memos. Email threads exported as PDF. When working on a new matter, partners would spend half a day per case finding analogous precedents. Longer if the case touched multiple practice areas.

Public case-law databases (Westlaw, LexisNexis) covered the published precedents. The firm's own internal work — the privileged material that actually mattered for their unique strategies — was effectively in a black box. Searches happened by partner memory, partner email, or "ask the senior associate who was on that case eight years ago."

They'd considered hiring a knowledge management specialist. The cost-benefit didn't work. They'd looked at off-the-shelf legal-AI tools, but most were trained on public case law, not their internal files. A Big Four firm had quoted £180k+ for a fourteen-month "knowledge management transformation" with no commitment to a working search interface at the end of it.

The diagnostic

Two weeks. We sat with three partners, watched them research a live case, and read enough historical files to understand the document landscape. Three findings:

  • 01The corpus was usable, but heterogeneous. About 60% of files were native PDFs, mostly modern. 30% were scanned (the older material). 10% were Word, email, and miscellaneous. An OCR and parsing pipeline would be needed before any retrieval work.
  • 02The right unit of retrieval was the paragraph, not the document. Partners didn't want documents — they wanted specific reasoning, specific clauses, specific precedents. Document-level retrieval would have been useless. Paragraph-level with metadata (case ID, document type, year) was what mattered.
  • 03Citation was non-negotiable. Partners would not trust an AI answer that didn't link back to the exact source. Anything else was professional risk.

We quoted eight weeks fixed price.

What we built

Architecture · ingestion + query
INGESTION · ONE-TIME QUERY · RUNTIME PDFs & scans 2.3m pages, mixed format OCR + Parse extract text, log errors for review Chunk paragraph or clause + back-refs Embed self-hosted, no data egress Vector store on-prem, behind firm SSO Partner query natural language Hybrid retrieval semantic + metadata filters Re-rank top 50 → top 10 Cited results snippet + source PDF link queries

Ingestion pipeline

OCR for scanned material with manual review of unclear pages. Parsing logic to handle the firm's filing structure (case ID embedded in filename, document type inferred from folder structure plus filename heuristics). Each document got a metadata record. Errors were logged for human review rather than dropped silently.

Chunking strategy

Paragraph-level chunks for prose documents (memos, opinions, correspondence). Clause-level chunks for contracts and orders. Each chunk carried back-references to its parent document and the surrounding two paragraphs as context — so retrieval surfaced enough surrounding text to make the result intelligible without having to open the source.

Retrieval

Vector embeddings over the chunks, with a hybrid retrieval combining semantic similarity (the meaning of the query) with metadata filtering (practice area, year range, document type). A re-ranking step on the top fifty results before surfacing the top ten to the partner.

Citation

Every retrieved result links back to the source PDF with the matched paragraph highlighted. Partners can click through to read the surrounding context, then click again to open the full document. No hallucinated citations — if the system can't cite it, it doesn't surface it.

Interface

A simple web app, internal-only, behind the firm's existing Single Sign-On. Partners type a natural-language query — "show me cases where we argued frustration of contract due to regulatory change" — and get back ranked results with snippets and citations. An optional generative summary was added later, after partners trusted the retrieval enough to want a synthesis on top.

Security

Everything on-premise. No data leaves the firm's network. Embeddings generated from a self-hosted model. Vector store on the firm's existing infrastructure, behind their existing access controls.

Build phase

First week was the ingestion pipeline. Slow start — OCR quality on the older material was uneven, and chunking heuristics needed tuning for each document type. By week three, partners could query a subset of the corpus (one practice area, last five years) and start seeing useful results.

Weeks four through six expanded the corpus and tuned retrieval. Partners started using the prototype on live work, which surfaced edge cases — same client name spelled differently across older files, citations to overruled statutes that needed flagging, document scans where two pages had been bound out of order.

Week seven was the citation system and UI polish. Week eight was SSO integration and partner rollout. The generative summary feature came in a follow-up engagement two months later, once the firm had used the retrieval system enough to know what they wanted from it.

Outcomes

Six months in production:

  • 73% reduction in research time for typical precedent searches. Partners measured this on their own time logs.
  • 2,000+ queries per day across the firm. Partners use it dozens of times each.
  • Junior associates onboard faster. The system gives them the same depth of access as a senior associate's institutional memory.
  • An unexpected one: partners rediscovered cases they'd forgotten about. Useful precedents from twelve years ago that no one in the current team had been on.

What translates

The pattern works wherever you have:

  • A large historical document set with internal value — legal, medical, financial advisory, regulatory, archives.
  • Domain expertise in your team that the documents reflect. The system surfaces your team's knowledge, not generic public knowledge.
  • Tolerance for a build that involves an ingestion phase as well as a retrieval phase. Older or messier corpora take longer.
  • Need for citation or audit. Regulated industries especially.

It doesn't work as a fast project where all documents are already digital, well-tagged, and in modern formats — most corpora aren't.

Got an archive worth searching properly?

We start with a two-week diagnostic. If the numbers don't work, we tell you and refund the second week.

Start with a diagnostic →