Building a Zero-Dependency PDF Parser to Survive Vercel Serverless

How I dodged DOMMatrix is not defined and shipped a RAG agent that actually deploys.

How I dodged DOMMatrix is not defined and shipped a RAG agent that actually deploys.

I was three hours into building a "chat with your PDFs" demo on Next.js 16 when I hit the kind of error that makes you question your career.

ReferenceError: DOMMatrix is not defined
    at /var/task/node_modules/pdfjs-dist/build/pdf.js:...

The PDF was loading fine on localhost. The moment I deployed to Vercel — boom. Function crash. The error surfaced from pdfjs-dist, the underlying library used by pdf-parse and roughly every "easy" PDF-text-extraction package on npm.

The reason is simple and brutal: pdfjs-dist was designed to run in the browser. It depends on DOMMatrix, canvas, and a handful of other DOM APIs that have always been part of the Web Platform but have never existed in Node.js. On your laptop, Next.js's dev server is a forgiving Node + JSDOM-ish environment that papers over the gap. On Vercel's serverless runtime, there is no DOM, no shim, no mercy.

I needed a different path.

This post is about that path. I ended up writing a zero-dependency PDF parser using nothing but Node's built-in zlib and a few regular expressions. It is not pretty. It will never replace a real PDF library. But it's small, it's serverless-safe, and it ships.

If you're building anything that has to extract text from user-uploaded PDFs in a Vercel/Lambda/Cloudflare Workers context, this is the cleanest workaround I've found.

A 60-second PDF anatomy lesson

Before the parser makes sense, you need to know what's actually inside a .pdf file. I knew almost nothing about this when I started; an afternoon of reading the Adobe PDF 1.7 reference clarified things faster than I expected.

A PDF is a tree of objects, encoded as a mix of ASCII and binary. The text you actually see on screen lives inside content streams — blobs of PostScript-like instructions that say "move the cursor here, set this font, show this string."

A typical content stream looks something like this:

BT
  /F1 12 Tf
  72 720 Td
  (Hello, world.) Tj
  0 -14 Td
  [(Multiple) -250 (segments)] TJ
ET

Two operators do almost all the work:

Tj — show a single string. Syntax: (some text) Tj.
TJ — show an array of strings with kerning adjustments. Syntax: [(Multiple) -250 (segments)] TJ.

If you can extract the strings inside the parentheses of every Tj and TJ operator in every content stream, you have, approximately, the text of the PDF.

There's one catch. Modern PDFs almost always compress their content streams using zlib (the FlateDecode filter in PDF terminology). A raw PDF byte-stream contains long runs of binary garbage that decompresses into the readable PostScript above.

So the parser needs to do three things:

Find every stream ... endstream block.
Detect whether it's FlateDecode-compressed and inflate if so.
Pull text out of Tj / TJ operators with regex.

That's the entire algorithm.

Step 1 — Find every stream

A PDF stream is delimited by the literal keywords stream and endstream, with newlines in between. The content between them is opaque bytes — either binary (compressed) or PostScript (raw).

The robust way to read a PDF in Node without losing bytes is to read it as latin1. Latin-1 maps every byte 0–255 to a unique character, so the round-trip is lossless. (UTF-8 would corrupt anything non-ASCII; the Buffer API is overkill for text scanning.)

import { inflateSync } from 'zlib'; // built-in, no npm install

function extractPdfText(buffer: Buffer): string {
  const raw = buffer.toString('latin1');
  const texts: string[] = [];

  const streamRegex = /stream\r?\n([\s\S]*?)\r?\nendstream/g;
  let m: RegExpExecArray | null;

  while ((m = streamRegex.exec(raw)) !== null) {
    const streamData = m[1];
    // ... step 2 below
  }

  return texts.join(' ').replace(/\s+/g, ' ').trim();
}

[\s\S]*? is the lazy "match anything including newlines" idiom — JavaScript doesn't have re.DOTALL like Python, so this is the standard workaround.

Step 2 — Decompress if FlateDecode

Each PDF stream is preceded by a dictionary that tells you which filter (if any) was used to encode the data. The dictionary lives in the bytes just before the stream keyword. We don't need to fully parse it — checking whether FlateDecode appears in the ~500 bytes preceding the stream is enough for 99% of real-world PDFs.

const streamStart = m.index;
const preceding = raw.slice(Math.max(0, streamStart - 500), streamStart);
const isFlate = preceding.includes('FlateDecode');

let content: string;
if (isFlate) {
  try {
    const compressed = Buffer.from(streamData, 'latin1');
    const decompressed = inflateSync(compressed);
    content = decompressed.toString('utf8');
  } catch {
    continue; // corrupted stream, skip
  }
} else {
  content = streamData;
}

inflateSync is part of Node's built-in zlib module — no npm install, no bundling. It throws on malformed input, which is fine because PDFs sometimes contain partial or corrupted streams that aren't worth crashing the whole upload over.

After inflation, content is the PostScript-like text we actually want.

Step 3 — Pull text from Tj and TJ

Two regexes do most of the work.

// Skip streams that don't have any text operators
if (!content.includes('Tj') && !content.includes('TJ')) continue;

// (text) Tj
for (const t of content.matchAll(/\(([^)\\]*(?:\\.[^)\\]*)*)\)\s*Tj/g)) {
  const s = t[1]
    .replace(/\\n/g, '\n')
    .replace(/\\(\d{3})/g, (_, oct) => String.fromCharCode(parseInt(oct, 8)));
  if (s.trim()) texts.push(s);
}

// [(text) -250 (more) ...] TJ
for (const t of content.matchAll(/\[([^\]]+)\]\s*TJ/g)) {
  for (const p of t[1].matchAll(/\(([^)\\]*(?:\\.[^)\\]*)*)\)/g)) {
    if (p[1].trim()) texts.push(p[1]);
  }
}

The (?:\\.[^)\\]*)* mess is there to handle escaped characters inside parentheses — PDFs allow \) and \\ inside string literals, which would otherwise terminate the match early. The \\(\d{3}) pass converts octal escapes (\101 → A) that appear in some PDFs.

This is enough to extract the text from a clinical paper, a textbook chapter, or pretty much any text-based PDF you'll feed a RAG pipeline.

What this does NOT handle

I want to be honest: this parser is good enough for RAG, not good enough for everything.

It does not handle:

Scanned PDFs (image-only, no text streams). For those, you need OCR — Tesseract, AWS Textract, or Google Document AI.
Complex font encodings (ToUnicode CMaps). Some PDFs encode characters with non-standard glyph IDs that need a CMap lookup. Most consumer-generated PDFs (Word, Pages, LaTeX) are fine; weird editorial PDFs sometimes return garbage.
Table structure. Text comes out in reading order, but you lose the row/column relationships. Fine for chunking, bad for "extract this table as JSON."
Encrypted PDFs. They'll just look like garbage after the missing decryption step.

For RAG, none of these usually matter. You're going to chunk the text anyway, embed it, and let semantic search do the heavy lifting.

Wiring it into the ingestion pipeline

Once the text is out, the rest of the ingestion is standard:

async function processDocument(blobUrl: string, sessionId: string, documentId: string) {
  const buffer = await fetch(blobUrl).then(r => r.arrayBuffer());
  const text = extractPdfText(Buffer.from(buffer));

  const chunks = chunkText(text);              // ~500 tokens each, 50-token overlap
  const embeddings = await embedBatch(chunks); // Voyage AI batch endpoint

  await supabase.from('chunks').insert(
    chunks.map((c, i) => ({
      document_id: documentId,
      session_id: sessionId,
      content: c.content,
      embedding: embeddings[i],
      chunk_index: i,
      page_number: c.pageNumber,
    }))
  );

  await supabase
    .from('documents')
    .update({ status: 'ready', chunk_count: chunks.length })
    .eq('id', documentId);
}

A subtle point that bit me: I originally ran processDocument in the background using Vercel's after(). That dies on the Hobby plan because after() keeps the function alive past the 10-second hard cap. The fix was unromantic — run ingestion synchronously inside the upload route handler, fast enough that a 1–10 page PDF finishes in 2–5 seconds. The user gets status: 'ready' directly in the response and never sees a "processing" spinner.

What I learned

Two takeaways I'll carry into every future serverless LLM project:

Treat browser-only dependencies as a serverless red flag. If a package is built on top of pdfjs-dist, canvas, jsdom, or any DOM API, assume it'll break in serverless and look for a Node-native alternative — even if it means writing a worse, simpler parser yourself. The maintenance cost of "weird unfixable production crash on Friday afternoon" outweighs the convenience of npm install pdf-parse.

A worse, smaller, owned parser beats a fancier one you can't debug. This parser is ~80 lines. I understand every byte of it. When something goes wrong, the stack trace points at a regex I wrote, not at three layers of abstraction inside a library. That ownership is worth a lot when you're shipping under a timeline.

The full code, with chunking, embedding, the dual memory system, and the chat agent on top, lives at:

🔗 github.com/NeryC/rag-agent-memory
🔗 Live demo: rag-agent-memory.vercel.app

If you're hitting the same DOMMatrix is not defined wall — paste my parser into your lib/ folder, npm uninstall pdf-parse, and ship.