dupehound: catch the code your AI wrote twice

Coding agents don't browse your repo before they write. They can't — they work from whatever fits in context. So when you ask a model to add date formatting, it writes formatDate. Three tickets later, another model writes renderTimestamp. A month after that, stringifyDate appears in the billing service. Three functions, same logic, aging separately, each one drifting under future edits.

dupehound is a Rust CLI that finds them. It fingerprints every function's structure — not its text — so renamed identifiers and swapped literals don't hide a copy. 1

github.com · GitHub リポジトリ

Rafaelpta/dupehound

https://github.com/Rafaelpta/dupehound

コンテンツカードを読み込んでいます…

MIT license. v0.1.0, released 2026-06-12. Written in Rust with tree-sitter. No network calls, no API keys, no ML model. 1

Why an index, not a model

The author put it plainly in the README: "An LLM can't do this job. Duplicate detection compares every function against every other; a model samples what fits in context, an index checks everything." 1

That's not a knock on LLMs — it's a constraint. A context window is finite. An inverted index is not. dupehound parses each function body with tree-sitter, normalizes every identifier, string, and number to a sentinel value, then applies winnowing (Schleimer, Wilkerson & Aiken, SIGMOD 2003) to generate structural fingerprints. Those fingerprints go into an inverted index, and exact Jaccard similarity matches the clusters. Same algorithm on every run: same input, same verdict. 1

A CI merge gate has to be reproducible. A probabilistic model that might return different results on two identical commits is not a usable gate. dupehound is.

Three subcommands

scan [path] — runs the full analysis and reports duplicate clusters sorted by deletable lines. Each cluster lists the representative function (marked with ★) and every copy, along with line counts and similarity percentages. The summary box reports a slop score: the percentage of the codebase that could be deleted by keeping only one copy of each duplicate cluster. 1

Here's what the output looks like on a codebase with a date-formatting problem:

$ dupehound scan .
dupehound v0.1.0 — scanned 19 files · 370 lines · 27 functions in 21ms
╭─────────────────────────────────────────────────────────╮
│ SLOP SCORE 36.1% grade F                                │
│ 127 of 352 significant lines are deletable duplicates  │
╰─────────────────────────────────────────────────────────╯
● Cluster 1 ─ 4 copies · 100% similar · 42 deletable lines ─────────────
★ src/utils/date.ts:1 formatDate 14 lines
  src/api/timestamps.ts:1 renderTimestamp 14 lines 100% █████████
  src/jobs/report_dates.ts:1 stringifyDate 14 lines 100% █████████
  src/billing/dates.ts:1 humanizeDate 14 lines 100% █████████
★ = representative (kept) · dupehound scan --explain 1 shows the code

history — charts duplication across git history using monthly snapshots. Useful for spotting the commit range where the slop score started climbing — often correlates with the start of an AI-assisted sprint. 1

check — the CI gate. Fails when an incoming change introduces a new duplicate of an existing function. It points to the function already in the codebase that the new code replicates, so the author can reference or refactor instead of merge. 1

The history and check subcommands need git on PATH. scan runs anywhere.

Language support and benchmarks

dupehound supports 11 languages: TypeScript, TSX, JavaScript, Python, Rust, Go, Java, Ruby, Swift, C, and C++. 1

On speed: a scan of VS Code (2.97 million lines, 53,000 functions) completes in 3.6 seconds on a laptop. 1 Grade calibration against known open-source projects:

Project	Slop score	Grade
express	0.0%	A
gin	0.2%	A
tokio	1.1%	A
fastapi	1.7%	A
vscode	2.8%	A

These serve as a sanity check for the grading scale — well-maintained OSS projects with human reviewers cluster near zero.

A real scenario

Your team has been shipping features with Claude Code for six weeks. The codebase grew from 40k lines to 70k. Code review is harder to do thoroughly because the PRs are large and the code looks coherent. A senior engineer on the team suspects there's structural repetition building up but doesn't have a way to measure it.

Run dupehound scan . from the repo root. The scan takes under two seconds. If the slop score comes back at 15%+, you have a real problem: thousands of lines that are logically redundant, each one a future maintenance surface. The cluster report shows exactly which functions to consolidate, ranked by the lines you'd save.

Add dupehound check to your pre-commit hook or CI workflow. From that point, every PR that re-implements an existing function gets rejected with a pointer to the original — before it's merged, not three months later when two diverged copies both have bugs.

dupehound project mascot 1

Install

Three paths:

Cargo (cross-platform, builds from source):

cargo install dupehound

Homebrew (macOS and Linux):

brew install rafaelpta/dupehound/dupehound

Prebuilt binaries — macOS, Linux, and Windows builds are available on the releases page. 2

scan has no external dependencies. history and check need git in PATH.

Momentum

22 stars, 5 forks, 18 commits — repository created June 11, v0.1.0 released June 12. 1 Two days old at time of writing. No community discussion threads yet — the tool hasn't had time to surface on HN or Reddit. That's the leading-edge window: the kind of thing that will land on r/rust or Show HN once a few teams hit a painful enough slop score and go looking for a tool like this.

The GitClear research the README cites gives the backpressure context: duplicated code blocks grew 8× in 2024, the first year where copy-pasted lines outnumbered moved ones in aggregate across the repos they track. 1 dupehound is a direct response to that.

GitHub Stars

Version

v0.1.0

Language

Rust

License

MIT

Released

2026-06-12

統計カードを読み込んでいます…

Caveats

v0.1.0, 18 commits. The core scan/history/check loop works, but edge-case handling in less common language grammars and very large monorepos isn't battle-tested yet. File issues — the author is clearly engaged.
Function-level granularity only. dupehound fingerprints function bodies, not arbitrary code blocks or multi-function sequences. If your duplication pattern is copy-pasted logic inside a single large function, it won't catch it.
No Windows Homebrew. The brew tap works on macOS and Linux. Windows users need the prebuilt binary from releases or a cargo install. 2
check is a pre-merge gate, not a remediation tool. It stops new duplication from entering; it doesn't generate consolidation patches or suggest refactors. The scan output tells you what exists; cleaning it up is on you.

Quick start: cargo install dupehound && dupehound scan .

Cover image: AI-generated illustration of duplicate function clusters