How I Built a Phishing Detector Using Only Classical Algorithms — No ML Required
Every day, millions of people fall victim to phishing attacks. The attacker doesn't hack your bank — they just register paypa1.com, build a fake login page, and wait. You type your password. They steal it.
I built PhishGuard: a phishing detection system that catches these attacks using nothing but classical algorithms from my Algorithms course. No machine learning. No training data. No internet connection required.
Here's exactly how it works.
The Problem With Existing Solutions
When I started this project, I looked at how phishing is typically detected:
- Blacklists maintain a database of known phishing domains. The problem? They're always one step behind. A newly registered
g00gle.comis invisible until someone reports it. - Machine Learning models are accurate but require massive labeled datasets, significant compute, and constant retraining as attack patterns evolve.
My question: a phishing domain is always visually similar to the real one. paypa1 looks like paypal. micosoft looks like microsoft. Can we detect that similarity algorithmically — without any of the overhead? The answer is yes.
The Core Insight
Phishing domains share a fundamental property: they must be recognizable to the victim. An attacker can't register xkq92bz.com and expect anyone to confuse it with PayPal. They need something close. That means phishing detection is, at its core, a string similarity problem.
If we can measure how similar a suspicious domain is to a database of trusted domains — and flag anything above a threshold — we have a working detector.
The Three Algorithms
PhishGuard uses three algorithms in sequence, each solving a different part of the problem.
Algorithm 1 — LCS (Dynamic Programming)
The Longest Common Subsequence algorithm finds the longest sequence of characters shared between two strings in the same order. It's a classic Dynamic Programming problem.
def compute_lcs(X, Y):
m, n = len(X), len(Y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i-1] == Y[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
def similarity(X, Y):
return (compute_lcs(X, Y) / max(len(X), len(Y))) * 100
The similarity score is the LCS length divided by the longer string's length — giving a percentage between 0 and 100.
Example: X = "google", Y = "google" (after normalizing g00gle) → LCS = 6 → score = 100% → PHISHING
Complexity: $O(m \times n)$ time and space, where m and n are string lengths. Domain names are short (typically under 15 characters), so this is extremely fast in practice.
Algorithm 2 — Greedy Selection
Here's the problem: if our trusted database has 10,000 domains, running LCS against all of them for every query is expensive. That's $10,000 \times O(m \times n)$ operations per URL.
The Greedy algorithm solves this by pruning the database first using two cheap heuristics — no LCS involved:
- Length proximity: if
|len(input) - len(domain)| > 4, skip it. Typos rarely change domain length significantly. - Character overlap: if the first characters don't match AND the Jaccard similarity of character sets is below 40%, skip it.
def greedy_select(hostname, domain_db, k=20):
candidates = []
for domain in domain_db:
if abs(len(hostname) - len(domain)) <= 4:
if hostname[0] == domain[0] or char_overlap(hostname, domain) >= 0.4:
candidates.append(domain)
return candidates[:20]
This reduces 10,000 domains to at most 20 candidates in $O(N)$ time. LCS then runs only on those 20. The expensive $O(m \times n)$ computation runs at most 20 times per query, making it effectively constant regardless of database size.
Algorithm 3 — Max-Heap Ranking
After scoring all candidates, we need the top 3 most similar domains efficiently. A Max-Heap extracts the maximum element in $O(\log k)$ time. Python's heapq module is a Min-Heap by default, so we store negative scores to simulate Max-Heap behavior:
import heapq
def rank_results(candidates, hostname):
heap = []
for domain in candidates:
score = similarity(hostname, domain)
heapq.heappush(heap, (-score, domain)) # negate = max-heap trick
top3 = []
for _ in range(min(3, len(heap))):
neg_score, domain = heapq.heappop(heap)
top3.append((-neg_score, domain))
return top3
Complexity: $O(k \log k)$ where $k \le 20$ — effectively constant.
The Full Pipeline
Total complexity: $O(N)$ — dominated by the linear Greedy scan.
Handling Real-World Attack Variants
Basic LCS isn't enough. Attackers use tricks that pure string comparison misses, like Leet-speak substitution (g00gle) or Homoglyph attacks (аmazon.com using Cyrillic 'а'). The solution is normalization before comparison:
_LEET = str.maketrans("013456789", "oleasgtbg")
def full_normalize(s):
# Converts homoglyphs to ASCII and replaces leet-speak numbers
return s.lower().translate(_LEET)
Test Results
| Input | Attack Type | Score | Result |
|---|---|---|---|
| g00gle.com | Leet-speak | 100% | ✅ PHISHING |
| paypa1.com | Digit substitution | 100% | ✅ PHISHING |
| micosoft.com | Missing letter | 88.9% | ✅ PHISHING |
| pay-pal.com | Hyphen insertion | 100% | ✅ PHISHING |
| amazon.com | Legitimate | — | ✅ SAFE |
Optimizing the Bottleneck
After profiling execution time, a clear bottleneck emerged: Greedy scan took 4,328ms (99.8% of total time). The fix? An Inverted Bigram Index. Instead of scanning all N domains, we map every 2-character substring to the domains containing it at startup.
| DB Size | Phase 2 (Greedy Scan) | Phase 3 (Bigram Index) | Speedup |
|---|---|---|---|
| 1,000 | 5.52ms | 1.27ms | 4.3x |
| 5,000 | 21.73ms | 4.21ms | 5.2x |
| 10,000 | 43.97ms | 8.35ms | 5.3x |
What I Learned
- Algorithm selection matters more than implementation: A clever algorithm design beats micro-optimization every time.
- Profile before optimizing: I assumed LCS was the bottleneck. Profiling revealed Greedy was consuming 99.8% of time.
- Classical algorithms still solve real problems: You don't always need complex ML; a solid pipeline can run under 10ms offline and be fully explainable.
Limitations
To be honest about what this system doesn't do yet: The 75% threshold was set manually, the database needs scaling to millions of records, and it currently performs string-only analysis without checking SSL certificates.