[ DATE: 2026-05-24 ] | [ CATEGORY: CYBERSECURITY & ALGORITHMS ]

How I Built a Phishing Detector Using Only Classical Algorithms — No ML Required

Every day, millions of people fall victim to phishing attacks. The attacker doesn't hack your bank — they just register paypa1.com, build a fake login page, and wait. You type your password. They steal it.

I built PhishGuard: a phishing detection system that catches these attacks using nothing but classical algorithms from my Algorithms course. No machine learning. No training data. No internet connection required.

Here's exactly how it works.

The Problem With Existing Solutions

When I started this project, I looked at how phishing is typically detected:

Blacklists maintain a database of known phishing domains. The problem? They're always one step behind. A newly registered g00gle.com is invisible until someone reports it.
Machine Learning models are accurate but require massive labeled datasets, significant compute, and constant retraining as attack patterns evolve.

My question: a phishing domain is always visually similar to the real one. paypa1 looks like paypal. micosoft looks like microsoft. Can we detect that similarity algorithmically — without any of the overhead? The answer is yes.

The Core Insight

Phishing domains share a fundamental property: they must be recognizable to the victim. An attacker can't register xkq92bz.com and expect anyone to confuse it with PayPal. They need something close. That means phishing detection is, at its core, a string similarity problem.

If we can measure how similar a suspicious domain is to a database of trusted domains — and flag anything above a threshold — we have a working detector.

The Three Algorithms

PhishGuard uses three algorithms in sequence, each solving a different part of the problem.

Algorithm 1 — LCS (Dynamic Programming)

The Longest Common Subsequence algorithm finds the longest sequence of characters shared between two strings in the same order. It's a classic Dynamic Programming problem.

def compute_lcs(X, Y):
    m, n = len(X), len(Y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i-1] == Y[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

def similarity(X, Y):
    return (compute_lcs(X, Y) / max(len(X), len(Y))) * 100

The similarity score is the LCS length divided by the longer string's length — giving a percentage between 0 and 100.

Example: X = "google", Y = "google" (after normalizing g00gle) → LCS = 6 → score = 100% → PHISHING

Complexity: $O(m \times n)$ time and space, where m and n are string lengths. Domain names are short (typically under 15 characters), so this is extremely fast in practice.

Algorithm 2 — Greedy Selection

Here's the problem: if our trusted database has 10,000 domains, running LCS against all of them for every query is expensive. That's $10,000 \times O(m \times n)$ operations per URL.

The Greedy algorithm solves this by pruning the database first using two cheap heuristics — no LCS involved:

Length proximity: if |len(input) - len(domain)| > 4, skip it. Typos rarely change domain length significantly.
Character overlap: if the first characters don't match AND the Jaccard similarity of character sets is below 40%, skip it.

def greedy_select(hostname, domain_db, k=20):
    candidates = []
    for domain in domain_db:
        if abs(len(hostname) - len(domain)) <= 4:
            if hostname[0] == domain[0] or char_overlap(hostname, domain) >= 0.4:
                candidates.append(domain)
    return candidates[:20]

This reduces 10,000 domains to at most 20 candidates in $O(N)$ time. LCS then runs only on those 20. The expensive $O(m \times n)$ computation runs at most 20 times per query, making it effectively constant regardless of database size.

Algorithm 3 — Max-Heap Ranking

After scoring all candidates, we need the top 3 most similar domains efficiently. A Max-Heap extracts the maximum element in $O(\log k)$ time. Python's heapq module is a Min-Heap by default, so we store negative scores to simulate Max-Heap behavior:

import heapq

def rank_results(candidates, hostname):
    heap = []
    for domain in candidates:
        score = similarity(hostname, domain)
        heapq.heappush(heap, (-score, domain))  # negate = max-heap trick
    
    top3 = []
    for _ in range(min(3, len(heap))):
        neg_score, domain = heapq.heappop(heap)
        top3.append((-neg_score, domain))
    return top3

Complexity: $O(k \log k)$ where $k \le 20$ — effectively constant.

The Full Pipeline

URL Input ↓ Hostname Extraction → strip protocol, www, path ↓ Greedy Filter → 10,000 domains → 20 candidates O(N) ↓ LCS Scoring → similarity % for each candidate O(m×n) × 20 ↓ Max-Heap Ranking → Top 3 results O(k log k) ↓ Verdict: PHISHING (≥75%) or SAFE

Total complexity: $O(N)$ — dominated by the linear Greedy scan.

Handling Real-World Attack Variants

Basic LCS isn't enough. Attackers use tricks that pure string comparison misses, like Leet-speak substitution (g00gle) or Homoglyph attacks (аmazon.com using Cyrillic 'а'). The solution is normalization before comparison:

_LEET = str.maketrans("013456789", "oleasgtbg")

def full_normalize(s):
    # Converts homoglyphs to ASCII and replaces leet-speak numbers
    return s.lower().translate(_LEET)

Test Results

Input	Attack Type	Score	Result
g00gle.com	Leet-speak	100%	✅ PHISHING
paypa1.com	Digit substitution	100%	✅ PHISHING
micosoft.com	Missing letter	88.9%	✅ PHISHING
pay-pal.com	Hyphen insertion	100%	✅ PHISHING
amazon.com	Legitimate	—	✅ SAFE

Optimizing the Bottleneck

After profiling execution time, a clear bottleneck emerged: Greedy scan took 4,328ms (99.8% of total time). The fix? An Inverted Bigram Index. Instead of scanning all N domains, we map every 2-character substring to the domains containing it at startup.

DB Size	Phase 2 (Greedy Scan)	Phase 3 (Bigram Index)	Speedup
1,000	5.52ms	1.27ms	4.3x
5,000	21.73ms	4.21ms	5.2x
10,000	43.97ms	8.35ms	5.3x

What I Learned

Algorithm selection matters more than implementation: A clever algorithm design beats micro-optimization every time.
Profile before optimizing: I assumed LCS was the bottleneck. Profiling revealed Greedy was consuming 99.8% of time.
Classical algorithms still solve real problems: You don't always need complex ML; a solid pipeline can run under 10ms offline and be fully explainable.

Limitations

To be honest about what this system doesn't do yet: The 75% threshold was set manually, the database needs scaling to millions of records, and it currently performs string-only analysis without checking SSL certificates.