<- // back to main portfolio

How I Built a Phishing Detector Using Only Classical Algorithms — No ML Required

Every day, millions of people fall victim to phishing attacks. The attacker doesn't hack your bank — they just register paypa1.com, build a fake login page, and wait. You type your password. They steal it.

I built PhishGuard: a phishing detection system that catches these attacks using nothing but classical algorithms from my Algorithms course. No machine learning. No training data. No internet connection required.

Here's exactly how it works.

The Problem With Existing Solutions

When I started this project, I looked at how phishing is typically detected:

My question: a phishing domain is always visually similar to the real one. paypa1 looks like paypal. micosoft looks like microsoft. Can we detect that similarity algorithmically — without any of the overhead? The answer is yes.

The Core Insight

Phishing domains share a fundamental property: they must be recognizable to the victim. An attacker can't register xkq92bz.com and expect anyone to confuse it with PayPal. They need something close. That means phishing detection is, at its core, a string similarity problem.

If we can measure how similar a suspicious domain is to a database of trusted domains — and flag anything above a threshold — we have a working detector.

The Three Algorithms

PhishGuard uses three algorithms in sequence, each solving a different part of the problem.

Algorithm 1 — LCS (Dynamic Programming)

The Longest Common Subsequence algorithm finds the longest sequence of characters shared between two strings in the same order. It's a classic Dynamic Programming problem.

def compute_lcs(X, Y):
    m, n = len(X), len(Y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i-1] == Y[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

def similarity(X, Y):
    return (compute_lcs(X, Y) / max(len(X), len(Y))) * 100

The similarity score is the LCS length divided by the longer string's length — giving a percentage between 0 and 100.

Example: X = "google", Y = "google" (after normalizing g00gle) → LCS = 6 → score = 100% → PHISHING

Complexity: $O(m \times n)$ time and space, where m and n are string lengths. Domain names are short (typically under 15 characters), so this is extremely fast in practice.

Algorithm 2 — Greedy Selection

Here's the problem: if our trusted database has 10,000 domains, running LCS against all of them for every query is expensive. That's $10,000 \times O(m \times n)$ operations per URL.

The Greedy algorithm solves this by pruning the database first using two cheap heuristics — no LCS involved:

def greedy_select(hostname, domain_db, k=20):
    candidates = []
    for domain in domain_db:
        if abs(len(hostname) - len(domain)) <= 4:
            if hostname[0] == domain[0] or char_overlap(hostname, domain) >= 0.4:
                candidates.append(domain)
    return candidates[:20]

This reduces 10,000 domains to at most 20 candidates in $O(N)$ time. LCS then runs only on those 20. The expensive $O(m \times n)$ computation runs at most 20 times per query, making it effectively constant regardless of database size.

Algorithm 3 — Max-Heap Ranking

After scoring all candidates, we need the top 3 most similar domains efficiently. A Max-Heap extracts the maximum element in $O(\log k)$ time. Python's heapq module is a Min-Heap by default, so we store negative scores to simulate Max-Heap behavior:

import heapq

def rank_results(candidates, hostname):
    heap = []
    for domain in candidates:
        score = similarity(hostname, domain)
        heapq.heappush(heap, (-score, domain))  # negate = max-heap trick
    
    top3 = []
    for _ in range(min(3, len(heap))):
        neg_score, domain = heapq.heappop(heap)
        top3.append((-neg_score, domain))
    return top3

Complexity: $O(k \log k)$ where $k \le 20$ — effectively constant.

The Full Pipeline

URL Input ↓ Hostname Extraction → strip protocol, www, path ↓ Greedy Filter → 10,000 domains → 20 candidates O(N) ↓ LCS Scoring → similarity % for each candidate O(m×n) × 20 ↓ Max-Heap Ranking → Top 3 results O(k log k) ↓ Verdict: PHISHING (≥75%) or SAFE

Total complexity: $O(N)$ — dominated by the linear Greedy scan.

Handling Real-World Attack Variants

Basic LCS isn't enough. Attackers use tricks that pure string comparison misses, like Leet-speak substitution (g00gle) or Homoglyph attacks (аmazon.com using Cyrillic 'а'). The solution is normalization before comparison:

_LEET = str.maketrans("013456789", "oleasgtbg")

def full_normalize(s):
    # Converts homoglyphs to ASCII and replaces leet-speak numbers
    return s.lower().translate(_LEET)

Test Results

Input Attack Type Score Result
g00gle.com Leet-speak 100% ✅ PHISHING
paypa1.com Digit substitution 100% ✅ PHISHING
micosoft.com Missing letter 88.9% ✅ PHISHING
pay-pal.com Hyphen insertion 100% ✅ PHISHING
amazon.com Legitimate ✅ SAFE

Optimizing the Bottleneck

After profiling execution time, a clear bottleneck emerged: Greedy scan took 4,328ms (99.8% of total time). The fix? An Inverted Bigram Index. Instead of scanning all N domains, we map every 2-character substring to the domains containing it at startup.

DB Size Phase 2 (Greedy Scan) Phase 3 (Bigram Index) Speedup
1,000 5.52ms 1.27ms 4.3x
5,000 21.73ms 4.21ms 5.2x
10,000 43.97ms 8.35ms 5.3x

What I Learned

Limitations

To be honest about what this system doesn't do yet: The 75% threshold was set manually, the database needs scaling to millions of records, and it currently performs string-only analysis without checking SSL certificates.