算法学习（一）

kuteliyafuka2025-07-082025-07-08

算法学习（一）

ref:
1、Charging Station: Generating the Neighborhood of a String(飞书)
2、Motif Finding Is More Difficult Than You Think(飞书)

🧬 Motif Finding 笔记

🎯 核心问题：Motif Enumeration

Motif Enumeration 是一种基础的 motif（模式）发现方法。目标是在一组 DNA 序列中找出所有长度为 k，且在每个序列中都至少出现一次、与真实 motif 最多相差 d 个碱基（即允许 d 个 mismatch）的模式。

以下是 Motif Enumeration 的伪代码：

1 MotifEnumeration(Dna, k, d)
2     Patterns ← an empty set
3     for each k-mer Pattern in the first string in Dna
4         for each k-mer Pattern’ differing from Pattern by at most d mismatches
5             if Pattern' appears in each string from Dna with at most d mismatches
6                 add Pattern' to Patterns
7     remove duplicates from Patterns
8     return Patterns

📌 学习内容记录

🧠 Frequent Words Problem

问题描述：
假设有 10 条 DNA 序列，每条序列中都包含一段相同的 motif（长度为 15）。如何通过统计出现频率最多的子串，发现这一段 motif？

示例代码（简化）如下：

text = "atcgatcgatcgatcg"
k = 4
count_dict = {}

for i in range(len(text) - k + 1):
    seq = text[i:i + k]
    count_dict[seq] = count_dict.get(seq, 0) + 1

# 最后找出频率最高的 k-mers
frequent_kmers = [kmer for kmer, count in count_dict.items() if count == max(count_dict.values())]

实际中我们会把这个逻辑扩展到所有的 DNA 序列中，对所有 k-mers 进行频率统计。

🧬 Mismatch Problem

问题描述：
给定一个模式 Pattern 和一个最大允许错配数 d，如何找到所有与 Pattern 的汉明距离 ≤ d 的近邻序列？
这类问题需要生成一个“邻居集合”，称为 Neighbors(Pattern, d)。

递归实现如下：

1 Neighbors(Pattern, d)
2     if d == 0:
3         return {Pattern}
4     if len(Pattern) == 1:
5         return {"A", "C", "G", "T"}
6     Neighborhood = set()
7     SuffixNeighbors = Neighbors(Suffix(Pattern), d)
8     for Text in SuffixNeighbors:
9         if HammingDistance(Suffix(Pattern), Text) < d:
10            for x in ["A", "C", "G", "T"]:
11                Neighborhood.add(x + Text)
12        else:
13            Neighborhood.add(Pattern[0] + Text)
14    return Neighborhood

✅ 关键思想解释：

将 Pattern 分为首字符和后缀：
Pattern = FirstSymbol + Suffix(Pattern)
递归构造 Suffix(Pattern) 的 d-neighbors。
若 Text 与 Suffix(Pattern) 的距离 < d，则说明我们还可以“消耗”1个错配机会，因此前缀可任意选择（A, C, G, T）。
否则，保留原始前缀字符。