这句话中的一些术语可以具体解释一下。「信号序列」指的是一个按顺序排列的信号,例如字符串从前到后的每一个字符、数组从 1 到 n 的每一个数、数从高到低的每一位等。「判断是否满足某种规则」,可以理解为:我们关心这个序列是否属于某个特定的集合。这个集合由我们事先设定好的规则来定义,例如「所有长度为偶数的二进制串」或「所有回文串」。
Algorithm Hopcroft’s Algorithm(Q,Σ,δ,q0,F):Input. DFA A=(Q,Σ,δ,q0,F).Output. A partition of Q into equivalence classes of the minimal DFA.Method. 1234567891011121314P←{F,Q∖F}W←{F}while W=∅choose and remove any A∈Wfor each c∈ΣS←{q∈Q∣δ(q,c)∈A}for each Y∈P such that S∩Y=∅ and Y∖S=∅Y1←S∩Y,Y2←Y∖SP←(P∖{Y})∪{Y1,Y2}if Y∈WW←(W∖{Y})∪{Y1,Y2}elseadd the smaller of Y1 and Y2 to Wreturn P
算法实现时,复杂度的瓶颈在于 S 的计算。直接遍历所有 q∈Q 进而判断 δ(q,c)∈A 是否成立是不可行的。因此,需要在算法运行前,预处理反向转移边 {q∈Q∣δ(q,c)=a},从而,利用这些反向转移,遍历 a∈A,就可以得到集合 S。这样做可以保证每条转移 δ(q,c)=a 只会在 a 属于某个证据时才会遍历到;而前文的证据筛选方法保证了,算法中实际用到的包含 a 的证据序列 A1⊃A2⊃⋯⊃Ak 中,前一个至少是后一个的两倍大小,因此,k∈O(logn)。也就是说,每条转移边至多只会遍历 O(logn) 次,而总的转移数目是 n∣Σ∣ 的,因此,总的复杂度就是 O(n∣Σ∣logn) 的。
// DFA minimization via Hopcroft's algorithm.// Complexity: O(n * m * log(n)).DFA DFA::hopcroft_minimize() const { // Construct inverse transition maps: // - pre[c] stores states sorted by the target of transition c. // - pos[c][s] is the start index in pre[c] of transitions going to state s. std::vector<std::vector<int>> pre(m), pos(m); for (int c = 0; c < m; ++c) { pre[c].assign(n, 0); pos[c].assign(n + 1, 0); // Counting sort. for (int i = 0; i < n; ++i) ++pos[c][trans[c][i]]; for (int i = 0; i < n; ++i) pos[c][i + 1] += pos[c][i]; for (int i = 0; i < n; ++i) pre[c][--pos[c][trans[c][i]]] = i; } // Partition element structure: // - os: starting index in the state list. // - sz: number of states in this class. // - cnt: temporary count of marked states during refinement. struct EquivClasses { int os, sz, cnt; EquivClasses(int os, int sz, int cnt) : os(os), sz(sz), cnt(cnt) {} }; // Partition and helper data structures. std::vector<EquivClasses> ec; // Current list of equivalence classes. std::vector<int> ids(n); // Permutation of states, grouped by ECs. std::vector<int> par(n); // Maps state to its EC index. std::vector<bool> tag(n); // Temporary marking for splitting. std::queue<int> evidences; // Worklist of ECs to check. // Initial partition by acceptance label. std::iota(ids.begin(), ids.end(), 0); std::sort(ids.begin(), ids.end(), [&](int l, int r) { return acc[l] < acc[r]; }); for (int l = 0, r; l < n; l = r) { for (r = l; r < n && acc[ids[r]] == acc[ids[l]]; ++r) par[ids[r]] = ec.size(); if (l) evidences.push(ec.size()); // Add all but first class to worklist. ec.emplace_back(l, r - l, 0); } // Refinement loop. while (!evidences.empty()) { int cr = evidences.front(); evidences.pop(); for (int c = 0; c < m; ++c) { std::vector<int> todo; for (int i = ec[cr].os; i < ec[cr].os + ec[cr].sz; ++i) { for (int k = pos[c][ids[i]]; k < pos[c][ids[i] + 1]; ++k) { int j = pre[c][k]; if (!tag[j]) { if (!ec[par[j]].cnt) todo.push_back(par[j]); ++ec[par[j]].cnt; tag[j] = true; } } } // Perform splits. for (int i : todo) { int ti = i; if (ec[i].cnt != ec[i].sz) { // Split into two: larger vs smaller segment. bool majority_tagged = ec[i].cnt * 2 >= ec[i].sz; int mid = std::partition(ids.begin() + ec[i].os, ids.begin() + ec[i].os + ec[i].sz, [&](int x) { return tag[x] == majority_tagged; }) - ids.begin() - ec[i].os; // Assign new EC index to the smaller segment. for (int j = ec[i].os + mid; j < ec[i].os + ec[i].sz; ++j) par[ids[j]] = ec.size(); evidences.push(ec.size()); if (!majority_tagged) ti = ec.size(); ec.emplace_back(ec[i].os + mid, ec[i].sz - mid, 0); ec[i].sz = mid; } // Clear temporary counters and tags. ec[i].cnt = 0; for (int j = ec[ti].os; j < ec[ti].os + ec[ti].sz; ++j) tag[ids[j]] = false; } } } // Build minimized DFA. DFA res(m, ec.size(), par[q0]); for (const auto& e : ec) { int i = ids[e.os]; // Representative state. res.acc[par[i]] = acc[i]; for (int c = 0; c < m; ++c) res.trans[c][par[i]] = par[trans[c][i]]; } return res;}
这一参考实现允许自动机的状态带有任何整数取值的标签,而并非简单的「接受」与「不接受」的二元标签。从参考实现可以看出,与基础 Hopcroft 算法的唯一不同就在于初始划分和证据集合的构造。这种拓展的自动机也称为 Moore 机。它的一个应用可以看本节的第二个例题。
例题
本节通过两道例题介绍如何实际应用 DFA 最小化的技巧。
例题
给定一个长度为 n 的 01? 串 a,初始变量 x=0,我们按顺序遍历每一位 ai 并执行如下操作:
#include <bitset>#include <cmath>#include <iostream>#include <unordered_map>#include <vector>constexpr int B = 10, L = 91;using state = std::bitset<L>;DFA raw_dfa(B), dfa(B);std::unordered_map<state, int> ids;// Construct a DFA by DFS.int dfs(const state& cr) { if (ids.count(cr)) return ids[cr]; int id = ids[cr] = raw_dfa.n++; // Check if accepted. for (int i = 0; i < B; ++i) { if (cr[i]) { raw_dfa.acc.push_back(i); break; } } // Construct transitions recursively. for (int c = 0; c < B; ++c) { raw_dfa.trans[c].push_back(0); } for (int c = 0; c < B; ++c) { state nt; for (int i = 0; i < L; ++i) { if (cr[i]) { if (i + c < L) nt[i + c] = true; nt[std::abs(i - c)] = true; } } raw_dfa.trans[c][id] = dfs(nt); } return id;};// DP.std::vector<long long> memo;std::vector<int> nums;long long sol(int x, int len, bool lim, int k) { if (!len) return dfa.acc[x] <= k; auto key = (x * 20 + len) * B + k; if (!lim && memo[key] != -1) return memo[key]; long long res = 0; for (int c = 0; c <= (lim ? nums[len - 1] : B - 1); ++c) res += sol(dfa.trans[c][x], len - 1, lim && c == nums[len - 1], k); return lim ? res : memo[key] = res;};long long calc(long long n, int k) { nums.clear(); for (; n; n /= B) nums.push_back(n % B); return sol(dfa.q0, nums.size(), true, k);};int main() { // Construct a DFA. for (int c = 0; c < B; ++c) raw_dfa.trans[c].reserve(20000); dfs(1); // DFA minimization. dfa = raw_dfa.hopcroft_minimize(); memo.assign(dfa.n * 20 * B, -1); // Queries. int t; std::cin >> t; for (; t; --t) { long long l, r; int k; std::cin >> l >> r >> k; std::cout << (calc(r, k) - calc(l - 1, k)) << '\n'; } return 0;}
Knuutila, Timo. “Re-describing an algorithm by Hopcroft.” Theoretical Computer Science 250, no. 1-2 (2001): 333-363.
Hopcroft, John E., Rajeev Motwani, and Jeffrey D. Ullman. “Introduction to automata theory, languages, and computation.” Acm Sigact News 32, no. 1 (2001): 60-65.
算法实现中有一处细节:对于一个证据 A,有可能检验完一部分字符后,这个证据集合就已经分裂为 B 和 C 了。不妨设 ∣B∣≥∣C∣。由于参考实现中,较小的集合 C 插入到了证据队列的末尾,而较大的证据集合 B 替换到了集合 A 原来的位置。算法继续运行时,实际只是利用证据 B 检验剩余的字符。这样做是正确的。这是因为对于已经检验完的字符,至少验证了 A 和 C 两个集合;而对于尚未检验的字符,至少验证了 B 和 C 两个集合。 ↩