一、背景
需求非常簡(jiǎn)單,給定一組關(guān)鍵詞,需要將商品名稱中出現(xiàn)過的關(guān)鍵字替換掉;
如:skuName="HUAWEI Pura 70 Pro 國(guó)家補(bǔ)貼500元 羽砂黑 12GB+512GB 超高速風(fēng)馳閃拍 華為鴻蒙智能手機(jī)" 需要替換成
skuName="HUAWEI Pura 70 Pro 羽砂黑 12GB+512GB 超高速風(fēng)馳閃拍 華為鴻蒙智能手機(jī)" 這里的關(guān)鍵字"國(guó)家補(bǔ)貼500元";
直接skuName.replace("國(guó)家補(bǔ)貼500元", ""),不就可以了嗎?如果是一組,那就循環(huán)替換就完了嘛,再考慮到關(guān)鍵字前綴問題,對(duì)這一組關(guān)鍵詞,按字符長(zhǎng)度進(jìn)行排序,先替換長(zhǎng)的關(guān)鍵詞,再替換短的就ok了;
如果這一組關(guān)鍵詞非常多,上千個(gè)怎么辦?真實(shí)場(chǎng)景也是這樣的,一般需要替換的關(guān)鍵詞都是比較多,并且使用String.replace上線后,直接CPU打滿,基本不可用;
這個(gè)字段替換本質(zhì)上與敏感詞過濾是一樣的原理,針對(duì)敏感詞的深入研究,出現(xiàn)了 Aho-Corasick(AC自動(dòng)機(jī)) 算法;
Aho-Corasick(AC自動(dòng)機(jī))是一種多模式字符串匹配算法,結(jié)合了Trie樹的前綴匹配能力和KMP算法的失敗跳轉(zhuǎn)思想,能夠在單次文本掃描中高效匹配多個(gè)模式串。其核心優(yōu)勢(shì)在于時(shí)間復(fù)雜度為O(n + m + z)(n為文本長(zhǎng)度,m為模式串總長(zhǎng)度,z為匹配次數(shù)),適用于敏感詞過濾、基因序列分析等場(chǎng)景。
?
二、方案
針對(duì)這幾種算法進(jìn)行對(duì)比;
字符串替換,定義一個(gè)接口,通過4個(gè)不同的方案實(shí)現(xiàn),進(jìn)行性能對(duì)比
public interface Replacer { String replaceKeywords(String text); }
2.1 String.replace 方案
這種方案最簡(jiǎn)單,也是關(guān)鍵詞少的時(shí)候,最有效,最好用的;
public class StrReplacer implements Replacer {
private final List keyWordList;
public StrReplacer(String keyWords) {
this.keyWordList = Lists.newArrayList(keyWords.split(";"));
// 按關(guān)鍵字長(zhǎng)度降序排序,確保長(zhǎng)關(guān)鍵字優(yōu)先匹配
keyWordList.sort((a, b) -> Integer.compare(b.length(), a.length()));
}
/**
* 替換文本中所有匹配的關(guān)鍵字為空字符串
*/
@Override
public String replaceKeywords(String text) {
String newTxt = text;
for (String s : keyWordList) {
newTxt = newTxt.replace(s, "");
}
return newTxt;
}
}
2.2 使用正則替換
String.replace本質(zhì),還是使用正則進(jìn)行替換的,通過代碼實(shí)現(xiàn)使用編譯好的正則進(jìn)行替換性能會(huì)好于直接使用replace;
String.replace的實(shí)現(xiàn)
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
使用正則替換的實(shí)現(xiàn)
public class PatternReplacer implements Replacer {
// 預(yù)編譯正則表達(dá)式模式
private final Pattern pattern;
public PatternReplacer(String keyWords) {
List keywords = Lists.newArrayList(keyWords.split(";"));
// 按關(guān)鍵字長(zhǎng)度降序排序,確保長(zhǎng)關(guān)鍵字優(yōu)先匹配
keywords.sort((a, b) -> Integer.compare(b.length(), a.length()));
// 轉(zhuǎn)義每個(gè)關(guān)鍵字并用|連接
String regex = keywords.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
this.pattern = Pattern.compile(regex);
}
// 替換方法
@Override
public String replaceKeywords(String skuName) {
return pattern.matcher(skuName).replaceAll("");
}
}
2.3 使用Aho-Corasick(AC自動(dòng)機(jī)) 算法實(shí)現(xiàn)
在java中已有現(xiàn)成的算法實(shí)現(xiàn),源代碼github-robert-bor/aho-corasick,?
引入jar包
org.ahocorasick/groupId?>
ahocorasick/artifactId?>
0.6.3/version?>
/dependency?>
基于 Aho-Corasick 算法的字符串替換實(shí)現(xiàn)
public class AhoCorasickReplacer implements Replacer {
private final Trie trie;
public AhoCorasickReplacer(String keyWords) {
// 構(gòu)建Aho-Corasick自動(dòng)機(jī)
Trie.TrieBuilder builder = Trie.builder().ignoreOverlaps().onlyWholeWords();
//trie.caseInsensitive();
//trie.onlyWholeWords();
for (String s : keyWords.split(";")) {
builder.addKeyword(s);
}
this.trie = builder.build();
}
/**
* 替換文本中所有匹配的關(guān)鍵字為空字符串
*/
@Override
public String replaceKeywords(String text) {
if (text == null || text.isEmpty()) {
return text;
}
StringBuilder result = new StringBuilder();
Collection emits = trie.parseText(text); // 獲取所有匹配結(jié)果
int lastEnd = 0;
for (Emit emit : emits) {
int start = emit.getStart();
int end = emit.getEnd();
// 添加未匹配的前綴部分
if (start > lastEnd) {
result.append(text, lastEnd, start);
}
// 跳過匹配的關(guān)鍵字(即替換為空)
lastEnd = end + 1; // 注意:end是閉區(qū)間,需+1移動(dòng)到下一個(gè)字符
}
// 添加剩余未匹配的后綴部分
if (lastEnd <= text.length() - 1) {
result.append(text.substring(lastEnd));
}
return result.toString();
}
}
2.4 自己實(shí)現(xiàn)Trie樹算法實(shí)現(xiàn)
通過deepseek等人工智能,是非常容易自己實(shí)現(xiàn)一個(gè)Trie樹,我們就只實(shí)現(xiàn)字符串替換的功能,其他的就不使用了;
Trie樹,又叫字典樹,前綴樹(Prefix Tree),單詞查找樹,是一種多叉樹的結(jié)構(gòu).

結(jié)構(gòu)說(shuō)明: 表示根節(jié)點(diǎn)(空節(jié)點(diǎn))
每個(gè)節(jié)點(diǎn)表示一個(gè)字符
粉色節(jié)點(diǎn)表示單詞結(jié)束標(biāo)記(使用 CSS class 實(shí)現(xiàn))
路徑示例:
root → c → a → t 組成 "cat"
root → c → a → r 組成 "car"
root → d → o → g 組成 "dog"
public class TrieKeywordReplacer implements Replacer {
private final Trie trie;
@Override
public String replaceKeywords(String text) {
return trie.replaceKeywords(text, "");
}
public TrieKeywordReplacer(String keyWords) {
Trie trie = new Trie();
for (String s : keyWords.split(";")) {
trie.insert(s);
}
this.trie = trie;
}
static class TrieNode {
Map children;
boolean isEndOfWord;
public TrieNode() {
children = new HashMap?>();
isEndOfWord = false;
}
}
static class Trie {
private TrieNode root;
public Trie() {
root = new TrieNode();
}
private synchronized void insert(String word) {
TrieNode node = root;
for (char c : word.toCharArray()) {
if (node.children.get(c) == null) {
node.children.put(c, new TrieNode());
}
node = node.children.get(c);
}
node.isEndOfWord = true;
}
public String replaceKeywords(String text, String replacement) {
StringBuilder result = new StringBuilder();
int i = 0;
while (i < text.length()) {
TrieNode node = root;
int j = i;
TrieNode endNode = null;
int endIndex = -1;
while (j < text.length() && node.children.get(text.charAt(j)) != null) {
node = node.children.get(text.charAt(j));
if (node.isEndOfWord) {
endNode = node;
endIndex = j;
}
j++;
}
if (endNode != null) {
result.append(replacement);
i = endIndex + 1;
} else {
result.append(text.charAt(i));
i++;
}
}
return result.toString();
}
}
}
4個(gè)實(shí)現(xiàn)類對(duì)象的大小對(duì)比
| 類 | 對(duì)象大小 |
|---|---|
| StrReplacer | 12560 |
| PatternReplacer | 21592 |
| TrieKeywordReplacer | 184944 |
| AhoCorasickReplacer | 253896 |
性能對(duì)比
說(shuō)明:待替換一組關(guān)鍵詞共 400個(gè);JDK1.8
| StrReplacer | PatternReplacer | TrieKeywordReplacer | AhoCorasickReplacer | |
| 單線程循環(huán)1w次,平均單次性能(ns) | 21843ns | 28846ns | 532ns | 727ns |
| 名稱中只有1個(gè)待替換的關(guān)鍵詞,2個(gè)并發(fā)線程,循環(huán)1w次,平均單次性能(ns),機(jī)器 CPU 30%左右 | 23444ns | 39984ns | 680ns | 1157ns |
| 名稱中只有20待替換的關(guān)鍵詞,2個(gè)并發(fā)線程,循環(huán)1w次,平均單次性能(ns),機(jī)器 CPU 30%左右 | 252738ns | 114740ns | 33900ns | 113764ns |
| 名稱中只有無(wú)待替換的關(guān)鍵詞,2個(gè)并發(fā)線程,循環(huán)1w次,平均單次性能(ns),機(jī)器 CPU 30%左右 | 22248ns | 9253ns | 397ns | 738ns |
?
通過性能對(duì)比,自己實(shí)現(xiàn)的Trie樹的性能是最好的,因?yàn)橹蛔隽颂鎿Q的邏輯,沒有實(shí)現(xiàn)其他功能,其次是使用AhoCorasick算法,因?yàn)槭褂?AhoCorasick算法,實(shí)現(xiàn)字符串替換是最基本的功能,AhoCorasick算法,還能精準(zhǔn)的匹配到在什么地方,出現(xiàn)過多少次等信息,功能非常強(qiáng)大;
通過對(duì)比編譯好的正則性能確實(shí)是比使用原生String.replace;
public class ReplacerTest {
@Test
public void testTrieKeywordReplacer(){
//String name = skuName;
//String expected = v2;
//String name = "三星Samsung Galaxy S25+ 超擬人AI助理 驍龍8至尊版 AI拍照 翻譯手機(jī) 游戲手機(jī) 12GB+256GB 冷川藍(lán)";
//String expected = name;
String name = keyWords;
String expected = v1;
int cnt = 2;
Replacer replacer = new TrieKeywordReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
}
@Test
public void 替換所有關(guān)鍵字() throws InterruptedException {
//String name = skuName;
//String expected = v2;
//String name = "三星Samsung Galaxy S25+ 超擬人AI助理 驍龍8至尊版 AI拍照 翻譯手機(jī) 游戲手機(jī) 12GB+256GB 冷川藍(lán)";
//String expected = name;
String name = keyWords;
String expected = v1;
int cnt = 2;
System.out.println("替換:" + name);
Replacer replacer = new StrReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new PatternReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new TrieKeywordReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new AhoCorasickReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
}
@Test
public void 無(wú)關(guān)鍵字替換() throws InterruptedException {
//String name = skuName;
//String expected = v2;
String name = "三星Samsung Galaxy S25+ 超擬人AI助理 驍龍8至尊版 AI拍照 翻譯手機(jī) 游戲手機(jī) 12GB+256GB 冷川藍(lán)";
String expected = name;
//String name = keyWords;
//String expected = v1;
int cnt = 1;
System.out.println("替換:" + name);
Replacer replacer = new StrReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new PatternReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new TrieKeywordReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new AhoCorasickReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
}
@Test
public void 有1個(gè)關(guān)鍵字替換() throws InterruptedException {
//String name = skuName;
//String expected = v2;
//String name = "三星Samsung Galaxy S25+ 超擬人AI助理 驍龍8至尊版 AI拍照 翻譯手機(jī) 游戲手機(jī) 12GB+256GB 冷川藍(lán)";
//String expected = name;
//String name = keyWords;
//String expected = v1;
String name = "HUAWEI Pura 70 Pro 國(guó)家補(bǔ)貼500元 羽砂黑 12GB+512GB 超高速風(fēng)馳閃拍 華為鴻蒙智能手機(jī)";
String expected = "HUAWEI Pura 70 Pro 500元 羽砂黑 12GB+512GB 超高速風(fēng)馳閃拍 華為鴻蒙智能手機(jī)";
int cnt = 1;
System.out.println("替換:" + name);
Replacer replacer = new StrReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new PatternReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new TrieKeywordReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
replacer = new AhoCorasickReplacer(keyWords);
check(replacer, name, expected);
for (int i = 0; i < cnt; i++) {
checkExec(replacer, name);
}
}
static void check(Replacer replacer, String name, String expected) {
System.out.println(replacer.getClass().getName()+",對(duì)象大?。?+ObjectSizeCalculator.getObjectSize(replacer));
String newTxt = replacer.replaceKeywords(name);
//System.out.println(newTxt);
Assert.assertEquals(replacer.getClass().getName() + ",對(duì)比不一致!", expected, newTxt);
}
void checkExec(Replacer replacer, String name) {
String newTxt = replacer.replaceKeywords(name);
int nThreads = 2;
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
CountDownLatch downLatch = new CountDownLatch(nThreads);
int i = 0;
while (i++ < nThreads) {
executorService.submit(new Runnable() {
@Override
public void run() {
int i = 0;
long ns = System.nanoTime();
while (i++ < 100000) {
replacer.replaceKeywords(name);
}
String name = replacer.getClass().getName();
downLatch.countDown();
System.out.println(StringUtils.substring(name, name.length() - 50, name.length()) + "ti=" + i + ", t耗時(shí):" + (System.nanoTime() - ns) / i + "ns");
}
});
}
executorService.shutdown();
try {
downLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
最后
1、使用現(xiàn)成的AhoCorasick算法進(jìn)行實(shí)現(xiàn),是性能與穩(wěn)定性最優(yōu)的選擇,非常強(qiáng)調(diào)性能,還是可以自己實(shí)現(xiàn)Trie樹來(lái)實(shí)現(xiàn);
2、在真實(shí)的使用過程中,因?yàn)榇蟛糠值纳唐访Q最多出現(xiàn)幾個(gè)關(guān)鍵詞,并且待替換的關(guān)鍵詞往往都是比較多的,可以將這么關(guān)鍵詞找出找出幾個(gè)有代表性能的詞,做前置判斷,商品名稱中是否存在;再進(jìn)行全量替換;
如待替換的關(guān)鍵詞有:政府補(bǔ)貼、國(guó)補(bǔ)、支持國(guó)補(bǔ); 那么我們并不是直接就循環(huán)這個(gè)待替換的關(guān)鍵詞組,而是找出這么關(guān)鍵詞中都有的關(guān)鍵字”補(bǔ)”先判斷商品名稱中是否存在“補(bǔ)”字后,再做處理; 這里的前置判斷,還可以使用布隆過濾器實(shí)現(xiàn);
public String replaceKeywords (String skuName){
Replacer replacer = new AhoCorasickReplacer(keyWords);
if(skuName.contains("補(bǔ)")){
return replacer.replaceKeywords(skuName);
} else {
return skuName;
}
}
參考
- [1] Aho-Corasick 算法 AC自動(dòng)機(jī)實(shí)現(xiàn)?
- [2] Trie字典樹?
審核編輯 黃宇
-
算法
+關(guān)注
關(guān)注
23文章
4784瀏覽量
98099 -
字符串
+關(guān)注
關(guān)注
1文章
596瀏覽量
23171
發(fā)布評(píng)論請(qǐng)先 登錄
求助 LabVIEW 字符串比較
打開工程后工程中的字體沒有顯示,如字符串,數(shù)字等控件不能預(yù)覽顯示字體?
字符串控件與靜態(tài)字符串控件中預(yù)覽字符顯示亂碼,如何修改顯示正常?
字符串關(guān)聯(lián)數(shù)字變量如何使用?我們的地址都是16位數(shù)據(jù),可以使用16位數(shù)字變量顯示字符串嗎?
飛凌嵌入式ElfBoard-標(biāo)準(zhǔn)IO接口之格式化輸入
飛凌嵌入式ElfBoard-標(biāo)準(zhǔn)IO接口之格式化輸出
如何使用 NuMaker 板和 Mbed OS 上的連接字符串連接到 Azure IoT?
LM3466 多串 LED 電流平衡器技術(shù)手冊(cè)
字符串替換研究
評(píng)論