解决 Gojieba 分词添加关键词 AddWord 接口不起作用的问题

最近项目需要用到中文分词，实现词云和情感分析的效果。

因为不是十分重要的业务，第一反应是想着接入外部云厂商的 API 接口：

利用大公司的机器学习模型资源，快速实现需求，我这边可以腾出时间继续做活动项目。

虽然刚过完年，但是感觉事情还是不少——很多东西都没有做好，好多代码还得优化、迁移等等。

看了腾讯云的接口，太贵了。按数据量估计一天要好多人民币！

麻了麻了……还是自己做吧哈哈！

不得不说，Go 结巴 分词非常好用，相比其他库，它速度飞快！

原版是 C++ 实现的，但我的开发语言主要是 Go，作者也给了 Go 的绑定：

https://github.com/yanyiwu/gojieba

照着官方的 Demo，很快就完成了第一版。太强大了！

不过有些句子，分词结果并不符合预期。

比如 我是奥斯卡，速来，带我飞，快点进群 这句话，

速来 被硬生生拆分成两个字，

带我飞 变成 带我、飞。

需求方希望新增一个词库录入的功能，后台新增词库，然后 gojieba 读取后新增进去！

这个需求很合理！官方的 Demo 也提供了 AddWord 接口，支持动态加词！

然而我又要加班了。

因为我自己测试中，有些词即便你 AddWord 也没有生效。

我一开始还以为是我读数据库的逻辑有问题！查了半天查不出所以然。

后来，我直接跑官方的 Demo，发现：

AddWord("带我飞") 可以成功识别新词；
AddWord("速来") 依然不行！还是 速、来！

百思不得其解，但又不知道怎么办！

翻了 Issues，发现有人跟我遇到同样的问题：

目前 Issues 没解决，依然是 Open 状态。

咋办！只能硬着头皮啃啃 C++ 了……

结论呢，也不是 bug，是 AddWord 接口在 C++ 的实现，是有 3 个参数的：

bool InsertUserWord(const string& word, int freq, const string& tag = UNKNOWN_TAG);

Gojieba 只有 word 参数，没有词频，词频默认使用了 WordWeightMedian：

DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) {
    Init(dict_path, user_dict_paths, user_word_weight_opt);
}

感兴趣的，可以看 C++ 代码：
https://github.com/yanyiwu/gojieba/blob/master/deps/cppjieba/DictTrie.hpp#L35
https://github.com/yanyiwu/gojieba/blob/master/deps/cppjieba/DictTrie.hpp#L173

WordWeightMedian 又是啥？看样子是中位数节点的权重值。

https://github.com/yanyiwu/gojieba/blob/master/deps/cppjieba/DictTrie.hpp#L238

  void SetStaticWordWeights(UserWordWeightOption option) {
    XCHECK(!static_node_infos_.empty());
    vector<DictUnit> x = static_node_infos_;
    sort(x.begin(), x.end(), WeightCompare);
    min_weight_ = x[0].weight;
    max_weight_ = x[x.size() - 1].weight;
    median_weight_ = x[x.size() / 2].weight; // 中位数节点权重
    switch (option) {
     case WordWeightMin:
       user_word_default_weight_ = min_weight_;
       break;
     case WordWeightMedian:
       user_word_default_weight_ = median_weight_; // 赋值
       break;
     default:
       user_word_default_weight_ = max_weight_;
       break;
    }
  }

在这里，用到了 user_word_default_weight_：

bool InsertUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
    DictUnit node_info;
    if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
      return false;
    }
    active_node_infos_.push_back(node_info);
    trie_->InsertNode(node_info.word, &active_node_infos_.back());
    return true;
}

bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) {
    DictUnit node_info;
    double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_ ;
    if (!MakeNodeInfo(node_info, word, weight , tag)) {
      return false;
    }
    active_node_infos_.push_back(node_info);
    trie_->InsertNode(node_info.word, &active_node_infos_.back());
    return true;
}

也就是说，Gojieba 的 AddWord 接口，权重值默认等于 WordWeightMedian。

如果新加的词权重太低，就有可能会加了没有生效（我理解是分词后的词权重比 median 还要大，不知道对不对？）！

目前初始化的时候，没办法直接指定默认权重（一般自己的词库，权重都可以大一些）。AddWord 接口也没有办法指定权重。

真是进退两难！翻了代码，确实无解：

https://github.com/yanyiwu/gojieba/blob/master/deps/cppjieba/DictTrie.hpp#L176

private:
void Init(const string& dict_path, const string& user_dict_paths, UserWordWeightOption user_word_weight_opt) {
    LoadDict(dict_path);
    freq_sum_ = CalcFreqSum(static_node_infos_);
    CalculateWeight(static_node_infos_, freq_sum_);
    SetStaticWordWeights(user_word_weight_opt); // 设置默认权重

    if (user_dict_paths.size()) {
      LoadUserDict(user_dict_paths); // 加载用户词库
    }
    Shrink(static_node_infos_);
    CreateTrie(static_node_infos_);
}

中位数又不是平均数：

如果可以把自己的词，用一个非常非常大的权重（比如 100000），

放入 jieba.dict.utf8 中（不放入用户词库），拉高 median_weight 就好了。

目前看都行不通。还是得自己魔改 gojieba，先这样吧，睡醒再搞！晚安！

睡醒了！

2023-01-31 10:03 更新：

先修复一下，给官方提 MR，期待合并哈哈：
https://github.com/Lofanmi/gojieba/commit/8e5f5e5cbb2a960483166aa4f2b42d289e599b90

着急的，先在项目代码的 go.mod 文件替换一下：

replace github.com/yanyiwu/gojieba v1.2.0 => github.com/Lofanmi/gojieba v0.0.0-20230131015425-8e5f5e5cbb2a

解决 Gojieba 分词添加关键词 AddWord 接口不起作用的问题

最新文章

最近回复

分类

归档

其它