-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlocal-search.xml
131 lines (62 loc) · 358 KB
/
local-search.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>语义搜索效果差?试试重排</title>
<link href="/2025/01/15/%E8%AF%AD%E4%B9%89%E6%90%9C%E7%B4%A2%E6%95%88%E6%9E%9C%E5%B7%AE%EF%BC%9F%E8%AF%95%E8%AF%95%E9%87%8D%E6%8E%92/"/>
<url>/2025/01/15/%E8%AF%AD%E4%B9%89%E6%90%9C%E7%B4%A2%E6%95%88%E6%9E%9C%E5%B7%AE%EF%BC%9F%E8%AF%95%E8%AF%95%E9%87%8D%E6%8E%92/</url>
<content type="html"><">[1]</span></a></sup> </p><blockquote><p>本文首发于 Zilliz 公众号。文中代码的 Notebook 在<a href="https://pan.baidu.com/s/15z5hYqnuH3jXqUsYEcQO8g?pwd=jg5y">这里</a>下载。</p></blockquote><h2 id="基于统计的重排"><a href="#基于统计的重排" class="headerlink" title="基于统计的重排"></a>基于统计的重排</h2><p>基于统计的重排用于混合搜索,它可以把多种搜索结果综合起来,重新排序。除了前面介绍的稠密向量和稀疏向量,还可以综合文本向量和图片向量。</p><p>怎么综合呢?有两种方法,一种是 WeightedRanker ——分数加权平均算法,通过设置权重计算得分,后面简称权重策略。另一种是 RRFRanker(Reciprocal Rank Fusion)——逆序排名融合算法,通过排名的倒数来计算得分,后面简称 RRF 策略。</p><h3 id="权重策略"><a href="#权重策略" class="headerlink" title="权重策略"></a>权重策略</h3><p>权重策略就是设置权重。权重值范围从0到1,数值越大表示重要性越大。计算方法很简单,初始得分乘以权重,就是最终得分。<sup id="fnref:2" class="footnote-ref"><a href="#fn:2" rel="footnote"><span class="hint--top hint--rounded" aria-label="[Reranking](https://milvus.io/docs/reranking.md#Weighted-Scoring-WeightedRanker)">[2]</span></a></sup></p><p>$$\text{WeightedRanker_score} = \sum_{i=1}^{N} (w_i \times \text{score_n}_i) $$</p><p>打个比方,假设某班级考了语文和数学两门课,统计出学生每门科目的分数和排名。学生就相当于向量数据库中的文档,学生这两门课的分数,就相当于文档在不同搜索结果中的得分。</p><p>假设学生的成绩如下表所示:</p><table><thead><tr><th>学生编号</th><th>数学成绩</th><th>语文成绩</th></tr></thead><tbody><tr><td>S1</td><td>100</td><td>50</td></tr><tr><td>S2</td><td>95</td><td>55</td></tr><tr><td>S3</td><td>80</td><td>70</td></tr><tr><td>S4</td><td>65</td><td>85</td></tr><tr><td>S5</td><td>80</td><td>75</td></tr><tr><td>S6</td><td>75</td><td>80</td></tr><tr><td>S7</td><td>70</td><td>85</td></tr><tr><td>S8</td><td>65</td><td>80</td></tr><tr><td>S9</td><td>60</td><td>85</td></tr><tr><td>S10</td><td>55</td><td>95</td></tr></tbody></table><p>在权重策略下,综合得分公式为:<br>$$score = 数学成绩 * 0.7 + 语文成绩 * 0.3$$<br>根据公式计算出学生们的综合分数,排名如下:</p><table><thead><tr><th>权重综合排名</th><th>权重综合得分</th><th>学生编号</th><th>数学成绩</th><th>语文成绩</th></tr></thead><tbody><tr><td>1</td><td>85</td><td>S1</td><td>100</td><td>50</td></tr><tr><td>2</td><td>83</td><td>S2</td><td>95</td><td>55</td></tr><tr><td>3</td><td>78.5</td><td>S5</td><td>80</td><td>75</td></tr><tr><td>4</td><td>77</td><td>S3</td><td>80</td><td>70</td></tr><tr><td>5</td><td>76.5</td><td>S6</td><td>75</td><td>80</td></tr><tr><td>6</td><td>74.5</td><td>S7</td><td>70</td><td>85</td></tr><tr><td>7</td><td>71</td><td>S4</td><td>65</td><td>85</td></tr><tr><td>8</td><td>69.5</td><td>S8</td><td>65</td><td>80</td></tr><tr><td>9</td><td>67.5</td><td>S9</td><td>60</td><td>85</td></tr><tr><td>10</td><td>67</td><td>S10</td><td>55</td><td>95</td></tr></tbody></table><h3 id="RRF-策略"><a href="#RRF-策略" class="headerlink" title="RRF 策略"></a>RRF 策略</h3><p>RRF 策略的计算方式稍微复杂一点:<br>$$\text{RRF_score}(d) = \sum_{i=1}^{N} \frac{1}{k + \text{rank}_i (d)}$$</p><p>公式中的 rank 是初始分数的排名,k 是平滑参数。从公式中可以看出,排名越靠前,rank 的值越小,综合得分越高。同时, k 的值越大,排名对分数的影响越小。</p><p>我们使用 RRF 策略重新计算分数和排名。参数 k 一般为60,为方便演示,这里设为 10,公式变成:<br>$$\text{RRF_score}(d) = \sum_{i=1}^{N} \frac{1}{10 + \text{rank}_i (d)}$$</p><p>RRF 策略根据排名计算分数,所以我们先列出数学和语文的排名。<br>数学成绩排名:</p><table><thead><tr><th>数学排名</th><th>数学成绩</th><th>学生编号</th></tr></thead><tbody><tr><td>1</td><td>100</td><td>S1</td></tr><tr><td>2</td><td>95</td><td>S2</td></tr><tr><td>3</td><td>80</td><td>S3</td></tr><tr><td>3</td><td>80</td><td>S5</td></tr><tr><td>5</td><td>75</td><td>S6</td></tr><tr><td>6</td><td>70</td><td>S7</td></tr><tr><td>7</td><td>65</td><td>S8</td></tr><tr><td>7</td><td>65</td><td>S4</td></tr><tr><td>9</td><td>60</td><td>S9</td></tr><tr><td>10</td><td>55</td><td>S10</td></tr></tbody></table><p>语文成绩排名:</p><table><thead><tr><th>语文排名</th><th>语文成绩</th><th>学生编号</th></tr></thead><tbody><tr><td>1</td><td>95</td><td>S10</td></tr><tr><td>2</td><td>85</td><td>S4</td></tr><tr><td>2</td><td>85</td><td>S7</td></tr><tr><td>2</td><td>85</td><td>S9</td></tr><tr><td>5</td><td>80</td><td>S6</td></tr><tr><td>5</td><td>80</td><td>S8</td></tr><tr><td>7</td><td>75</td><td>S5</td></tr><tr><td>8</td><td>70</td><td>S3</td></tr><tr><td>9</td><td>55</td><td>S2</td></tr><tr><td>10</td><td>50</td><td>S1</td></tr></tbody></table><p>接下来使用 RRF 策略计算综合得分,重新排名:</p><table><thead><tr><th>RRF 综合排名</th><th>RRF 综合得分</th><th>学生编号</th><th>数学成绩</th><th>语文成绩</th></tr></thead><tbody><tr><td>1</td><td>0.1458</td><td>S7</td><td>70</td><td>85</td></tr><tr><td>2</td><td>0.1421</td><td>S4</td><td>65</td><td>85</td></tr><tr><td>3</td><td>0.1409</td><td>S1</td><td>100</td><td>50</td></tr><tr><td>3</td><td>0.1409</td><td>S10</td><td>55</td><td>95</td></tr><tr><td>5</td><td>0.1359</td><td>S2</td><td>95</td><td>55</td></tr><tr><td>5</td><td>0.1359</td><td>S9</td><td>60</td><td>85</td></tr><tr><td>7</td><td>0.1357</td><td>S5</td><td>80</td><td>75</td></tr><tr><td>8</td><td>0.1334</td><td>S6</td><td>75</td><td>80</td></tr><tr><td>9</td><td>0.1325</td><td>S3</td><td>80</td><td>70</td></tr><tr><td>10</td><td>0.1255</td><td>S8</td><td>65</td><td>80</td></tr></tbody></table><p>比较两个排名可以发现,在权重策略下,数学的权重较大,偏科学生 S1虽然语文只有50分,也能因为数学100分而排在第一名。而 RRF 策略注重的是各科的排名,而不是分数,所以 S1的数学虽然排名第一,但是语文排名第10,综合排名下降到第三。</p><p>学生 S7 正好相反,在权重策略下,即使他语文得了85的高分,但是权重只占30%,而高权重的数学只得了70分,所以综合排名靠后,排在第六名。在 RRF 策略下,他的数学和语文排名分别是第六名和第二名,语文的高排名拉高了综合排名,上升到了第一名。</p><p>通过比较两种策略的排名结果,我们发现了这样的规律,如果你更看重搜索结果的得分,就使用权重策略,你还可以通过调整权重来调整得分;如果你更看重搜索结果的排名,就使用RRF策略。</p><h2 id="基于深度学习的重排"><a href="#基于深度学习的重排" class="headerlink" title="基于深度学习的重排"></a>基于深度学习的重排</h2><p>和基于统计的重排相比,基于深度学习的重排更加复杂,通常被称为 Cross-encoder Reranker,交叉编码重排,后面简称“重排模型”。</p><p>粗排和重排模型有什么区别呢?粗排搜索速度更快,重排模型准确性更高。</p><p>为什么粗排搜索速快?粗排使用的是双塔模型(Dual-Encoder),“双塔”指的是它有两个独立的编码器,分别把查询和文档向量化,然后通过计算向量之间的相似度(比如余弦相似度Cosine)搜索结果并且排序。双塔模型的优势在于搜索效率高,因为可以提前计算文档向量,搜索时只需要向量化查询即可。而重排模型则是在搜索时现场编码。就好比两个饭店,一个使用预制菜,一个现场热炒,上菜速度肯定不一样。</p><p>重排模型的优势则是准确性高。它把查询和文档组成数据对后输入给编码器编码,然后给它们的相似程度打分,针对性强。这就相当于公司招聘人才,粗排是根据专业、学历和工作年限等几个指标快速筛选简历,挑选出多位候选者。重排则是通过面试详细了解候选者做过什么项目,遇到了什么挑战,解决了什么难题,然后判断他有多适合应聘的岗位(文档与查询有多相似)。</p><p><img src="https://picgo233.oss-cn-hangzhou.aliyuncs.com/img/202501042343533.png" alt="image.png|600"><br>图片来源:自己画的</p><p>所以,重排模型适合那些对回答准确性要求高的场景,比如专业知识库或者客服系统。不适合追求高响应速度和低成本的场景,比如网页搜索、电商,这种场景建议使用基于统计的重排。</p><p>你还可以把粗排和重排模型结合起来。比如,先通过粗排筛选出10个候选结果,再用重排模型重新排名。既可以提高搜索速度,也能保证准确度。</p><h2 id="代码实践"><a href="#代码实践" class="headerlink" title="代码实践"></a>代码实践</h2><p>版本说明:<br>Milvus 版本:>=2.5.0<br>pymilvus 版本:>=2.5.0</p><p>接下来我们通过代码实践一下,看看这些重排方法实际效果到底如何。</p><p>我们会使用“敏捷的狐狸跳过懒惰的狗。”作为查询,从下面10个句子中搜索出语义相似的句子。你可以先猜一猜,粗排、基于统计的重排以及基于深度学习的重排,哪个效果最好。</p><p>文档:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><code class="hljs json"><span class="hljs-punctuation">[</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"灵活的狐跳过了懒散的犬。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"一只敏捷的狐在公园里跳过了那只懒犬。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"那只懈怠的犬正在大树下睡觉。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"在公园里,那只棕色的狐狸正在跳。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"犬跃过了狐。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"树下有一个小池塘。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"动物如狗和狐狸生活在公园里。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"池塘靠近公园里的大树。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"懒狗跳过了狐狸。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"那只灵巧的狐狸轻松地跨过了那只懒散的狗。"</span><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"狐迅速地跳过了那只不活跃的犬。"</span><span class="hljs-punctuation">}</span><br><span class="hljs-punctuation">]</span><br></code></pre></td></tr></table></figure><p>首先创建集合。我们为集合设置稠密向量“dense_vectors”和稀疏向量“sparse_vectors”两个字段,分别储存稠密向量和稀疏向量,用于混合搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> MilvusClient, DataType<br><span class="hljs-keyword">import</span> time<br><br><span class="hljs-comment"># 检查并且删除同名集合</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">check_and_drop_collection</span>(<span class="hljs-params">collection_name</span>):<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 已经存在"</span>)<br> <span class="hljs-keyword">try</span>:<br> milvus_client.drop_collection(collection_name)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合时出现错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br><br><br><span class="hljs-comment"># 创建模式</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_schema</span>():<br> schema = milvus_client.create_schema(<br> auto_id=<span class="hljs-literal">True</span>,<br> enable_dynamic_field=<span class="hljs-literal">True</span>,<br> description=<span class="hljs-string">""</span><br> )<br> schema.add_field(field_name=<span class="hljs-string">"id"</span>, datatype=DataType.INT64, is_primary=<span class="hljs-literal">True</span>, max_length=<span class="hljs-number">256</span>)<br> schema.add_field(field_name=<span class="hljs-string">"content"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">256</span>)<br> schema.add_field(field_name=<span class="hljs-string">"dense_vectors"</span>, datatype=DataType.FLOAT_VECTOR, dim=<span class="hljs-number">1024</span>)<br> schema.add_field(field_name=<span class="hljs-string">"sparse_vectors"</span>, datatype=DataType.SPARSE_FLOAT_VECTOR)<br> <span class="hljs-keyword">return</span> schema<br><br><br><span class="hljs-comment"># 创建集合</span><br><span class="hljs-keyword">import</span> time<br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_collection</span>(<span class="hljs-params">collection_name, schema, timeout = <span class="hljs-number">3</span></span>):<br> <span class="hljs-comment"># 创建集合</span><br> <span class="hljs-keyword">try</span>:<br> milvus_client.create_collection(<br> collection_name=collection_name,<br> schema=schema,<br> shards_num=<span class="hljs-number">2</span><br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"开始创建集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合的过程中出现了错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br><br> <span class="hljs-comment"># 检查集合是否创建成功</span><br> start_time = time.time()<br> <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 创建成功"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">elif</span> time.time() - start_time > timeout:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合 <span class="hljs-subst">{collection_name}</span> 超时"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> time.sleep(<span class="hljs-number">1</span>)<br><br><span class="hljs-keyword">class</span> <span class="hljs-title class_">CollectionDeletionError</span>(<span class="hljs-title class_ inherited__">Exception</span>):<br> <span class="hljs-string">"""删除集合失败"""</span><br><br><br><br>collection_name = <span class="hljs-string">"test_rank"</span><br>uri=<span class="hljs-string">"http://localhost:19530"</span><br>milvus_client = MilvusClient(uri=uri)<br><br><span class="hljs-comment"># 如果无法删除集合,抛出异常</span><br><span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> check_and_drop_collection(collection_name):<br> <span class="hljs-keyword">raise</span> CollectionDeletionError(<span class="hljs-string">'删除集合失败'</span>)<br><span class="hljs-keyword">else</span>:<br> <span class="hljs-comment"># 创建集合的模式</span><br> schema = create_schema()<br> <span class="hljs-comment"># 创建集合并等待成功</span><br> create_collection(collection_name, schema)<br></code></pre></td></tr></table></figure><p>然后,定义把文档向量化的函数。我们使用 bge_m3 生成稠密向量和稀疏向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm<br><span class="hljs-keyword">import</span> torch<br><span class="hljs-keyword">from</span> pymilvus.model.hybrid <span class="hljs-keyword">import</span> BGEM3EmbeddingFunction<br><br><span class="hljs-comment"># 初始化嵌入模型的实例</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">init_embedding_model</span>():<br> <span class="hljs-comment"># 检查是否有可用的CUDA设备</span><br> device = <span class="hljs-string">"cuda:0"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span><br> <span class="hljs-comment"># 根据设备选择是否使用fp16</span><br> use_fp16 = device.startswith(<span class="hljs-string">"cuda"</span>)<br> <span class="hljs-comment"># 创建嵌入模型实例</span><br> bge_m3_ef = BGEM3EmbeddingFunction(<br> model_name=<span class="hljs-string">"BAAI/bge-m3"</span>,<br> device=device,<br> use_fp16=use_fp16<br> )<br> <span class="hljs-keyword">return</span> bge_m3_ef<br><br><br><span class="hljs-comment"># 把查询向量化</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_query</span>(<span class="hljs-params">query, encoder</span>):<br> <span class="hljs-comment"># 验证参数是否符合要求</span><br> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> (<span class="hljs-built_in">isinstance</span>(query, <span class="hljs-built_in">list</span>) <span class="hljs-keyword">and</span> <span class="hljs-built_in">all</span>(<span class="hljs-built_in">isinstance</span>(text, <span class="hljs-built_in">str</span>) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> query)):<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"query必须为字符串列表。"</span>)<br> <span class="hljs-keyword">return</span> encoder.encode_queries(query)<br><br><br><span class="hljs-comment"># 把文档向量化</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_docs</span>(<span class="hljs-params">docs, encoder</span>):<br> <span class="hljs-comment"># 验证参数是否符合要求</span><br> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> (<span class="hljs-built_in">isinstance</span>(docs, <span class="hljs-built_in">list</span>) <span class="hljs-keyword">and</span> <span class="hljs-built_in">all</span>(<span class="hljs-built_in">isinstance</span>(text, <span class="hljs-built_in">str</span>) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> docs)):<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"docs必须为字符串列表。"</span>)<br> <span class="hljs-keyword">return</span> encoder.encode_documents(docs)<br><br><br><span class="hljs-comment"># 初始化嵌入模型实例</span><br>bge_m3_ef = init_embedding_model()<br></code></pre></td></tr></table></figure><p>接下来,生成向量并且导入向量数据库。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">import</span> json<br><br><span class="hljs-comment"># 把文件中的指定字段向量化</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_file</span>(<span class="hljs-params">input_file_path, encoder, field_name</span>):<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(input_file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> data_list = json.load(file)<br> docs = [data[field_name] <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list]<br> <span class="hljs-comment"># 向量化文档</span><br> <span class="hljs-keyword">return</span> vectorize_docs(docs, encoder), data_list<br><br><br><span class="hljs-comment"># 把数据插入数据库</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">insert_data</span>(<span class="hljs-params"></span><br><span class="hljs-params"> collection_name: <span class="hljs-built_in">str</span>,</span><br><span class="hljs-params"> data_list,</span><br><span class="hljs-params"> dense_vectors,</span><br><span class="hljs-params"> sparse_vectors,</span><br><span class="hljs-params"> batch_size: <span class="hljs-built_in">int</span> = <span class="hljs-number">1000</span></span>):<br> <span class="hljs-comment"># 定义默认稀疏向量,避免生成稀疏向量失败。元素值全为0,表示它在所有维度上都不重要</span><br> default_sparse_vector = {<span class="hljs-number">0</span>: <span class="hljs-number">1.0</span>, <span class="hljs-number">1</span>: <span class="hljs-number">1.0</span>, <span class="hljs-number">2</span>: <span class="hljs-number">1.0</span>, <span class="hljs-number">3</span>: <span class="hljs-number">1.0</span>}<br><br> <span class="hljs-comment"># 接收稠密向量和稀疏向量</span><br> <span class="hljs-keyword">for</span> data, dense_vector, sparse_vector <span class="hljs-keyword">in</span> <span class="hljs-built_in">zip</span>(data_list, dense_vectors, sparse_vectors):<br> data[<span class="hljs-string">'dense_vectors'</span>] = dense_vector<br> <span class="hljs-comment"># 如果稀疏向量为空,则使用默认稀疏向量,避免入库时报错</span><br> <span class="hljs-keyword">if</span> sparse_vector <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:<br> data[<span class="hljs-string">'sparse_vectors'</span>] = sparse_vector<br> <span class="hljs-keyword">else</span>:<br> data[<span class="hljs-string">'sparse_vectors'</span>] = default_sparse_vector<br><br> <span class="hljs-comment"># 分批入库</span><br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在将数据插入集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> total_count = <span class="hljs-built_in">len</span>(data_list)<br> <span class="hljs-keyword">with</span> tqdm(total=total_count, desc=<span class="hljs-string">"插入数据"</span>) <span class="hljs-keyword">as</span> progress_bar:<br> <span class="hljs-comment"># 每次插入 batch_size 条数据</span><br> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, total_count, batch_size): <br> batch_data = data_list[i:i + batch_size]<br> res = milvus_client.insert(<br> collection_name=collection_name,<br> data=batch_data<br> )<br> progress_bar.update(<span class="hljs-built_in">len</span>(batch_data))<br><br><br>input_file_path = <span class="hljs-string">"docs_rank.json"</span><br>field_name = <span class="hljs-string">"content"</span><br>vectors, data_list = vectorize_file(input_file_path, bge_m3_ef, field_name)<br>docs_dense_vectors = vectors[<span class="hljs-string">'dense'</span>]<br>docs_sparse_vectors = vectors[<span class="hljs-string">'sparse'</span>]<br>insert_data(collection_name, data_list, docs_dense_vectors, docs_sparse_vectors)<br></code></pre></td></tr></table></figure><p>数据入库后,为它们创建索引。因为数据库中同时包含了两个向量,所以使用混合搜索,需要分别创建稠密向量和稀疏向量的索引。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><code class="hljs python">index_params = milvus_client.prepare_index_params()<br><br><span class="hljs-comment"># 创建密集向量索引参数</span><br>index_params.add_index(<br> index_name=<span class="hljs-string">"IVF_FLAT"</span>,<br> field_name=<span class="hljs-string">"dense_vectors"</span>,<br> index_type=<span class="hljs-string">"IVF_FLAT"</span>,<br> metric_type=<span class="hljs-string">"COSINE"</span>,<br> params={<span class="hljs-string">"nlist"</span>: <span class="hljs-number">128</span>}<br>)<br><br><span class="hljs-comment"># 创建稀疏向量索引参数</span><br>index_params.add_index(<br> index_name=<span class="hljs-string">"sparse"</span>,<br> field_name=<span class="hljs-string">"sparse_vectors"</span>,<br> index_type=<span class="hljs-string">"SPARSE_INVERTED_INDEX"</span>,<br> <span class="hljs-comment"># 目前仅支持IP</span><br> metric_type=<span class="hljs-string">"IP"</span>,<br> params={<span class="hljs-string">"drop_ratio_build"</span>: <span class="hljs-number">0.2</span>}<br>)<br></code></pre></td></tr></table></figure><p>加载集合。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在加载集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br>milvus_client.load_collection(collection_name=collection_name)<br><br><span class="hljs-comment"># 验证加载状态</span><br><span class="hljs-built_in">print</span>(milvus_client.get_load_state(collection_name=collection_name))<br></code></pre></td></tr></table></figure><p>为了实现混合搜索,还需要定义混合搜索函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 混合搜索</span><br><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> AnnSearchRequest, WeightedRanker, RRFRanker<br><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">perform_hybrid_search</span>(<span class="hljs-params"></span><br><span class="hljs-params"> collection_name, </span><br><span class="hljs-params"> query, </span><br><span class="hljs-params"> ranker,</span><br><span class="hljs-params"> output_fields,</span><br><span class="hljs-params"> limit_dense,</span><br><span class="hljs-params"> limit_sparse,</span><br><span class="hljs-params"> limit_hybrid</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-comment"># 获取查询的稠密向量和稀疏向量</span><br> query_vectors = vectorize_query(query, bge_m3_ef)<br> query_dense_vectors = [query_vectors[<span class="hljs-string">'dense'</span>][<span class="hljs-number">0</span>]]<br> query_sparse_vectors = [query_vectors[<span class="hljs-string">'sparse'</span>][[<span class="hljs-number">0</span>]]]<br> <span class="hljs-comment"># 创建稠密向量的搜索参数</span><br> dense_search_params = {<br> <span class="hljs-comment"># 查询向量</span><br> <span class="hljs-string">"data"</span>: query_dense_vectors, <br> <span class="hljs-string">"anns_field"</span>: <span class="hljs-string">"dense_vectors"</span>,<br> <span class="hljs-string">"param"</span>: {<br> <span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"COSINE"</span>,<br> <span class="hljs-string">"params"</span>: {<br> <span class="hljs-string">"nprobe"</span>: <span class="hljs-number">16</span>,<br> <span class="hljs-string">"radius"</span>: <span class="hljs-number">0.1</span>,<br> <span class="hljs-string">"range_filter"</span>: <span class="hljs-number">1</span><br> }<br> },<br> <span class="hljs-string">"limit"</span>: limit_dense<br> }<br> <span class="hljs-comment"># 创建稠密向量的搜索请求</span><br> dense_req = AnnSearchRequest(**dense_search_params)<br> <br> <span class="hljs-comment"># 创建稀疏向量的搜索参数</span><br> sparse_search_params = {<br> <span class="hljs-string">"data"</span>: query_sparse_vectors,<br> <span class="hljs-string">"anns_field"</span>: <span class="hljs-string">"sparse_vectors"</span>,<br> <span class="hljs-string">"param"</span>: {<br> <span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"IP"</span>,<br> <span class="hljs-string">"params"</span>: {<span class="hljs-string">"drop_ratio_search"</span>: <span class="hljs-number">0.2</span>}<br> },<br> <span class="hljs-string">"limit"</span>: limit_sparse<br> }<br> <span class="hljs-comment"># 创建稀疏向量的搜索请求</span><br> sparse_req = AnnSearchRequest(**sparse_search_params)<br> <span class="hljs-comment"># 执行混合搜索</span><br> start_time = time.time()<br> res = milvus_client.hybrid_search(<br> collection_name=collection_name,<br> reqs=[dense_req, sparse_req],<br> ranker = ranker,<br> limit=limit_hybrid,<br> output_fields=output_fields<br> )<br> end_time = time.time()<br> total_time = end_time - start_time<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"搜索时间:<span class="hljs-subst">{total_time:<span class="hljs-number">.3</span>f}</span>"</span>)<br> <span class="hljs-keyword">return</span> res<br></code></pre></td></tr></table></figure><p>最后再定义一个打印函数,方便查看搜索结果。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_vector_results</span>(<span class="hljs-params">res</span>): <br> <span class="hljs-keyword">for</span> hits <span class="hljs-keyword">in</span> res:<br> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> hits:<br> entity = hit.get(<span class="hljs-string">"entity"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"content: <span class="hljs-subst">{entity[<span class="hljs-string">'content'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"distance: <span class="hljs-subst">{hit[<span class="hljs-string">'distance'</span>]:<span class="hljs-number">.4</span>f}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(hits)}</span>"</span>)<br></code></pre></td></tr></table></figure><h2 id="对比搜索结果"><a href="#对比搜索结果" class="headerlink" title="对比搜索结果"></a>对比搜索结果</h2><p>准备工作就绪,先分别看下稠密向量和稀疏向量的搜索结果。在混合搜索的权重策略下,调整权重,一个设置1,另一个设置为0,就可以只查看一种搜索结果。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python">query = [<span class="hljs-string">"敏捷的狐狸跳过懒惰的狗。"</span>]<br>ranker=WeightedRanker(<span class="hljs-number">1</span>, <span class="hljs-number">0</span>)<br>output_fields = [<span class="hljs-string">"content"</span>]<br>limit_dense = <span class="hljs-number">10</span><br>limit_sparse = <span class="hljs-number">10</span><br>limit_hybrid = <span class="hljs-number">10</span><br><br>res_dense = perform_hybrid_search(collection_name, query, ranker, output_fields, limit_dense, limit_sparse, limit_hybrid)<br>print_vector_results(res_dense)<br></code></pre></td></tr></table></figure><p>稠密向量的搜索结果勉强及格,正确答案分别排在第一、第三、第四和第五。让人不满意的是,语义和查询完全相反的句子,却排在了第二和第六,而且前6个搜索结果的得分相差很小,区别不明显。</p><p>另外,留意一下搜索时间是0.012秒,后面要和基于深度学习的重排做比较。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">搜索时间:0.012<br>content: 灵活的狐跳过了懒散的犬。<br>distance: 0.9552<br>--------------------------------------------------<br>content: 懒狗跳过了狐狸。<br>distance: 0.9444<br>--------------------------------------------------<br>content: 一只敏捷的狐在公园里跳过了那只懒犬。<br>distance: 0.9373<br>--------------------------------------------------<br>content: 那只灵巧的狐狸轻松地跨过了那只懒散的狗。<br>distance: 0.9366<br>--------------------------------------------------<br>content: 狐迅速地跳过了那只不活跃的犬。<br>distance: 0.9194<br>--------------------------------------------------<br>content: 犬跃过了狐。<br>distance: 0.9025<br>--------------------------------------------------<br>content: 在公园里,那只棕色的狐狸正在跳。<br>distance: 0.8456<br>--------------------------------------------------<br>content: 动物如狗和狐狸生活在公园里。<br>distance: 0.8303<br>--------------------------------------------------<br>content: 那只懈怠的犬正在大树下睡觉。<br>distance: 0.7702<br>--------------------------------------------------<br>content: 树下有一个小池塘。<br>distance: 0.7174<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><p>调整权重,再来看看稀疏向量的结果。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python">ranker=WeightedRanker(<span class="hljs-number">0</span>, <span class="hljs-number">1</span>)<br>res_sparse = perform_hybrid_search(collection_name, query, ranker, output_fields, limit_dense, limit_sparse, limit_hybrid)<br>print_vector_results(res_sparse)<br></code></pre></td></tr></table></figure><p><del>稀疏向量的搜索结果不太理想,第一个搜索结果是正确的,第二、第三个搜索结果和查询相差太大,另一个正确答案在第四位才出现。搜索时间是0.014秒,和稠密向量相当。</del></p><p>稀疏向量的搜索结果就更差了,正确答案分别排在第二、第三、第六和第七。这是因为我特意用语义相近但是文本不同的词做了替换,比如用“犬”代替“狗”,“懈怠”代替“懒”,导致它们较难命中查询中的词,得分较低。如果你想了解稀疏向量是如何参与搜索并且计算得分的,可以看看 <a href="http://jiangjunqiao.top/2024/12/11/%E9%97%A8%E5%A4%96%E6%B1%89%E5%A6%82%E4%BD%95%E2%80%9C%E5%86%92%E5%85%85%E2%80%9D%E4%B8%93%E5%AE%B6%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%80%E7%96%8F%E5%90%91%E9%87%8F/">门外汉如何“冒充”专家?向量嵌入之稀疏向量</a> 这篇文章。</p><p>搜索时间是0.014秒,和稠密向量相当。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">搜索时间:0.014<br>content: 懒狗跳过了狐狸。<br>distance: 0.5801<br>--------------------------------------------------<br>content: 那只灵巧的狐狸轻松地跨过了那只懒散的狗。<br>distance: 0.5586<br>--------------------------------------------------<br>content: 一只敏捷的狐在公园里跳过了那只懒犬。<br>distance: 0.5553<br>--------------------------------------------------<br>content: 在公园里,那只棕色的狐狸正在跳。<br>distance: 0.5502<br>--------------------------------------------------<br>content: 动物如狗和狐狸生活在公园里。<br>distance: 0.5476<br>--------------------------------------------------<br>content: 灵活的狐跳过了懒散的犬。<br>distance: 0.5441<br>--------------------------------------------------<br>content: 狐迅速地跳过了那只不活跃的犬。<br>distance: 0.5336<br>--------------------------------------------------<br>content: 犬跃过了狐。<br>distance: 0.5192<br>--------------------------------------------------<br>content: 那只懈怠的犬正在大树下睡觉。<br>distance: 0.5006<br>--------------------------------------------------<br>content: 树下有一个小池塘。<br>distance: 0.0000<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><p>接下来是重点了,我们分别使用权重策略和 RRF 策略,看看重排后的结果如何。</p><p>先来看看权重策略中,权重是如何影响综合得分的。我们给稠密向量设置更高的权重——0.8,稀疏向量的权重则设置为0.2。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python">ranker=WeightedRanker(<span class="hljs-number">0.8</span>, <span class="hljs-number">0.2</span>)<br>res_Weighted = perform_hybrid_search(collection_name, query, ranker, output_fields, limit_dense, limit_sparse, limit_hybrid)<br>print_vector_results(res_Weighted)<br></code></pre></td></tr></table></figure><p>综合排名第一的结果“灵活的狐跳过了懒散的犬。”,在稠密向量中的得分是0.9552,排名也是第一,与第二名相差0.108。</p><p>它在稀疏向量中的得分是0.5441,排名第六。虽然排名低,但是得分与第一名只差0.036分,而且权重只占0.2,对综合得分仍然是第一。因为稠密向量的权重高,综合排名基本和稠密向量的排名一致。</p><p>搜索时间是0.022秒。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">搜索时间:0.022<br>content: 灵活的狐跳过了懒散的犬。<br>distance: 0.8730<br>--------------------------------------------------<br>content: 懒狗跳过了狐狸。<br>distance: 0.8716<br>--------------------------------------------------<br>content: 那只灵巧的狐狸轻松地跨过了那只懒散的狗。<br>distance: 0.8610<br>--------------------------------------------------<br>content: 一只敏捷的狐在公园里跳过了那只懒犬。<br>distance: 0.8609<br>--------------------------------------------------<br>content: 狐迅速地跳过了那只不活跃的犬。<br>distance: 0.8423<br>--------------------------------------------------<br>content: 犬跃过了狐。<br>distance: 0.8259<br>--------------------------------------------------<br>content: 在公园里,那只棕色的狐狸正在跳。<br>distance: 0.7865<br>--------------------------------------------------<br>content: 动物如狗和狐狸生活在公园里。<br>distance: 0.7738<br>--------------------------------------------------<br>content: 那只懈怠的犬正在大树下睡觉。<br>distance: 0.7163<br>--------------------------------------------------<br>content: 树下有一个小池塘。<br>distance: 0.5739<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><p>接下来,我们来看看权重策略下的第一名,在 RRF 策略中表现如何。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python">ranker = RRFRanker(k=<span class="hljs-number">10</span>)<br>res_rrf = perform_hybrid_search(collection_name, query, ranker, output_fields, limit_dense, limit_sparse, limit_hybrid)<br>print_vector_results(res_rrf)<br></code></pre></td></tr></table></figure><p>“灵活的狐跳过了懒散的犬。”在 RRF 策略中的排名从第一下滑到了第四。因为这次注重的是排名,它在稠密向量中虽然排名第一,但是在稀疏向量中的排名只有第六,拉低了综合排名。</p><p>排名第一是“懒狗跳过了狐狸。”,因为它在两个搜索结果中的排名都很高,分别是第二和第一。</p><p>搜索时间是0.022秒,和权重策略的搜索时间差不多。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">搜索时间:0.022<br>content: 懒狗跳过了狐狸。<br>distance: 0.1742<br>--------------------------------------------------<br>content: 那只灵巧的狐狸轻松地跨过了那只懒散的狗。<br>distance: 0.1548<br>--------------------------------------------------<br>content: 一只敏捷的狐在公园里跳过了那只懒犬。<br>distance: 0.1538<br>--------------------------------------------------<br>content: 灵活的狐跳过了懒散的犬。<br>distance: 0.1534<br>--------------------------------------------------<br>content: 在公园里,那只棕色的狐狸正在跳。<br>distance: 0.1303<br>--------------------------------------------------<br>content: 狐迅速地跳过了那只不活跃的犬。<br>distance: 0.1255<br>--------------------------------------------------<br>content: 动物如狗和狐狸生活在公园里。<br>distance: 0.1222<br>--------------------------------------------------<br>content: 犬跃过了狐。<br>distance: 0.1181<br>--------------------------------------------------<br>content: 那只懈怠的犬正在大树下睡觉。<br>distance: 0.1053<br>--------------------------------------------------<br>content: 树下有一个小池塘。<br>distance: 0.0500<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><p>终于,轮到我们最期待的重排模型上场了。其实,因为返回的搜索结果数量和文档中的句子数量相同,对任何一个搜索结果重排,或者直接对文档重排,效果都是一样的。不过,为了和实际应用中的粗排、重排流程一致,我们还是对粗排结果重排,比如稀疏向量的搜索结果。</p><p>首先,我们要以字符串列表的形式,获取稀疏向量的搜索结果,以满足重排模型的输入要求。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 获取稀疏向量的搜索结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_init_res_list</span>(<span class="hljs-params">res, field_name</span>):<br> res_list = []<br> <span class="hljs-keyword">for</span> hits <span class="hljs-keyword">in</span> res:<br> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> hits:<br> entity = hit.get(<span class="hljs-string">"entity"</span>)<br> res_list.append(entity[field_name])<br> <span class="hljs-keyword">return</span> res_list<br><br><span class="hljs-comment"># 为了显示重排的效果,我们对搜索结果最差的稀疏向量做重排</span><br>init_res_list = get_init_res_list(res_sparse, field_name)<br></code></pre></td></tr></table></figure><p>接下来,定义重排模型。这里使用的是 bge_m3的重排模型。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus.model.reranker <span class="hljs-keyword">import</span> BGERerankFunction<br><br><span class="hljs-comment"># 定义重排函数</span><br>bge_rf = BGERerankFunction(<br> model_name=<span class="hljs-string">"BAAI/bge-reranker-v2-m3"</span>,<br> device=<span class="hljs-string">"cpu"</span><br>)<br><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">perform_reranking</span>(<span class="hljs-params">query: <span class="hljs-built_in">str</span>, documents: <span class="hljs-built_in">list</span>, top_k: <span class="hljs-built_in">int</span> = <span class="hljs-number">10</span></span>) -> <span class="hljs-built_in">list</span>:<br> <span class="hljs-comment"># 获取重排结果</span><br> start_time = time.time()<br> rerank_res = bge_rf(<br> <span class="hljs-comment"># query参数是字符串</span><br> query=query[<span class="hljs-number">0</span>],<br> <span class="hljs-comment"># documents参数是字符串列表</span><br> documents=documents,<br> top_k=top_k,<br> )<br> end_time = time.time()<br> total_time = end_time - start_time<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"搜索时间:<span class="hljs-subst">{total_time:<span class="hljs-number">.3</span>f}</span>"</span>)<br> <br> <span class="hljs-keyword">return</span> rerank_res<br><br>top_k = <span class="hljs-number">10</span><br>rerank_res = perform_reranking(query, init_res_list, top_k)<br></code></pre></td></tr></table></figure><p>前面我提到过重排模型会花更多的时间,我们先对比下时间。第一次使用重排模型花了3.2秒,后面再使用一般用时0.4秒,这可能是因为第一次需要加载重排模型到内存中,花的时间较多。所以我们按照用时0.4秒计算。</p><p>基于统计的重排用时在0.014-0.022秒之间,按照最慢的0.022秒计算。两者时间相差18倍。</p><p>重排模型多花了这么多时间,效果怎么样呢?打印搜索结果看看吧。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> rerank_res:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"content: <span class="hljs-subst">{hit.text}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"score: <span class="hljs-subst">{hit.score:<span class="hljs-number">.4</span>f}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>)<br></code></pre></td></tr></table></figure><p>我对重排结果还是比较满意的。四个正确答案排在前四名,而且得分非常接近满分1分。而且,它们和其他搜索结果在得分上终于拉开了较大的差距。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">content: 灵活的狐跳过了懒散的犬。<br>score: 0.9998<br>--------------------------------------------------<br>content: 狐迅速地跳过了那只不活跃的犬。<br>score: 0.9997<br>--------------------------------------------------<br>content: 那只灵巧的狐狸轻松地跨过了那只懒散的狗。<br>score: 0.9987<br>--------------------------------------------------<br>content: 一只敏捷的狐在公园里跳过了那只懒犬。<br>score: 0.9980<br>--------------------------------------------------<br>content: 犬跃过了狐。<br>score: 0.3730<br>--------------------------------------------------<br>content: 懒狗跳过了狐狸。<br>score: 0.2702<br>--------------------------------------------------<br>content: 在公园里,那只棕色的狐狸正在跳。<br>score: 0.1924<br>--------------------------------------------------<br>content: 动物如狗和狐狸生活在公园里。<br>score: 0.0972<br>--------------------------------------------------<br>content: 那只懈怠的犬正在大树下睡觉。<br>score: 0.0059<br>--------------------------------------------------<br>content: 树下有一个小池塘。<br>score: 0.0000<br>--------------------------------------------------<br></code></pre></td></tr></table></figure><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>通过对比我们发现,基于统计的重排速度快,准确性一般,适合追求高响应速度和低成本的场景,比如网页搜索、电商。</p><p>它有权重和 RRF 两个策略。如果你更看重某种类型的搜索结果,建议使用权重策略。如果你没有明显的偏好,希望在不同搜索结果中,排名都靠前的结果能够胜出,建议使用 RRF 策略。</p><p>基于深度学习的重排速度慢,但是准确性高,适合对回答准确性要求高的场景,比如专业知识库或者客服系统。</p><h2 id="藏宝图"><a href="#藏宝图" class="headerlink" title="藏宝图"></a>藏宝图</h2><p>如果你还想了解更多重排的知识,可以参考下面的文章:</p><ul><li><a href="https://zilliz.com.cn/blog/rag-reranker-therole-and-tradeoffs">提高 RAG 应用准确度,时下流行的 Reranker 了解一下?</a></li><li><a href="https://mp.weixin.qq.com/s/t_ybJrc_Dhi5keIsgqy1vg">一文玩转 Milvus 新特性之 Hybrid Search</a></li><li><a href="https://milvus.io/docs/rerankers-overview.md">Rerankers Overview</a></li><li><a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#model-list">BGE 重排模型-github</a></li><li><a href="https://milvus.io/docs/rerankers-bge.md">BGE重排模型在Milvus中的使用</a></li><li><a href="https://milvus.io/docs/rerankers-cross-encoder.md">Cross Encoder 重排模型在 Milvus 中的使用</a></li></ul><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><section class="footnotes"><div class="footnote-list"><ol><li><span id="fn:1" class="footnote-text"><span><a href="https://zilliz.com.cn/blog/Hybrid-Search">一文玩转 Milvus 新特性之 Hybrid Search</a><a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li><li><span id="fn:2" class="footnote-text"><span><a href="https://milvus.io/docs/reranking.md#Weighted-Scoring-WeightedRanker">Reranking</a><a href="#fnref:2" rev="footnote" class="footnote-backref"> ↩</a></span></span></li></ol></div></section>]]></content>
<categories>
<category>向量数据库</category>
<category>原理探秘</category>
</categories>
</entry>
<entry>
<title>写给新读者的导航</title>
<link href="/2025/01/07/%E5%86%99%E7%BB%99%E6%96%B0%E8%AF%BB%E8%80%85%E7%9A%84%E5%AF%BC%E8%88%AA/"/>
<url>/2025/01/07/%E5%86%99%E7%BB%99%E6%96%B0%E8%AF%BB%E8%80%85%E7%9A%84%E5%AF%BC%E8%88%AA/</url>
<content type="html"><![CDATA[<p>你好啊朋友,我是江浩,一名AI大陆的探险者,目前主要关注向量数据库和大语言模型领域。</p><p>在博客里,我会探秘AI神奇能力的背后原理。别担心,我会用有趣的言语和生动的类比来解释这些原理。你是否好奇,孙悟空 + 红楼梦 - 西游记 = ?那就来了解下向量嵌入吧。当你接触到向量嵌入后,你可能会问,既然已经有了稠密向量,为什么还需要稀疏向量?嗯,如果说稠密向量是领域专家,那么稀疏向量就是一个聪明的门外汉,面对领域外的知识,请教后者反而更合适。</p><p>除了原理探秘,我还会用AI开发一些有趣的应用,比如,用白话文搜索语义相似的古诗词,让你体验一把“文艺青年”的乐趣。或者,开发一个“鲁迅说没有”的RAG应用,验证把所谓的“鲁迅名言”是否属实,它的原文又是怎样的。甚至,我还想和牛魔王对话,问问他更爱铁扇公主还是玉面狐狸。哈哈,有趣的想法太多,慢慢实现。</p><p>对了,我在探索AI大陆时,也采集了不少鲜美的果实————优质资源,我会整理好了分享给你。</p><p>AI大陆有趣又神奇,朋友,我邀请你和我同行。</p><p>ChangeLog<br>2025-01-07</p>]]></content>
<categories>
<category>杂谈</category>
</categories>
</entry>
<entry>
<title>门外汉如何“冒充”专家?向量嵌入之稀疏向量</title>
<link href="/2024/12/11/%E9%97%A8%E5%A4%96%E6%B1%89%E5%A6%82%E4%BD%95%E2%80%9C%E5%86%92%E5%85%85%E2%80%9D%E4%B8%93%E5%AE%B6%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%80%E7%96%8F%E5%90%91%E9%87%8F/"/>
<url>/2024/12/11/%E9%97%A8%E5%A4%96%E6%B1%89%E5%A6%82%E4%BD%95%E2%80%9C%E5%86%92%E5%85%85%E2%80%9D%E4%B8%93%E5%AE%B6%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%80%E7%96%8F%E5%90%91%E9%87%8F/</url>
<content type="html"><![CDATA[<p>在 <a href="http://jiangjunqiao.top/2024/10/11/%E5%AD%99%E6%82%9F%E7%A9%BA-%E7%BA%A2%E6%A5%BC%E6%A2%A6-%E8%A5%BF%E6%B8%B8%E8%AE%B0-%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%A0%E5%AF%86%E5%90%91%E9%87%8F/">孙悟空 + 红楼梦 - 西游记 = ?向量嵌入之稠密向量</a> 这篇文章中,我们已经知道了文本怎么变成稠密向量,并且还能够表达文本的语义。但是,对于嵌入模型的“专业领域”外的文本,它的效果不尽如人意。</p><p>打个比方,假设你身体不舒服去看医生,医生完全理解你的描述,他会判断病因然后做出诊断。但是,如果你问医生“人工智能如何影响汽车行业?”,医生大概会觉得你不仅身体不舒服,脑子也需要治一治。医生不懂这方面的知识。</p><p>想要获得答案,你可以去找人工智能或者汽车领域的专家。当然,你还有另一个选择,去找一位聪明的门外汉,“冒充”专家。</p><blockquote><p>本文首发于 Zilliz 公众号。文中代码的 Notebook 在<a href="https://pan.baidu.com/s/1TCCz9KZelyNFgiUyPIvVnQ?pwd=9zh6">这里</a>下载。</p></blockquote><h2 id="聪明的门外汉——BM25"><a href="#聪明的门外汉——BM25" class="headerlink" title="聪明的门外汉——BM25"></a>聪明的门外汉——BM25</h2><p>稠密向量(Dense Vector)的维度较低,一般在几百到上千左右,每个维度的元素一般都不为零。相对的,还有一种稀疏向量(Sparse Vector),它的维度远远超过稠密向量,一般有几万甚至十万,但是大部分维度的元素都为零,只有少数元素是非零的。</p><p>稀疏向量分成统计得到的稀疏向量和学习得到的稀疏向量两种,我们先聊聊第一种,代表就是 BM25。</p><p>BM25 是一位聪明的门外汉,你问他领域外的知识,他虽然不理解,但是他会找到问题中的关键词,比如“人工智能”和“汽车”,然后去查文档,把文档中和关键词最相关的信息告诉你。</p><p>那么,这位门外汉具体是怎么做的呢?</p><p>首先,他会搜索成百上千篇相关文档,并且快速地翻一遍,了解这些文档中有哪些专业术语。什么样的词是专业术语呢?作为一个聪明的门外汉,他决定通过单词出现的频率判断。像“的”、“是”、“了”等常见词肯定不是专业术语,反而是那些出现频率比较低的词,更可能是专业术语。</p><p>这就好比你有两个微信群,一个是工作群,平时消息不多,但是一旦有消息,不是领导布置任务,就是同事反馈进度,都很重要,你把这个群置顶了。</p><p>另一个是吃喝玩乐群,一群朋友在群里聊天吹水,一整天消息不断,但是没那么重要,错过就错过了,忙的时候你还会设置成“信息免打扰”。</p><p>对你来说,不同的群权重不同,门外汉也会为不同的词设置不同的权重。他会为文档中出现的词建立一个词汇表,并且根据单词出现的频率赋予权重,出现的频率越低,权重越大,越可能是专业术语。</p><p>然后,他要判断哪些文档和“人工智能”以及“汽车”这两个专业术语更相关。他会对照词汇表,数一数每篇文档中这两个术语出现的频率,频率越高,相关性越大。</p><p>以上是 BM25 的极简版解释,实际算法要复杂很多。公式越多,读者越少,所以下面我就简单介绍下 BM25 算法的工作原理。</p><p>首先,BM25对文档集合做分词处理,得到一张词汇表。词汇表的单词(准确来说是 token)的数量,就是稀疏向量的维度。</p><p>然后,对查询也做分词处理,比如,如果查询是“人工智能如何影响汽车行业?”,分词得到“人工智能“”、“影响”和“汽车行业”这三个词。</p><p>接下来,计算<strong>文档集合中的每个词</strong>的逆文档频率 IDF,以及<strong>查询中的某个词在指定文档</strong>中的词频 TF。</p><p>逆文档频率 IDF(Inverse Document Frequency),很绕口的一个名字。简单来说,它用来计算某个词在<strong>文档集合</strong>中出现的次数。出现次数越少,数值越大。门外汉用它给出现频率低的专业术语,赋予更大的权重。</p><p>$$<br>\text{IDF}(q_i) = \log \left( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} \right)<br>$$</p><p>其中:</p><ul><li>${IDF}(q_i)$ 是单词 $q_i$ 的逆文档频率。</li><li>$N$ 是文档总数。</li><li>$n(q_i)$ 是语料库中,包含查询词 $(q_i)$ 的文档数量。</li><li>$0.5$ 是一个平滑因子,用于避免分母为零的情况。</li></ul><p>词频TF(Term Frequency)表示查询中的某个词,在<strong>指定文档</strong>中出现的频率,频率越大数值越大,也就意味着查询和该文档的相关性更高。</p><p>$$<br>\text{TF}(q_i, d) = \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}<br>$$<br>其中:</p><ul><li>${TF}(q_i, d)$ 是查询词 $q_i$ 在文档 $d$ 中的词频。反映了查询词在文档中的重要性。</li><li>$q$ 是查询。</li><li>$d$ 是语料库中的某一个文档。</li><li>$q_i$ 是查询中的第 $i$ 个 token。</li><li>$f (q_i, d)$ 是查询词 $q_i$ 在文档 $d$ 中出现的次数。</li><li>$k_1$ 是一个调节参数,用于控制词频的影响。 $k_1$ 取值在1.2到2之间</li><li>$b$ 是一个调节参数,用于控制文档长度对词频的影响。$b$ 取值为0.75。</li><li>$|d|$ 是文档的长度。文档长度指的是分词后的 token 数量。</li><li>${avgdl}$ 是语料库中所有文档的平均长度。</li></ul><p>最后,根据 IDF 和 TF 计算 BM25分数,用来表示查询与指定文档的相关程度。</p><p>$$\text{BM25}(q, d) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \text{TF}(q_i, d)$$</p><h2 id="BM25-代码实践"><a href="#BM25-代码实践" class="headerlink" title="BM25 代码实践"></a>BM25 代码实践</h2><p>好啦,纸上谈兵到此结束,下面我们用代码实际操练一番吧。先做点准备工作。</p><p>版本说明:<br>Milvus 版本:>=2.5.0<br>pymilvus 版本:>=2.5.0</p><p>假如下面的字符串列表就是我们的文档集合,每个字符串是一个文档:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">docs = [<br> "机器学习正在改变我们的生活方式。",<br> "深度学习在图像识别中表现出色。",<br> "自然语言处理是计算机科学的重要领域。",<br> "自动驾驶依赖于先进的算法。",<br> "AI可以帮助医生诊断疾病。",<br> "金融领域广泛应用数据分析技术。",<br> "生产效率可以通过自动化技术提高。",<br> "机器智能的未来充满潜力。",<br> "大数据支持是机器智能发展的关键。",<br> "量子隧穿效应使得电子能够穿过经典力学认为无法穿过的势垒,这在半导体器件中有着重要的应用。"<br>]<br></code></pre></td></tr></table></figure><p>使用BM25对第一个文档“机器学习正在改变我们的生活方式。”做分词处理:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus.model.sparse.bm25.tokenizers <span class="hljs-keyword">import</span> build_default_analyzer<br><span class="hljs-keyword">from</span> pymilvus.model.sparse <span class="hljs-keyword">import</span> BM25EmbeddingFunction<br><br><span class="hljs-comment"># 使用支持中文的分析器</span><br>analyzer = build_default_analyzer(language=<span class="hljs-string">"zh"</span>)<br><br><span class="hljs-comment"># 分析器对文本做分词处理</span><br>tokens1 = analyzer(docs[<span class="hljs-number">0</span>])<br><span class="hljs-built_in">print</span>(tokens1)<br></code></pre></td></tr></table></figure><p>分词结果如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">['机器', '学习', '改变', '生活', '方式']<br></code></pre></td></tr></table></figure><p>接下来对整个文档集合做分词处理,并且计算文档集合的 IDF 等参数:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建BM25EmbeddingFunction实例,传入分词器,以及其他参数</span><br>bm25_ef = BM25EmbeddingFunction(analyzer)<br><br><span class="hljs-comment"># 计算文档集合的参数</span><br>bm25_ef.fit(docs)<br><br><span class="hljs-comment"># 保存训练好的参数到磁盘以加快后续处理</span><br>bm25_ef.save(<span class="hljs-string">"bm25_params.json"</span>)<br></code></pre></td></tr></table></figure><p>我们看下参数有哪些内容:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">import</span> json<br><br>file_path = <span class="hljs-string">"bm25_params.json"</span><br><span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> bm25_params = json.load(file)<br> <span class="hljs-built_in">print</span>(bm25_params)<br></code></pre></td></tr></table></figure><p><code>corpus_size</code> 是文档数量,<code>avgdl</code>、<code>idf_value</code> 等参数都在前面的公式中出现过。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">{'version': 'v1', 'corpus_size': 10, 'avgdl': 5.4, 'idf_word': ['机器', '学习', '改变', '生活', '方式', '深度', '图像识别', '中', '表现出色', '自然语言', '计算机科学', '领域', '自动', '驾驶', '依赖于', '先进', '算法', 'AI', '医生', '诊断', '疾病', '金融', '广泛应用', '数据分析', '技术', '生产', '效率', '自动化', '提高', '智能', '未来', '充满', '潜力', '大', '数据', '支持', '发展', '关键', '量子', '隧穿', '效应', '电子', '穿过', '经典力学', '势垒', '半导体器件'], 'idf_value': [0.7621400520468966, 1.2237754316221157, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.2237754316221157, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.2237754316221157, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.2237754316221157, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.2237754316221157, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331, 1.845826690498331], 'k1': 1.5, 'b': 0.75, 'epsilon' `idf_word` 是 BM25对文档集合的分词结果,也就是前面提到的词汇表。词汇表中单词的数量,也是稀疏向量的维度。<hexoPostRenderCodeBlock><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># BM25词汇表中的单词数量</span><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"BM25词汇表中的单词数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(bm25_params[<span class="hljs-string">'idf_word'</span>])}</span>"</span>)<br><br><span class="hljs-comment"># BM25稀疏向量的维度</span><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"BM25稀疏向量维度:<span class="hljs-subst">{bm25_ef.dim}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回的结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">BM25词汇表中的单词数量:46<br>BM25稀疏向量维度:46<br></code></pre></td></tr></table></figure><p>需要的参数计算好了,接下来就可以生成文档集合的稀疏向量了。文档集合中有10篇文档,也就是10个字符串,而稀疏向量的维度是46,所以文档集合的稀疏向量是一个10行46列的矩阵。每一行表示一个文档的稀疏向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 生成文档集合的稀疏向量</span><br>sparse_vectors_bm25 = bm25_ef.encode_documents(docs)<br><br><span class="hljs-comment"># 打印文档集合的稀疏向量</span><br><span class="hljs-built_in">print</span>(sparse_vectors_bm25)<br></code></pre></td></tr></table></figure><p>输出结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">(0, 0)1.0344827586206897<br>(0, 1)1.0344827586206897<br>(0, 2)1.0344827586206897<br>(0, 3)1.0344827586206897<br>(0, 4)1.0344827586206897<br>::<br>(9, 7)0.7228915662650603<br>(9, 38)0.7228915662650603<br>(9, 39)0.7228915662650603<br>(9, 40)0.7228915662650603<br>(9, 41)0.7228915662650603<br>(9, 42)1.1214953271028039<br>(9, 43)0.7228915662650603<br>(9, 44)0.7228915662650603<br>(9, 45)0.7228915662650603<br></code></pre></td></tr></table></figure><p>我们来看下第一个文档“机器学习正在改变我们的生活方式。”的稀疏向量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 第一个文档的稀疏向量</span><br><span class="hljs-built_in">print</span>(<span class="hljs-built_in">list</span>(sparse_vectors_bm25)[<span class="hljs-number">0</span>])<br></code></pre></td></tr></table></figure><p>结果为:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">(0, 0)1.0344827586206897<br>(0, 1)1.0344827586206897<br>(0, 2)1.0344827586206897<br>(0, 3)1.0344827586206897<br>(0, 4)1.0344827586206897<br></code></pre></td></tr></table></figure><p>你发现了吧,第一个文档的稀疏向量只有5个非零元素,因为它的分词结果是5个单词,对应上了。而且,每个元素的值都相同,说明它们的逆文档频率 IDF 和词频 TF 都是一样的。</p><p>第一个文档的分词结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">['机器', '学习', '改变', '生活', '方式']<br></code></pre></td></tr></table></figure><p>文档集合处理好了,我们再给出一个查询的句子,就可以执行搜索了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python">query = [<span class="hljs-string">"自动驾驶如何影响汽车行业?"</span>]<br><br><span class="hljs-comment"># 把查询文本向量化</span><br>query_sparse_vectors_bm25 = bm25_ef.encode_queries(query)<br><br><span class="hljs-comment"># 打印稀疏向量</span><br><span class="hljs-built_in">print</span>(query_sparse_vectors_bm25)<br><br><span class="hljs-comment"># 查询的分词结果</span><br><span class="hljs-built_in">print</span>(analyzer(query[<span class="hljs-number">0</span>]))<br></code></pre></td></tr></table></figure><p>查看查询的稀疏向量,以及它的分词结果。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs plaintext"> (0, 12)1.845826690498331<br> (0, 13)1.845826690498331<br>['自动', '驾驶', '影响', '汽车行业']<br></code></pre></td></tr></table></figure><p>你可能会有疑问,为什么查询分词后得到4个单词,但是它的稀疏向量只有2维?因为这4个单词中,词汇表中只有“自动”和“驾驶”,没有“影响”和“汽车行业”,后两个词的 BM25 分数为0。</p><p>哎,毕竟是门外汉啊。</p><h2 id="刚入门的新人——splade"><a href="#刚入门的新人——splade" class="headerlink" title="刚入门的新人——splade"></a>刚入门的新人——splade</h2><p>如果说稠密向量是精通特定领域的专家,统计得到的稀疏向量 BM25是聪明的门外汉,那么学习得到的稀疏向量 splade 就是刚入门的新人。他理解领域内专业术语的语义,而且能够举一反三,增加更多语义相近的词,一起查找。但是他毕竟还是新人,并不精通,还是通过数专业术语出现的次数,找到最相关的文档。</p><p>具体来说,splade 是这样工作的:<br>首先,splade 先对句子分词,通过嵌入模型 BERT (BERT 相关内容详见 <a href="http://jiangjunqiao.top/2024/10/11/%E5%AD%99%E6%82%9F%E7%A9%BA-%E7%BA%A2%E6%A5%BC%E6%A2%A6-%E8%A5%BF%E6%B8%B8%E8%AE%B0-%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%A0%E5%AF%86%E5%90%91%E9%87%8F/">02-孙悟空 + 红楼梦 - 西游记 = ?向量嵌入之稠密向量</a>)得到单词的向量。向量可以表达语义,所以 splade 能够“举一反三”,找到更多语义相似的单词。</p><p>比如,对于“人工智能如何影响汽车行业”这个句子,分词得到“人工智能”和“汽车”两个 单词,以及与“人工智能”相似的“AI”等单词。</p><p>splade 也有一张词汇表,不过它不需要像 bm25 那样根据文档集合统计,而是预先就有的,来源于 BERT。</p><p>接下来,splade 生成这些单词的稀疏向量。它会计算每个单词出现在词汇表中的每个位置的概率。也就是说,单词和词汇表中某个位置的词在语义上越接近,计算得到的概率越大。这个概率就是单词的权重。</p><p>以“人工智能”为例,假设词汇表中第5个词也是“人工智能”,两个词完全一样,计算得到的概率就很高,比如40%。而词汇表第8个词是“机器学习”,两个词比较相似,概率是20%。而词汇表中其他的词和“人工智能”语义相差较远,概率很小,忽略不计。最后,“人工智能”的权重就是 $40% + 20% = 60%$。</p><p>然后再用相同的方法,计算出“AI”和“汽车”的权重,得到稀疏向量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python">sparse_vector = {<span class="hljs-string">"人工智能"</span>: <span class="hljs-number">0.6</span>,<span class="hljs-string">"AI"</span>: <span class="hljs-number">0.5</span>,<span class="hljs-string">"汽车"</span>: <span class="hljs-number">0.1</span>}<br></code></pre></td></tr></table></figure><h2 id="splade-代码实践"><a href="#splade-代码实践" class="headerlink" title="splade 代码实践"></a>splade 代码实践</h2><p>老规矩,我们还使用代码验证下前面的内容。这次使用英文的文档集合:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><code class="hljs plaintext"># 使用英文<br>docs_en = [<br> "Machine learning is changing our way of life.",<br> "Deep learning performs exceptionally well in image recognition.",<br> "Natural language processing is an important field in computer science.",<br> "Autonomous driving relies on advanced algorithms.",<br> "AI can help doctors diagnose diseases.",<br> "Data analysis technology is widely applied in the financial field.",<br> "Production efficiency can be improved through automation technology.",<br> "The future of machine intelligence is full of potential.",<br> "Big data support is key to the development of machine intelligence.",<br> "The quantum tunneling effect allows electrons to pass through potential barriers that classical mechanics consider impassable, which has important applications in semiconductor devices."<br>]<br></code></pre></td></tr></table></figure><p>生成文档集合的稀疏向量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus.model.sparse <span class="hljs-keyword">import</span> SpladeEmbeddingFunction<br><br>query_en = [<span class="hljs-string">"How does artificial intelligence affect the automotive industry?"</span>]<br><br>model_name = <span class="hljs-string">"naver/splade-cocondenser-selfdistil"</span><br><br><span class="hljs-comment"># 实例化splade嵌入模型</span><br>splade_ef = SpladeEmbeddingFunction(<br> model_name = model_name, <br> device=<span class="hljs-string">"cpu"</span><br>)<br><br><span class="hljs-comment"># 生成文档集合的稀疏向量</span><br>sparse_vectors_splade = splade_ef.encode_documents(docs_en)<br><span class="hljs-built_in">print</span>(sparse_vectors_splade)<br></code></pre></td></tr></table></figure><p>和 BM25 一样,我们同样得到一个稀疏向量矩阵:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">(0, 1012)0.053256504237651825<br>(0, 2003)0.22995686531066895<br>(0, 2047)0.08765587955713272<br>::<br>(9, 27630)0.2794925272464752<br>(9, 28688)0.02786295674741268<br>(9, 28991)0.12241243571043015<br></code></pre></td></tr></table></figure><p>splade 的词汇表是预先准备好的,词汇表中的单词数量同样也是稀疏向量的维度。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># splade词汇表中的单词数量</span><br><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForMaskedLM, AutoTokenizer<br>tokenizer = AutoTokenizer.from_pretrained(model_name)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"splade词汇表中的单词数量:<span class="hljs-subst">{tokenizer.vocab_size}</span>"</span>)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"splade稀疏向量维度:<span class="hljs-subst">{splade_ef.dim}</span>"</span>)<br></code></pre></td></tr></table></figure><p>二者相同:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">splade词汇表中的单词数量:30522<br>splade稀疏向量维度:30522<br></code></pre></td></tr></table></figure><p>我们再来看看查询的分词结果及其稀疏向量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看查询的分词</span><br>tokens = tokenizer.tokenize(query_en[<span class="hljs-number">0</span>])<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"“<span class="hljs-subst">{query_en[<span class="hljs-number">0</span>]}</span>” 的分词结果:\n<span class="hljs-subst">{tokens}</span>"</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"tokens数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(tokens)}</span>"</span>)<br><br><span class="hljs-comment"># 生成查询的稀疏向量</span><br>query_sparse_vectors_splade = splade_ef.encode_queries(query_en)<br><span class="hljs-built_in">print</span>(query_sparse_vectors_splade)<br></code></pre></td></tr></table></figure><p>结果如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">“How does artificial intelligence affect the automotive industry?” 的分词结果:<br>['how', 'does', 'artificial', 'intelligence', 'affect', 'the', 'automotive', 'industry', '?']<br><br>tokens数量:9<br><br> (0, 2054)0.139632448554039<br> (0, 2079)0.08572433888912201<br> (0, 2106)0.22006677091121674<br> (0, 2126)0.038961488753557205<br> (0, 2129)0.6875206232070923<br> (0, 2138)0.5343469381332397<br> (0, 2194)0.32417890429496765<br> (0, 2224)0.011731390841305256<br> (0, 2339)0.33811360597610474<br> ::<br> (0, 26060)0.0731586366891861<br></code></pre></td></tr></table></figure><p>比较分词的数量和稀疏向量的维度,你有没有发现有什么不对劲的地方?没错,分词数量和稀疏向量的维度不一样。这就是 splade 和 BM25的重要区别,splade 能够“举一反三”,它在最初9个分词的基础上,又增加了其他语义相近的单词。</p><p>那么,查询现在一共有多少个单词呢?或者说,它的稀疏向量的非零元素有多少呢?</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 获取稀疏向量的非零索引</span><br>nonzero_indices = query_sparse_vectors_splade.indices[query_sparse_vectors_splade.indptr[<span class="hljs-number">0</span>]:query_sparse_vectors_splade.indptr[<span class="hljs-number">1</span>]]<br><br><span class="hljs-comment"># 构建稀疏词权重列表</span><br>sparse_token_weights = [<br> (splade_ef.model.tokenizer.decode(col), query_sparse_vectors_splade[<span class="hljs-number">0</span>, col])<br> <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> nonzero_indices<br>]<br><br><span class="hljs-comment"># 按权重降序排序</span><br>sparse_token_weights = <span class="hljs-built_in">sorted</span>(sparse_token_weights, key=<span class="hljs-keyword">lambda</span> item: item[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)<br><br><span class="hljs-comment"># 查询句只有9个tokens,splade通过举一反三,生成的稀疏向量维度增加到了98个。</span><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"splade 稀疏向量非零元素数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(sparse_token_weights)}</span>"</span>)<br></code></pre></td></tr></table></figure><p>一共有98个:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">splade 稀疏向量非零元素数量:98<br></code></pre></td></tr></table></figure><p>具体是哪些单词?我们打印出来看一下:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 比如,和“artificial intelligence”语义相近的 “ai”,和“automotive”语义相近的“car”。</span><br><span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> sparse_token_weights:<br> <span class="hljs-built_in">print</span>(token)<br></code></pre></td></tr></table></figure><p>splade 增加了大量语义相近的单词,比如和“artificial intelligence”语义相近的 “ai”,和“automotive”语义相近的“car”和“vehicle”。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">('artificial', 2.588431)<br>('intelligence', 2.3582284)<br>('car', 1.590975)<br>('automotive', 1.4835068)<br>('vehicle', 0.798108)<br>('ai', 0.676852)<br> ::<br></code></pre></td></tr></table></figure><h2 id="搜索实践"><a href="#搜索实践" class="headerlink" title="搜索实践"></a>搜索实践</h2><p>我们已经了解了两种稀疏向量的特点,以及生成方法,下面就在搜索中体会下它们的区别吧。</p><p>我们需要用 Milvus 创建集合,然后导入数据,创建索引,加载数据,就可以搜索了。这个过程我在 <a href="http://jiangjunqiao.top/2024/09/16/%E5%A6%82%E4%BD%95%E5%81%87%E8%A3%85%E6%96%87%E8%89%BA%E9%9D%92%E5%B9%B4%EF%BC%8C%E6%80%8E%E4%B9%88%E6%8A%8A%E5%A4%A7%E7%99%BD%E8%AF%9D%E2%80%9C%E5%8F%98%E6%88%90%E2%80%9D%E5%8F%A4%E8%AF%97%E8%AF%8D%EF%BC%9F/">如何假装文艺青年,怎么把大白话“变成”古诗词?</a> 中有详细介绍,就不多赘述了。</p><p>创建集合。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> MilvusClient, DataType<br><span class="hljs-keyword">import</span> time<br><br><span class="hljs-comment"># 删除同名集合</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">check_collection</span>(<span class="hljs-params">collection_name</span>):<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 已经存在"</span>)<br> <span class="hljs-keyword">try</span>:<br> milvus_client.drop_collection(collection_name)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合时出现错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br><br><br><span class="hljs-comment"># 创建模式</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_schema</span>():<br> schema = milvus_client.create_schema(<br> auto_id=<span class="hljs-literal">True</span>,<br> enable_dynamic_field=<span class="hljs-literal">True</span>,<br> num_partitions=<span class="hljs-number">16</span>,<br> description=<span class="hljs-string">""</span><br> )<br> <span class="hljs-comment"># 添加字段到schema</span><br> schema.add_field(field_name=<span class="hljs-string">"id"</span>, datatype=DataType.INT64, is_primary=<span class="hljs-literal">True</span>, max_length=<span class="hljs-number">256</span>)<br> schema.add_field(field_name=<span class="hljs-string">"text"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">256</span>)<br> <span class="hljs-comment"># bm25稀疏向量</span><br> schema.add_field(field_name=<span class="hljs-string">"sparse_vectors_bm25"</span>, datatype=DataType.SPARSE_FLOAT_VECTOR)<br> <span class="hljs-comment"># splade稀疏向量</span><br> schema.add_field(field_name=<span class="hljs-string">"sparse_vectors_splade"</span>, datatype=DataType.SPARSE_FLOAT_VECTOR)<br> <span class="hljs-keyword">return</span> schema<br><br><br><span class="hljs-comment"># 创建集合</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_collection</span>(<span class="hljs-params">collection_name, schema, timeout = <span class="hljs-number">3</span></span>):<br> <span class="hljs-comment"># 创建集合</span><br> <span class="hljs-keyword">try</span>:<br> milvus_client.create_collection(<br> collection_name=collection_name,<br> schema=schema,<br> shards_num=<span class="hljs-number">2</span><br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"开始创建集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合的过程中出现了错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br><br> <span class="hljs-comment"># 检查集合是否创建成功</span><br> start_time = time.time()<br> <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 创建成功"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">elif</span> time.time() - start_time > timeout:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合 <span class="hljs-subst">{collection_name}</span> 超时"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> time.sleep(<span class="hljs-number">1</span>)<br><br><br><span class="hljs-comment"># 定义删除集合失败的异常类</span><br><span class="hljs-keyword">class</span> <span class="hljs-title class_">CollectionDeletionError</span>(<span class="hljs-title class_ inherited__">Exception</span>):<br> <span class="hljs-string">"""删除集合失败"""</span><br><br>collection_name = <span class="hljs-string">"docs"</span><br>uri=<span class="hljs-string">"http://localhost:19530"</span><br>milvus_client = MilvusClient(uri=uri)<br>check_collection(collection_name)<br><br><span class="hljs-comment"># 检查并删除同名集合</span><br><span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> check_collection(collection_name):<br> <span class="hljs-comment"># 无法删除集合,抛出异常</span><br> <span class="hljs-keyword">raise</span> CollectionDeletionError(<span class="hljs-string">'删除集合失败'</span>)<br><span class="hljs-keyword">else</span>:<br> <span class="hljs-comment"># 创建集合的模式</span><br> schema = create_schema()<br> <span class="hljs-comment"># 创建集合并等待成功</span><br> create_collection(collection_name, schema)<br></code></pre></td></tr></table></figure><p>导入数据。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 准备数据</span><br>entities = [<br> {<br> <span class="hljs-comment"># 文本字段</span><br> <span class="hljs-string">"text"</span>: docs[i],<br> <span class="hljs-string">"text_en"</span>: docs_en[i],<br> <span class="hljs-comment"># bm25稀疏向量字段</span><br> <span class="hljs-string">"sparse_vectors_bm25"</span>: <span class="hljs-built_in">list</span>(sparse_vectors_bm25)[i].reshape(<span class="hljs-number">1</span>, -<span class="hljs-number">1</span>),<br> <span class="hljs-comment"># splade稀疏向量字段</span><br> <span class="hljs-string">"sparse_vectors_splade"</span>: <span class="hljs-built_in">list</span>(sparse_vectors_splade)[i].reshape(<span class="hljs-number">1</span>, -<span class="hljs-number">1</span>),<br> }<br> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(docs))<br>]<br><br><span class="hljs-comment"># 导入数据</span><br>milvus_client.insert(collection_name=collection_name, data=entities)<br></code></pre></td></tr></table></figure><p>创建索引。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建索引参数</span><br>index_params = milvus_client.prepare_index_params()<br><br><span class="hljs-comment"># 为稀疏向量bm25创建索引参数</span><br>index_params.add_index(<br> index_name=<span class="hljs-string">"sparse_vectors_bm25"</span>,<br> field_name=<span class="hljs-string">"sparse_vectors_bm25"</span>,<br> <span class="hljs-comment"># SPARSE_INVERTED_INDEX是传统的倒排索引,SPARSE_WAND使用Weak-AND算法来减少搜索过程中的完整IP距离计算</span><br> index_type=<span class="hljs-string">"SPARSE_INVERTED_INDEX"</span>,<br> <span class="hljs-comment"># 目前仅支持IP</span><br> metric_type=<span class="hljs-string">"IP"</span>,<br> <span class="hljs-comment"># 创建索引时,排除向量值最小的20%的向量。对于稀疏向量来说,向量值越大,说明在该维度上的重要性越大。范围[0,1]。</span><br> params={<span class="hljs-string">"drop_ratio_build"</span>: <span class="hljs-number">0.2</span>}<br>)<br><br><br><span class="hljs-comment"># 为稀疏向量splade创建索引参数</span><br>index_params.add_index(<br> index_name=<span class="hljs-string">"sparse_vectors_splade"</span>,<br> field_name=<span class="hljs-string">"sparse_vectors_splade"</span>,<br> <span class="hljs-comment"># SPARSE_INVERTED_INDEX是传统的倒排索引,SPARSE_WAND使用Weak-AND算法来减少搜索过程中的完整IP距离计算</span><br> index_type=<span class="hljs-string">"SPARSE_INVERTED_INDEX"</span>,<br> <span class="hljs-comment"># 目前仅支持IP</span><br> metric_type=<span class="hljs-string">"IP"</span>,<br> <span class="hljs-comment"># 创建索引时,排除向量值最小的20%的向量。对于稀疏向量来说,向量值越大,说明在该维度上的重要性越大。范围[0,1]。</span><br> params={<span class="hljs-string">"drop_ratio_build"</span>: <span class="hljs-number">0.2</span>}<br>)<br><br><span class="hljs-comment"># 创建索引</span><br>milvus_client.create_index(<br> collection_name=collection_name,<br> index_params=index_params<br>)<br></code></pre></td></tr></table></figure><p>查看索引是否创建成功。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看索引信息</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">show_index_info</span>(<span class="hljs-params">collection_name: <span class="hljs-built_in">str</span></span>) -> <span class="hljs-literal">None</span>:<br> <span class="hljs-string">"""</span><br><span class="hljs-string"> 显示指定集合中某个索引的详细信息。</span><br><span class="hljs-string"></span><br><span class="hljs-string"> 参数:</span><br><span class="hljs-string"> collection_name (str): 集合的名称。</span><br><span class="hljs-string"></span><br><span class="hljs-string"> 返回:</span><br><span class="hljs-string"> None: 该函数仅打印索引信息,不返回任何值。</span><br><span class="hljs-string"> """</span><br> <span class="hljs-comment"># 查看集合的所有索引</span><br> indexes = milvus_client.list_indexes(<br> collection_name=collection_name <br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"已经创建的索引:<span class="hljs-subst">{indexes}</span>"</span>)<br> <span class="hljs-built_in">print</span>()<br> <span class="hljs-comment"># 查看索引信息</span><br> <span class="hljs-keyword">if</span> indexes:<br> <span class="hljs-keyword">for</span> index <span class="hljs-keyword">in</span> indexes:<br> index_details = milvus_client.describe_index(<br> collection_name=collection_name, <br> <span class="hljs-comment"># 指定索引名称,这里假设使用第一个索引</span><br> index_name=index<br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"索引 <span class="hljs-subst">{index}</span> 详情:<span class="hljs-subst">{index_details}</span>"</span>)<br> <span class="hljs-built_in">print</span>()<br> <span class="hljs-keyword">else</span>:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 中没有创建索引。"</span>)<br><br><span class="hljs-comment"># 示例</span><br>show_index_info(collection_name)<br></code></pre></td></tr></table></figure><p>如果创建成功,你会看到下面的输出:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">已经创建的索引:['sparse_vectors_bm25', 'sparse_vectors_splade']<br><br>索引 sparse_vectors_bm25 详情:{'drop_ratio_build': '0.2', 'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'IP', 'field_name': 'sparse_vectors_bm25', 'index_name': 'sparse_vectors_bm25', 'total_rows': 0, 'indexed_rows': 0, 'pending_index_rows': 0, 'state': 'Finished'}<br><br>索引 sparse_vectors_splade 详情:{'drop_ratio_build': '0.2', 'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'IP', 'field_name': 'sparse_vectors_splade', 'index_name': 'sparse_vectors_splade', 'total_rows': 0, 'indexed_rows': 0, 'pending_index_rows': 0, 'state': 'Finished'}<br></code></pre></td></tr></table></figure><p>加载集合。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 加载集合</span><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在加载集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br>milvus_client.load_collection(collection_name=collection_name)<br><br><span class="hljs-comment"># 验证加载状态</span><br><span class="hljs-built_in">print</span>(milvus_client.get_load_state(collection_name=collection_name))<br></code></pre></td></tr></table></figure><p>如果加载成功,会显示:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">正在加载集合:docs<br>{'state': <LoadState: Loaded>}<br></code></pre></td></tr></table></figure><p>加载完成,下面就是重头戏了,搜索。</p><p>定义搜索函数:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 定义稀疏向量搜索参数</span><br>search_params_sparse_vectors = {<br> <span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"IP"</span>,<br> <span class="hljs-string">"params"</span>: {<span class="hljs-string">"drop_ratio_search"</span>: <span class="hljs-number">0.2</span>},<br>}<br><br><span class="hljs-comment"># 执行向量搜索</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vector_search</span>(<span class="hljs-params"></span><br><span class="hljs-params"> query_vectors,</span><br><span class="hljs-params"> field_name,</span><br><span class="hljs-params"> search_params,</span><br><span class="hljs-params"> output_fields,</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-comment"># 向量搜索</span><br> res = milvus_client.search(<br> collection_name=collection_name,<br> <span class="hljs-comment"># 指定查询向量。</span><br> data=query_vectors,<br> <span class="hljs-comment"># 指定要搜索的向量字段</span><br> anns_field=field_name,<br> <span class="hljs-comment"># 设置搜索参数</span><br> search_params=search_params,<br> output_fields=output_fields<br> )<br> <span class="hljs-keyword">return</span> res<br></code></pre></td></tr></table></figure><p>再定义一个打印结果的函数,方便查看结果。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 打印向量搜索结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_vector_results</span>(<span class="hljs-params">res</span>):<br> <span class="hljs-keyword">for</span> hits <span class="hljs-keyword">in</span> res:<br> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> hits:<br> entity = hit.get(<span class="hljs-string">"entity"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"text: <span class="hljs-subst">{entity[<span class="hljs-string">'text'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"distance: <span class="hljs-subst">{hit[<span class="hljs-string">'distance'</span>]:<span class="hljs-number">.3</span>f}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(hits)}</span>"</span>)<br></code></pre></td></tr></table></figure><p>首先,我们使用 BM25搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 使用稀疏向量BM25搜索</span><br>query1 = [<span class="hljs-string">"人工智能如何影响汽车行业?"</span>]<br><br>query_sparse_vectors_bm25 = bm25_ef.encode_queries(query1)<br><br>field_name = <span class="hljs-string">"sparse_vectors_bm25"</span><br>output_fields = [<span class="hljs-string">"text"</span>]<br><span class="hljs-comment"># 指定搜索的分区,或者过滤搜索</span><br>res_sparse_vectors_bm25 = vector_search(query_sparse_vectors_bm25, field_name, search_params_sparse_vectors, output_fields)<br><br>print_vector_results(res_sparse_vectors_bm25)<br></code></pre></td></tr></table></figure><p>但是并没有搜索到任何结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">数量:0<br></code></pre></td></tr></table></figure><p>为什么呢?我们查看下 query1的分词结果:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看query1的分词结果</span><br><span class="hljs-built_in">print</span>(analyzer(query1[<span class="hljs-number">0</span>]))<br></code></pre></td></tr></table></figure><p>分词结果只有“人工智能”一个词:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">['人工智能', '影响', '汽车行业']<br></code></pre></td></tr></table></figure><p>BM25的词汇表中虽然有“智能”这个词,但是并不包含“人工智能”、“影响”和“汽车行业”这些词,所以没有返回任何结果。</p><p>我们把“人工智能”替换成“机器智能”,就可以搜索到了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 使用稀疏向量BM25搜索</span><br>query2 = [<span class="hljs-string">"机器智能如何影响汽车行业?"</span>]<br><br>query_sparse_vectors_bm25 = bm25_ef.encode_queries(query2)<br><br>field_name = <span class="hljs-string">"sparse_vectors_bm25"</span><br>output_fields = [<span class="hljs-string">"text"</span>]<br><span class="hljs-comment"># 指定搜索的分区,或者过滤搜索</span><br>res_sparse_vectors_bm25 = vector_search(query_sparse_vectors_bm25, field_name, search_params_sparse_vectors, output_fields)<br><br>print_vector_results(res_sparse_vectors_bm25)<br></code></pre></td></tr></table></figure><p>而且,这次还搜索到了包含“机器学习”的句子。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">text: 机器智能的未来充满潜力。<br>distance: 2.054<br>--------------------------------------------------<br>text: 大数据支持是机器智能发展的关键。<br>distance: 1.752<br>--------------------------------------------------<br>text: 机器学习正在改变我们的生活方式。<br>distance: 0.788<br>--------------------------------------------------<br>数量:3<br></code></pre></td></tr></table></figure><p>这是因为分词时把“机器智能“分成了“机器”和“智能”两个词,所以能搜索到更多句子。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看query2的分词结果</span><br><span class="hljs-built_in">print</span>(analyzer(query2[<span class="hljs-number">0</span>]))<br></code></pre></td></tr></table></figure><p>分词结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">['机器', '智能', '影响', '汽车行业']<br></code></pre></td></tr></table></figure><p>接下来,我们使用 splade 搜索,看看和 BM25的搜索结果有什么不同。</p><p>先定义一个打印结果的函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 打印向量搜索结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_vector_results_en</span>(<span class="hljs-params">res</span>):<br> <span class="hljs-keyword">for</span> hits <span class="hljs-keyword">in</span> res:<br> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> hits:<br> entity = hit.get(<span class="hljs-string">"entity"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"text_en: <span class="hljs-subst">{entity[<span class="hljs-string">'text_en'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"distance: <span class="hljs-subst">{hit[<span class="hljs-string">'distance'</span>]:<span class="hljs-number">.3</span>f}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(hits)}</span>"</span>)<br></code></pre></td></tr></table></figure><p>然后使用 splade 搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python">query1_en = [<span class="hljs-string">"How does artificial intelligence affect the automotive industry?"</span>]<br><br>query_sparse_vectors_splade = splade_ef.encode_queries(query1_en)<br><br>field_name = <span class="hljs-string">"sparse_vectors_splade"</span><br>output_fields = [<span class="hljs-string">"text_en"</span>]<br>res_sparse_vectors_splade = vector_search(query_sparse_vectors_splade, field_name, search_params_sparse_vectors, output_fields)<br><br>print_vector_results_en(res_sparse_vectors_splade)<br></code></pre></td></tr></table></figure><p>比较 BM25 和 splade 的搜索结果,我们很容易发现它们之间的区别。splade 的文档集合中并不包含“artificial intelligence”这个词,但是由于它具有“举一反三”的能力,仍然搜索到了包含“AI”、“machine intelligence”以及“Autonomous”的句子,返回了更多结果(其实是返回了所有文档)。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">text_en: The future of machine intelligence is full of potential.<br>distance: 10.020<br>--------------------------------------------------<br>text_en: Big data support is key to the development of machine intelligence.<br>distance: 8.232<br>--------------------------------------------------<br>text_en: AI can help doctors diagnose diseases.<br>distance: 7.291<br>--------------------------------------------------<br>text_en: Autonomous driving relies on advanced algorithms.<br>distance: 7.213<br>--------------------------------------------------<br>text_en: Production efficiency can be improved through automation technology.<br>distance: 6.999<br>--------------------------------------------------<br>text_en: Machine learning is changing our way of life.<br>distance: 6.863<br>--------------------------------------------------<br>text_en: Data analysis technology is widely applied in the financial field.<br>distance: 5.064<br>--------------------------------------------------<br>text_en: The quantum tunneling effect allows electrons to pass through potential barriers that classical mechanics consider impassable, which has important applications in semiconductor devices.<br>distance: 3.695<br>--------------------------------------------------<br>text_en: Deep learning performs exceptionally well in image recognition.<br>distance: 3.464<br>--------------------------------------------------<br>text_en: Natural language processing is an important field in computer science.<br>distance: 3.044<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><p>如果把查询中的“artificial intelligence”替换成“machine intelligence”,仍然会返回所有结果,但是权重有所不同。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">text_en: The future of machine intelligence is full of potential.<br>distance: 15.128<br>--------------------------------------------------<br>text_en: Big data support is key to the development of machine intelligence.<br>distance: 12.945<br>--------------------------------------------------<br>text_en: Machine learning is changing our way of life.<br>distance: 12.763<br>--------------------------------------------------<br>text_en: Production efficiency can be improved through automation technology.<br>distance: 7.446<br>--------------------------------------------------<br>text_en: AI can help doctors diagnose diseases.<br>distance: 6.055<br>--------------------------------------------------<br>text_en: Autonomous driving relies on advanced algorithms.<br>distance: 5.309<br>--------------------------------------------------<br>text_en: Data analysis technology is widely applied in the financial field.<br>distance: 4.857<br>--------------------------------------------------<br>text_en: The quantum tunneling effect allows electrons to pass through potential barriers that classical mechanics consider impassable, which has important applications in semiconductor devices.<br>distance: 3.356<br>--------------------------------------------------<br>text_en: Deep learning performs exceptionally well in image recognition.<br>distance: 3.317<br>--------------------------------------------------<br>text_en: Natural language processing is an important field in computer science.<br>distance: 2.688<br>--------------------------------------------------<br>数量:10<br></code></pre></td></tr></table></figure><h2 id="藏宝图"><a href="#藏宝图" class="headerlink" title="藏宝图"></a>藏宝图</h2><p>如果你想深入研究稀疏向量,可以参考下面的资料:</p><ul><li><a href="https://zilliz.com.cn/blog/mastering-bm25-a-deep-dive-into-the-algorithm-and-application-in-milvu">精通BM25:深入探讨算法及其在Milvus中的应用</a></li><li><a href="https://mp.weixin.qq.com/s/ZvId2vm8PDdA1fW3hJY1bA">详解如何通过稀疏向量优化信息检索</a></li><li><a href="https://github.com/naver/splade">splade的github</a></li><li><a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25-wiki</a></li></ul><h2 id="注释"><a href="#注释" class="headerlink" title="注释"></a>注释</h2><section class="footnotes"><div class="footnote-list"><ol><li><span id="fn:1" class="footnote-text"><span>0.25&#125;<br></code></pre></td></tr></table></figure></hexoPostRenderCodeBlock><a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li><li><span id="fn:1" class="footnote-text"><span>BM25公式中并没有 epsilon 这个参数。在模型中,它用于平滑处理,以避免除以零的情况,特别是在文档长度(dl)为零的情况下。epsilon 通常是一个小的正数,如0.5,它被加到文档长度的归一化公式中,确保公式的稳定性。<a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li></ol></div></section>]]></content>
<categories>
<category>向量数据库</category>
<category>原理探秘</category>
</categories>
</entry>
<entry>
<title>鲁迅到底说没说?RAG之分块</title>
<link href="/2024/10/29/%E9%B2%81%E8%BF%85%E5%88%B0%E5%BA%95%E8%AF%B4%E6%B2%A1%E8%AF%B4%EF%BC%9FRAG%E4%B9%8B%E5%88%86%E5%9D%97/"/>
<url>/2024/10/29/%E9%B2%81%E8%BF%85%E5%88%B0%E5%BA%95%E8%AF%B4%E6%B2%A1%E8%AF%B4%EF%BC%9FRAG%E4%B9%8B%E5%88%86%E5%9D%97/</url>
<content type="html">< ,增加了一些字段。`luxun_sample.json` 为鲁迅部分作品,方便试用。`luxun.json` 为完整的鲁迅作品集。">[1]</span></a></sup>的文本格式:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><code class="hljs json"><span class="hljs-punctuation">[</span> <br><span class="hljs-punctuation">{</span><br><span class="hljs-attr">"book"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"伪自由书"</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"title"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"最艺术的国家"</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"author"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"鲁迅"</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"type"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">""</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"source"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">""</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"date"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">""</span><span class="hljs-punctuation">,</span> <br><span class="hljs-attr">"content"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人... </span><br><span class="hljs-string">}, </span><br><span class="hljs-string">{</span><br><span class="hljs-string">"</span>book<span class="hljs-string">": "</span>伪自由书<span class="hljs-string">", </span><br><span class="hljs-string">"</span>title<span class="hljs-string">": "</span>王道诗话<span class="hljs-string">", </span><br><span class="hljs-string">"</span>author<span class="hljs-string">": "</span>鲁迅<span class="hljs-string">", </span><br><span class="hljs-string">"</span>type<span class="hljs-string">": "</span><span class="hljs-string">", </span><br><span class="hljs-string">"</span>source<span class="hljs-string">": "</span><span class="hljs-string">", </span><br><span class="hljs-string">"</span>date<span class="hljs-string">": "</span><span class="hljs-string">", </span><br><span class="hljs-string">"</span>content<span class="hljs-string">": "</span>《人权论》是从鹦鹉开头的,据说古时候有一只高飞远走的鹦哥儿... <br><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span> <br>...<br><span class="hljs-punctuation">]</span><br></code></pre></td></tr></table></figure><p>文本中的“content”字段的值,就是一篇文章。有的文章字数多达几万字,用几百维的向量根本无法表达文章的语义细节。怎么办?就像前面说的,既然全文字数太多,我们就把文章切成几块,对每个块再做向量化。这个操作叫做“分块”。</p><h2 id="根据固定字数分块"><a href="#根据固定字数分块" class="headerlink" title="根据固定字数分块"></a>根据固定字数分块</h2><p>最简单的分块方法是 <code>fixed_chunk</code>(固定分块),是按照字数分块,比如每隔150个字就分割一次。比如,对于《最艺术的国家》这篇文章使用 <code>fixed_chunk</code>,再通过 <a href="https://chunkviz.up.railway.app/">ChunkViz</a> 把分块结果可视化,如下图所示:<br><img src="https://picgo233.oss-cn-hangzhou.aliyuncs.com/img/202410082047910.png"></p><p>我们用代码来实现 <code>fixed_chunk</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">import</span> json<br><br><span class="hljs-comment"># 固定分块</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">fixed_chunk</span>(<span class="hljs-params"></span><br><span class="hljs-params"> input_file_path,</span><br><span class="hljs-params"> output_file_path, </span><br><span class="hljs-params"> chunk_size, </span><br><span class="hljs-params"> field_name</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(input_file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> data_list = json.load(file)<br> chunk_data_list = []<br> <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list:<br> <span class="hljs-comment"># 获取指定字段的值</span><br> text = data[field_name]<br> <span class="hljs-comment"># 对指定字段分割</span><br> chunks = [text[i:i + chunk_size] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(text), chunk_size)]<br> <span class="hljs-keyword">for</span> idx, chunk <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(chunks):<br> chunk_data_list.append({<br> <span class="hljs-comment"># 使用原始文章的 id 生成chunk的id</span><br> <span class="hljs-string">"id"</span>: <span class="hljs-string">f'<span class="hljs-subst">{data[<span class="hljs-string">"book"</span>]}</span>#<span class="hljs-subst">{data[<span class="hljs-string">"title"</span>]}</span>#chunk<span class="hljs-subst">{idx}</span>'</span>,<br> <span class="hljs-string">"book"</span> : data[<span class="hljs-string">"book"</span>],<br> <span class="hljs-string">"title"</span> : data[<span class="hljs-string">"title"</span>],<br> <span class="hljs-string">"chunk"</span> : chunk,<br> <span class="hljs-comment"># window 字段在这里只是占位,没有实际作用,后面会详细介绍它的用处</span><br> <span class="hljs-string">"window"</span>: <span class="hljs-string">""</span>,<br> <span class="hljs-string">"method"</span>: <span class="hljs-string">"fixed_chunk"</span><br> })<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(output_file_path, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> json_file:<br> json.dump(chunk_data_list, json_file, ensure_ascii=<span class="hljs-literal">False</span>, indent=<span class="hljs-number">4</span>)<br><br><span class="hljs-comment"># 执行固定分块的函数</span><br>input_file_path = <span class="hljs-string">"luxun_sample.json"</span><br>output_file_path = <span class="hljs-string">"luxun_sample_fixed_chunk.json"</span><br>chunk_size = <span class="hljs-number">150</span><br>field_name = <span class="hljs-string">"content"</span><br><br>fixed_chunk(input_file_path, output_file_path, chunk_size, field_name)<br></code></pre></td></tr></table></figure><p>运行代码,得到 <code>luxun_sample_fixed_chunk.json</code> 文件,格式和上文中的可视化结果一致。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">[<br> {<br> "id": "伪自由书 #最艺术的国家 #chunk0 ",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人...",<br> "window": "",<br> "method": "fixed_chunk"<br> },<br> {<br> "id": "伪自由书 #最艺术的国家 #chunk1 ",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "民国。然而这民国年久失修...",<br> "window": "",<br> "method": "fixed_chunk"<br> },<br> ...<br>]<br></code></pre></td></tr></table></figure><p>你可能已经发现了,<code>fixed_chunk</code> 经常在句子中间分割,导致句子不连贯,语义的完整性被破坏。</p><h2 id="根据标点符号分割"><a href="#根据标点符号分割" class="headerlink" title="根据标点符号分割"></a>根据标点符号分割</h2><p>怎么解决这个问题呢?我们可以在标点符号处分割。但是这还不够,因为这样分割的话,块与块之间仍然是相互独立的了,缺少关联。打个比方,如果看《生活大爆炸》这样的单元剧,我们跳着看也没关系,不影响理解剧情。但是如果看《天龙八部》这样的连续剧,上一集讲的还是段誉为救钟灵去万劫谷拿解药,下一集他就瞬移到了少室山,用六脉神剑大战慕容复。我们会一头雾水,这中间到底发生了什么?</p><p>所以,连续剧的开头有“前情提要”,结尾有“下集预告”。同样,为了保证块与块之间语义的连贯,我们也要设计一个“重叠”部分,让下一个块的开头部分,重复上一个块的结尾部分。</p><p>听起来很复杂?不用担心,我们可以使用 LlamaIndex<sup id="fnref:2" class="footnote-ref"><a href="#fn:2" rel="footnote"><span class="hint--top hint--rounded" aria-label="LlamaIndex 是一个用于构建带有上下文增强功能的生成式 AI 应用的框架,支持大型语言模型(LLMs)。">[2]</span></a></sup> 库轻松实现这种分块方法—— <code>semantic_chunk</code> 。</p><p>安装 LlamaIndex 库。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">pip install llama_index==0.11.16<br></code></pre></td></tr></table></figure><p>定义 <code>semantic_chunk</code> 分块函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入SentenceSplitter用来分块</span><br><span class="hljs-keyword">from</span> llama_index.core.node_parser <span class="hljs-keyword">import</span> SentenceSplitter<br><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">semantic_chunk</span>(<span class="hljs-params"></span><br><span class="hljs-params"> input_file_path, </span><br><span class="hljs-params"> output_file_path, </span><br><span class="hljs-params"> <span class="hljs-comment"># 块的大小</span></span><br><span class="hljs-params"> chunk_size,</span><br><span class="hljs-params"> <span class="hljs-comment"># 重叠部分的大小</span></span><br><span class="hljs-params"> chunk_overlap,</span><br><span class="hljs-params"> <span class="hljs-comment"># 指定分块的字段</span></span><br><span class="hljs-params"> field_name,</span><br><span class="hljs-params"> </span>) :<br> <span class="hljs-comment"># 初始化 SentenceSplitter,设置分块的参数</span><br> text_splitter = SentenceSplitter(<br> <span class="hljs-comment"># 指定段落分隔符</span><br> paragraph_separator=<span class="hljs-string">"\n\n\n"</span>,<br> <span class="hljs-comment"># 指定主要分隔符</span><br> separator=<span class="hljs-string">"。"</span>,<br> <span class="hljs-comment"># 指定次要分隔符</span><br> secondary_chunking_regex=<span class="hljs-string">"[^,.;、。:]+[,.;、。:]?"</span>,<br> <span class="hljs-comment"># 指定块的大小</span><br> chunk_size=chunk_size, <br> <span class="hljs-comment"># 指定重叠部分的大小</span><br> chunk_overlap=chunk_overlap,<br> )<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(input_file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> data_list = json.load(file)<br> chunk_data_list = []<br> <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list:<br> text = data[field_name]<br> chunks = text_splitter.split_text(text)<br> <span class="hljs-keyword">for</span> idx, chunk <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(chunks):<br> chunk_data_list.append({<br> <span class="hljs-comment"># 使用原始文章的 id 生成chunk的id</span><br> <span class="hljs-string">"id"</span>: <span class="hljs-string">f'<span class="hljs-subst">{data[<span class="hljs-string">"book"</span>]}</span>#<span class="hljs-subst">{data[<span class="hljs-string">"title"</span>]}</span>#chunk<span class="hljs-subst">{idx}</span>'</span>,<br> <span class="hljs-string">"book"</span> : data[<span class="hljs-string">"book"</span>],<br> <span class="hljs-string">"title"</span> : data[<span class="hljs-string">"title"</span>],<br> <span class="hljs-string">"chunk"</span> : chunk,<br> <span class="hljs-comment"># window 字段在这里只是占位,没有实际作用,后面会详细介绍它的用处</span><br> <span class="hljs-string">"window"</span>: <span class="hljs-string">""</span>,<br> <span class="hljs-string">"method"</span>: <span class="hljs-string">"semantic_chunk"</span><br> })<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(output_file_path, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> json_file:<br> json.dump(chunk_data_list, json_file, ensure_ascii=<span class="hljs-literal">False</span>, indent=<span class="hljs-number">4</span>)<br><br><span class="hljs-comment"># 执行标点符号分块</span><br>input_file_path = <span class="hljs-string">"luxun_sample.json"</span><br>output_file_path = <span class="hljs-string">"luxun_sample_semantic_chunk.json"</span><br>chunk_size = <span class="hljs-number">150</span><br>chunk_overlap = <span class="hljs-number">20</span><br>field_name = <span class="hljs-string">"content"</span><br><br>semantic_chunk(<br> input_file_path, <br> output_file_path, <br> chunk_size, <br> chunk_overlap,<br> field_name<br>)<br></code></pre></td></tr></table></figure><p>执行上面的代码,得到 <code>luxun_sample_semantic_chunk.json</code> 文件,我们来看一下分块的结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">[<br> {<br> "id": "伪自由书#最艺术的国家#chunk0",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人...中国的固有文化是科举制度,外加捐班之类。",<br> "window": "",<br> "method": "semantic_chunk"<br> },<br> {<br> "id": "伪自由书#最艺术的国家#chunk1",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "外加捐班之类。当初说这太不像民权...这对于民族是不忠,对于祖宗是不孝,",<br> "window": "",<br> "method": "semantic_chunk"<br> },<br> ...<br>]<br></code></pre></td></tr></table></figure><p>果然是在我们设置的标点符号处分块的,而且附带重叠部分,这样就能保证块与块之间语义的连贯了。</p><h2 id="根据句子分块"><a href="#根据句子分块" class="headerlink" title="根据句子分块"></a>根据句子分块</h2><p>对于上面的分块结果,你可能还不满意。虽然它根据标点符号分割,但是并不一定在句号处分割,无法保证句子的完整性。比如,对于这句话 <code>我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人。这艺术的可贵,是在于两面光,或谓之“中庸”---男人看见“扮女人”,女人看见“男人扮”。</code> 可能分割成 <code>我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人。这艺术的可贵</code> 和 <code>是在于两面光,或谓之“中庸”---男人看见“扮女人”,女人看见“男人扮”</code> 两个块。</p><p>为了解决这个问题,又诞生了一种分块方法,它根据句子而不是字数分割,也就是说,根据“。”、“!”和“?”这三个表示句子结束的标点符号分割,而不会受到字数的限制。但是,这种分割方式怎么实现重叠的功能呢?这也简单,把整个句子作为重叠部分就行了,叫做“窗口句子”。这种分块方法叫做 <code>window_chunk</code>。</p><p>比如,对于句子 <code>ABCD</code>,设置窗口大小为1,表示原始句子的左右各1个句子为“窗口句子”。分块如下:<br>第一个句子:A。窗口句子:B。因为第一个句子的左边没有句子。<br>第二个句子:B。窗口句子:A 和 C。<br>第三个句子:C。窗口句子:B 和 D。<br>第四个句子:D。窗口句子:C。因为最后一个句子的右边没有句子。</p><p>前面两种分块方法,都是对 chunk 字段向量化。而这种分块方法,除了对 chunk 字段(也就是原始句子)向量化外,还会把窗口句子作为原始句子的上下文,以元数据的形式储存在文件中。</p><p>原始句子用来做向量搜索,而在生成回答时,窗口句子和原始句子会一起传递给大模型。这样做的好处是,只向量化原始句子,节省了储存空间。提供窗口句子作为原始句子的上下文,可以帮助大模型理解原始句子的语境。</p><p>理解原理了,我们用代码来实现吧。</p><p>导入依赖。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入句子分块需要的依赖</span><br><span class="hljs-keyword">import</span> re<br><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> <span class="hljs-type">List</span><br><span class="hljs-keyword">from</span> llama_index.core <span class="hljs-keyword">import</span> Document<br><span class="hljs-keyword">from</span> llama_index.core.node_parser <span class="hljs-keyword">import</span> SentenceWindowNodeParser<br></code></pre></td></tr></table></figure><p>定义函数 <code>split_text_into_sentences</code>,用来分割中英文句子。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 分割中英文句子</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">split_text_into_sentences</span>(<span class="hljs-params">text</span>):<br> <span class="hljs-comment"># 使用正则表达式识别中英文句子结束符</span><br> sentence_endings = re.<span class="hljs-built_in">compile</span>(<span class="hljs-string">r'(?<=[。!?.!?])'</span>)<br> sentences = sentence_endings.split(text)<br> <span class="hljs-keyword">return</span> [s.strip() <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> sentences <span class="hljs-keyword">if</span> s.strip()]<br></code></pre></td></tr></table></figure><p>定义函数 <code>window_chunk</code>,基于句子对文本分块。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 根据句子对文本分块</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">window_chunk</span>(<span class="hljs-params"></span><br><span class="hljs-params"> input_file_path, </span><br><span class="hljs-params"> output_file_path,</span><br><span class="hljs-params"> field_name,</span><br><span class="hljs-params"> window_size</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-comment"># 设置用于文本解析的节点解析器</span><br> node_parser = SentenceWindowNodeParser.from_defaults(<br> window_size=window_size,<br> <span class="hljs-comment"># 为窗口元数据指定一个键名为"window",用于在解析过程中存储窗口数据</span><br> window_metadata_key=<span class="hljs-string">"window"</span>,<br> <span class="hljs-comment"># 为原始文本元数据指定一个键名为"original_text",用于在解析过程中存储原始文本</span><br> original_text_metadata_key=<span class="hljs-string">"original_text"</span>,<br> sentence_splitter = split_text_into_sentences<br> )<br> <br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(input_file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> data_list = json.load(file)<br> chunk_data_list = []<br> <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list:<br> text = data[field_name]<br> <span class="hljs-comment"># 将分割后的句子处理成节点。节点包含多个句子,类似于块</span><br> document = Document(text=text)<br> nodes = node_parser.get_nodes_from_documents([document])<br> <span class="hljs-keyword">for</span> idx, node <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(nodes):<br> chunk = node.metadata[<span class="hljs-string">"original_text"</span>]<br> window = node.metadata[<span class="hljs-string">"window"</span>]<br> chunk_data_list.append({<br> <span class="hljs-string">"id"</span>: <span class="hljs-string">f'<span class="hljs-subst">{data[<span class="hljs-string">"book"</span>]}</span>#<span class="hljs-subst">{data[<span class="hljs-string">"title"</span>]}</span>#chunk<span class="hljs-subst">{idx}</span>'</span>,<br> <span class="hljs-string">"book"</span>: data[<span class="hljs-string">"book"</span>],<br> <span class="hljs-string">"title"</span>: data[<span class="hljs-string">"title"</span>],<br> <span class="hljs-string">"chunk"</span>: chunk,<br> <span class="hljs-string">"window"</span>: window,<br> <span class="hljs-string">"method"</span>: <span class="hljs-string">"window_chunk"</span><br> })<br> <br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(output_file_path, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> json_file:<br> json.dump(chunk_data_list, json_file, ensure_ascii=<span class="hljs-literal">False</span>, indent=<span class="hljs-number">4</span>)<br><br><span class="hljs-comment"># 执行句子分块</span><br>input_file_path = <span class="hljs-string">"luxun_sample.json"</span><br>output_file_path = <span class="hljs-string">"luxun_sample_window_chunk.json"</span><br>field_name = <span class="hljs-string">"content"</span><br>window_size = <span class="hljs-number">1</span><br><br>window_chunk(<br> input_file_path,<br> output_file_path,<br> field_name,<br> window_size<br>)<br></code></pre></td></tr></table></figure><p>让我们来看下分块的结果,字段“chunk”是原始句子,“window”里面包含了原始句子和窗口句子。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">[<br> {<br> "id": "伪自由书#最艺术的国家#chunk0",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人。",<br> "window": "我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人。 这艺术的可贵,是在于两面光,或谓之“中庸”---男人看见“扮女人”,女人看见“男人扮”。",<br> "method": "window_chunk"<br> },<br> {<br> "id": "伪自由书#最艺术的国家#chunk1",<br> "book": "伪自由书",<br> "title": "最艺术的国家",<br> "author": "鲁迅",<br> "type": "",<br> "source": "",<br> "date": "",<br> "chunk": "这艺术的可贵,是在于两面光,或谓之“中庸”---男人看见“扮女人”,女人看见“男人扮”。",<br> "window": "我们中国的最伟大最永久,而且最普遍的“艺术”是男人扮女人。 这艺术的可贵,是在于两面光,或谓之“中庸”---男人看见“扮女人”,女人看见“男人扮”。 表面上是中性,骨子里当然还是男的。",<br> "method": "window_chunk"<br> },<br> ...<br>]<br></code></pre></td></tr></table></figure><h2 id="创建向量数据库"><a href="#创建向量数据库" class="headerlink" title="创建向量数据库"></a>创建向量数据库</h2><p>文本分块完成,接下来就是文本向量化,导入向量数据库了,这部分你应该比较熟悉了,我直接给出代码。</p><p>定义函数 vectorize_file,向量化 json 文件中指定的字段。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 向量化json文件中指定的字段</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_file</span>(<span class="hljs-params">input_file_path, encoder, field_name</span>):<br> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(input_file_path, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br> data_list = json.load(file)<br> docs = [data[field_name] <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list]<br> <span class="hljs-comment"># 向量化文档</span><br> <span class="hljs-keyword">return</span> vectorize_docs(docs, encoder), data_list<br></code></pre></td></tr></table></figure><p>为了比较 RAG 使用不同分块方法的效果,我们把三个分块文件全部向量化。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 向量化固定分块的文件</span><br>fixed_vectors, fixed_data_list = vectorize_file(<span class="hljs-string">"luxun_sample_fixed_chunk.json"</span>, bge_m3_ef, <span class="hljs-string">"chunk"</span>)<br>fixed_dense_vectors = fixed_vectors[<span class="hljs-string">'dense'</span>]<br><br><span class="hljs-comment"># 向量化通过标点符号分块的文件</span><br>semantic_vectors, semantic_data_list = vectorize_file(<span class="hljs-string">"luxun_sample_semantic_chunk.json"</span>, bge_m3_ef, <span class="hljs-string">"chunk"</span>)<br>semantic_dense_vectors = semantic_vectors[<span class="hljs-string">'dense'</span>]<br><br><span class="hljs-comment"># 向量化通过句子分块的文件</span><br>window_vectors, window_data_list = vectorize_file(<span class="hljs-string">"luxun_sample_window_chunk.json"</span>, bge_m3_ef, <span class="hljs-string">"chunk"</span>)<br>window_dense_vectors = window_vectors[<span class="hljs-string">'dense'</span>]<br></code></pre></td></tr></table></figure><p>接下来创建集合。为了能够在同一个集合中区分三种分块方法的搜索结果,我们设置参数 <code>partition_key_field</code> 的值为 <code>method</code>,它表示采用的分块方法。Milvus 会根据 <code>method</code> 字段的值,把数据插入到对应的分区中。打个比方,如果把集合看作一个 excel 文件,partition (分区)就是表格的工作表(Worksheet)。一个 excel 文件包含多张工作表,不同的数据填写在对应的工作表中。相应的,我们把不同的数据插入到对应分区中,搜索时指定分区,就可以提高搜索效率。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建集合</span><br><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> MilvusClient, DataType<br><span class="hljs-keyword">import</span> time<br><br><span class="hljs-comment"># 删除同名集合</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">check_collection</span>(<span class="hljs-params">collection_name</span>):<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 已经存在"</span>)<br> <span class="hljs-keyword">try</span>:<br> milvus_client.drop_collection(collection_name)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合时出现错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br><br><span class="hljs-comment"># 创建模式</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_schema</span>():<br> schema = milvus_client.create_schema(<br> auto_id=<span class="hljs-literal">False</span>,<br> enable_dynamic_field=<span class="hljs-literal">True</span>,<br> partition_key_field=<span class="hljs-string">"method"</span>,<br> num_partitions=<span class="hljs-number">16</span>,<br> description=<span class="hljs-string">""</span><br> )<br> schema.add_field(field_name=<span class="hljs-string">"id"</span>, datatype=DataType.VARCHAR, is_primary=<span class="hljs-literal">True</span>, max_length=<span class="hljs-number">100</span>)<br> schema.add_field(field_name=<span class="hljs-string">"book"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">100</span>)<br> schema.add_field(field_name=<span class="hljs-string">"title"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">100</span>)<br> schema.add_field(field_name=<span class="hljs-string">"chunk"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">4000</span>)<br> schema.add_field(field_name=<span class="hljs-string">"window"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">6000</span>)<br> schema.add_field(field_name=<span class="hljs-string">"method"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">30</span>)<br> schema.add_field(field_name=<span class="hljs-string">"dense_vectors"</span>, datatype=DataType.FLOAT_VECTOR, dim=<span class="hljs-number">1024</span>)<br> <span class="hljs-keyword">return</span> schema<br><br><span class="hljs-comment"># 创建集合</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">create_collection</span>(<span class="hljs-params">collection_name, schema, timeout</span>):<br> <span class="hljs-keyword">try</span>:<br> milvus_client.create_collection(<br> collection_name=collection_name,<br> schema=schema,<br> shards_num=<span class="hljs-number">2</span><br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"开始创建集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合的过程中出现了错误: <span class="hljs-subst">{e}</span>"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br><br> <span class="hljs-comment"># 检查集合是否创建成功</span><br> start_time = time.time()<br> <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:<br> <span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 创建成功"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span><br> <span class="hljs-keyword">elif</span> time.time() - start_time > timeout:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合 <span class="hljs-subst">{collection_name}</span> 超时"</span>)<br> <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span><br> time.sleep(<span class="hljs-number">1</span>)<br><br>collection_name = <span class="hljs-string">"LuXunWorks_sample"</span><br>uri=<span class="hljs-string">"http://localhost:19530"</span><br>milvus_client = MilvusClient(uri=uri)<br>timeout = <span class="hljs-number">10</span><br><br><span class="hljs-comment"># 检查并删除集合</span><br><span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> check_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"无法删除集合 <span class="hljs-subst">{collection_name}</span>,停止创建。"</span>)<br><span class="hljs-keyword">else</span>:<br> <span class="hljs-comment"># 创建集合的模式</span><br> schema = create_schema()<br> <span class="hljs-comment"># 创建集合并等待成功</span><br> create_collection(collection_name, schema, timeout)<br></code></pre></td></tr></table></figure><p>把数据插入到向量数据库。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm<br><span class="hljs-keyword">def</span> <span class="hljs-title function_">insert_data</span>(<span class="hljs-params"></span><br><span class="hljs-params"> collection_name,</span><br><span class="hljs-params"> data_list,</span><br><span class="hljs-params"> dense_vectors,</span><br><span class="hljs-params"> batch_size=<span class="hljs-number">1000</span></span>): <br> <span class="hljs-comment"># 接收稠密向量</span><br> <span class="hljs-keyword">for</span> data, dense_vector <span class="hljs-keyword">in</span> <span class="hljs-built_in">zip</span>(data_list, dense_vectors):<br> data[<span class="hljs-string">'dense_vectors'</span>] = dense_vector<br><br> <span class="hljs-comment"># 分批入库</span><br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在将数据插入集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> total_count = <span class="hljs-built_in">len</span>(data_list)<br> <span class="hljs-keyword">with</span> tqdm(total=total_count, desc=<span class="hljs-string">"插入数据"</span>) <span class="hljs-keyword">as</span> progress_bar:<br> <span class="hljs-comment"># 每次插入 batch_size 条数据</span><br> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, total_count, batch_size): <br> batch_data = data_list[i:i + batch_size]<br> res = milvus_client.insert(<br> collection_name=collection_name,<br> data=batch_data<br> )<br> progress_bar.update(<span class="hljs-built_in">len</span>(batch_data))<br><br>insert_data(collection_name, fixed_data_list, dense_vectors=fixed_dense_vectors)<br>insert_data(collection_name, semantic_data_list, dense_vectors=semantic_dense_vectors)<br>insert_data(collection_name, window_data_list, dense_vectors=window_dense_vectors)<br></code></pre></td></tr></table></figure><p>创建索引。我们使用倒排索引,首先创建索引参数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><code class="hljs python">index_params = milvus_client.prepare_index_params()<br><br>index_params.add_index(<br> <span class="hljs-comment"># 指定索引名称</span><br> index_name=<span class="hljs-string">"IVF_FLAT"</span>,<br> <span class="hljs-comment"># 指定创建索引的字段</span><br> field_name=<span class="hljs-string">"dense_vectors"</span>,<br> <span class="hljs-comment"># 设置索引类型</span><br> index_type=<span class="hljs-string">"IVF_FLAT"</span>,<br> <span class="hljs-comment"># 设置度量方式</span><br> metric_type=<span class="hljs-string">"IP"</span>,<br> <span class="hljs-comment"># 设置索引聚类中心的数量</span><br> params={<span class="hljs-string">"nlist"</span>: <span class="hljs-number">128</span>}<br>)<br></code></pre></td></tr></table></figure><p>接下来创建索引。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python">milvus_client.create_index(<br> <span class="hljs-comment"># 指定为创建索引的集合</span><br> collection_name=collection_name,<br> <span class="hljs-comment"># 使用前面创建的索引参数创建索引</span><br> index_params=index_params<br>)<br></code></pre></td></tr></table></figure><p>验证下索引是否成功创建。查看集合的所有索引。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><code class="hljs python">res = milvus_client.list_indexes(<br> collection_name=collection_name<br>)<br><span class="hljs-built_in">print</span>(res)<br></code></pre></td></tr></table></figure><p>返回我们创建的索引 <code>['IVF_FLAT']</code>。再查看下索引的详细信息。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python">res = milvus_client.describe_index(<br> collection_name=collection_name,<br> index_name=<span class="hljs-string">"IVF_FLAT"</span><br>)<br><span class="hljs-built_in">print</span>(res)<br></code></pre></td></tr></table></figure><p>返回下面的索引信息,表示索引创建成功:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">{'nlist': '128', 'index_type': 'IVF_FLAT', 'metric_type': 'IP', 'field_name': 'dense_vectors', 'index_name': 'IVF_FLAT', 'total_rows': 0, 'indexed_rows': 0, 'pending_index_rows': 0, 'state': 'Finished'}<br></code></pre></td></tr></table></figure><p>接下来加载集合到内存。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span> (<span class="hljs-string">f"正在加载集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br>milvus_client.load_collection (collection_name=collection_name)<br></code></pre></td></tr></table></figure><p>验证下加载状态。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span> (milvus_client.get_load_state (collection_name=collection_name))<br></code></pre></td></tr></table></figure><p>如果返回 <code>{'state': <LoadState: Loaded>}</code>,说明加载完成。接下来,我们定义搜索函数。</p><p>先定义搜索参数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python">search_params = {<br> <span class="hljs-comment"># 度量类型</span><br> <span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"IP"</span>,<br> <span class="hljs-comment"># 搜索过程中要查询的聚类单元数量。增加nprobe值可以提高搜索精度,但会降低搜索速度</span><br> <span class="hljs-string">"params"</span>: {<span class="hljs-string">"nprobe"</span>: <span class="hljs-number">16</span>}<br>}<br></code></pre></td></tr></table></figure><p>再定义搜索函数。还记得前面我们在创建集合时,设置的 <code>partition_key_field</code> 吗?它会根据 <code>method</code> 字段的值,把数据插入到相应的分区中。而搜索函数中的 <code>filter</code> 参数,就是用来指定在哪个分区中搜索的。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 把查询向量化</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_query</span>(<span class="hljs-params">query, encoder</span>):<br> <span class="hljs-comment"># 验证参数是否符合要求</span><br> <span class="hljs-keyword">if</span> encoder <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"嵌入模型未初始化。"</span>)<br> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> (<span class="hljs-built_in">isinstance</span>(query, <span class="hljs-built_in">list</span>) <span class="hljs-keyword">and</span> <span class="hljs-built_in">all</span>(<span class="hljs-built_in">isinstance</span>(text, <span class="hljs-built_in">str</span>) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> query)):<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"query必须为字符串列表。"</span>)<br> <span class="hljs-keyword">return</span> encoder.encode_queries(query)<br><br><span class="hljs-comment"># 搜索函数</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vector_search</span>(<span class="hljs-params"></span><br><span class="hljs-params"> query, </span><br><span class="hljs-params"> search_params,</span><br><span class="hljs-params"> limit,</span><br><span class="hljs-params"> output_fields,</span><br><span class="hljs-params"> partition_name</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-comment"># 将查询转换为向量</span><br> query_vectors = [vectorize_query(query, bge_m3_ef)[<span class="hljs-string">'dense'</span>][<span class="hljs-number">0</span>]]<br> <span class="hljs-comment"># 向量搜索</span><br> res = milvus_client.search(<br> collection_name=collection_name,<br> <span class="hljs-comment"># 指定查询向量</span><br> data=query_vectors,<br> <span class="hljs-comment"># 指定搜索的字段</span><br> anns_field=<span class="hljs-string">"dense_vectors"</span>,<br> <span class="hljs-comment"># 设置搜索参数</span><br> search_params=search_params,<br> <span class="hljs-comment"># 设置搜索结果的数量</span><br> limit=limit,<br> <span class="hljs-comment"># 设置输出字段</span><br> output_fields=output_fields,<br> <span class="hljs-comment"># 在指定分区中搜索</span><br> <span class="hljs-built_in">filter</span>=<span class="hljs-string">f"method =='<span class="hljs-subst">{partition_name}</span>'"</span><br> )<br> <span class="hljs-keyword">return</span> res<br></code></pre></td></tr></table></figure><p>再定义一个打印搜索结果的函数,方便查看。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 打印向量搜索结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_vector_results</span>(<span class="hljs-params">res</span>):<br> <span class="hljs-comment"># hit是搜索结果中的每一个匹配的实体</span><br> res = [hit[<span class="hljs-string">"entity"</span>] <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> res[<span class="hljs-number">0</span>]]<br> <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> res:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"title: <span class="hljs-subst">{item[<span class="hljs-string">'title'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"chunk: <span class="hljs-subst">{item[<span class="hljs-string">'chunk'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"method: <span class="hljs-subst">{item[<span class="hljs-string">'method'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>) <br></code></pre></td></tr></table></figure><p>下面我们就来看一看,<code>fixed_chunk</code>、<code>semantic_chunk</code> 和 <code>window_chunk</code> 三位选手在向量搜索上表现如何。首先搜索第一个句子:“世上本没有路,走的人多了,也便成了路。”。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 比较不同分块方法产生的搜索结果</span><br>query1 = [<span class="hljs-string">"世上本没有路,走的人多了,也便成了路。"</span>]<br>limit = <span class="hljs-number">1</span><br>output_fields = [<span class="hljs-string">"title"</span>, <span class="hljs-string">"chunk"</span>, <span class="hljs-string">"window"</span>, <span class="hljs-string">"method"</span>]<br><br><span class="hljs-comment"># 定义分块方法列表</span><br>chunk_methods = [<span class="hljs-string">"fixed_chunk"</span>, <span class="hljs-string">"semantic_chunk"</span>, <span class="hljs-string">"window_chunk"</span>]<br><br><span class="hljs-comment"># 定义一个函数来执行搜索并打印结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">compare_chunk_methods</span>(<span class="hljs-params">query, search_params, limit, output_fields, methods</span>):<br> <span class="hljs-keyword">for</span> method <span class="hljs-keyword">in</span> methods:<br> res = vector_search(query, search_params, limit, output_fields, method)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{method}</span> 的搜索结果是:\n"</span>)<br> print_vector_results(res)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"*"</span> * <span class="hljs-number">50</span>)<br><br><span class="hljs-comment"># 调用函数进行比较</span><br>compare_chunk_methods(query1, search_params, limit, output_fields, chunk_methods)<br></code></pre></td></tr></table></figure><p><code>fixed_chunk</code> 选手的确搜索到了原文,但是并不完整。这也是 <code>fixed_chunk</code> 分块的典型问题。</p><p>搜索结果如下:</p><blockquote><p>的人多了,也便成了路。一九二一年一月。 </p></blockquote><p><code>semantic_chunk</code> 选手的表现让人失望,它并没有搜索到原文。它的搜索结果是:</p><blockquote><p>跨过了灭亡的人们向前进。什么是路?就是从没路的地方践踏出来的,从只有荆棘的地方开辟出来的。以前早有路了,以后也该永远有路。人类总不会寂寞,因为生命是进步的,是乐天的。昨天,我对我的朋友 L 说,“一个人死了,在死者自身和他的眷属是悲惨的事,</p></blockquote><p>但是它给我们带来了意外收获,搜索结果的意思和原文有些类似。这也是向量数据库语义搜索功能的体现。</p><p>原文其实在这个块中:</p><blockquote><p>“我的愿望茫远罢了。我在朦胧中,眼前展开一片海边碧绿的沙地来,上面深蓝的天空中挂着一轮金黄的圆月。我想:希望本是无所谓有,无所谓无的。这正如地上的路;其实地上本没有路,走的人多了,也便成了路。一九二一年一月。”</p></blockquote><p><code>semantic_chunk</code> 选手没有搜索到它,可能是因为这个块的前半部分和查询句子的语义相差较远。这也反应了分块对搜索结果的影响。</p><p>最后出场的 <code>window_chunk</code> 选手,给出了标准答案:</p><blockquote><p>这正如地上的路;其实地上本没有路,走的人多了,也便成了路。</p></blockquote><p>恭喜 <code>window_chunk</code> 选手完美找到了原文。因为它基于句子分割,能够更好地保存句子的语义。</p><p>我们再来看看第二个句子,三位选手的表现如何。搜索句子:“我家墙外有两株树,一株是枣树,还有一株也是枣树。”</p><p><code>fixed_chunk</code> 选手给出的句子仍然不完整,但是包含了完整的原文:</p><blockquote><p>在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。这上面的夜的天空,奇怪而高,我生平没有见过这样的奇怪而高的天空。他仿佛要离开人间而去,使人们仰面不再看见。然而现在却非常之蓝,闪闪地䀹着几十个星星的眼,冷眼。他的口角上现出微笑,似乎自以为大有深意,而将繁霜洒在我的园里的野花草上。我不知</p></blockquote><p><code>semantic_chunk</code> 选手这次正常发挥,也找到了原文:</p><blockquote><p>在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。这上面的夜的天空,奇怪而高,我生平没有见过这样的奇怪而高的天空。他仿佛要离开人间而去,使人们仰面不再看见。然而现在却非常之蓝,闪闪地䀹着几十个星星的眼,冷眼。他的口角上现出微笑,</p></blockquote><p><code>window_chunk</code> 选手依旧给出了完美答案:</p><blockquote><p>在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。</p></blockquote><p>虽然三位选手都找到了原文,但是 <code>window_chunk</code> 选手返回的原文不但完整,而且没有包含无关内容,减少了干扰信息。</p><p>再来看看最后一个句子:</p><blockquote><p>“猛兽总是独行,牛羊才成群结对。”</p></blockquote><p><code>fixed_chunk</code> 选手找到了类似的句子,但是包含了较多的无关内容:</p><blockquote><p>兽是单独的,牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。人民与牛马同流,——此就中国而言,夷人别有分类法云,——治之之道,自然应该禁止集合:这方法是对的。其次要防说话。人能说话,已经是祸胎了,而况有时还要做文章。所以苍颉造字,夜有鬼哭。鬼且反对,而况于官?猴子不会说话</p></blockquote><p><code>semantic_chunk</code> 和 <code>fixed_chunk</code> 表现类似:</p><blockquote><p>牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。人民与牛马同流,——此就中国而言,夷人别有分类法云,——治之之道,自然应该禁止集合:这方法是对的。其次要防说话。人能说话,已经是祸胎了,而况有时还要做文章。所以苍颉造字,夜有鬼哭。</p></blockquote><p>我们最后看看 <code>window_chunk</code> 选手的表现:</p><blockquote><p>猛兽是单独的,牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。</p></blockquote><p>别忘了 <code>window_chunk</code> 选手除了搜索到的原始句子,还能提供“窗口句子”作为上下文:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看`window_chunk`方法的窗口句子</span><br>method = <span class="hljs-string">"window_chunk"</span><br>res_window_chunk = vector_search(query3, search_params, limit, output_fields, method)<br>res_window_chunk = [hit[<span class="hljs-string">"entity"</span>] <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> res_window_chunk[<span class="hljs-number">0</span>]]<br><span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> res_window_chunk:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"window: <span class="hljs-subst">{item[<span class="hljs-string">'window'</span>]}</span>"</span>)<br></code></pre></td></tr></table></figure><p>窗口句子如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">window: 然亦可见至道嘉猷,人同此心,心同此理,固无华夷之限也。 猛兽是单独的,牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。 人民与牛马同流,——此就中国而言,夷人别有分类法云,——治之之道,自然应该禁止集合:这方法是对的。<br></code></pre></td></tr></table></figure><p>在 RAG 应用中,把上下文句子一起传递给大模型,能让大模型更好地理解句子的语义,作出更好的回答。</p><h2 id="调用大模型的-API"><a href="#调用大模型的-API" class="headerlink" title="调用大模型的 API"></a>调用大模型的 API</h2><p>创建向量数据库这部分想必你已经轻车熟路了,下面我们来完成 RAG 应用的最后一个部分:生成。我们要把搜索到的句子传递给大模型,让它根据提示词重新组装成回答。</p><p>首先,我们要创建一个大模型的 api key,用来调用大模型。我使用的是 <a href="https://platform.deepseek.com/api_keys">deepseek</a>。为了保护 api key 的安全,把 api key 设置为环境变量“DEEPSEEK_API_KEY”。请把 <code><you_api_key></code> 替换成你自己的 api key。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">import</span> os<br>os.environ[<span class="hljs-string">'DEEPSEEK_API_KEY'</span>] = <you_api_key><br></code></pre></td></tr></table></figure><p>然后,再从环境变量中读取 api key。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python">deepseek_api_key = os.getenv(<span class="hljs-string">"DEEPSEEK_API_KEY"</span>)<br></code></pre></td></tr></table></figure><p>deepseek 使用与 OpenAI 兼容的 API 格式,我们可以使用 OpenAI SDK 来访问 DeepSeek API。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs shell"><span class="hljs-meta prompt_"># </span><span class="language-bash">安装 openai 库</span><br>pip install openai<br></code></pre></td></tr></table></figure><p>接下来创建 openai 客户端实例。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入openai库</span><br><span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI<br><br><span class="hljs-comment"># 导入os库</span><br><span class="hljs-keyword">import</span> os<br><br><span class="hljs-comment"># 创建openai客户端实例</span><br>OpenAI_client = OpenAI(api_key=deepseek_api_key, base_url=<span class="hljs-string">"https://api.deepseek.com"</span>)<br></code></pre></td></tr></table></figure><p>根据 <a href="https://api-docs.deepseek.com/zh-cn/">deepseek api 文档</a>的说明,定义生成响应的函数 <code>generate_response</code>。<code>model</code> 是我们使用的大模型,这里是 <code>deepseek-chat</code>。<code>temperature</code> 决定大模型回答的随机性,数值在0-2之间,数值越高,生成的文本越随机;值越低,生成的文本越确定。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 定义生成响应的函数</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">generate_response</span>(<span class="hljs-params"></span><br><span class="hljs-params"> system_prompt, </span><br><span class="hljs-params"> user_prompt, </span><br><span class="hljs-params"> model, </span><br><span class="hljs-params"> temperature</span><br><span class="hljs-params"> </span>):<br> <span class="hljs-comment"># 大模型的响应</span><br> response = OpenAI_client.chat.completions.create(<br> model=model,<br> messages=[<br> <span class="hljs-comment"># 设置系统信息,通常用于设置模型的行为、角色或上下文。</span><br> {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: system_prompt},<br> <span class="hljs-comment"># 设置用户消息,用户消息是用户发送给模型的消息。</span><br> {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_prompt},<br> ],<br> <span class="hljs-comment"># 设置温度</span><br> temperature=temperature, <br> stream=<span class="hljs-literal">True</span><br> )<br> <span class="hljs-comment"># 遍历响应中的每个块</span><br> <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> response:<br> <span class="hljs-comment"># 检查块中是否包含选择项</span><br> <span class="hljs-keyword">if</span> chunk.choices:<br> <span class="hljs-comment"># 打印选择项中的第一个选项的增量内容,并确保立即刷新输出</span><br> <span class="hljs-built_in">print</span>(chunk.choices[<span class="hljs-number">0</span>].delta.content, end=<span class="hljs-string">""</span>, flush=<span class="hljs-literal">True</span>)<br></code></pre></td></tr></table></figure><p>响应函数接收的参数中,<code>system_prompt</code> 是系统提示词,主要用于设置模型的行为、角色或上下文。你可以理解为这是系统给大模型的提示词,而且始终有效。我们可以使用下面的提示词规范大模型的响应:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python">system_prompt = <span class="hljs-string">"你是鲁迅作品研究者,熟悉鲁迅的各种作品。"</span><br></code></pre></td></tr></table></figure><p><code>user_prompt</code> 是用户提示词,是用户发给大模型的。大模型会在系统提示词和用户提示词的共同作用下,生成响应。用户提示词由查询句子 <code>query</code> 和向量数据库搜索到的句子组成。对于 <code>fixed_chunk</code> 和 <code>semantic_chunk</code>,我们需要获取 <code>chunk</code> 字段的值。对于 <code>window_chunk</code>,我们需要获取 <code>window</code> 字段的值。定义下面的函数可以帮助我们方便获取想要的值。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_ref_info</span> (query, search_params, limit, output_fields, method):<br> res = vector_search (query, search_params, limit, output_fields, method)<br> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> res[<span class="hljs-number">0</span>]:<br> ref_info = {<br> <span class="hljs-string">"ref"</span>: hit[<span class="hljs-string">"entity"</span>][<span class="hljs-string">"window"</span>] <span class="hljs-keyword">if</span> method == <span class="hljs-string">"window_chunk"</span> <span class="hljs-keyword">else</span> hit[<span class="hljs-string">"entity"</span>][<span class="hljs-string">"chunk"</span>],<br> <span class="hljs-string">"title"</span>: hit[<span class="hljs-string">"entity"</span>][<span class="hljs-string">"title"</span>]<br> }<br> <span class="hljs-keyword">return</span> ref_info<br></code></pre></td></tr></table></figure><p>最后,针对不同的分块方法,获取对应的响应。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">for</span> method <span class="hljs-keyword">in</span> chunk_methods:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"分块方法: <span class="hljs-subst">{method}</span>"</span>)<br> <span class="hljs-comment"># 获取参考信息</span><br> ref_info = get_ref_info(query, search_params, limit, output_fields, method)<br> <span class="hljs-comment"># 生成用户提示词</span><br> user_prompt = (<br> <span class="hljs-string">f"请你根据提供的参考信息,查找是否有与问题语义相似的内容。参考信息:<span class="hljs-subst">{ref_info}</span>。问题:<span class="hljs-subst">{query}</span>。\n"</span><br> <span class="hljs-string">f"如果找到了相似的内容,请回复“鲁迅的确说过类似的话,原文是[原文内容],这句话来自[文章标题]”。\n"</span><br> <span class="hljs-string">f"[原文内容]是参考信息中ref字段的值,[文章标题]是参考信息中title字段的值。如果引用它们,请引用完整的内容。\n"</span><br> <span class="hljs-string">f"如果参考信息没有提供和问题相关的内容,请回答“据我所知,鲁迅并没有说过类似的话。”"</span><br>)<br> <span class="hljs-comment"># 生成响应</span><br> generate_response(system_prompt, user_prompt, model, temperature)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"\n"</span> + <span class="hljs-string">"*"</span> * <span class="hljs-number">50</span> + <span class="hljs-string">"\n"</span>)<br></code></pre></td></tr></table></figure><p>好啦,一切准备就绪,让我们看看使用不同分块方法的 RAG,究竟有什么区别。先看第一句话,“世上本没有路,走的人多了,也便成了路。”,搜索结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">分块方法: fixed_chunk<br>鲁迅的确说过类似的话,原文是“的人多了,也便成了路。 一九二一年一月。”,这句话来自《故乡》。<br>**************************************************<br><br>分块方法: semantic_chunk<br>鲁迅的确说过类似的话,原文是“跨过了灭亡的人们向前进。什么是路?就是从没路的地方践踏出来的,从只有荆棘的地方开辟出来的。以前早有路了,以后也该永远有路。人类总不会寂寞,因为生命是进步的,是乐天的。昨天,我对我的朋友L说,‘一个人死了,在死者自身和他的眷属是悲惨的事,”,这句话来自《六十六生命的路》。<br>**************************************************<br><br>分块方法: window_chunk<br>鲁迅的确说过类似的话,原文是“我想:希望本是无所谓有,无所谓无的。 这正如地上的路;其实地上本没有路,走的人多了,也便成了路。 一九二一年一月。”,这句话来自《故乡》。<br>**************************************************<br></code></pre></td></tr></table></figure><p><code>fixed_chunk</code> 选手虽然给出了原文,但是遗憾的是不够完整。<code>semantic_chunk</code> 选手没有搜索到原文,但是给出的句子语义也和原文类似,算是意外收获。而 <code>window_chunk</code> 选手则给出了标准答案。</p><p>再来看看第二句,“我家墙外有两株树,一株是枣树,还有一株也是枣树。”搜索结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">分块方法: fixed_chunk<br>鲁迅的确说过类似的话,原文是在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。这上面的夜的天空,奇怪而高,我生平没有见过这样的奇怪而高的天空。他仿佛要离开人间而去,使人们仰面不再看见。然而现在却非常之蓝,闪闪地䀹着几十个星星的眼,冷眼。他的口角上现出微笑,似乎自以为大有深意,而将繁霜洒在我的园里的野花草上。我不知,这句话来自秋夜。<br>**************************************************<br><br>分块方法: semantic_chunk<br>鲁迅的确说过类似的话,原文是“在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。这上面的夜的天空,奇怪而高,我生平没有见过这样的奇怪而高的天空。他仿佛要离开人间而去,使人们仰面不再看见。然而现在却非常之蓝,闪闪地䀹着几十个星星的眼,冷眼。他的口角上现出微笑,”,这句话来自《秋夜》。<br>**************************************************<br><br>分块方法: window_chunk<br>鲁迅的确说过类似的话,原文是“在我的后园,可以看见墙外有两株树,一株是枣树,还有一株也是枣树。 这上面的夜的天空,奇怪而高,我生平没有见过这样的奇怪而高的天空。”,这句话来自《秋夜》。<br>**************************************************<br></code></pre></td></tr></table></figure><p>三位选手表现差不多,<code>window_chunk</code> 选手给出的结果更精准。</p><p>最后来看看第三句,“猛兽总是独行,牛羊才成群结对。”搜索结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">分块方法: fixed_chunk<br>鲁迅的确说过类似的话,原文是“兽是单独的,牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。人民与牛马同流,——此就中国而言,夷人别有分类法云,——治之之道,自然应该禁止集合:这方法是对的。其次要防说话。人能说话,已经是祸胎了,而况有时还要做文章。所以苍颉造字,夜有鬼哭。鬼且反对,而况于官?猴子不会说话”,这句话来自《春末闲谈》。<br>**************************************************<br><br>分块方法: semantic_chunk<br>鲁迅的确说过类似的话,原文是“牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。人民与牛马同流,——此就中国而言,夷人别有分类法云,——治之之道,自然应该禁止集合:这方法是对的。其次要防说话。人能说话,已经是祸胎了,而况有时还要做文章。所以苍颉造字,夜有鬼哭。”,这句话来自《春末闲谈》。<br>**************************************************<br><br>分块方法: window_chunk<br>鲁迅的确说过类似的话,原文是“猛兽是单独的,牛羊则结队;野牛的大队,就会排角成城以御强敌了,但拉开一匹,定只能牟牟地叫。”,这句话来自《春末闲谈》。<br>**************************************************<br></code></pre></td></tr></table></figure><p>和上一句的搜索结果相似,三位选手都找到了正确的句子,<code>window_chunk</code> 选手的答案最标准。请为 <code>window_chunk</code> 选手的精彩表现鼓掌。</p><h2 id="更多探索"><a href="#更多探索" class="headerlink" title="更多探索"></a>更多探索</h2><p>其实,RAG 的响应和很多因素相关,你可以多多尝试,看看结果有什么不同。比如,修改 <code>vector_search</code> 函数的 <code>limit</code> 参数,让向量数据库多返回几个句子,增加命中概率。或者增加 <code>generate_response</code> 函数的 <code>temperature</code> 参数,看看 RAG 的响应如何变化。还有提示词,它直接影响大模型如何回答。</p><p>另外,你还可以基于本应用,开发其他功能,比如鲁迅作品智能问答功能,解答关于鲁迅作品的问题。或者鲁迅作品推荐功能,输入你想要阅读的作品类型,让 RAG 为你做推荐。玩法多多,祝你玩得开心。</p><h2 id="藏宝图"><a href="#藏宝图" class="headerlink" title="藏宝图"></a>藏宝图</h2><p>老规矩,推荐一些资料供你参考。<br><a href="https://chunkviz.up.railway.app/">ChunkViz</a> 是一个在线网站,提供分块可视化功能。</p><p>想了解 RAG 更多有趣应用,可以看看这个视频:<a href="https://www.bilibili.com/video/BV1da4y1k78p/?spm_id_from=333.337.search-card.all.click&vd_source=ad92e3138da83a643ab3f5883c7664c7">当我开发出史料检索RAG应用,正史怪又该如何应对?</a>。想了解更多技术细节,看这里: <a href="https://zilliz.com.cn/blog/milvus-rag-bilibili">揭秘「 B 站最火的 RAG 应用」是如何炼成的</a>。</p><p>想了解更多分块技术,可以阅读<a href="https://zilliz.com.cn/blog/guide-to-chunking-sreategies-for-rag">检索增强生成(RAG)的分块策略指南</a>和<a href="https://safjan.com/from-fixed-size-to-nlp-chunking-a-deep-dive-into-text-chunking-techniques/">从固定大小到NLP分块 - 文本分块技术的深入研究</a>两篇文章。</p><h2 id="注释"><a href="#注释" class="headerlink" title="注释"></a>注释</h2><section class="footnotes"><div class="footnote-list"><ol><li><span id="fn:1" class="footnote-text"><span>鲁迅作品集数据基于 <a href="https://github.com/sun510001/luxun_dataset">luxun_dataset</a> ,增加了一些字段。<code>luxun_sample.json</code> 为鲁迅部分作品,方便试用。<code>luxun.json</code> 为完整的鲁迅作品集。<a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li><li><span id="fn:2" class="footnote-text"><span>LlamaIndex 是一个用于构建带有上下文增强功能的生成式 AI 应用的框架,支持大型语言模型(LLMs)。<a href="#fnref:2" rev="footnote" class="footnote-backref"> ↩</a></span></span></li></ol></div></section>]]></content>
<categories>
<category>向量数据库</category>
<category>原理探秘</category>
</categories>
</entry>
<entry>
<title>孙悟空 + 红楼梦 - 西游记 = ?向量嵌入之稠密向量</title>
<link href="/2024/10/11/%E5%AD%99%E6%82%9F%E7%A9%BA-%E7%BA%A2%E6%A5%BC%E6%A2%A6-%E8%A5%BF%E6%B8%B8%E8%AE%B0-%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%A0%E5%AF%86%E5%90%91%E9%87%8F/"/>
<url>/2024/10/11/%E5%AD%99%E6%82%9F%E7%A9%BA-%E7%BA%A2%E6%A5%BC%E6%A2%A6-%E8%A5%BF%E6%B8%B8%E8%AE%B0-%EF%BC%9F%E5%90%91%E9%87%8F%E5%B5%8C%E5%85%A5%E4%B9%8B%E7%A8%A0%E5%AF%86%E5%90%91%E9%87%8F/</url>
<content type="html"><![CDATA[<p>一起来开个脑洞,如果孙悟空穿越到红楼梦的世界,他会成为谁?贾宝玉,林黛玉,还是薛宝钗?这看似一道文学题,但是我们不妨用数学方法来求解:<code>孙悟空 + 红楼梦 - 西游记 = ?</code></p><p>文字也能做运算?当然不行,但是把文字转换成数字之后,就可以用来计算了。而这个过程,叫做 “向量嵌入”。为什么要做向量嵌入?因为具有语义意义的数据,比如文本或者图像,人可以分辨相关程度,但是无法量化,更不能计算。比如,对于一组词“孙悟空、猪八戒、沙僧、西瓜、苹果、香蕉“,我会把“孙悟空、猪八戒、沙僧”分成一组,“西瓜、苹果、香蕉”分成另一组。但是,如果进一步提问,“孙悟空”是和“猪八戒”更相关,还是和“沙僧”更相关呢?这很难回答。</p><p>而把这些信息转换成向量后,相关程度就可以通过它们在向量空间中的距离量化。甚至于,我们可以做 <code>孙悟空 + 红楼梦 - 西游记 = ?</code> 这样的脑洞数学题。</p><blockquote><p>本文首发于 Zilliz 公众号。文中代码的 Notebook 在<a href="https://pan.baidu.com/s/1VtPt-6Y_hhxKn9uB4AMNFg?pwd=7zv9">这里</a>下载。</p></blockquote><h2 id="文字是怎么变成向量的"><a href="#文字是怎么变成向量的" class="headerlink" title="文字是怎么变成向量的"></a>文字是怎么变成向量的</h2><p>怎么把文字变成向量呢?首先出现的是词向量,其中的代表是 word2vec 模型。它先准备一张词汇表,给每个词随机赋予一个向量,然后利用大量语料,通过 CBOW(Continuous Bag-of-Words)和 Skip-Gram 两种方法训练模型,不断优化字词的向量。</p><p>CBOW 使用上下文(周围的词)预测目标词<sup id="fnref:1" class="footnote-ref"><a href="#fn:1" rel="footnote"><span class="hint--top hint--rounded" aria-label="严格来说,“目标词”不是单词而是“token”。token 是组成句子的基本单元。对于英文来说,token可以简单理解为单词,还可能是子词(subword)或者标点符号,比如“unhappiness” 可能会被分割成“un”和“happiness“。对于汉字来说,则是字、词或者短语,汉字不会像英文单词那样被分割。">[1]</span></a></sup>,而 Skip-Gram 则相反,通过目标词预测它的上下文。举个例子,对于“我爱吃冰淇淋”这个句子,CBOW方法已知上下文“我爱“和”冰淇淋”,计算出中间词的概率,比如,“吃”的概率是90%,“喝”的概率是7%,“玩”的概率是3%。然后再使用损失函数预测概率与实际概率的差异,最后通过反向传播算法,调整词向量模型的参数,使得损失函数最小化。训练词向量模型的最终目的,是捕捉词汇之间的语义关系,使得相关的词在向量空间中距离更近。</p><p>打个比方,最初的词向量模型就像一个刚出生的孩子,对字词的理解是模糊的。父母在各种场景下和孩子说话,时不时考一考孩子,相当于用语料库训练模型。只不过训练模型的过程是不断迭代神经网络的参数,而教孩子说话,则是改变大脑皮层中神经元突触的连接。</p><p>比如,父母会在吃饭前跟孩子说:<br>“肚子饿了就要…”<br>“要吃饭。”</p><p>如果答错了,父母会纠正孩子:<br>“吃饭之前要…”<br>“要喝汤。”<br>“不对,吃饭之前要洗手。”</p><p>这就是在调整模型的参数。</p><p>好了,纸上谈兵结束,咱们用代码实际操练一番吧。</p><p>版本说明:<br>Milvus 版本:>=2.5.0<br>pymilvus 版本:>=2.5.0</p><p>安装依赖:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">pip install gensim scikit-learn transformers matplotlib<br></code></pre></td></tr></table></figure><p>从 gensim.models 模块中导入 KeyedVectors 类,它用于存储和操作词向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> gensim.models <span class="hljs-keyword">import</span> KeyedVectors<br></code></pre></td></tr></table></figure><p>在<a href="https://github.com/Embedding/Chinese-Word-Vectors/blob/master/README_zh.md">这里</a>下载中文词向量模型 <code>Literature 文学作品</code>,并且加载该模型。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 加载中文词向量模型</span><br>word_vectors = KeyedVectors.load_word2vec_format(<span class="hljs-string">'sgns.literature.word'</span>, binary=<span class="hljs-literal">False</span>)<br></code></pre></td></tr></table></figure><p>词向量模型其实就像一本字典。在字典里,每个字对应的是一条解释,在词向量模型中,每个词对应的是一个向量。</p><p>我们使用的词向量模型是300维的,数量太多,可以只显示前4个维度的数值:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'孙悟空'的向量的前四个维度:<span class="hljs-subst">{word_vectors[<span class="hljs-string">'孙悟空'</span>].tolist()[:<span class="hljs-number">4</span>]}</span>"</span>)<br></code></pre></td></tr></table></figure><p>输出结果为:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">'孙悟空'的向量的前四个维度:[-0.09262000024318695, -0.034056998789310455, -0.16306699812412262, -0.05771299824118614]<br></code></pre></td></tr></table></figure><h2 id="语义更近,距离更近"><a href="#语义更近,距离更近" class="headerlink" title="语义更近,距离更近"></a>语义更近,距离更近</h2><p>前面我们提出了疑问,“孙悟空”是和“猪八戒”更相关,还是和“沙僧”更相关呢?在 <a href="http://jiangjunqiao.top/2024/09/16/%E5%A6%82%E4%BD%95%E5%81%87%E8%A3%85%E6%96%87%E8%89%BA%E9%9D%92%E5%B9%B4%EF%BC%8C%E6%80%8E%E4%B9%88%E6%8A%8A%E5%A4%A7%E7%99%BD%E8%AF%9D%E2%80%9C%E5%8F%98%E6%88%90%E2%80%9D%E5%8F%A4%E8%AF%97%E8%AF%8D%EF%BC%9F/">如何假装文艺青年,怎么把大白话“变成”古诗词?</a> 这篇文章中,我们使用内积 <code>IP</code> 计算两个向量的距离,这里我们使用余弦相似度来计算。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'孙悟空'和'猪八戒'向量的余弦相似度是:<span class="hljs-subst">{word_vectors.similarity(<span class="hljs-string">'孙悟空'</span>, <span class="hljs-string">'猪八戒'</span>):<span class="hljs-number">.2</span>f}</span>"</span>)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'孙悟空'和'沙僧'向量的余弦相似度是:<span class="hljs-subst">{word_vectors.similarity(<span class="hljs-string">'孙悟空'</span>, <span class="hljs-string">'沙僧'</span>):<span class="hljs-number">.2</span>f}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">'孙悟空'和'猪八戒'向量的余弦相似度是:0.60<br>'孙悟空'和'沙僧'向量的余弦相似度是:0.59<br></code></pre></td></tr></table></figure><p>看来,孙悟空还是和猪八戒更相关。但是我们还不满足,我们还想知道,和孙悟空最相关的是谁。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查找与“孙悟空”最相关的4个词</span><br>similar_words = word_vectors.most_similar(<span class="hljs-string">"孙悟空"</span>, topn=<span class="hljs-number">4</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"与'孙悟空'最相关的4个词分别是:"</span>)<br><span class="hljs-keyword">for</span> word, similarity <span class="hljs-keyword">in</span> similar_words:<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{word}</span>, 余弦相似度为:<span class="hljs-subst">{similarity:<span class="hljs-number">.2</span>f}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">与'孙悟空'最相关的4个词分别是:<br>悟空, 余弦相似度为:0.66<br>唐僧, 余弦相似度为:0.61<br>美猴王, 余弦相似度为:0.61<br>猪八戒, 余弦相似度为:0.60<br></code></pre></td></tr></table></figure><p>“孙悟空”和“悟空”、“美猴王”相关,这容易理解。为什么它还和“唐僧”、“猪八戒”相关呢?前面提到的词向量模型的训练原理解释,就是因为在训练文本中,“唐僧”、“猪八戒”经常出现在“孙悟空”这个词的上下文中。这不难理解——在《西游记》中,孙悟空经常救唐僧,还喜欢戏耍八戒。</p><p>前面提到,训练词向量模型是为了让语义相关的词,在向量空间中距离更近。那么,我们可以测试一下,给出四组语义相近的词,考一考词向量模型,看它能否识别出来。</p><p>第一组:西游记,三国演义,水浒传,红楼梦<br>第二组:西瓜,苹果,香蕉,梨<br>第三组:长江,黄河</p><p>首先,获取这四组词的词向量:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入用于数值计算的库</span><br><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<br><br><span class="hljs-comment"># 定义要可视化的单词列表</span><br>words = [<span class="hljs-string">"西游记"</span>, <span class="hljs-string">"三国演义"</span>, <span class="hljs-string">"水浒传"</span>, <span class="hljs-string">"红楼梦"</span>, <br> <span class="hljs-string">"西瓜"</span>, <span class="hljs-string">"苹果"</span>, <span class="hljs-string">"香蕉"</span>, <span class="hljs-string">"梨"</span>, <br> <span class="hljs-string">"长江"</span>, <span class="hljs-string">"黄河"</span>]<br><br><span class="hljs-comment"># 使用列表推导式获取每个单词的向量</span><br>vectors = np.array([word_vectors[word] <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> words])<br></code></pre></td></tr></table></figure><p>然后,使用 PCA (Principal Component Analysis,组成分分析)把200维的向量降到2维,一个维度作为 x 坐标,另一个维度作为 y 坐标,这样就把高维向量投影到平面了,方便我们在二维图形上显示它们。换句话说,PCA 相当于《三体》中的二向箔,对高维向量实施了降维打击。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入用于降维的PCA类</span><br><span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> PCA<br><br><span class="hljs-comment"># 创建PCA对象,设置降至2维</span><br>pca = PCA(n_components=<span class="hljs-number">2</span>)<br><br><span class="hljs-comment"># 对词向量实施PCA降维</span><br>vectors_pca = pca.fit_transform(vectors)<br></code></pre></td></tr></table></figure><p>最后,在二维图形上显示降维后的向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入用于绘图的库</span><br><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt<br><span class="hljs-comment"># 创建一个5x5英寸的图</span><br>fig, axes = plt.subplots(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, figsize=(<span class="hljs-number">7</span>, <span class="hljs-number">7</span>))<br><br><span class="hljs-comment"># 设置中文字体</span><br>plt.rcParams[<span class="hljs-string">'font.sans-serif'</span>] = [<span class="hljs-string">'Heiti TC'</span>]<br><span class="hljs-comment"># 确保负号能够正确显示</span><br>plt.rcParams[<span class="hljs-string">'axes.unicode_minus'</span>] = <span class="hljs-literal">False</span> <br><br><span class="hljs-comment"># 使用PCA降维后的前两个维度作为x和y坐标绘制散点图</span><br>axes.scatter(vectors_pca[:, <span class="hljs-number">0</span>], vectors_pca[:, <span class="hljs-number">1</span>])<br><br><span class="hljs-comment"># 为每个点添加文本标注</span><br><span class="hljs-keyword">for</span> i, word <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(words):<br> <span class="hljs-comment"># 添加注释,设置文本内容、位置、样式等</span><br> <span class="hljs-comment"># 要显示的文本(单词)</span><br> axes.annotate(word,<br> <span class="hljs-comment"># 点的坐标</span><br> (vectors_pca[i, <span class="hljs-number">0</span>], vectors_pca[i, <span class="hljs-number">1</span>]), <br> <span class="hljs-comment"># 文本相对于点的偏移量</span><br> xytext=(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>), <br> <span class="hljs-comment"># 指定偏移量的单位</span><br> textcoords=<span class="hljs-string">'offset points'</span>, <br> <span class="hljs-comment"># 字体大小</span><br> fontsize=<span class="hljs-number">10</span>, <br> <span class="hljs-comment"># 字体粗细</span><br> fontweight=<span class="hljs-string">'bold'</span>) <br><br><span class="hljs-comment"># 设置图表标题和字体大小</span><br>axes.set_title(<span class="hljs-string">'词向量'</span>, fontsize=<span class="hljs-number">14</span>)<br><br><span class="hljs-comment"># 自动调整子图参数,使之填充整个图像区域</span><br>plt.tight_layout()<br><br><span class="hljs-comment"># 在屏幕上显示图表</span><br>plt.show()<br></code></pre></td></tr></table></figure><p>从图中可以看出,同一组词的确在图中的距离更近。</p><p><img src="https://picgo233.oss-cn-hangzhou.aliyuncs.com/img/202409161007773.png" alt="600"></p><p>既然可以把高维向量投影到二维,那么是不是也能投影到三维呢?当然可以,那样更酷。你可以在 <a href="https://projector.tensorflow.org/">TensorFlow Embedding Projector</a> 上尝试下,输入单词,搜索与它最近的几个词,看看它们在三维空间上的位置关系。</p><p>比如,输入 <code>apple</code>,最接近的5个词分别是 <code>OS</code>、<code>macintosh</code>、<code>amiga</code>、<code>ibm</code> 和 <code>microsoft</code>。</p><p><img src="https://picgo233.oss-cn-hangzhou.aliyuncs.com/img/202409161023693.png" alt="600"></p><h2 id="如果孙悟空穿越到红楼梦"><a href="#如果孙悟空穿越到红楼梦" class="headerlink" title="如果孙悟空穿越到红楼梦"></a>如果孙悟空穿越到红楼梦</h2><p>回到我们开篇的问题,把文本向量化后,就可以做运算了。如果孙悟空穿越到红楼梦,我们用下面的数学公式表示:<br><code>孙悟空 + 红楼梦 - 西游记</code></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python">result = word_vectors.most_similar(positive=[<span class="hljs-string">"孙悟空"</span>, <span class="hljs-string">"红楼梦"</span>], negative=[<span class="hljs-string">"西游记"</span>], topn=<span class="hljs-number">4</span>)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"孙悟空 + 红楼梦 - 西游记 = <span class="hljs-subst">{result}</span>"</span>)<br></code></pre></td></tr></table></figure><p>答案为:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">孙悟空 + 红楼梦 - 西游记 = [('唐僧', 0.4163001477718353), ('贾宝玉', 0.41606390476226807), ('妙玉', 0.39432790875434875), ('沙和尚', 0.3922004997730255)]<br></code></pre></td></tr></table></figure><p>你是不是有点惊讶,因为答案中的“唐僧”和“沙和尚”根本就不是《红楼梦》中的人物。这是因为虽然词向量可以反映字词之间的语义相关性,但是它终究是在做数学题,不能像人类一样理解“孙悟空 + 红楼梦 - 西游记”背后的含义。答案中出现“唐僧”和“沙和尚”是因为它们和“孙悟空”更相关,而出现“贾宝玉”和“妙玉”则是因为它们和“红楼梦”更相关。</p><p>不过,这样的测试还蛮有趣的,你也可以多尝试一下,有的结果还蛮符合直觉的。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python">result = word_vectors.most_similar(positive=[<span class="hljs-string">"牛奶"</span>, <span class="hljs-string">"发酵"</span>], topn=<span class="hljs-number">1</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"牛奶 + 发酵 = <span class="hljs-subst">{result[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]}</span>"</span>)<br><br>result = word_vectors.most_similar(positive=[<span class="hljs-string">"男人"</span>, <span class="hljs-string">"泰国"</span>], topn=<span class="hljs-number">1</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"男人 + 泰国 = <span class="hljs-subst">{result[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]}</span>"</span>)<br></code></pre></td></tr></table></figure><p>计算的结果如下:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">牛奶 + 发酵 = 变酸<br>男人 + 泰国 = 女人<br></code></pre></td></tr></table></figure><h2 id="待优化!"><a href="#待优化!" class="headerlink" title="待优化!"></a>待优化!</h2><p>尝试把上面的计算题降维,显示在图像上,看看是否满足两个向量相加,等于第三个向量</p><h2 id="一词多义怎么办"><a href="#一词多义怎么办" class="headerlink" title="一词多义怎么办"></a>一词多义怎么办</h2><p>前面说过,词向量模型就像一本字典,每个词对应一个向量,而且是唯一一个向量。但是,在语言中一词多义的现象是非常常见的,比如对于 “苹果” 这个词,既可以指一种水果,也可以指一家电子产品公司。词向量模型在训练 “苹果”这个词的向量时,这两种语义都会考虑到,所以它在向量空间中将位于“水果”和 “电子产品公司”之间。这就好像你3月20号过生日,你同事3月30号过生日,你的领导为了给你们两个人一起过庆祝生日,选择了3月25号——不是任何一个人的生日。</p><p>为了解决一词多义的问题,BERT(Bidirectional Encoder Representations from Transformers)模型诞生了。它是一种基于深度神经网络的预训练语言模型,使用 Transformer 架构,通过自注意力机制同时考虑一个 token 的前后上下文,并且根据上下文环境更新该 token 的向量。</p><p>比如,“苹果”这个目标词的初始向量是从词库中获取的,向量的值是固定的。当注意力模型处理“苹果“这个词时,如果发现上下文中有“手机”一词,会给它分配更多权重,“苹果”的向量会更新,靠近“手机”的方向。如果上下文中有“水果”一词,则会靠近“水果”的方向。</p><p>注意力模型分配权重是有策略的。它只会给上下文中与目标词关系紧密的词分配更多权重。所以,BERT 能够理解目标词与上下文之间的语义关系,根据上下文调整目标词的向量。</p><p>BERT 的预训练分成两种训练方式。第一种训练方式叫做“掩码语言模型(Masked Language Model,MLM)”,和 word2vec 相似,它会随机选择句子中的一些词遮住,根据上下文信息预测这个词,再根据预测结果与真实结果的差异调整参数。第二种训练方式叫做“下一句预测(Next Sentence Prediction,NSP)”,每次输入两个句子,判断第二个句子是否是第一个句子的下一句,然后同样根据结果差异调整参数。</p><p>说了这么多,BERT 模型的效果究竟怎么样?让我们动手试试吧。首先导入 BERT 模型,定义一个获取句子中指定单词的向量的函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 从transformers库中导入BertTokenizer类和BertModel类</span><br><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> BertTokenizer, BertModel<br><br><span class="hljs-comment"># 加载分词器 BertTokenizer</span><br>bert_tokenizer = BertTokenizer.from_pretrained(<span class="hljs-string">'bert-base-chinese'</span>)<br><br><span class="hljs-comment"># 加载嵌入模型 BertModel</span><br>bert_model = BertModel.from_pretrained(<span class="hljs-string">'bert-base-chinese'</span>)<br><br><span class="hljs-comment"># 使用BERT获取句子中指定单词的向量</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_bert_emb</span>(<span class="hljs-params">sentence, word</span>):<br> <span class="hljs-comment"># 使用 bert_tokenizer 对句子编码</span><br> <span class="hljs-built_in">input</span> = bert_tokenizer(sentence, return_tensors=<span class="hljs-string">'pt'</span>)<br> <span class="hljs-comment"># 将编码传递给 BERT 模型,计算所有层的输出</span><br> output = bert_model(**<span class="hljs-built_in">input</span>)<br> <span class="hljs-comment"># 获取 BERT 模型最后一层的隐藏状态,它包含了每个单词的嵌入信息</span><br> last_hidden_states = output.last_hidden_state<br> <span class="hljs-comment"># 将输入的句子拆分成单词,并生成一个列表</span><br> word_tokens = bert_tokenizer.tokenize(sentence)<br> <span class="hljs-comment"># 获取目标单词在列表中的索引位置</span><br> word_index = word_tokens.index(word)<br> <span class="hljs-comment"># 从最后一层隐藏状态中提取目标单词的嵌入表示</span><br> word_emb = last_hidden_states[<span class="hljs-number">0</span>, word_index + <span class="hljs-number">1</span>, :]<br> <span class="hljs-comment"># 返回目标单词的嵌入表示</span><br> <span class="hljs-keyword">return</span> word_emb<br></code></pre></td></tr></table></figure><p>然后通过 BERT 和词向量模型分别获取两个句子中指定单词的向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><code class="hljs python">sentence1 = <span class="hljs-string">"我今天很开心。"</span><br>sentence2 = <span class="hljs-string">"我打开了房门。"</span><br>word = <span class="hljs-string">"开"</span><br><br><span class="hljs-comment"># 使用 BERT 模型获取句子中指定单词的向量</span><br>bert_emb1 = get_bert_emb(sentence1, word).detach().numpy()<br><br>bert_emb2 = get_bert_emb(sentence2, word).detach().numpy()<br><br><span class="hljs-comment"># 使用词向量模型获取指定单词的向量</span><br>word_emb = word_vectors[word]<br></code></pre></td></tr></table></figure><p>最后,查看这三个向量的区别。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"在句子 '<span class="hljs-subst">{sentence1}</span>' 中,'<span class="hljs-subst">{word}</span>'的向量的前四个维度:<span class="hljs-subst">{bert_emb1[: <span class="hljs-number">4</span>]}</span>"</span>)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"在句子 '<span class="hljs-subst">{sentence2}</span>' 中,'<span class="hljs-subst">{word}</span>'的向量的前四个维度:<span class="hljs-subst">{bert_emb2[: <span class="hljs-number">4</span>]}</span>"</span>)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"在词向量模型中, '<span class="hljs-subst">{word}</span>' 的向量的前四个维度:<span class="hljs-subst">{word_emb[: <span class="hljs-number">4</span>]}</span>"</span>)<br></code></pre></td></tr></table></figure><p>结果为:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">在句子 '我今天很开心。' 中,'开'的向量的前四个维度:[1.4325644 0.05137304 1.6045816 0.01002912]<br><br>在句子 '我打开了房门。' 中,'开'的向量的前四个维度:[ 0.9039772 -0.5877741 0.6639165 0.45880783]<br><br>在词向量模型中, '开' 的向量的前四个维度:[ 0.260962 0.040874 0.434256 -0.305888]<br></code></pre></td></tr></table></figure><p>BERT 模型果然能够根据上下文调整单词的向量。不妨再比较下余弦相似度:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入用于计算余弦相似度的函数</span><br><span class="hljs-keyword">from</span> sklearn.metrics.pairwise <span class="hljs-keyword">import</span> cosine_similarity<br><br><span class="hljs-comment"># 计算两个BERT嵌入向量的余弦相似度</span><br>bert_similarity = cosine_similarity([bert_emb1], [bert_emb2])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"在 '<span class="hljs-subst">{sentence1}</span>' 和 '<span class="hljs-subst">{sentence2}</span>' 这两个句子中,两个 '<span class="hljs-subst">{word}</span>' 的余弦相似度是: <span class="hljs-subst">{bert_similarity:<span class="hljs-number">.2</span>f}</span>"</span>)<br><br><span class="hljs-comment"># 计算词向量模型的两个向量之间的余弦相似度</span><br>word_similarity = cosine_similarity([word_emb], [word_emb])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"在词向量中, '<span class="hljs-subst">{word}</span>' 和 '<span class="hljs-subst">{word}</span>' 的余弦相似度是: <span class="hljs-subst">{word_similarity:<span class="hljs-number">.2</span>f}</span>"</span>)<br></code></pre></td></tr></table></figure><p>观察结果发现,不同句子中的“开”语义果然不同:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">在 '我今天很开心。' 和 '我打开了房门。' 这两个句子中,两个 '开' 的余弦相似度是: 0.69<br><br>在词向量中, '开' 和 '开' 的余弦相似度是: 1.00<br></code></pre></td></tr></table></figure><h2 id="怎么获得句子的向量"><a href="#怎么获得句子的向量" class="headerlink" title="怎么获得句子的向量"></a>怎么获得句子的向量</h2><p>我们虽然可以通过 BERT 模型获取单词的向量,但是怎么获得句子的向量呢?最简单的方法就是让 BERT 输出句子中每个单词的向量,然后计算向量的平均值。但是,这种不分重点一刀切的效果肯定是不好的,就好像我和千万富豪站在一起,计算我们的平均资产,然后得出结论,这两个人都是千万富翁,这显然不能反映真实情况。更关键的是,使用这种方法,并不能反映句子中词的顺序,而词序对句子语义的影响是非常大的。<br>``<br>所以,想要反映句子的语义,必须使用专门的句子嵌入模型。它能够直接生成句子级别的嵌入表示,更好地捕捉句子中的上下文信息,从而生成更准确的句子向量。</p><p>句子嵌入模型是怎么训练的?一种常见方法是使用句子对。每次输入两个句子,分别生成它们的嵌入向量,计算相似度,然后与句子对自带的相似度做比较,通过差异调整嵌入模型的参数。</p><p>BGE_M3 模型就是这样一个嵌入模型,而且支持中文。</p><p>真的这么好用?是骡子是马,拉出来遛遛,我们比较一下这两种生成句子嵌入的方法。</p><p>首先,定义一个使用 BERT 模型获取句子向量的函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入 PyTorch 库</span><br><span class="hljs-keyword">import</span> torch<br><br><span class="hljs-comment"># 使用 BERT 模型获取句子的向量</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_bert_sentence_emb</span>(<span class="hljs-params">sentence</span>):<br> <span class="hljs-comment"># 使用 bert_tokenizer 对句子进行编码,得到 PyTorch 张量格式的输入</span><br> <span class="hljs-built_in">input</span> = bert_tokenizer(sentence, return_tensors=<span class="hljs-string">'pt'</span>)<br> <span class="hljs-comment"># print(f"input: {input}")</span><br> <span class="hljs-comment"># 将编码后的输入传递给 BERT 模型,计算所有层的输出</span><br> output = bert_model(**<span class="hljs-built_in">input</span>)<br> <span class="hljs-comment"># print(f"output: {output}")</span><br> <span class="hljs-comment"># 获取 BERT 模型最后一层的隐藏状态,它包含了每个单词的嵌入信息</span><br> last_hidden_states = output.last_hidden_state<br> <span class="hljs-comment"># 将所有词的向量求平均值,得到句子的表示</span><br> sentence_emb = torch.mean(last_hidden_states, dim=<span class="hljs-number">1</span>).flatten().tolist()<br> <span class="hljs-comment"># 返回句子的嵌入表示</span><br> <span class="hljs-keyword">return</span> sentence_emb<br></code></pre></td></tr></table></figure><p>然后,定义一个用 bge_m3模型获取句子向量的函数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 导入 bge_m3 模型</span><br><span class="hljs-keyword">from</span> pymilvus.model.hybrid <span class="hljs-keyword">import</span> BGEM3EmbeddingFunction<br><br><span class="hljs-comment"># 使用 bge_m3 模型获取句子的向量</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_bgem3_sentence_emb</span>(<span class="hljs-params">sentence, model_name=<span class="hljs-string">'BAAI/bge-m3'</span></span>):<br> bge_m3_ef = BGEM3EmbeddingFunction(<br> model_name=model_name,<br> device=<span class="hljs-string">'cpu'</span>,<br> use_fp16=<span class="hljs-literal">False</span><br> )<br> vectors = bge_m3_ef.encode_documents([sentence])<br> <span class="hljs-keyword">return</span> vectors[<span class="hljs-string">'dense'</span>][<span class="hljs-number">0</span>].tolist()<br></code></pre></td></tr></table></figure><p>接下来,先计算下 BERT 模型生成的句子向量之间的余弦相似度。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><code class="hljs python">sentence1 = <span class="hljs-string">"我喜欢这部电影!"</span><br>sentence2 = <span class="hljs-string">"这部电影太棒了!"</span><br>sentence3 = <span class="hljs-string">"我讨厌这部电影。"</span><br><br><span class="hljs-comment"># 使用 BERT 模型获取句子的向量</span><br>bert_sentence_emb1 = get_bert_sentence_emb(sentence1)<br>bert_sentence_emb2 = get_bert_sentence_emb(sentence2)<br>bert_sentence_emb3 = get_bert_sentence_emb(sentence3)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence1}</span>' 和 '<span class="hljs-subst">{sentence2}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bert_sentence_emb1], [bert_sentence_emb2])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence1}</span>' 和 '<span class="hljs-subst">{sentence3}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bert_sentence_emb1], [bert_sentence_emb3])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence2}</span>' 和 '<span class="hljs-subst">{sentence3}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bert_sentence_emb2], [bert_sentence_emb3])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br></code></pre></td></tr></table></figure><p>结果是:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">'我喜欢这部电影!' 和 '这部电影太棒了!' 的余弦相似度: 0.93<br>'我喜欢这部电影!' 和 '我讨厌这部电影。' 的余弦相似度: 0.94<br>'这部电影太棒了!' 和 '我讨厌这部电影。' 的余弦相似度: 0.89<br></code></pre></td></tr></table></figure><p>很明显,前两个句子语义相近,并且与第三个句子语义相反。但是使用 BERT 模型的结果却是三个句子语义相近。</p><p>最后看看 bge_m3模型的效果如何:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 使用 bge_m3 模型获取句子的向量</span><br>bgem3_sentence_emb1 = get_bgem3_sentence_emb(sentence1)<br>bgem3_sentence_emb2 = get_bgem3_sentence_emb(sentence2)<br>bgem3_sentence_emb3 = get_bgem3_sentence_emb(sentence3)<br><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence1}</span>' 和 '<span class="hljs-subst">{sentence2}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bgem3_sentence_emb1], [bgem3_sentence_emb2])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence1}</span>' 和 '<span class="hljs-subst">{sentence3}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bgem3_sentence_emb1], [bgem3_sentence_emb3])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"'<span class="hljs-subst">{sentence2}</span>' 和 '<span class="hljs-subst">{sentence3}</span>' 的余弦相似度: <span class="hljs-subst">{cosine_similarity([bgem3_sentence_emb2], [bgem3_sentence_emb3])[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)<br></code></pre></td></tr></table></figure><p>结果是:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">'我喜欢这部电影!' 和 '这部电影太棒了!' 的余弦相似度: 0.86<br>'我喜欢这部电影!' 和 '我讨厌这部电影。' 的余弦相似度: 0.65<br>'这部电影太棒了!' 和 '我讨厌这部电影。' 的余弦相似度: 0.57<br></code></pre></td></tr></table></figure><p>从余弦相似度可以看出,前两个句子语义相近,和第三个句子语义较远。看来 bge_m3 模型确实可以捕捉句子中的上下文信息。</p><h2 id="藏宝图"><a href="#藏宝图" class="headerlink" title="藏宝图"></a>藏宝图</h2><p>本文主要通过执行代码直观展示向量嵌入的原理和模型,如果你想进一步了解技术细节,这里有一些资料供你参考。</p><h3 id="词向量模型"><a href="#词向量模型" class="headerlink" title="词向量模型"></a>词向量模型</h3><p>word2vect 模型论文:</p><ul><li><a href="https://arxiv.org/abs/1301.3781">Efficient Estimation of Word Representations in Vector Space</a></li><li><a href="https://arxiv.org/abs/1310.4546">Distributed Representations of Words and Phrases and their Compositionality</a></li></ul><h3 id="中文词向量模型"><a href="#中文词向量模型" class="headerlink" title="中文词向量模型"></a>中文词向量模型</h3><ul><li><a href="https://github.com/Embedding/Chinese-Word-Vectors">Chinese-Word-Vectors</a> 项目提供了上百种预训练的中文词向量,这些词向量是基于不同的表征、上下文特征和语料库训练的,可以用于各种中文自然语言处理任务。</li><li><a href="https://ai.tencent.com/ailab/nlp/en/embedding.html">腾讯 AI Lab 中英文词和短语的嵌入语料库</a></li><li><a href="https://github.com/lzhenboy/word2vec-Chinese">word2vec-Chinese</a> 介绍了如何训练中文 Word2Vec 词向量模型。</li></ul><h3 id="BERT-模型"><a href="#BERT-模型" class="headerlink" title="BERT 模型"></a>BERT 模型</h3><p>BERT 模型论文:</p><ul><li><a href="https://arxiv.org/abs/1810.04805">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</a></li><li><a href="https://arxiv.org/abs/2004.12832">ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT</a></li></ul><p>BERT 模型的 GitHub:<a href="https://github.com/google-research/bert">bert</a></p><p>介绍 ColBERT 模型的博客:<a href="https://zilliz.com/learn/explore-colbert-token-level-embedding-and-ranking-model-for-similarity-search">Exploring ColBERT: A Token-Level Embedding and Ranking Model for Efficient Similarity Search</a></p><h3 id="bge-m3-模型"><a href="#bge-m3-模型" class="headerlink" title="bge_m3 模型"></a>bge_m3 模型</h3><p>介绍 bge_m3模型的博客:<a href="https://zilliz.com/learn/bge-m3-and-splade-two-machine-learning-models-for-generating-sparse-embeddings#BERT-The-Foundation-Model-for-BGE-M3-and-Splade">Exploring BGE-M3 and Splade: Two Machine Learning Models for Generating Sparse Embeddings</a></p><h3 id="注意力模型"><a href="#注意力模型" class="headerlink" title="注意力模型"></a>注意力模型</h3><p>注意力模型论文:<a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a></p><h3 id="模型库"><a href="#模型库" class="headerlink" title="模型库"></a>模型库</h3><ul><li><a href="https://radimrehurek.com/gensim/">gensim</a> 包含了 word2vec 模型和 GloVe(Global Vectors for Word Representation)模型。</li><li><a href="https://huggingface.co/transformers/">Transformers</a> 是 Hugging Face 开发的一个开源库,专门用于自然语言处理(NLP)任务,它提供了大量预训练的 Transformer 模型,如 BERT、GPT、T5 等,并且支持多种语言和任务。</li><li><a href="https://github.com/ymcui/Chinese-BERT-wwm">Chinese-BERT-wwm</a> 是哈工大讯飞联合实验室(HFL)发布的中文 BERT 模型。</li><li><a href="https://milvus.io/docs/embeddings.md">pymilvus.model</a> 是 PyMilvus 客户端库的一个子包,提供多种嵌入模型的封装,用于生成向量嵌入,简化了文本转换过程。</li></ul><h2 id="注释"><a href="#注释" class="headerlink" title="注释"></a>注释</h2><section class="footnotes"><div class="footnote-list"><ol><li><span id="fn:1" class="footnote-text"><span>严格来说,“目标词”不是单词而是“token”。token 是组成句子的基本单元。对于英文来说,token可以简单理解为单词,还可能是子词(subword)或者标点符号,比如“unhappiness” 可能会被分割成“un”和“happiness“。对于汉字来说,则是字、词或者短语,汉字不会像英文单词那样被分割。<a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li></ol></div></section>]]></content>
<categories>
<category>向量数据库</category>
<category>原理探秘</category>
</categories>
</entry>
<entry>
<title>如何假装文艺青年,怎么把大白话“变成”古诗词?</title>
<link href="/2024/09/16/%E5%A6%82%E4%BD%95%E5%81%87%E8%A3%85%E6%96%87%E8%89%BA%E9%9D%92%E5%B9%B4%EF%BC%8C%E6%80%8E%E4%B9%88%E6%8A%8A%E5%A4%A7%E7%99%BD%E8%AF%9D%E2%80%9C%E5%8F%98%E6%88%90%E2%80%9D%E5%8F%A4%E8%AF%97%E8%AF%8D%EF%BC%9F/"/>
<url>/2024/09/16/%E5%A6%82%E4%BD%95%E5%81%87%E8%A3%85%E6%96%87%E8%89%BA%E9%9D%92%E5%B9%B4%EF%BC%8C%E6%80%8E%E4%B9%88%E6%8A%8A%E5%A4%A7%E7%99%BD%E8%AF%9D%E2%80%9C%E5%8F%98%E6%88%90%E2%80%9D%E5%8F%A4%E8%AF%97%E8%AF%8D%EF%BC%9F/</url>
<content type="html"><![CDATA[<p>午后细雨绵绵,你独倚窗边,思绪万千,于是拿出手机,想发一条朋友圈抒发情怀,随便展示一下文采。奈何好不容易按出几个字,又全部删除。“今天的雨好大”展示不出你的文采。你灵机一动,如果有一个搜索引擎,能搜索出和“今天的雨好大”意思相近的古诗词,岂不妙哉!</p><p>使用向量数据库就可以实现,代码还不到100行,一起来试试吧。我们会从零开始安装向量数据库 Milvus,向量化古诗词数据集,然后创建集合,导入数据,创建索引,最后实现语义搜索功能。</p><blockquote><p>本文首发于 Zilliz 公众号。文中代码的 Notebook 在<a href="https://pan.baidu.com/s/1Su0U65G6ZXXuzUy8VO3_cA?pwd=esca">这里</a>下载。</p></blockquote><h2 id="0-准备工作"><a href="#0-准备工作" class="headerlink" title="0 准备工作"></a>0 准备工作</h2><p>首先安装向量数据库 Milvus。因为 Milvus 是运行在 docker 中的,所以需要先安装 Docker Desktop。MacOS 系统安装方法:<a href="https://docs.docker.com/desktop/install/mac-install/">Install Docker Desktop on Mac</a> ,Windows 系统安装方法:<a href="https://docs.docker.com/desktop/install/windows-install/">Install Docker Desktop on Windows</a></p><p>然后安装 Milvus。Milvus 版本:>=2.5.0<br>下载安装脚本:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh<br></code></pre></td></tr></table></figure><p>运行 Milvus:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">bash standalone_embed.sh start<br></code></pre></td></tr></table></figure><p>安装依赖。pymilvus >= 2.5.0</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">pip install pymilvus==2.5.0 "pymilvus[model]" torch <br></code></pre></td></tr></table></figure><p>下载古诗词数据集<sup id="fnref:1" class="footnote-ref"><a href="#fn:1" rel="footnote"><span class="hint--top hint--rounded" aria-label="古诗词数据集来自 [chinese-poetry](https://github.com/BushJiang/chinese-poetry),数据结构做了调整。">[1]</span></a></sup> TangShi.json。它的格式是这样的:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><code class="hljs json"><span class="hljs-punctuation">[</span><br><span class="hljs-punctuation">{</span><br><span class="hljs-attr">"author"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"太宗皇帝"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"paragraphs"</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span><br><span class="hljs-string">"秦川雄帝宅,函谷壮皇居。"</span><br><span class="hljs-punctuation">]</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"title"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"帝京篇十首 一"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"id"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">20000001</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"type"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"唐诗"</span><br><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br>...<br><span class="hljs-punctuation">]</span><br></code></pre></td></tr></table></figure><p>准备就绪,正式开始啦。</p><h2 id="1-向量化文本"><a href="#1-向量化文本" class="headerlink" title="1 向量化文本"></a>1 向量化文本</h2><p>为了实现语义搜索,我们需要先把文本向量化。你可以理解为把不同类型的信息(如文字、图像、声音)翻译成计算机可以理解的数字语言。计算机理解了,才能帮你找到语义相近的诗句。</p><p>先定义两个函数,一个用来初始化嵌入模型(也就是用来向量化的模型)的实例,另一个是调用嵌入模型的实例,把输入的文档向量化。</p><p>初始化嵌入模型的实例:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm<br><span class="hljs-keyword">import</span> torch<br><span class="hljs-keyword">from</span> pymilvus.model.hybrid <span class="hljs-keyword">import</span> BGEM3EmbeddingFunction<br><br><span class="hljs-comment"># 初始化嵌入模型的实例</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">init_embedding_model</span>():<br> <span class="hljs-comment"># 检查是否有可用的CUDA设备</span><br> device = <span class="hljs-string">"cuda:0"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span><br> <span class="hljs-comment"># 根据设备选择是否使用fp16</span><br> use_fp16 = device.startswith(<span class="hljs-string">"cuda"</span>)<br> <span class="hljs-comment"># 创建嵌入模型实例</span><br> bge_m3_ef = BGEM3EmbeddingFunction(<br> model_name=<span class="hljs-string">"BAAI/bge-m3"</span>,<br> device=device,<br> use_fp16=use_fp16<br> )<br> <span class="hljs-keyword">return</span> bge_m3_ef<br><br>bge_m3_ef = init_embedding_model()<br></code></pre></td></tr></table></figure><p>向量化文档:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 把文档向量化</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">vectorize_docs</span>(<span class="hljs-params">docs, encoder</span>):<br> <span class="hljs-comment"># 验证参数是否符合要求</span><br> <span class="hljs-keyword">if</span> encoder <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"嵌入模型未初始化。"</span>)<br> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> (<span class="hljs-built_in">isinstance</span>(docs, <span class="hljs-built_in">list</span>) <span class="hljs-keyword">and</span> <span class="hljs-built_in">all</span>(<span class="hljs-built_in">isinstance</span>(text, <span class="hljs-built_in">str</span>) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> docs)):<br> <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"docs必须为字符串列表。"</span>)<br> <span class="hljs-keyword">return</span> encoder.encode_documents(docs)<br></code></pre></td></tr></table></figure><p>准备好后,我们就可以向量化整个数据集了。首先读取数据集 TangShi.json 中的数据,把其中的 paragraphs 字段向量化,然后写入 TangShi_vector.json 文件。如果你是第一次使用 Milvus,运行下面的代码时还会安装必要的依赖。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">import</span> json<br><span class="hljs-comment"># 读取 json 文件,把paragraphs字段向量化</span><br><span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"TangShi.json"</span>, <span class="hljs-string">'r'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:<br>data_list = json.load(file)<br>docs = [data[<span class="hljs-string">'paragraphs'</span>][<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> data_list]<br><br><span class="hljs-comment"># 向量化文本数据</span><br>vectors = vectorize_docs(docs, bge_m3_ef)<br><br><span class="hljs-comment"># 将向量添加到原始文本中</span><br><span class="hljs-keyword">for</span> data, vector <span class="hljs-keyword">in</span> <span class="hljs-built_in">zip</span>(data_list, vectors[<span class="hljs-string">'dense'</span>]):<br> <span class="hljs-comment"># 将 NumPy 数组转换为 Python 的普通列表</span><br>data[<span class="hljs-string">'vector'</span>] = vector.tolist()<br><br><span class="hljs-comment"># 将更新后的文本内容写入新的json文件</span><br><span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"TangShi_vector.json"</span>, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> outfile:<br>json.dump(data_list, outfile, ensure_ascii=<span class="hljs-literal">False</span>, indent=<span class="hljs-number">4</span>)<br></code></pre></td></tr></table></figure><p>如果一切顺利,你会得到 TangShi_vector.json 文件,它增加了 vector 字段,它的值是一个字符串列表,也就是“向量”。</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><code class="hljs json"><span class="hljs-punctuation">[</span><br><span class="hljs-punctuation">{</span><br><span class="hljs-attr">"author"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"太宗皇帝"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"paragraphs"</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span><br><span class="hljs-string">"秦川雄帝宅,函谷壮皇居。"</span><br><span class="hljs-punctuation">]</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"title"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"帝京篇十首 一"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"id"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">20000001</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"type"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"唐诗"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"vector"</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span><br><span class="hljs-number">0.005114779807627201</span><span class="hljs-punctuation">,</span><br><span class="hljs-number">0.033538609743118286</span><span class="hljs-punctuation">,</span><br><span class="hljs-number">0.020395483821630478</span><span class="hljs-punctuation">,</span><br>...<br><span class="hljs-punctuation">]</span><br><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br><span class="hljs-punctuation">{</span><br><span class="hljs-attr">"author"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"太宗皇帝"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"paragraphs"</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span><br><span class="hljs-string">"绮殿千寻起,离宫百雉余。"</span><br><span class="hljs-punctuation">]</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"title"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"帝京篇十首 一"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"id"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">20000002</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"type"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"唐诗"</span><span class="hljs-punctuation">,</span><br><span class="hljs-attr">"vector"</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span><br><span class="hljs-number">-0.06334448605775833</span><span class="hljs-punctuation">,</span><br><span class="hljs-number">0.0017451602034270763</span><span class="hljs-punctuation">,</span><br><span class="hljs-number">-0.0010646708542481065</span><span class="hljs-punctuation">,</span><br>...<br><span class="hljs-punctuation">]</span><br><span class="hljs-punctuation">}</span><span class="hljs-punctuation">,</span><br>...<br><span class="hljs-punctuation">]</span><br></code></pre></td></tr></table></figure><h2 id="2-创建集合"><a href="#2-创建集合" class="headerlink" title="2 创建集合"></a>2 创建集合</h2><p>接下来我们要把向量数据导入向量数据库。当然,我们得先在向量数据库中创建一个集合,用来容纳向量数据。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建milvus_client实例</span><br><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> MilvusClient, DataType<br><br>milvus_client = MilvusClient(uri=<span class="hljs-string">"http://localhost:19530"</span>)<br><span class="hljs-comment"># 指定集合名称</span><br>collection_name = <span class="hljs-string">"TangShi"</span><br></code></pre></td></tr></table></figure><p>为了避免向量数据库中存在同名集合,产生干扰,创建集合前先删除同名集合。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 检查同名集合是否存在,如果存在则删除</span><br><span class="hljs-keyword">if</span> milvus_client.has_collection(collection_name):<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"集合 <span class="hljs-subst">{collection_name}</span> 已经存在"</span>)<br> <span class="hljs-keyword">try</span>:<br> <span class="hljs-comment"># 删除同名集合</span><br> milvus_client.drop_collection(collection_name)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br> <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"删除集合时出现错误: <span class="hljs-subst">{e}</span>"</span>)<br></code></pre></td></tr></table></figure><p>我们把数据填入 excel 表格前,需要先设计好表头,规定有哪些字段,各个字段的数据类型是怎样的。向量数据库也是一样,它的“表头”就是 schema,模式。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-keyword">from</span> pymilvus <span class="hljs-keyword">import</span> DataType<br><br><span class="hljs-comment"># 创建集合模式</span><br>schema = MilvusClient.create_schema(<br> auto_id=<span class="hljs-literal">False</span>,<br> enable_dynamic_field=<span class="hljs-literal">True</span>,<br> description=<span class="hljs-string">"TangShi"</span><br>)<br><br><span class="hljs-comment"># 添加字段到schema</span><br>schema.add_field(field_name=<span class="hljs-string">"id"</span>, datatype=DataType.INT64, is_primary=<span class="hljs-literal">True</span>)<br>schema.add_field(field_name=<span class="hljs-string">"vector"</span>, datatype=DataType.FLOAT_VECTOR, dim=<span class="hljs-number">1024</span>)<br>schema.add_field(field_name=<span class="hljs-string">"title"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">1024</span>)<br>schema.add_field(field_name=<span class="hljs-string">"author"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">256</span>)<br>schema.add_field(field_name=<span class="hljs-string">"paragraphs"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">10240</span>)<br>schema.add_field(field_name=<span class="hljs-string">"type"</span>, datatype=DataType.VARCHAR, max_length=<span class="hljs-number">128</span>)<br></code></pre></td></tr></table></figure><p>模式创建好了,接下来就可以创建集合了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建集合</span><br><span class="hljs-keyword">try</span>:<br> milvus_client.create_collection(<br> collection_name=collection_name,<br> schema=schema,<br> shards_num=<span class="hljs-number">2</span><br> )<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"开始创建集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br><span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"创建集合的过程中出现了错误: <span class="hljs-subst">{e}</span>"</span>)<br></code></pre></td></tr></table></figure><h2 id="3-入库"><a href="#3-入库" class="headerlink" title="3 入库"></a>3 入库</h2><p>接下来把文件导入到 Milvus。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 读取和处理文件</span><br><span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">"TangShi_vector.json"</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> file:<br> data = json.load(file)<br> <span class="hljs-comment"># paragraphs的值是列表,需要获取列表中的字符串,以符合Milvus插入数据的要求</span><br> <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> data:<br> item[<span class="hljs-string">"paragraphs"</span>] = item[<span class="hljs-string">"paragraphs"</span>][<span class="hljs-number">0</span>]<br><br><span class="hljs-comment"># 将数据插入集合</span><br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在将数据插入集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br>res = milvus_client.insert(<br> collection_name=collection_name,<br> data=data<br>)<br></code></pre></td></tr></table></figure><p>导入成功了吗?我们来验证下。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"插入的实体数量: <span class="hljs-subst">{res[<span class="hljs-string">'insert_count'</span>]}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回插入实体的数量,看来是成功了。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">插入的实体数量: 4307<br></code></pre></td></tr></table></figure><h2 id="4-创建索引"><a href="#4-创建索引" class="headerlink" title="4 创建索引"></a>4 创建索引</h2><p>向量已经导入 Milvus,现在可以搜索了吗?别急,为了提高搜索效率,我们还需要创建索引。什么是索引?一些大部头图书的最后,一般都会有索引,它列出了书中出现的关键术语以及对应的页码,帮助你快速找到它们的位置。如果没有索引,那就只能用笨方法,从第一页开始一页一页往后找了。</p><p><img src="https://picgo233.oss-cn-hangzhou.aliyuncs.com/img/202408232208878.jpeg" alt="图片来源:自己拍的《英国皇家园艺学会植物繁育手册:用已有植物打造完美新植物》"><br>图片来源:自己拍的《英国皇家园艺学会植物繁育手册:用已有植物打造完美新植物》</p><p>Milvus 的索引也是如此。如果不创建索引,虽然也可以搜索,但是速度很慢,它会逐一比较查询向量与数据库中每一个向量,通过指定方法计算出两个向量之间的 <strong>距离</strong>,找出距离最近的几个向量。而创建索引之后,搜索速度会大大提升。</p><p>索引有不同的类型,适合不同的场景使用,我们以后会详细讨论这个问题。这里我们使用 IVF_FLAT。另外,计算<strong>距离</strong>的方法也有多种,我们使用 IP,也就是计算两个向量的内积。这些都是索引的参数,我们先创建这些参数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 创建IndexParams对象,用于存储索引的各种参数</span><br>index_params = client.prepare_index_params()<br><span class="hljs-comment"># 设置索引名称</span><br>vector_index_name = <span class="hljs-string">"vector_index"</span><br><span class="hljs-comment"># 设置索引的各种参数</span><br>index_params.add_index(<br><span class="hljs-comment"># 指定为"vector"字段创建索引</span><br>field_name=<span class="hljs-string">"vector"</span>,<br><span class="hljs-comment"># 设置索引类型</span><br>index_type=<span class="hljs-string">"IVF_FLAT"</span>,<br><span class="hljs-comment"># 设置度量类型</span><br>metric_type=<span class="hljs-string">"IP"</span>,<br><span class="hljs-comment"># 设置索引聚类中心的数量</span><br>params={<span class="hljs-string">"nlist"</span>: <span class="hljs-number">128</span>},<br><span class="hljs-comment"># 指定索引名称</span><br>index_name=vector_index_name<br>)<br></code></pre></td></tr></table></figure><p>索引参数创建好了,现在终于可以创建索引了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"开始创建索引:<span class="hljs-subst">{vector_index_name}</span>"</span>)<br><br><span class="hljs-comment"># 创建索引</span><br>client.create_index(<br><span class="hljs-comment"># 指定为哪个集合创建索引</span><br>collection_name=collection_name,<br><span class="hljs-comment"># 使用前面创建的索引参数创建索引</span><br>index_params=index_params<br>)<br></code></pre></td></tr></table></figure><p>我们来验证下索引是否创建成功了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><code class="hljs python">indexes = client.list_indexes(<br>collection_name=collection_name<br>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"列出创建的索引:<span class="hljs-subst">{indexes}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回了包含索引名称的列表,索引名称 vector_index 正是我们之前创建的。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">列出创建的索引:['vector_index']<br></code></pre></td></tr></table></figure><p>再来查看下索引的详情。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 查看索引详情</span><br>index_details = client.describe_index(<br>collection_name=collection_name,<br><span class="hljs-comment"># 指定索引名称,这里假设使用第一个索引</span><br>index_name=<span class="hljs-string">"vector_index"</span><br>)<br><span class="hljs-built_in">print</span>(<span class="hljs-string">f"索引vector_index详情:<span class="hljs-subst">{index_details}</span>"</span>)<br></code></pre></td></tr></table></figure><p>返回了一个包含索引详细信息的字典,可以我们之前设置的索引参数,比如 nlist,index_type 和 metric_type 等等。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">索引vector_index详情:{'nlist': '128', 'index_type': 'IVF_FLAT', 'metric_type': 'IP', 'field_name': 'vector', 'index_name': 'vector_index', 'total_rows': 0, 'indexed_rows': 0, 'pending_index_rows': 0, 'state': 'Finished'}<br></code></pre></td></tr></table></figure><h2 id="5-加载索引"><a href="#5-加载索引" class="headerlink" title="5 加载索引"></a>5 加载索引</h2><p>索引创建成功了,现在可以搜索了吗?等等,我们还需要把集合中的数据和索引,从硬盘加载到内存中。因为在内存中搜索更快。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(<span class="hljs-string">f"正在加载集合:<span class="hljs-subst">{collection_name}</span>"</span>)<br>client.load_collection(collection_name=collection_name)<br></code></pre></td></tr></table></figure><p>加载完成了,仍然验证下。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">print</span>(client.get_load_state(collection_name=collection_name))<br></code></pre></td></tr></table></figure><p>返回加载状态 Loaded,没问题,加载完成。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs shell">{'state': <LoadState: Loaded>}<br></code></pre></td></tr></table></figure><h2 id="6-搜索"><a href="#6-搜索" class="headerlink" title="6 搜索"></a>6 搜索</h2><p>经过前面的一系列准备,现在我们终于可以回到开头的问题了,用现代白话文搜索语义相似的古诗词。</p><p>首先,把我们要搜索的现代白话文“翻译”成向量。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 获取查询向量</span><br>query = <span class="hljs-string">"今天的雨好大"</span><br>query_vectors = [vectorize_docs([query], bge_m3_ef)[<span class="hljs-string">'dense'</span>][<span class="hljs-number">0</span>].tolist()]<br></code></pre></td></tr></table></figure><p>然后,设置搜索参数,告诉 Milvus 怎么搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 设置搜索参数</span><br>search_params = {<br><span class="hljs-comment"># 设置度量类型</span><br><span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"IP"</span>,<br><span class="hljs-comment"># 指定在搜索过程中要查询的聚类单元数量,增加nprobe值可以提高搜索精度,但会降低搜索速度</span><br><span class="hljs-string">"params"</span>: {<span class="hljs-string">"nprobe"</span>: <span class="hljs-number">16</span>}<br>}<br></code></pre></td></tr></table></figure><p>最后,我们还得告诉它怎么输出结果。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 指定返回搜索结果的数量</span><br>limit = <span class="hljs-number">3</span><br><span class="hljs-comment"># 指定返回的字段</span><br>output_fields = [<span class="hljs-string">"author"</span>, <span class="hljs-string">"title"</span>, <span class="hljs-string">"paragraphs"</span>]<br></code></pre></td></tr></table></figure><p>一切就绪,让我们开始搜索吧!</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><code class="hljs python">res1 = milvus_client.search(<br> collection_name=collection_name,<br> <span class="hljs-comment"># 指定查询向量</span><br> data=query_vectors,<br> <span class="hljs-comment"># 指定搜索的字段</span><br> anns_field=<span class="hljs-string">"vector"</span>,<br> <span class="hljs-comment"># 设置搜索参数</span><br> search_params=search_params,<br> limit=limit,<br> output_fields=output_fields<br>)<br><span class="hljs-built_in">print</span>(res1)<br></code></pre></td></tr></table></figure><p>得到下面的结果:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">data: [<br> "[<br> {<br> 'id': 20002740,<br> 'distance': 0.6542239189147949,<br> 'entity': {<br> 'title': '郊庙歌辞 享太庙乐章 大明舞',<br> 'paragraphs': '旱望春雨,云披大风。',<br> 'author': '张说'<br> }<br> },<br> {<br> 'id': 20001658,<br> 'distance': 0.6228379011154175,<br> 'entity': {<br> 'title': '三学山夜看圣灯',<br> 'paragraphs': '细雨湿不暗,好风吹更明。',<br> 'author': '蜀太妃徐氏'<br> }<br> },<br> {<br> 'id': 20003360,<br> 'distance': 0.6123768091201782,<br> 'entity': {<br> 'title': '郊庙歌辞 汉宗庙乐舞辞 积善舞',<br> 'paragraphs': '云行雨施,天成地平。',<br> 'author': '张昭'<br> }<br> }<br> ]"<br>]<br></code></pre></td></tr></table></figure><p>在搜索结果中,id、title 等字段我们都了解了,只有 distance 是新出现的。它指的是搜索结果与查询向量之间的“距离”,具体含义和度量类型有关。我们使用的度量类型是 IP 内积,数字越大表示搜索结果和查询向量越接近。</p><p>为了增加可读性,我们写一个输出函数:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 打印向量搜索结果</span><br><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_vector_results</span>(<span class="hljs-params">res</span>):<br> <span class="hljs-comment"># hit是搜索结果中的每一个匹配的实体</span><br> res = [hit[<span class="hljs-string">"entity"</span>] <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> res[<span class="hljs-number">0</span>]]<br> <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> res:<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"title: <span class="hljs-subst">{item[<span class="hljs-string">'title'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"author: <span class="hljs-subst">{item[<span class="hljs-string">'author'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"paragraphs: <span class="hljs-subst">{item[<span class="hljs-string">'paragraphs'</span>]}</span>"</span>)<br> <span class="hljs-built_in">print</span>(<span class="hljs-string">"-"</span>*<span class="hljs-number">50</span>) <br> <span class="hljs-built_in">print</span>(<span class="hljs-string">f"数量:<span class="hljs-subst">{<span class="hljs-built_in">len</span>(res)}</span>"</span>)<br></code></pre></td></tr></table></figure><p>重新输出结果:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python">print_vector_results(res1)<br></code></pre></td></tr></table></figure><p>这下搜索结果容易阅读了。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><code class="hljs shell">title: 郊庙歌辞 享太庙乐章 大明舞<br>author: 张说<br>paragraphs: 旱望春雨,云披大风。<br>--------------------------------------------------<br>title: 三学山夜看圣灯<br>author: 蜀太妃徐氏<br>paragraphs: 细雨湿不暗,好风吹更明。<br>--------------------------------------------------<br>title: 郊庙歌辞 汉宗庙乐舞辞 积善舞<br>author: 张昭<br>paragraphs: 云行雨施,天成地平。<br>--------------------------------------------------<br>数量:3<br></code></pre></td></tr></table></figure><p>如果你不想限制搜索结果的数量,而是返回所有质量符合要求的搜索结果,可以修改搜索参数。比如,在搜索参数中增加 radius 和 range_filter 参数,它们限制了距离 distance 的范围在0.55到1之间。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 修改搜索参数,设置距离的范围</span><br>search_params = {<br> <span class="hljs-string">"metric_type"</span>: <span class="hljs-string">"IP"</span>,<br> <span class="hljs-string">"params"</span>: {<br> <span class="hljs-string">"nprobe"</span>: <span class="hljs-number">16</span>,<br> <span class="hljs-string">"radius"</span>: <span class="hljs-number">0.55</span>,<br> <span class="hljs-string">"range_filter"</span>: <span class="hljs-number">1.0</span><br> }<br>}<br></code></pre></td></tr></table></figure><p>然后调整搜索代码,删除 limit 参数。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><code class="hljs python"><br>res2 = milvus_client.search(<br> collection_name=collection_name,<br> <span class="hljs-comment"># 指定查询向量</span><br> data=query_vectors,<br> <span class="hljs-comment"># 指定搜索的字段</span><br> anns_field=<span class="hljs-string">"vector"</span>,<br> <span class="hljs-comment"># 设置搜索参数</span><br> search_params=search_params,<br> <span class="hljs-comment"># 删除limit参数</span><br> output_fields=output_fields<br>)<br><br><span class="hljs-built_in">print</span>(res2)<br></code></pre></td></tr></table></figure><p>可以看到,输出结果的 distance 都大于0.55。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">data: [<br> "[<br> {<br> 'id': 20002740,<br> 'distance': 0.6542239189147949,<br> 'entity': {<br> 'author': '张说',<br> 'title': '郊庙歌辞 享太庙乐章 大明舞',<br> 'paragraphs': '旱望春雨,云披大风。'<br> }<br> },<br> {<br> 'id': 20001658,<br> 'distance': 0.6228379011154175,<br> 'entity': {<br> 'author': '蜀太妃徐氏',<br> 'title': '三学山夜看圣灯',<br> 'paragraphs': '细雨湿不暗,好风吹更明。'<br> }<br> },<br> ...<br> {<br> 'id': 20001375, <br> 'distance': 0.5891480445861816, <br> 'entity': {<br> 'author': '上官昭容', <br> 'title': '游长宁公主流杯池二十五首 二十', <br> 'paragraphs': '瀑溜晴疑雨,丛篁昼似昏。'<br> }<br>}<br> ]"<br>]<br></code></pre></td></tr></table></figure><p>也许你还想知道你最喜欢的李白,有没有和你一样感慨今天的雨真大,没问题,我们增加filter参数就可以了。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># 通过表达式过滤字段author,筛选出字段“author”的值为“李白”的结果</span><br><span class="hljs-built_in">filter</span> = <span class="hljs-string">f"author == '李白'"</span><br><br>res3 = client.search(<br>collection_name=collection_name,<br><span class="hljs-comment"># 指定查询向量</span><br>data=query_vectors,<br><span class="hljs-comment"># 指定搜索的字段</span><br>anns_field=<span class="hljs-string">"vector"</span>,<br><span class="hljs-comment"># 设置搜索参数</span><br>search_params=search_params,<br><span class="hljs-comment"># 通过表达式实现标量过滤,筛选结果</span><br><span class="hljs-built_in">filter</span>=<span class="hljs-built_in">filter</span>,<br><span class="hljs-comment"># 指定返回搜索结果的数量</span><br>limit=limit,<br><span class="hljs-comment"># 指定返回的字段</span><br>output_fields=output_fields<br>)<br><span class="hljs-built_in">print</span>(res3)<br></code></pre></td></tr></table></figure><p>返回的结果为空值。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><code class="hljs python">data: [<span class="hljs-string">'[]'</span>] <br></code></pre></td></tr></table></figure><p>这是因为我们前面设置了 distance 的范围在0.55到1之间,放大范围可以获得更多结果。把 “radius” 的值修改为0.2,再次运行命令,让我们看看李白是怎么感慨的。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">data: [<br> "[<br> {<br> 'id': 20004246,<br> 'distance': 0.46472394466400146,<br> 'entity': {<br> 'author': '李白',<br> 'title': '横吹曲辞 关山月',<br> 'paragraphs': '明月出天山,苍茫云海间。'<br> }<br> },<br> {<br> 'id': 20003707,<br> 'distance': 0.4347272515296936,<br> 'entity': {<br> 'author': '李白',<br> 'title': '鼓吹曲辞 有所思',<br> 'paragraphs': '海寒多天风,白波连山倒蓬壶。'<br> }<br> },<br> {<br> 'id': 20003556,<br> 'distance': 0.40778297185897827,<br> 'entity': {<br> 'author': '李白',<br> 'title': '鼓吹曲辞 战城南',<br> 'paragraphs': '去年战桑干源,今年战葱河道。'<br> }<br> }<br> ]"<br>]<br></code></pre></td></tr></table></figure><p>我们观察搜索结果发现, distance 在0.4左右,小于之前设置的0.55,所以被排除了。另外,distance 数值较小,说明搜索结果并不是特别接近查询向量,而这几句诗词的确和“雨”的关系比较远。</p><p>如果你希望搜索结果中直接包含“雨”字,可以使用 query 方法做标量搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-comment"># paragraphs字段包含“雨”字</span><br><span class="hljs-built_in">filter</span> = <span class="hljs-string">f"paragraphs like '%雨%'"</span><br><br>res4 = client.query(<br>collection_name=collection_name,<br><span class="hljs-built_in">filter</span>=<span class="hljs-built_in">filter</span>,<br>output_fields=output_fields,<br>limit=limit<br>)<br><span class="hljs-built_in">print</span>(res4)<br></code></pre></td></tr></table></figure><p>标量查询的代码更简单,因为它免去了和向量搜索相关的参数,比如查询向量 data,指定搜索字段的 anns_field 和搜索参数 search_params,搜索参数只有 filter 。</p><p>观察搜索结果发现,标量搜索结果的数据结构少了一个 “[]”,我们在提取具体字段时需要注意这一点。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">data: [<br> "{<br> "author": "太宗皇帝",<br> "title": "咏雨",<br> "paragraphs": "罩云飘远岫,喷雨泛长河。",<br> "id": 20000305<br> },<br> {<br> "author": "太宗皇帝",<br> "title": "咏雨",<br> "paragraphs": "和气吹绿野,梅雨洒芳田。",<br> "id": 20000402<br> },<br> {<br> "author": "太宗皇帝",<br> "title": "赋得花庭雾",<br> "paragraphs": "还当杂行雨,髣髴隐遥空。",<br> "id": 20000421<br> }"<br>]<br></code></pre></td></tr></table></figure><p>filter 表达式还有丰富的用法,比如同时搜索两个字段,author 字段指定为 “杜甫”,同时 paragraphs 字段仍然要求包含“雨”字:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><code class="hljs python"><span class="hljs-built_in">filter</span> = <span class="hljs-string">f"author == '杜甫' && paragraphs like '%雨%'"</span><br><br>res5 = client.query(<br>collection_name=collection_name,<br><span class="hljs-built_in">filter</span>=<span class="hljs-built_in">filter</span>,<br>output_fields=output_fields,<br>limit=limit<br>)<br><span class="hljs-built_in">print</span>(res5)<br></code></pre></td></tr></table></figure><p>返回杜甫含有“雨”字的诗句:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><code class="hljs plaintext">data: [<br>"{<br>'title': '横吹曲辞 前出塞九首 七', <br>'paragraphs': '驱马天雨雪,军行入高山。', <br>'id': 20004039, <br>'author': '杜甫'<br>}"<br>] <br></code></pre></td></tr></table></figure><p>更多标量搜索的表达式可以参考<a href="https://milvus.io/docs/get-and-scalar-query.md#Use-Basic-Operators">Get & Scalar Query</a>。</p><h2 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h2><p>可能这样的搜索结果并没有让你很满意,这里面有多个原因。首先,数据集太小了。只有4000多个句子,语义更接近的句子可能没有包含其中。其次,嵌入模型虽然支持中文,但是古诗词并不是它的专长。这就好像你找了个翻译帮你和老外交流,翻译虽然懂普通话,但是你满嘴四川方言,翻译也只能也蒙带猜,翻译质量可想而知。</p><p>如果你希望优化搜索功能,可以在 <a href="https://github.com/BushJiang/chinese-poetry">chinese-poetry</a> 下载完整的古诗词数据集,再找找专门用于古诗词的嵌入模型,相信搜索效果会有较大提升。</p><p>另外,我在以上代码的基础上,开发了一个命令行应用,有兴趣可以玩玩:<a href="https://github.com/BushJiang/searchPoems">语义搜索古诗词</a></p><h2 id="注释"><a href="#注释" class="headerlink" title="注释"></a>注释</h2><section class="footnotes"><div class="footnote-list"><ol><li><span id="fn:1" class="footnote-text"><span>古诗词数据集来自 <a href="https://github.com/BushJiang/chinese-poetry">chinese-poetry</a>,数据结构做了调整。<a href="#fnref:1" rev="footnote" class="footnote-backref"> ↩</a></span></span></li></ol></div></section>]]></content>
<categories>
<category>向量数据库</category>
<category>趣味应用</category>
</categories>
</entry>
</search>