Skip to content

Commit 65a8c47

Browse files
authored
Merge pull request #90 from BillFarber/task/extendExamples
Adding setup for MarkLogic 12 examples
2 parents c1306e6 + fd87194 commit 65a8c47

File tree

6 files changed

+195
-1
lines changed

6 files changed

+195
-1
lines changed

examples/langchain/README.md

+58-1
Original file line numberDiff line numberDiff line change
@@ -26,23 +26,31 @@ is available):
2626

2727
docker-compose up -d --build
2828

29+
## Deploy With Gradle
30+
2931
Then deploy a small REST API application to MarkLogic, which includes a basic non-admin MarkLogic user
3032
named `langchain-user`:
3133

3234
./gradlew -i mlDeploy
3335

36+
## Install Python Libraries
37+
3438
Next, create a new Python virtual environment - [pyenv](https://github.com/pyenv/pyenv) is recommended for this -
3539
and install the
3640
[langchain example dependencies](https://python.langchain.com/docs/use_cases/question_answering/quickstart#dependencies),
3741
along with the MarkLogic Python Client:
3842

3943
pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client
4044

45+
## Load Sample Data
46+
4147
Then run the following Python program to load text data from the langchain quickstart guide
4248
into two different collections in the `langchain-test-content` database:
4349

4450
python load_data.py
4551

52+
## Create Python Environment File
53+
4654
Create a ".env" file to hold your AzureOpenAI environment values. It should look
4755
something like this.
4856
```
@@ -89,4 +97,53 @@ query using the `marklogic_contextual_query_retriever.py` module in this project
8997

9098
This retriever builds a term-query using words from the question. Then the term-query is
9199
added to the structured query and the merged query is used to select from the documents
92-
loaded via `load_data.py`.
100+
loaded via `load_data.py`.
101+
102+
## Testing using MarkLogic 12EA Vector Search
103+
104+
### MarkLogic 12EA Setup
105+
106+
To try out this functionality out, you will need acces to an instance of MarkLogic 12
107+
(currently internal or Early Access only). You may use docker
108+
[docker-compose](https://docs.docker.com/compose/) to instantiate a new MarkLogic
109+
instance with port 8003 available (you can use your own MarkLogic instance too, just be
110+
sure that port 8003 is available):
111+
112+
docker-compose -f docker-compose-12.yml up -d --build
113+
114+
### Deploy With Gradle
115+
116+
You will also need to deploy the application. However, for this example, you will need
117+
to include an additional switch on the command line to deploy a TDE schema that takes
118+
advantage of the vector capabilities in MarkLogic 12.
119+
120+
./gradlew -i mlDeploy -PmlSchemasPath=src/main/ml-schemas-12
121+
122+
### Install Python Libraries
123+
124+
As above, if you have not yet installed the Python libraries, install this with pip:
125+
```
126+
pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client
127+
```
128+
129+
### Create Python Environment File
130+
The Python script for this example also generates LLM embeddings and includes them in
131+
the documents stored in MarkLogic. In order to generate the embeddings, you'll need to
132+
add the following environment variables (with your values) to the .env file created
133+
above.
134+
135+
```
136+
AZURE_EMBEDDING_DEPLOYMENT_NAME=text-test-embedding-ada-002
137+
AZURE_EMBEDDING_DEPLOYMENT_MODEL=text-embedding-ada-002
138+
```
139+
140+
### Load Sample Data
141+
142+
Then run the following Python program to load text data from the langchain quickstart
143+
guide into two different collections in the `langchain-test-content` database. Note that
144+
this script is different than the one in the earlier setup section and loads the data
145+
into different collections.
146+
147+
```
148+
python load_data_with_embeddings.py
149+
```
+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
version: '3.8'
2+
name: marklogic_python_example_langchain-12
3+
4+
services:
5+
6+
marklogic:
7+
image: "ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:12.0.nightly-ubi-2.0.1"
8+
platform: linux/amd64
9+
environment:
10+
- MARKLOGIC_INIT=true
11+
- MARKLOGIC_ADMIN_USERNAME=admin
12+
- MARKLOGIC_ADMIN_PASSWORD=admin
13+
volumes:
14+
- ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
15+
ports:
16+
- "8000-8003:8000-8003"
17+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Based on example at
2+
# https://python.langchain.com/docs/use_cases/question_answering/quickstart .
3+
4+
import bs4
5+
from dotenv import load_dotenv
6+
from langchain.document_loaders import WebBaseLoader
7+
from langchain.text_splitter import RecursiveCharacterTextSplitter
8+
from langchain_openai import AzureOpenAIEmbeddings
9+
from marklogic import Client
10+
from marklogic.documents import DefaultMetadata, Document
11+
import os
12+
13+
load_dotenv()
14+
embeddings = AzureOpenAIEmbeddings(
15+
azure_deployment=os.environ["AZURE_EMBEDDING_DEPLOYMENT_NAME"]
16+
)
17+
18+
loader = WebBaseLoader(
19+
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
20+
bs_kwargs=dict(
21+
parse_only=bs4.SoupStrainer(
22+
class_=("post-content", "post-title", "post-header")
23+
)
24+
),
25+
)
26+
docs = loader.load()
27+
28+
text_splitter = RecursiveCharacterTextSplitter(
29+
chunk_size=1000, chunk_overlap=100
30+
)
31+
splits = text_splitter.split_documents(docs)
32+
33+
client = Client("http://localhost:8003", digest=("langchain-user", "password"))
34+
35+
marklogic_docs = [DefaultMetadata(collections="posts_with_embeddings")]
36+
for split in splits:
37+
text = split.page_content
38+
embedding = embeddings.embed_query(text)
39+
doc = Document(
40+
None,
41+
{"text": text, "embedding": embedding},
42+
extension=".json",
43+
directory="/post/",
44+
)
45+
marklogic_docs.append(doc)
46+
47+
client.documents.write(marklogic_docs)
48+
print(
49+
f"Number of documents written to collection 'posts': {len(marklogic_docs)-1}"
50+
)
51+
52+
loader = WebBaseLoader(
53+
web_paths=(
54+
"https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt",
55+
)
56+
)
57+
docs = loader.load()
58+
text_splitter = RecursiveCharacterTextSplitter(
59+
chunk_size=1000, chunk_overlap=100
60+
)
61+
splits = text_splitter.split_documents(docs)
62+
63+
marklogic_docs = [DefaultMetadata(collections="sotu_with_embeddings")]
64+
for split in splits:
65+
text = split.page_content
66+
embedding = embeddings.embed_query(text)
67+
doc = Document(
68+
None,
69+
{"text": text, "embedding": embedding},
70+
extension=".json",
71+
directory="/sotu/",
72+
)
73+
marklogic_docs.append(doc)
74+
75+
client.documents.write(marklogic_docs)
76+
print(
77+
f"Number of documents written to collection 'sotu': {len(marklogic_docs)-1}"
78+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"database-name": "%%DATABASE%%",
3+
"schema-database": "%%SCHEMAS_DATABASE%%"
4+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"database-name": "%%SCHEMAS_DATABASE%%"
3+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"template": {
3+
"context": "/",
4+
"collections": [
5+
"posts_with_embeddings"
6+
],
7+
"rows": [
8+
{
9+
"schemaName": "demo",
10+
"viewName": "posts",
11+
"columns": [
12+
{
13+
"name": "uri",
14+
"scalarType": "string",
15+
"val": "xdmp:node-uri(.)"
16+
},
17+
{
18+
"name": "embedding",
19+
"scalarType": "vector",
20+
"val": "vec:vector(embedding)",
21+
"dimension": "1536",
22+
"invalidValues": "reject",
23+
"nullable": true
24+
},
25+
{
26+
"name": "text",
27+
"scalarType": "string",
28+
"val": "text",
29+
"nullable": true
30+
}
31+
]
32+
}
33+
]
34+
}
35+
}

0 commit comments

Comments
 (0)