Skip to content

Commit 812d6a7

Browse files
committed
Add multilingual extraction and classification support
1 parent d17d876 commit 812d6a7

18 files changed

Lines changed: 950 additions & 107 deletions

src/classifier/prompt-sections.ts

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,13 @@ DON'T use for: generic ("go to movies"), places (use placeName), bands (use wiki
5858
export const SHARED_CATEGORIES_SECTION = `CATEGORIES: ${VALID_CATEGORIES.join(', ')}
5959
("other" should be used only as a last resort. Only use it if no other category applies.)`
6060

61+
export const SHARED_LANGUAGE_SECTION = `LANGUAGE HANDLING:
62+
- Chats may be in any language or mix multiple languages in one conversation.
63+
- Treat translated equivalents of "we should", "let's go", "want to visit", enthusiasm, and agreement the same as English.
64+
- Output JSON field names exactly as specified. Use concise English for generic activity descriptions when natural.
65+
- Preserve proper nouns, venue names, media titles, addresses, and city/place names in their original script unless a standard English name is obvious.
66+
- Do not skip a candidate just because the surrounding context is not English.`
67+
6168
export const SHARED_NORMALIZATION = `NORMALIZATION:
6269
- Distinct categories: cafe≠restaurant, bar≠restaurant
6370
- KEEP mediaKey specificity: "glow worm cave" not "cave", "hot air balloon" not "balloon"

src/classifier/prompt.test.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,16 @@ describe('Classifier Prompt', () => {
146146
expect(prompt).toContain('adult content')
147147
})
148148

149+
it('includes multilingual analysis instructions', () => {
150+
const candidates = [createCandidate(1, 'あのレストランに行ってみたい')]
151+
152+
const prompt = buildClassificationPrompt(candidates, TEST_CONTEXT)
153+
154+
expect(prompt).toContain('Chats may be in any language')
155+
expect(prompt).toContain('Preserve proper nouns')
156+
expect(prompt).toContain('あのレストランに行ってみたい')
157+
})
158+
149159
it('tags agreement candidates with [AGREE]', () => {
150160
const candidates = [createCandidate(1, 'Sounds great!', [], [], 'agreement')]
151161

src/classifier/prompt.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ import {
1818
SHARED_EXAMPLES,
1919
SHARED_IMAGE_SECTION,
2020
SHARED_INCLUDE_RULES,
21+
SHARED_LANGUAGE_SECTION,
2122
SHARED_LINK_SECTION,
2223
SHARED_NORMALIZATION,
2324
SHARED_SKIP_RULES,
@@ -186,6 +187,8 @@ ${formatted}
186187
187188
${buildUserContextSection(context.homeCountry, context.timezone)}
188189
190+
${SHARED_LANGUAGE_SECTION}
191+
189192
WHY THESE MESSAGES:
190193
You're seeing messages pre-filtered by heuristics (regex patterns like "let's go", "we should try") and semantic search (embeddings). We intentionally cast a wide net - you'll see some false positives. Your job is to identify the real activities worth saving.
191194
@@ -250,6 +253,8 @@ ${formatted}
250253
251254
${buildUserContextSection(context.homeCountry, context.timezone)}
252255
256+
${SHARED_LANGUAGE_SECTION}
257+
253258
These are AGREEMENT messages - phrases like "sounds great!", "I'm keen!", "let's do it!". Your job is to find WHAT they are agreeing to by looking at the messages BEFORE the >>> candidate.
254259
255260
The >>> message is the agreement itself. Look at the context BEFORE it to find the activity being agreed to.

src/extraction/embeddings/index.test.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,6 +450,9 @@ describe('Embeddings Module', () => {
450450
expect(DEFAULT_ACTIVITY_QUERIES.length).toBeGreaterThan(0)
451451
expect(DEFAULT_ACTIVITY_QUERIES.some((q) => q.includes('restaurant'))).toBe(true)
452452
expect(DEFAULT_ACTIVITY_QUERIES.some((q) => q.includes('hiking'))).toBe(true)
453+
expect(DEFAULT_ACTIVITY_QUERIES.some((q) => q.includes('deberíamos'))).toBe(true)
454+
expect(DEFAULT_ACTIVITY_QUERIES.some((q) => q.includes('行ってみたい'))).toBe(true)
455+
expect(DEFAULT_ACTIVITY_QUERIES.some((q) => q.includes('我們應該'))).toBe(true)
453456
})
454457
})
455458
})

src/extraction/embeddings/queries/activity-types.json

Lines changed: 139 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,20 @@
2626
"let's get dim sum",
2727
"we should get tapas",
2828
"let's go to the food market",
29-
"we should check out the farmers market"
29+
"we should check out the farmers market",
30+
"restaurante cafe bar tapas",
31+
"restaurante café bar vinho",
32+
"restaurant café bar bistro",
33+
"restaurant kneipe biergarten",
34+
"ristorante pizzeria gelato",
35+
"restauracja kawiarnia bar",
36+
"ресторан кафе бар",
37+
"restoran kafe makanan",
38+
"مطعم مقهى طعام",
39+
"レストラン カフェ 居酒屋",
40+
"식당 카페 술집",
41+
"餐厅 咖啡馆 酒吧",
42+
"餐廳 咖啡館 酒吧"
3043
],
3144
"nature_outdoors": [
3245
"we should go hiking",
@@ -57,7 +70,22 @@
5770
"let's go diving",
5871
"we should try kayaking",
5972
"let's go paddle boarding",
60-
"we should go surfing"
73+
"we should go surfing",
74+
"senderismo playa montaña parque",
75+
"trilha praia cachoeira parque",
76+
"randonnée plage montagne parc",
77+
"wanderung strand berg park",
78+
"escursione spiaggia montagna parco",
79+
"wandeling strand natuurpark",
80+
"wędrówka plaża góry park",
81+
"поход пляж горы парк",
82+
"mendaki pantai gunung taman",
83+
"yürüyüş plaj dağ park",
84+
"شاطئ جبل حديقة",
85+
"ハイキング ビーチ 山 公園",
86+
"등산 해변 산 공원",
87+
"徒步 海滩 山 公园",
88+
"徒步 海灘 山 公園"
6189
],
6290
"entertainment": [
6391
"we should go to that concert",
@@ -87,7 +115,22 @@
87115
"let's play mini golf",
88116
"we should try laser tag",
89117
"let's go to go karts",
90-
"we should go to the theme park"
118+
"we should go to the theme park",
119+
"concierto cine festival teatro",
120+
"spettacolo cinema festival teatro",
121+
"concert cinéma festival théâtre",
122+
"konzert kino festival theater",
123+
"concerto cinema festival teatro",
124+
"concert bioscoop festival theater",
125+
"koncert kino festiwal teatr",
126+
"концерт кино фестиваль театр",
127+
"konser film festival teater",
128+
"konser sinema festival tiyatro",
129+
"حفلة سينما مهرجان مسرح",
130+
"コンサート 映画 フェス",
131+
"콘서트 영화 축제 공연",
132+
"演唱会 电影 节日 演出",
133+
"演唱會 電影 節日 表演"
91134
],
92135
"culture_arts": [
93136
"we should visit the museum",
@@ -118,7 +161,21 @@
118161
"let's do a food tour",
119162
"we should do the ghost tour",
120163
"let's do a pub crawl",
121-
"we should do wine tasting"
164+
"we should do wine tasting",
165+
"museo galería exposición",
166+
"museu galeria exposição",
167+
"musée galerie exposition",
168+
"museum galerie ausstellung",
169+
"museo galleria mostra",
170+
"muzeum galeria wystawa",
171+
"музей галерея выставка",
172+
"museum galeri pameran",
173+
"müze galeri sergi",
174+
"متحف معرض",
175+
"美術館 博物館 展覧会",
176+
"박물관 미술관 전시",
177+
"博物馆 美术馆 展览",
178+
"博物館 美術館 展覽"
122179
],
123180
"travel_accommodation": [
124181
"let's book an airbnb",
@@ -136,7 +193,22 @@
136193
"we should do a city break",
137194
"let's do a staycation",
138195
"I found cheap flights",
139-
"there's a travel deal"
196+
"there's a travel deal",
197+
"viaje hotel escapada vacaciones",
198+
"viagem hotel férias",
199+
"voyage hôtel vacances",
200+
"reise hotel urlaub",
201+
"viaggio hotel vacanza",
202+
"reis hotel vakantie",
203+
"podróż hotel wakacje",
204+
"поездка отель отпуск",
205+
"यात्रा होटल छुट्टी",
206+
"perjalanan hotel liburan",
207+
"seyahat otel tatil",
208+
"رحلة فندق عطلة",
209+
"旅行 ホテル 休暇",
210+
"여행 호텔 휴가",
211+
"旅行 酒店 假期"
140212
],
141213
"sports_adventure": [
142214
"we should go skiing",
@@ -168,7 +240,22 @@
168240
"let's go on safari",
169241
"we should go whale watching",
170242
"let's go dolphin watching",
171-
"we should do a wildlife tour"
243+
"we should do a wildlife tour",
244+
"esquí surf buceo escalada",
245+
"esqui surfe mergulho escalada",
246+
"ski surf plongée escalade",
247+
"ski surfen tauchen klettern",
248+
"sci surf immersione arrampicata",
249+
"skiën surfen duiken klimmen",
250+
"narciarstwo surfing nurkowanie wspinaczka",
251+
"лыжи серфинг дайвинг скалолазание",
252+
"ski selancar menyelam panjat",
253+
"kayak sörf dalış tırmanış",
254+
"تزلج غوص تسلق",
255+
"スキー サーフィン ダイビング",
256+
"스키 서핑 다이빙 클라이밍",
257+
"滑雪 冲浪 潜水 攀岩",
258+
"滑雪 衝浪 潛水 攀岩"
172259
],
173260
"social_events": [
174261
"let's throw a party",
@@ -189,7 +276,22 @@
189276
"we should take a workshop",
190277
"let's take a class",
191278
"we should take a lesson",
192-
"let's sign up for that course"
279+
"let's sign up for that course",
280+
"fiesta cumpleaños reunión taller",
281+
"festa aniversário encontro oficina",
282+
"fête anniversaire atelier",
283+
"party geburtstag treffen kurs",
284+
"festa compleanno laboratorio",
285+
"feest verjaardag workshop",
286+
"impreza urodziny warsztaty",
287+
"вечеринка день рождения мастер-класс",
288+
"pesta ulang tahun lokakarya",
289+
"parti doğum günü atölye",
290+
"حفلة ورشة عيد ميلاد",
291+
"パーティー ワークショップ 誕生日",
292+
"파티 생일 워크숍",
293+
"派对 生日 工作坊",
294+
"派對 生日 工作坊"
193295
],
194296
"shopping": [
195297
"let's go to the mall",
@@ -199,7 +301,21 @@
199301
"we should go to the market",
200302
"let's explore the bazaar",
201303
"we should get souvenirs",
202-
"let's go to the gift shop"
304+
"let's go to the gift shop",
305+
"mercado tienda bazar recuerdos",
306+
"mercado loja lembranças",
307+
"marché boutique souvenirs",
308+
"markt laden souvenirs",
309+
"mercato negozio souvenir",
310+
"rynek sklep pamiątki",
311+
"рынок магазин сувениры",
312+
"pasar toko suvenir",
313+
"pazar mağaza hediyelik",
314+
"سوق متجر هدايا",
315+
"市場 ショッピング お土産",
316+
"시장 쇼핑 기념품",
317+
"市场 商店 纪念品",
318+
"市場 商店 紀念品"
203319
],
204320
"family_kids": [
205321
"let's go to the playground",
@@ -210,6 +326,20 @@
210326
"we should visit the petting zoo",
211327
"let's do a farm visit",
212328
"we should go to the science museum",
213-
"let's try the trampoline park"
329+
"let's try the trampoline park",
330+
"zoológico acuario parque infantil",
331+
"zoológico aquário parque infantil",
332+
"zoo aquarium aire de jeux",
333+
"zoo aquarium spielplatz",
334+
"zoo acquario parco giochi",
335+
"zoo akwarium plac zabaw",
336+
"зоопарк аквариум детская площадка",
337+
"kebun binatang akuarium taman bermain",
338+
"hayvanat bahçesi akvaryum oyun alanı",
339+
"حديقة حيوان حوض أسماك",
340+
"動物園 水族館 遊園地",
341+
"동물원 수족관 놀이터",
342+
"动物园 水族馆 游乐园",
343+
"動物園 水族館 遊樂園"
214344
]
215345
}

src/extraction/embeddings/queries/agreement.json

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,61 @@
66
"that looks awesome",
77
"i'm keen to do that",
88
"i'd be keen",
9-
"let's book it"
9+
"let's book it",
10+
"suena bien",
11+
"me apunto",
12+
"qué buena pinta",
13+
"parece bom",
14+
"boa ideia",
15+
"eu topo",
16+
"klingt gut",
17+
"ich bin dabei",
18+
"sieht gut aus",
19+
"ça a l'air bien",
20+
"partant",
21+
"bonne idée",
22+
"sembra bello",
23+
"ci sto",
24+
"bella idea",
25+
"klinkt goed",
26+
"ik ben erbij",
27+
"ziet er leuk uit",
28+
"låter bra",
29+
"jag är på",
30+
"ser kul ut",
31+
"lyder godt",
32+
"jeg er på",
33+
"ser sjovt ud",
34+
"høres bra",
35+
"jeg er med",
36+
"ser gøy ut",
37+
"brzmi dobrze",
38+
"jestem za",
39+
"dobry pomysł",
40+
"звучит здорово",
41+
"я за",
42+
"выглядит классно",
43+
"अच्छा लगता है",
44+
"मैं तैयार हूँ",
45+
"बहुत बढ़िया",
46+
"kedengarannya bagus",
47+
"aku ikut",
48+
"ide bagus",
49+
"kulağa güzel",
50+
"ben varım",
51+
"iyi fikir",
52+
"يبدو رائع",
53+
"أنا معكم",
54+
"فكرة جيدة",
55+
"よさそう",
56+
"いいね",
57+
"面白そう",
58+
"좋겠다",
59+
"좋아",
60+
"재밌겠다",
61+
"听起来不错",
62+
"聽起來不錯",
63+
"看起来很好",
64+
"看起來很好",
65+
"好主意"
1066
]
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:311165f70eabbe70131dfbd6365fb6b05d07c74df6898c06d6fa809caba55090
3-
size 3465505
2+
oid sha256:7cc1019f6519505f38a4815fef9393b3814fe0b5ca8bf46f3906a5c8ccb9a046
3+
size 6982893

0 commit comments

Comments
 (0)