Analyzer , Character Filter ,Tokenizer,Token Filters ,Tokens ,Terms
1️⃣ Text Analysis
Definition: OpenSearch–এ যখন আমরা unstructured text (মুক্ত লেখা) search করতে চাই, তখন সেই text–কে terms–এ split করে inverted index–এ store করা হয়।
Purpose: Search efficient করা।
Example:
Document: "OpenSearch is awesome"
After analysis → tokens: ["opensearch", "is", "awesome"]
এগুলো inverted index–এ store হবে।
2️⃣ Analyzer
Definition: Text analysis process–এর main component।
Role: Text prepare করে search–এর জন্য।
Example: Standard analyzer–এ input
"OpenSearch is awesome"→ tokens:["opensearch","is","awesome"]
3️⃣ Tokenizer
Definition: Analyzer–এর অংশ যা text split করে individual tokens (words) তৈরি করে।
Example:
Input: "OpenSearch is awesome"
Tokenizer output: ["OpenSearch", "is", "awesome"]
Tokens–এর সাথে position metadata ও থাকে → phrase search, highlighting, etc.–এর জন্য।
4️⃣ Token Filter
Definition: Tokenizer–এর পর ব্যবহার হয়, tokens modify/add/remove করে।
Examples:
Lowercase conversion →
"OpenSearch"→"opensearch"Stopword removal →
"is"remove করাSynonym addition →
"TV"→"television"
5️⃣ Token
Definition: Text analysis process–এ তৈরি ছোট unit।
Example:
"OpenSearch"token হলো একটা unit, metadata: position, offset, etc.
6️⃣ Term
Definition: Inverted index–এ directly stored value, search match–এর জন্য।
Example: Token
"opensearch"→ Term"opensearch"store হয় inverted index–এ
💡 Difference: Token = analysis step–এর intermediate unit, Term = index–এ stored final unit
7️⃣ Character Filter
Definition: Analyzer–এর প্রথম component, tokenization–এর আগে text modify করে।
Example:
Remove HTML tags →
<p>Hello</p>→"Hello"Replace
&→"and"
8️⃣ Normalizer
Definition: Special analyzer, tokenization skip করে, শুধু character-level operation করে।
Example:
Lowercase normalizer →
"OpenSearch"→"opensearch"
9️⃣ Stemming
Definition: Words–কে তাদের root/base form–এ reduce করা।
Example:
"running" → "run""cats" → "cat"
Purpose → Search efficiency, synonym matching
Flow Diagram (Text Analysis)
Raw text → Character Filter → Tokenizer → Token Filters → Tokens → Terms (inverted index)
💡 Summary Table
| Component | Role | Example |
|---|---|---|
| Text Analysis | Split text → terms | "OpenSearch is awesome" → ["opensearch","is","awesome"] |
| Analyzer | Prepares text | Standard analyzer |
| Tokenizer | Splits text | "OpenSearch is awesome" → ["OpenSearch","is","awesome"] |
| Token Filter | Modify tokens | lowercase, stopword removal |
| Token | Unit of text | "opensearch" |
| Term | Stored in index | "opensearch" |
| Character Filter | Pre-tokenization modify | Remove HTML, replace & → "and" |
| Normalizer | Character-level operation only | Lowercase only |
| Stemming | Reduce to root | "running" → "run" |