Header Ads

Header ADS

Analyzer , Character Filter ,Tokenizer,Token Filters ,Tokens ,Terms


1️⃣ Text Analysis

  1. Definition: OpenSearch–এ যখন আমরা unstructured text (মুক্ত লেখা) search করতে চাই, তখন সেই text–কে terms–এ split করে inverted index–এ store করা হয়।

  2. Purpose: Search efficient করা।

  3. Example:

Document: "OpenSearch is awesome"
After analysis → tokens: ["opensearch", "is", "awesome"]

এগুলো inverted index–এ store হবে।


2️⃣ Analyzer

  1. Definition: Text analysis process–এর main component।

  2. Role: Text prepare করে search–এর জন্য।

  3. Example: Standard analyzer–এ input "OpenSearch is awesome" → tokens: ["opensearch","is","awesome"]


3️⃣ Tokenizer

Definition: Analyzer–এর অংশ যা text split করে individual tokens (words) তৈরি করে।

Example:

Input: "OpenSearch is awesome"
Tokenizer output: ["OpenSearch", "is", "awesome"]

Tokens–এর সাথে position metadata ও থাকে → phrase search, highlighting, etc.–এর জন্য।


4️⃣ Token Filter

  1. Definition: Tokenizer–এর পর ব্যবহার হয়, tokens modify/add/remove করে।

  2. Examples:

    Lowercase conversion → "OpenSearch""opensearch"

    Stopword removal → "is" remove করা

    Synonym addition → "TV""television"


5️⃣ Token

  1. Definition: Text analysis process–এ তৈরি ছোট unit।

  2. Example: "OpenSearch" token হলো একটা unit, metadata: position, offset, etc.


6️⃣ Term

  1. Definition: Inverted index–এ directly stored value, search match–এর জন্য।

  2. Example: Token "opensearch" → Term "opensearch" store হয় inverted index–এ

💡 Difference: Token = analysis step–এর intermediate unit, Term = index–এ stored final unit


7️⃣ Character Filter

  1. Definition: Analyzer–এর প্রথম component, tokenization–এর আগে text modify করে।

  2. Example:

    Remove HTML tags → <p>Hello</p>"Hello"

    Replace &"and"


8️⃣ Normalizer

  1. Definition: Special analyzer, tokenization skip করে, শুধু character-level operation করে।

  2. Example:

    Lowercase normalizer → "OpenSearch""opensearch"


9️⃣ Stemming

  1. Definition: Words–কে তাদের root/base form–এ reduce করা।

  2. Example:

    1. "running" → "run"

    2. "cats" → "cat"

  3. Purpose → Search efficiency, synonym matching


Flow Diagram (Text Analysis)

Raw text → Character Filter → Tokenizer → Token Filters → Tokens → Terms (inverted index)

💡 Summary Table

ComponentRoleExample
Text AnalysisSplit text → terms"OpenSearch is awesome" → ["opensearch","is","awesome"]
AnalyzerPrepares textStandard analyzer
TokenizerSplits text"OpenSearch is awesome" → ["OpenSearch","is","awesome"]
Token FilterModify tokenslowercase, stopword removal
TokenUnit of text"opensearch"
TermStored in index"opensearch"
Character FilterPre-tokenization modifyRemove HTML, replace & → "and"
NormalizerCharacter-level operation onlyLowercase only
StemmingReduce to root"running" → "run"


 

Powered by Blogger.