Phase 2: Embeddings & Semantic Search
Keyword search fails when words do not match exactly.
If a resume says "team management" and a job description asks for "leadership," a basic search returns zero results. The words are different, but the meaning is the same.
Phase 2 solves this by using embeddings and semantic search.
How it works:
• Tokenization: Computers do not read words. They read numbers. A tokenizer breaks text into small pieces called tokens and converts them into Token IDs. Common words become one token. Rare words break into multiple tokens.
• Embeddings: A Token ID is just a label. The embedding layer turns that ID into a vector. A vector is a long list of numbers that represents meaning. Instead of one number, a model uses many dimensions to describe a concept.
• Dimensions: Think of these numbers as coordinates. One dimension might represent "frontend vs backend." Another might represent "web vs systems." High-dimensional vectors allow the model to place "React" and "JavaScript" near each other in a mathematical space.
• Semantic Search: When you ask a question, the system converts your question into a vector. It then compares your vector to the vectors of your stored documents.
• Cosine Similarity: This measures the angle between two vectors. If the vectors point in the same direction, they are similar. This allows the system to find "resignation requirements" even if you only searched for "notice period."
Key lessons for production:
- Vector Databases: Searching millions of vectors is slow. Databases like Pinecone or Qdrant use indexing to find the nearest neighbors in milliseconds.
- Model Migrations: Every embedding model uses a different mathematical space. You cannot compare an OpenAI vector with a Cohere vector. If you change models, you must re-embed all your data.
- Cost vs ROI: Re-embedding millions of chunks is expensive. Companies often stay with older models unless the accuracy gain justifies the migration cost.
- Always store your raw text chunks. If you upgrade your model later, you can use the old text to create new vectors.
Phase 2 is where the intelligence happens.
Source: https://dev.to/surajrkhonde/phase-2-embeddings-semantic-search-3lco
Optional learning community: https://t.me/GyaanSetuAi
