Metadata-Version: 2.4
Name: justembed
Version: 0.1.1a4
Summary: Your First Step Into Semantic Search. Experience embeddings hands-on with no cloud accounts required.
Author-email: Krishnamoorthy Sankaran <krishnamoorthy.sankaran@sekrad.org>
License: MIT
Project-URL: Homepage, https://github.com/sekarkrishna/justembed
Project-URL: Repository, https://github.com/sekarkrishna/justembed
Project-URL: Issues, https://github.com/sekarkrishna/justembed/issues
Keywords: semantic-search,embeddings,offline,onnx,nlp,justembed,lens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: onnxruntime>=1.15.0
Requires-Dist: tokenizers>=0.13.0
Requires-Dist: numpy<2.0.0,>=1.20.0
Requires-Dist: duckdb>=0.9.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: skl2onnx>=1.15.0
Requires-Dist: onnx<1.19.0,>=1.14.0
Requires-Dist: ml-dtypes<0.5.0,>=0.4.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: httpx>=0.25.0; extra == "dev"
Dynamic: license-file

# JustEmbed

**Your First Step Into Semantic Search**

Experience embeddings hands-on. No cloud accounts, no setup complexity, no commitment. Just your laptop and your curiosity.

[![PyPI version](https://badge.fury.io/py/justembed.svg)](https://pypi.org/project/justembed/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Author**: Krishnamoorthy Sankaran  
**Email**: krishnamoorthy.sankaran@sekrad.org  
**GitHub**: https://github.com/sekarkrishna/justembed  
**PyPI**: https://pypi.org/project/justembed/

---

## What is JustEmbed?

JustEmbed is a focused tool for **semantic search** - understanding meaning, not just matching keywords. It's designed as your entry point into the embedding ecosystem, letting you experience how semantic search works before committing to cloud platforms or production tools.

### For Non-Technical Users

Upload your documents through a web interface and search by meaning. No coding required, no technical knowledge needed. See exactly how your text is processed and understand what's happening at each step.

### For Developers

A simple Python API (`import justembed as je`) that lets you experiment with embeddings locally. Build confidence with semantic search concepts before moving to production vector databases like Pinecone, Weaviate, or Qdrant.

---

## Quick Start

### Installation

```bash
pip install justembed
```

### Web Interface

```bash
justembed begin --workspace ~/my_documents
```

Open http://localhost:5424 in your browser.

### Python API

```python
import justembed as je

je.begin(workspace="~/docs")
je.create_kb("my_kb")
je.add(kb="my_kb", file="document.txt")
results = je.query("search term", kb="my_kb")
```

---

## Understanding Semantic Search

Traditional keyword search looks for exact word matches. Semantic search understands meaning.

**Example**: Imagine a document with these paragraphs:

1. "Volcanoes erupt with molten lava at temperatures exceeding 1000°C..."
2. "Industrial smelting uses high-temperature furnaces above 800°C..."
3. "Igloos are dome-shaped shelters built from compressed snow..."
4. "Icebergs float in cold ocean waters at sub-zero temperatures..."

Search for **"hot"**:
- Traditional search: No results (word "hot" doesn't appear)
- Semantic search: Returns paragraphs 1 & 2 (understands heat/temperature relationship)

This is what JustEmbed lets you experience.

---

## Core Concepts

### 1. Chunking
Documents are broken into smaller pieces (chunks) for efficient searching. JustEmbed's UI shows you exactly how your text will be chunked before processing.

### 2. Embedding
Each chunk is converted to a list of numbers (an embedding) that represents its meaning. Similar meanings have similar numbers.

### 3. Searching
When you search, your query is converted to an embedding and compared to all chunk embeddings. Results are ranked by similarity (0.0-1.0 score).

---

## Complete API Reference

### Workspace Management

```python
# Start workspace
je.begin(workspace="~/my_docs", port=5424)

# Register existing workspace
je.register_workspace("~/shared_workspace")

# List workspaces
workspaces = je.list_workspaces()

# Deregister (data stays on disk)
je.deregister_workspace("~/old_workspace", confirm=True)

# Stop server
je.terminate()
```

### Knowledge Bases

```python
# Create with default model
je.create_kb("general_kb")

# Create with custom model
je.create_kb("medical_kb", model_type="custom", model_name="medical_v1")

# List all KBs
kbs = je.list_kbs()

# Delete KB
je.delete_kb("old_kb")
```

### Adding Documents

```python
# From file
je.add(kb="my_kb", file="document.txt")

# From text
je.add(kb="my_kb", text="Your content...", filename="custom.txt")

# With chunking options
je.add(
    kb="my_kb",
    file="document.txt",
    max_tokens=300,
    merge_threshold=50,
    split_by_headings=True,
    split_by_paragraphs=True
)
```

### Searching

```python
# Basic search
results = je.query("search term", kb="my_kb")

# Search all KBs
results = je.query("search term", kb="all")

# Advanced options
results = je.query(
    query="search term",
    kb="my_kb",
    top_k=10,
    mode="retrieve"  # or "count"
)

# Results structure
for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Text: {result['text']}")
    print(f"File: {result['file']}")
    print(f"KB: {result['kb']}")
```

### Custom Model Training

```python
# Train from file
je.train_model(
    model_name="medical_v1",
    file="medical_textbook.txt",
    embedding_dim=128,
    max_features=5000
)

# Train from text
je.train_model(
    model_name="legal_v1",
    text="Your training corpus...",
    embedding_dim=128
)

# List models
models = je.list_models()
```

---

## Key Features

### Domain-Specific Models

Train models that understand your domain's vocabulary:

```python
# Medical domain
medical_text = """
Pyrexia, commonly known as fever, is elevated body temperature.
Renal function refers to kidney performance.
A UTI affects the bladder and kidneys.
"""

je.train_model("medical_v1", text=medical_text)
je.create_kb("medical_kb", model_type="custom", model_name="medical_v1")

# Now "fever" finds "pyrexia", "kidney" finds "renal"
```

### Multiple Knowledge Bases

Organize by topic, each with its own model:

```python
je.create_kb("medical_kb", model_type="custom", model_name="medical_v1")
je.create_kb("legal_kb", model_type="custom", model_name="legal_v1")
je.create_kb("general_kb")  # Uses default E5-Small model
```

### Workspace Sharing

Share by zipping the workspace folder:

```python
# Create and populate
je.begin(workspace="~/shared_kb")
je.create_kb("team_kb")
je.add(kb="team_kb", file="docs.txt")

# Zip ~/shared_kb and share

# Recipient registers and uses
je.register_workspace("~/received_kb")
je.begin(workspace="~/received_kb")
results = je.query("search", kb="team_kb")
```

---

## Architecture

```
User Interface (Web UI / Python API)
           ↓
    FastAPI Server
           ↓
Embedder Layer (E5-Small / Custom Models)
           ↓
Storage Layer (DuckDB / File System)
```

### Design Decisions

**Offline-First**: Everything runs locally. No API keys, no cloud dependencies, no internet after installation.

**ONNX Models**: Portable, CPU-friendly, small size (~8-15 MB). Works on any platform.

**DuckDB Storage**: Embedded database, no separate server. Fast columnar storage.

**Deterministic Chunking**: Rule-based, predictable. Same input always produces same chunks.

**Privacy**: Your data never leaves your machine. No telemetry, no tracking.

---

## Understanding Limitations

### Context Matters

**Example**: "The igloo was decorated with fireworks for the winter celebration."

Searching for "hot" might return this (score: 0.5-0.6) because "fireworks" associates with heat.

**What this reveals**: Embeddings capture word associations, not deep understanding. Production systems use larger context windows, attention mechanisms, and re-ranking.

### Domain Specificity

Without domain training, "fever" in medical vs financial contexts scores similarly. Custom models learn domain-specific meanings.

**What this shows**: Why domain-specific training matters. Production systems use massive pre-training and fine-tuning.

### No Generation

JustEmbed finds similar text. It doesn't generate new text, answer questions, or summarize.

**What this demonstrates**: Embeddings are one component. Full LLMs combine embeddings, generation, reasoning, memory, and tools.

### Scale

Designed for 1-1000 documents, 1-10 queries/second, single user.

**What this illustrates**: Production systems handle millions of documents, thousands of queries/second, concurrent users.

---

## The Complete Picture

JustEmbed focuses on the **embedding layer** - the foundation of semantic search. This represents approximately 2-3% of what full LLM systems provide.

### What JustEmbed Covers
- Text chunking
- Embedding generation
- Vector similarity search
- Basic model training

### What Production Systems Add
- Massive pre-training (billions of parameters)
- Text generation
- Reasoning and inference
- Long context windows (100K+ tokens)
- Memory and conversation history
- Safety and alignment
- Optimization (quantization, distillation)
- Distributed infrastructure
- Tool integration
- Multimodal understanding

After experiencing JustEmbed, you'll appreciate the engineering behind systems like GPT-4, Claude, or Gemini.

---

## JustEmbed vs Production Tools

| Feature | JustEmbed | Vector DBs | Full LLMs |
|---------|-----------|------------|-----------|
| **Purpose** | Learn embeddings | Production search | Complete AI |
| **Setup** | `pip install` | Cloud account | API keys |
| **Cost** | Free | $70-500/mo | $0.002-0.06/1K tokens |
| **Scale** | 1-1K docs | Millions | Unlimited |
| **Speed** | <100ms | <10ms | <1s |
| **Offline** | ✅ Yes | ❌ No | ❌ No |
| **Privacy** | ✅ Local | ⚠️ Cloud | ⚠️ Cloud |
| **Learning** | Gentle | Moderate | Steep |
| **Generation** | ❌ No | ❌ No | ✅ Yes |

---

### When to Use What

**Use JustEmbed**:
- Learning about embeddings
- Small collections (10-1000 docs)
- Privacy-critical applications
- Offline environments
- Quick prototypes
- Building confidence

**Graduate to Vector DBs**:
- Scaling beyond 1000 docs
- Production reliability
- Sub-10ms latency
- Team collaboration
- Advanced features

**Move to Full LLMs**:
- Need text generation
- Require reasoning
- Conversational AI
- Multi-modal applications

---


## Requirements

- Python 3.8+
- 500 MB disk space
- 1 GB RAM
- CPU (no GPU required)
- No internet (after installation)

---

## Guarantees

**Technical**:
- Deterministic (same input → same output)
- No hallucinations (only returns your text)
- Offline (works without internet)
- Private (data never leaves your machine)
- No tracking or telemetry

**File System**:
- Writes only to workspace and `~/.cache/justembed/`
- Reads only files you upload
- Never deletes files outside workspace

---

## License

MIT License

---

## Author

**Krishnamoorthy Sankaran**

- Email: krishnamoorthy.sankaran@sekrad.org
- GitHub: https://github.com/sekarkrishna/justembed
- PyPI: https://pypi.org/project/justembed/

---

## Support

- Issues: https://github.com/sekarkrishna/justembed/issues
- Discussions: https://github.com/sekarkrishna/justembed/discussions
- Email: krishnamoorthy.sankaran@sekrad.org

---

## Citation

```bibtex
@software{justembed2026,
  title = {JustEmbed: Your First Step Into Semantic Search},
  author = {Sankaran, Krishnamoorthy},
  year = {2026},
  url = {https://github.com/sekarkrishna/justembed}
}
```

---

## Acknowledgments

- E5-Small model: Microsoft Research
- ONNX Runtime: Microsoft
- FastAPI: Sebastián Ramírez
- DuckDB: DuckDB Labs
- scikit-learn: scikit-learn developers

---

**JustEmbed** - Start here. Build confidence. Graduate to production tools when ready.
