Apply for a CPU & Token community grant: Community project

#1
by jsdosanj - opened

The Mission

Five hundred years of Sikh history, philosophy, and scripture — 758 million words spanning sacred Gurbani, historical manuscripts, classical ਸਟੀਕ commentaries, ਸਾਹਿਤ literature, and peer-reviewed academic ਖੋਜ research — has never been accessible through a single verified, citable scholarly interface. Traditional search engines return unverified blog posts. PDF repositories require institutional access. Ancient Gurmukhi and Shahmukhi ligatures defeat standard OCR and keyword search entirely.

This project is a ਸੇਵਾ (selfless service) effort — non-commercial, open-science, and open-access — to build a high-integrity Retrieval-Augmented Generation (RAG) pipeline that gives the global Sikh diaspora, academic researchers, university students, and ਪ੍ਰਚਾਰਕ parchariks the same tool that was previously available only to those with institutional library access and decades of scholarly training.

The underlying dataset, jsdosanj/SikhLibrary, is gated — and has already received 4 access requests from external users who found the dataset independently and want to build on it. This is unsolicited early evidence of demand from the research community before the tool has even been widely announced.


What Is Already Built and Fully Working

This is not a prototype, a proof-of-concept, or a grant application for funding to begin development. The system is live, production-grade, and serving real users today at jsdosanj/SikhLibrarian.

Corpus & Index

  • 331,771 semantic chunks ingested, embedded, and indexed from the complete jsdosanj/SikhLibrary dataset
  • 758 million+ words across 5 scholarly categories:
    • ਗੁਰਬਾਣੀ (Gurbani) — sacred scripture
    • ਗ੍ਰੰਥ (Granths) — historical canonical texts
    • ਸਟੀਕ (Steeks) — classical scholarly commentaries
    • ਸਾਹਿਤ (Literature) — Sikh historical and devotional literature
    • ਖੋਜ (Research) — peer-reviewed academic scholarship
  • FAISS IndexFlatIP vector index (931 MB) — inner-product search with normalised embeddings, pre-built and stored in jsdosanj/SikhLibrarian-storage, loaded at startup with zero rebuild time
  • SQLite document store (5.4 GB) — O(1) random-access retrieval of full chunk text by integer ID, enabling instant source excerpt display without scanning the full corpus
  • meta.json metadata store — per-chunk category, display name, source file, and chunk index for citation generation

Application Stack (app.py — production)

Embedding model : sentence-transformers/all-MiniLM-L6-v2 (90 MB, CPU)
Vector search   : FAISS IndexFlatIP (cosine similarity via normalised IP)
Doc retrieval   : SQLite — batch fetch by integer ID
LLM             : Qwen/Qwen2.5-72B-Instruct via HF Inference API (external)
Framework       : Gradio 6 with queue (max_size=20, concurrency_limit=1 for LLM)
Citation format : Chicago Manual of Style 17th edition
Hallucination guard : ਅੰਗ reference cross-verification against retrieved context

Research Mode (No LLM — Instant, Always Free)

Research mode runs entirely on Space hardware with zero external API calls:

  1. User query is sanitised (Gurmukhi Unicode preserved, injection patterns stripped)
  2. Query embedded via all-MiniLM-L6-v2 → 384-dimensional float32 vector
  3. FAISS searches 331,771 vectors for top-500 candidates
  4. Results filtered by relevance threshold (0.45) and optional category
  5. Sources grouped by file, ranked by maximum semantic score
  6. Best-scoring chunk per source fetched from SQLite in a single batch query
  7. Chicago Manual of Style 17th ed. citation generated per source with relevance %, section numbers, and 80-word source excerpt
  8. Plain-text export file generated for download into academic papers

This entire pipeline completes in under 500ms on the current 2 vCPU hardware for a single user. It is permanently free, requires no inference credits, and is the primary mode of use for the scholarly audience this tool serves.

Learn Mode (Qwen2.5-72B with Hallucination Verification)

Learn mode adds LLM synthesis on top of the retrieval pipeline:

  1. FAISS retrieves top-5 highest-relevance chunks (configurable)
  2. Full chunk texts fetched from SQLite and trimmed to 1,400 words each
  3. A structured system prompt assembles retrieved context + Chicago citation metadata
  4. Qwen2.5-72B-Instruct streams a PhD-level scholarly response in English, with all theological terms in Punjabi Unicode (ਸੰਤ-ਸਿਪਾਹੀ, ਨਾਮ ਸਿਮਰਨ, ਵਾਹਿਗੁਰੂ — never romanised transliterations)
  5. Post-generation, every ਅੰਗ reference in the response is cross-verified against the retrieved context — unverified citations are struck through with a warning before display
  6. Multi-turn conversation history (last 6 turns) is preserved for follow-up questions

The LLM runs on HF's Inference API servers, not the Space hardware. The Space only handles embedding and retrieval. This is the architectural decision that makes CPU Upgrade — not GPU — the correct hardware choice.

Engineering Hardening Already Implemented

  • Input sanitisation: Gurmukhi Unicode (U+0A00–U+0A7F), Devanagari (U+0900–U+097F), and Latin Extended (U+00C0–U+024F) preserved; all other non-standard characters stripped; max 500 characters enforced
  • Retry logic with backoff: 402 (billing exhausted) detected and never retried; 429/503/502/timeout retried up to 3× with 5s/15s/30s delays; post-stream cleanup exceptions preserved rather than discarding completed responses
  • Gradio queue: max_size=20, default_concurrency_limit=1 for LLM requests; Research mode is unaffected and runs concurrently
  • Background index loading: FAISS index and SQLite store load in a daemon thread — the UI is available immediately at startup, with a clear loading status shown to users
  • Graceful degradation: If SQLite is unavailable, Research mode continues without excerpts; if LLM credits are exhausted, a clear message directs users to Research mode
  • No traceback leakage: All exceptions caught and mapped to user-friendly messages; server-side errors logged only to Space logs

Dependencies (requirements.txt)

sentence-transformers>=2.7.0,<4.0
faiss-cpu>=1.8.0,<2.0
huggingface_hub>=0.23.0,<2.0
numpy>=1.26.0,<2.0

Gradio is provided by the HF Spaces runtime at the version pinned in README.md (sdk_version). The total install footprint is minimal — the heaviest dependency is the sentence-transformer model at 90 MB, downloaded once and cached.


Current Hardware Constraint and Why It Is the Bottleneck

The Space currently runs on CPU Basic: 2 vCPU / 16 GB RAM (free tier).

Every user query on Research mode requires:

Operation Duration CPU cores consumed
Sentence-transformer tokenisation + inference ~130ms 1 vCPU
FAISS IndexFlatIP search (331K vectors) ~20ms 1 vCPU
SQLite batch fetch + snippet generation ~5ms I/O-bound
Gradio response serialisation ~5ms minimal
Total per request ~160ms ~1 vCPU

With 2 vCPUs, maximum throughput is approximately 12 requests/second. Under Python's GIL, realistic concurrent throughput is closer to 6–8 requests/second.

At 50 concurrent users sending queries every 30 seconds, arrival rate = 1.67 req/s — well within capacity. But real usage is bursty: a single share on a Sikh diaspora WhatsApp group or Reddit post sends dozens of users simultaneously. A 30-person burst all hitting Send within 5 seconds = 6 req/s arrival rate, saturating the 2-vCPU system and producing 3–8 second queue waits for every subsequent user.

At that point, users leave. The tool fails its mission.


Why CPU Upgrade (8 vCPU / 32 GB RAM) Is the Correct Solution

Throughput Analysis

8 vCPUs × GIL efficiency (~0.55 for CPU-bound Python) × (1 / 0.160s)
= ~27 requests/second sustained throughput

100 concurrent users, 1 query per 30s = 3.3 req/s arriving
→ 12% utilisation → average queue wait < 50ms ✓

300 concurrent users, 1 query per 30s = 10 req/s arriving
→ 37% utilisation → average queue wait < 200ms ✓

500 concurrent users, 1 query per 30s = 16.7 req/s arriving
→ 62% utilisation → average queue wait < 500ms ✓

100–300 simultaneous users: fully handled with sub-500ms Research mode response times.

RAM Analysis

FAISS index resident in RAM:        ~510 MB
SQLite mmap (OS-managed):           ~200 MB
Sentence-transformer model:         ~400 MB
meta.json loaded:                   ~200 MB
Python + Gradio + library overhead: ~600 MB
─────────────────────────────────────────────
Baseline (shared across all users): ~1.9 GB

Per-user marginal cost:
  Gradio WebSocket connection:      ~1 MB
  In-flight request buffer:         ~3–5 MB
  ─────────────────────────────────────────
  Per user:                         ~5 MB

500 users × 5 MB =                  ~2.5 GB
─────────────────────────────────────────────
Total at 500 users:                 ~4.4 GB of 32 GB (14% RAM utilisation)

RAM is not the bottleneck at any realistic user count on the 32 GB tier. The 32 GB headroom is insurance against OS page cache growth, Gradio's async event loop buffers, and burst spikes — not a requirement for baseline operation.

Why Not GPU?

A GPU tier would cost $288–$1,800/month and provide zero measurable benefit for this architecture:

  • The embedding model (all-MiniLM-L6-v2, 90 MB) is trivially small — GPU overhead for a single 384-dim inference call exceeds CPU time
  • The LLM (Qwen2.5-72B) runs on HF's Inference API servers in HF's datacenter — the Space hardware has no role in LLM inference
  • FAISS IndexFlatIP is an in-memory CPU operation — there is no GPU-accelerated FAISS index in use
  • GPU VRAM would sit entirely idle

The CPU Upgrade tier is architecturally correct at 1/13th the cost of an Nvidia T4.

Identified Optimizations Ready to Deploy Post-Grant

Two zero-cost code changes are already designed and ready to implement once load testing on the upgraded hardware confirms they are needed:

  1. ProcessPoolExecutor for embedding: Bypasses Python's GIL by running sentence_transformers.encode() in a subprocess pool. Achieves near-linear scaling across all 8 vCPUs. Estimated throughput improvement: ~2×, extending comfortable capacity from 300 to 600+ concurrent users on the same hardware.

  2. Query embedding cache: An lru_cache (or Redis-backed cache) on _embed() keyed by sanitized query string. Popular queries — scripture titles, common theological terms, recurring research topics — will be embedded once and served from cache for all subsequent users. Estimated cache hit rate for a 10,000-entry LRU: ~35–50% for a mature Sikh studies tool, effectively eliminating embedding cost for the majority of queries.


Evidence of Demand

  • 4 independent access requests to the gated jsdosanj/SikhLibrary dataset — from researchers who discovered it organically, before any public announcement of this tool
  • The dataset spans 5 scholarly categories with material relevant to Punjabi studies programs, Sikh theology departments, cultural heritage organizations, and diaspora communities across India, Canada, the UK, the United States, and East Africa
  • The global Sikh diaspora numbers 25–30 million people, the majority of whom have no access to scholarly Sikh primary sources in a searchable, citable format
  • No equivalent tool exists: SikhiToTheMax provides scripture lookup but no semantic search, no citations, and no cross-text synthesis. Digitized archives at SGPC and universities are not publicly searchable in this way.

Impact Statement

Research mode is permanently free and credit-independent. Any user worldwide — a PhD student in Toronto, an educator in rural Punjab, a high school student writing a history essay in Nairobi — can get fully cited Chicago Manual of Style scholarly sources from 758 million words of primary texts in under 500ms. No account required. No credits. No institutional access. No cost.

Learn mode provides PhD-level theological and historical analysis grounded exclusively in retrieved primary sources, with automatic hallucination detection for ਅੰਗ references. The system will not fabricate scripture. It will not invent citations. Every claim is either sourced from the retrieved context or explicitly flagged.

The dataset (jsdosanj/SikhLibrary) is open for academic access requests, allowing researchers to build their own tools on the same corpus. The 4 existing access requests demonstrate this is already happening.

This is a cultural preservation project as much as a software project. The texts being indexed include manuscripts that exist in only a handful of physical copies worldwide. Making them semantically searchable and citable — free, forever — is an act of ਸੇਵਾ for the next generation of Sikh scholars, students, and communities.

A CPU Upgrade grant of $21.60/month is the precise upgrade needed to ensure this tool serves its audience at scale without degraded performance, queue timeouts, or infrastructure failure during the bursts of traffic that follow community sharing events. It is the minimum viable upgrade, the architecturally correct upgrade, and the most cost-efficient use of grant funds for the impact delivered.

Updated grant proposal to include Inference tokens alongside CPU upgrade.

jsdosanj changed discussion title from Apply for a GPU community grant: Personal project to Apply for a GPU community grant: Community project
jsdosanj changed discussion title from Apply for a GPU community grant: Community project to Apply for a CPU & Token community grant: Community project

Sign up or log in to comment