{"name":"NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning for Retrieval","description":"NUDGE is a lightweight, non-parametric tool designed to fine-tune pre-trained embeddings, significantly enhancing retrieval and RAG pipelines. It operates by adjusting data embeddings directly, rather than modifying model parameters, to maximize accuracy. This approach often leads to over 10% improvement in retrieval accuracy and runs in minutes.","github":"https://github.com/szeighami/nudge","url":"https://osrepos.com/repo/szeighami-nudge","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/szeighami-nudge","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/szeighami-nudge.md","json":"https://osrepos.com/repo/szeighami-nudge.json","topics":["Python","Embeddings","Fine-tuning","Retrieval","RAG","Machine Learning","NLP"],"keywords":["Python","Embeddings","Fine-tuning","Retrieval","RAG","Machine Learning","NLP"],"stars":null,"summary":"NUDGE is a lightweight, non-parametric tool designed to fine-tune pre-trained embeddings, significantly enhancing retrieval and RAG pipelines. It operates by adjusting data embeddings directly, rather than modifying model parameters, to maximize accuracy. This approach often leads to over 10% improvement in retrieval accuracy and runs in minutes.","content":"## Introduction\nNUDGE is a lightweight, non-parametric tool designed for fine-tuning pre-trained embeddings, specifically for enhancing retrieval and RAG (Retrieval Augmented Generation) pipelines. Presented in the ICLR'25 paper \"NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning\", this tool can significantly improve retrieval accuracy, often by over 10%, and runs in minutes.\n\nUnlike traditional methods that modify model parameters, NUDGE operates by non-parametrically adjusting the data embeddings themselves. It solves a constrained optimization problem, moving data embeddings closer to the embeddings of training queries for which they are the ground-truth answers. The repository offers two variants, NUDGE-M and NUDGE-N, each with distinct optimization constraints.\n\n<p align=\"center\">\n<img src=\"https://github.com/szeighami/nudge/blob/main/nudge_overview.jpg\" width=\"500\" alt=\"NUDGE Overview\">\n</p>\nAs illustrated above, NUDGE modifies data embeddings within a defined region to maximize similarity with training queries.\n\n## Installation\nTo get started with NUDGE, simply install it using pip:\n\nbash\npip install nudge-ft\n\n\n## Examples\nNUDGE operates directly on pre-computed embeddings. You need to have your documents and training/validation queries already embedded, along with ground-truth answers for the queries.\n\nHere's a basic workflow:\n\npython\nfrom nudge import NUDGEN, NUDGEM\n\ntrain_set = {'q_embs':train_q_embs, 'q_ans_indx':train_q_ans_indx}\nval_set = {'q_embs':val_q_embs, 'q_ans_indx':val_q_ans_indx}\n\nfinetuned_embs_nudge_n = NUDGEN().finetune_embeddings(data_embs, train_set, val_set)\nfinetuned_embs_nudge_m = NUDGEM().finetune_embeddings(data_embs, train_set, val_set)\n\n\nFor a complete end-to-end example, you can fine-tune embeddings on the `nfcorpus` dataset. This involves embedding data and queries using `sentence_transformers` and then applying NUDGE. A detailed example is available in the repository's [notebook](https://github.com/szeighami/nudge/blob/main/example.ipynb){target=\"_blank\"} or can be run via `python example.py`.\n\npython\n# Install dependencies\npip install sentence_transformers datasets\n\n# Load dataset and embed\nfrom util.utils import load_hf_datasets, embed_data_and_query_sets\ndataset_name = 'nfcorpus'\ndataset, query_sets = load_hf_datasets(dataset_name)\ndata_emb, query_sets = embed_data_and_query_sets(dataset, query_sets, \"BAAI/bge-small-en-v1.5\")\n\n# Fine-tune Embeddings\nfrom nudge import NUDGEN\nfinetuned_embs_nudge_n = NUDGEN().finetune_embeddings(data_emb, query_sets['train'], query_sets['dev'])\n\n# Use fine-tuned embeddings for retrieval\nfrom util.knnretriever import kNNRetriever\nnudge_n_res = kNNRetriever(finetuned_embs_nudge_n).retrieve_topk_from_emb_batch(k=10, q_embeds=query_sets['test']['q_embs'])\n\n# Use non-fine-tuned embeddings to answer queries (for comparison)\nno_ft_res = kNNRetriever(data_emb).retrieve_topk_from_emb_batch(k=10, q_embeds=query_sets['test']['q_embs'])\n\n# Compare accuracy\nfrom util.utils import calc_metrics_batch\nmetrics = [('recall',10), ('ndcg',10)]\nno_ft_accs = calc_metrics_batch(metrics, no_ft_res, query_sets['test']['q_ans_indx'], query_sets['test']['q_ans_indx_rel'])\nnudgen_accs = calc_metrics_batch(metrics, nudge_n_res, query_sets['test']['q_ans_indx'], query_sets['test']['q_ans_indx_rel'])\nprint(f\"No Fine-Tuning {metrics[0][0]}@{metrics[0][1]}: {no_ft_accs[0]*100:.1f}, {metrics[1][0]}@{metrics[1][1]}: {no_ft_accs[1]*100:.1f}\")\nprint(f\"NUDGE-N {metrics[0][0]}@{metrics[0][1]}: {nudgen_accs[0]*100:.1f}, {metrics[1][0]}@{metrics[1][1]}: {nudgen_accs[1]*100:.1f}\")\n\nThis example typically shows a significant improvement in metrics like recall and nDCG. For larger datasets, NUDGE also provides an optimization to reduce memory usage by filtering out data records not relevant to training or validation queries. An example for larger datasets is available in this [notebook](https://github.com/szeighami/nudge/blob/main/example_large_datasets.ipynb){target=\"_blank\"}.\n\n## Why use NUDGE?\nNUDGE offers several compelling advantages for anyone working with embeddings for retrieval:\n*   **Efficiency:** It's a lightweight tool that runs in minutes, making it highly efficient for fine-tuning.\n*   **Performance Boost:** It consistently improves retrieval accuracy, often by over 10%, without altering the underlying embedding model.\n*   **Non-Parametric Approach:** By directly adjusting data embeddings, NUDGE avoids the complexities and computational costs associated with fine-tuning large language models.\n*   **Ease of Integration:** NUDGE seamlessly integrates into existing embedding-based retrieval pipelines, requiring only pre-computed embeddings and ground-truth answers.\n*   **Scalability:** Includes optimizations for handling larger datasets, ensuring efficient memory usage.\n\n## Links\n*   **GitHub Repository:** [szeighami/nudge](https://github.com/szeighami/nudge){target=\"_blank\"}\n*   **Paper:** [NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning](https://arxiv.org/pdf/2409.02343){target=\"_blank\"}\n*   **Overview Blog Post:** [NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning](https://data-people-group.github.io/blogs/2024/09/05/nudge/){target=\"_blank\"}","metrics":{"detailViews":9,"githubClicks":3},"dates":{"published":null,"modified":"2026-03-04T21:50:06.000Z"}}