Understand Large Codebases Faster Using GitIngest | Oxford Protein Informatics Group

Often as researchers we have to deal with large and ugly codebases – this is not new, I know. Alas, fear not, now we have large language models (LLMs) like ChatGPT and friends which make things a little faster! In this blogpost I will show you how to use GitIngest to do this even faster using your favourite LLM.

No more copy pasting files individually or writing a paragraph explaining the directory structure, or even worse, relying on an LLM to use web search to find the codebase. As the codebase grows, the unreliability of these methods does too. GitIngest makes any “whole” codebase, prompt friendly – one prompt will be all you need!

Simply take your favourite Github repo URL (publicly available, ideally) and replace “hub” with “ingest”. See the example below.

On Gitingest, you can access all code in text as well as the directory structure of the repo. Take only the structure, or take parts of the code and feed into your favourite LLM with your questions or things you’d like to understand better.

As a bonus, you get the token count and some extra information!

I like Gemini 2.5 Pro for this task as it has a context window of 1 million tokens.

Happy coding!

# your favourite Github URL

https://github.com/google-research/google-research/tree/master/mol_dqn

# replace "hub" with "ingest"

https://gitingest.com/google-research/google-research/tree/master/mol_dqn

# example number of tokens if you feed all code into an LLM is also given - might be important for a model like Claude.

Estimated tokens: 133.7k

# example directory structure output (i won't paste the code output or i'll break the blog)

Directory structure:
└── mol_dqn/
    ├── README.md
    ├── requirements.txt
    ├── chemgraph/
    │   ├── __init__.py
    │   ├── all_800_mols.json
    │   ├── multi_obj_opt.py
    │   ├── multi_obj_opt_test.py
    │   ├── optimize_logp.py
    │   ├── optimize_logp_of_800_molecules.py
    │   ├── optimize_logp_of_800_molecules_test.py
    │   ├── optimize_qed.py
    │   ├── optimize_qed_test.py
    │   ├── target_sas.py
    │   ├── target_sas_eval.ipynb
    │   ├── target_sas_test.py
    │   ├── configs/
    │   │   ├── bootstrap_dqn.json
    │   │   ├── bootstrap_dqn_opt_800.json
    │   │   ├── bootstrap_dqn_step1.json
    │   │   ├── bootstrap_dqn_step2.json
    │   │   ├── multi_obj_dqn.json
    │   │   ├── naive_dqn.json
    │   │   ├── naive_dqn_opt_800.json
    │   │   └── target_sas.json
    │   └── dqn/
    │       ├── __init__.py
    │       ├── deep_q_networks.py
    │       ├── deep_q_networks_test.py
    │       ├── molecules.py
    │       ├── molecules_test.py
    │       ├── run_dqn.py
    │       ├── run_dqn_test.py
    │       ├── py/
    │       │   ├── __init__.py
    │       │   ├── molecules.py
    │       │   └── molecules_test.py
    │       └── tensorflow_core/
    │           ├── __init__.py
    │           └── core.py
    ├── experimental/
    │   ├── deep_q_networks_noise.py
    │   ├── eval_800_mols.py
    │   ├── max_qed_with_sim.py
    │   ├── multi_obj.py
    │   ├── multi_obj_gen.py
    │   ├── multi_obj_opt.py
    │   ├── optimize_logp.py
    │   ├── optimize_qed.py
    │   ├── optimize_qed_final_reward.py
    │   ├── optimize_qed_max_steps.py
    │   ├── optimize_qed_noise.py
    │   ├── optimize_qed_t.py
    │   ├── optimize_weight_noise.py
    │   └── target_logp.py
    └── plot/
        ├── drug_20_smiles.json
        ├── episode_length_qed.json
        ├── plot.py
        ├── q_values_20.json
        └── target_sas_results.csv

Author

Sanaz Kazeminia

View all posts