Code Llama: Open Code Models Built on Llama 2 (7B-70B)

Quick answer

Code Llama is Meta AI’s family of open code models built on top of Llama 2, scoring up to 67% on HumanEval and 65% on MBPP, the best results among openly available models when it shipped in August 2023. It comes in 7B, 13B, 34B, and 70B sizes and three flavors: the base Code Llama, a Python-tuned Code Llama - Python, and an instruction-following Code Llama - Instruct. The license permits both research and commercial use, which is the part that actually mattered to the ecosystem.

Specializing Llama 2 for code

Code Llama does not train a code model from scratch. It starts from Llama 2, a general-purpose language model, and keeps training it on code-heavy data, then adds task-specific stages. That choice is the whole point: by inheriting Llama 2’s language reasoning, the code models pick up programming ability without losing the natural-language understanding needed to follow instructions and explain themselves.

The release splits into three lines so you do not have to compromise. The base models handle general code generation. Code Llama - Python adds further training on Python data for teams whose stack is Python-first. Code Llama - Instruct is tuned to follow natural-language instructions and behave more safely, which is the variant most assistant-style products should actually deploy. Each line spans 7B to 70B, so you can trade quality against the GPU you can afford.

Infilling and long context

Two engineering features separate Code Llama from a plain code completer. First, infilling: the 7B, 13B, and 70B base and Instruct variants can fill in a missing span using both the code before and after the gap, not just what precedes it. That is what an IDE actually needs. You rarely write code strictly left to right; you insert into the middle of an existing file.

Second, context length. All models train on 16k-token sequences but show improvements on inputs up to 100k tokens. For code that matters more than for prose: a single function rarely fits the whole problem, and feeding the model a large slice of a real repository is often the difference between a plausible-looking answer and a correct one.

Key results

Code Llama reached state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% on HumanEval and 65% on MBPP. The sharpest single data point: Code Llama - Python 7B outperforms Llama 2 70B on both HumanEval and MBPP. A model one-tenth the size beats its own base model purely through code specialization. All variants also beat every other publicly available model on MultiPL-E, the multi-language coding benchmark, showing the gains were not confined to Python or English-language prompts.

Limits and open questions

The honest caveats. The headline 67%/65% comes from the largest, most specialized configurations; the 7B you can run on a laptop scores meaningfully lower, so “Code Llama hits 67%” is not a claim about the model most people will actually run. HumanEval and MBPP are short, self-contained function-writing tasks. They say little about multi-file changes, debugging a real repository, or maintaining a large codebase, which is where coding assistants are still weak. Open weights do not mean safe output either: the models can still produce insecure or subtly wrong code, so tests, review, and dependency hygiene remain on you, not the model.

FAQ

What is Code Llama and who made it?

Code Llama is a family of open large language models for code released by Meta AI in August 2023, built by continuing to train Llama 2 on code data. It ships in 7B, 13B, 34B, and 70B sizes under a license that allows research and commercial use.

How good is Code Llama on HumanEval and MBPP?

Code Llama reaches up to 67% on HumanEval and 65% on MBPP, the best scores among openly available models at release. Notably, the Python-specialized Code Llama - Python 7B beats the much larger general Llama 2 70B on both benchmarks.

Can Code Llama edit code in the middle of a file?

Yes. The 7B, 13B, and 70B base and Instruct variants support infilling, meaning they complete a missing span using both the surrounding code before and after it. That is the behavior an IDE autocomplete actually needs, rather than only continuing from the end.

How long an input can Code Llama handle?

Code Llama is trained on 16k-token sequences and shows improvements on inputs up to 100k tokens, enough to feed it large chunks of a real codebase rather than a single isolated function.

Which Code Llama variant should I use?

Use Code Llama - Instruct for assistant-style products that follow natural-language requests, Code Llama - Python if your stack is Python-first, and the base Code Llama for raw code generation or further fine-tuning. Pick the largest size your hardware budget allows, since quality scales with parameters.

Code Llama’s real contribution was not a benchmark record but a permissively licensed open baseline good enough to build on. Read it at https://arxiv.org/abs/2308.12950.