In a significant move, OpenAI has released two open-source artificial intelligence models - gpt-oss-120b and gpt-oss-20b - marking the company's first open contribution to the AI research community since the release of GPT-2 in 2019.
The new models are now publicly available through Hugging Face and are said to perform on par with OpenAI’s proprietary o3 and o3-mini models.
The announcement was made by OpenAI CEO Sam Altman via a post on X (formerly Twitter), where he highlighted that “gpt-oss-120b performs about as well as o3 on challenging health issues.”
According to OpenAI, the models are designed with advanced capabilities such as native reasoning, chain-of-thought (CoT) transparency, and tool use — including web search and Python code execution.
Built on a mixture-of-experts (MoE) architecture, the models are optimised for efficiency. gpt-oss-120b, which contains 117 billion total parameters, activates just 5.1 billion per token. Similarly, gpt-oss-20b houses 21 billion parameters, with 3.6 billion activated per token. Both models can process a context length of up to 128,000 tokens, making them suitable for long-form applications.
These models are also compatible with OpenAI’s Responses API and agent-based workflows, enabling developers to integrate them easily into complex systems. Users can configure CoT reasoning to prioritise either response quality or low latency, based on application needs.
OpenAI has emphasised that safety was a primary concern during development.
The models were trained with careful filtering to eliminate data linked to chemical, biological, radiological, and nuclear (CBRN) threats. Additional measures were taken to safeguard against prompt injection attacks and to ensure that the models refuse unsafe prompts. Despite being open-source, OpenAI asserts that the models cannot be easily fine-tuned by malicious actors to produce harmful outputs.
During internal testing, gpt-oss-120b demonstrated superior performance over o3-mini in areas like competitive programming (Codeforces), general problem solving (MMLU, Humanity’s Last Exam), and tool calling (TauBench). However, it reportedly trails slightly behind o3 and o3-mini on other benchmarks such as GPQA Diamond.
The training corpus was primarily English-language and focused on domains such as STEM, programming, and general knowledge. For post-training, OpenAI implemented reinforcement learning (RL)-based fine-tuning to enhance the models' performance and safety features.