The Rise of Open-Source Language Models: Evaluating Starling-7B

The field of large language models (LLMs) continues to advance at a rapid pace. The latest development comes with the release of Starling-7B – an open-source 7 billion parameter model that aims to match the performance of commercial models like GPT-4 in most areas, with some key exceptions.

In this post, we’ll take a closer look at Starling-7B, how it was developed, and evaluate its strengths and weaknesses compared to proprietary LLMs. Specifically, we’ll focus on its performance in reasoning, mathematical, and coding tasks.

While Starling-7B represents impressive progress for open-source LLMs, it also highlights areas where further work is needed, especially in domains requiring logical thinking. Nonetheless, the model shows the potential for community-driven efforts to push the boundaries of what’s possible.

Starling-7B Development

The Starling-7B is an open large language model (LLM) developed by a team including Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. It is trained by Reinforcement Learning from AI Feedback (RLAIF) and is finetuned from the Openchat 3.5 model. The model utilizes the GPT-4 labeled ranking dataset, berkeley-nest/Nectar, and a new reward training and policy tuning pipeline. Starling-7B-alpha has achieved a score of 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI’s GPT-4 and GPT-4 Turbo.

The model is released along with the ranking dataset Nectar, the reward model Starling-RM-7B-alpha, and an online demo in LMSYS Chatbot Arena. The model is licensed for non-commercial use only and is subject to the data distillation License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. The developers express their gratitude to various organizations and the open-source community for their support and contributions to the project.

The Starling-7B is a language model that has been trained using reinforcement learning and has shown impressive performance in MT Bench evaluations. It is part of a larger project that includes the development of a ranking dataset and a reward model. The model is available for non-commercial use and is hosted on HuggingFace.

Starling-7B Performance

The Starling-7B performance is characterized by its ability to beat Openchat 3.5 and come close to GPT-4. It is a reward model trained from Llama2-7B-Chat and fine-tuned on mistral, following the exact chat template and usage as Openchat 3.5. The model’s performance is discussed in various contexts, including comparisons with GPT-4 and other models, as well as issues related to line feed code and prompt templates.

Final Thoughts

The release of Starling-7B represents admirable progress for open-source language models. However, the claim that it “performs almost as well as GPT-4” is likely an overstatement that should be re-evaluated.

I’ve grown wary of claims that tiny models can genuinely compete with or beat GPT-4. Too often, these suggestions stem from benchmarks exaggeration or other questionable practices. While Starling-7B appears to be a legitimate model making strides within its weight class, directly pitting it against GPT-4 triggers skepticism rather than good faith.

Especially concerning is the considerable gap in coding capabilities compared to GPT-4. Code generation requires precise logical thinking – an area still needing improvement in Starling-7B. Additionally, there is no disclosure of the sources of training data – an omission that further raises suspicions.

Rather than sensationalized headlines claiming to beat the leading commercial models, the open-source community would be better served with transparent and realistic assessments. There is impressive work being done, but it does a disservice when the incremental progress is overstated. By maintaining high standards of evaluation and expectation setting, we will build trust and interest in these models for the right reasons.