The Machine Learning Foundations team at Microsoft Research has been on a roll with their Phi series—a suite of small language models (SLMs). Phi-2, the latest addition, with its 2.7 billion parameters, is turning heads by punching well above its weight. It’s not just about size; Phi-2’s prowess in reasoning and language understanding surpasses models much larger, sometimes up to 25 times its size.
And the results speak volumes. Phi-2 outperforms its predecessor, Phi-1.5, in various domains like common sense reasoning, language understanding, math, and coding. What’s even more impressive is its safety and bias handling, trumping larger models like Llama-7B across the board.
In a world where bigger has often meant better in the AI realm, Phi-2 challenges that notion. It’s not just about size but about strategic training and innovation.
The recent developments in AI from Microsoft Research’s Machine Learning Foundations team introduce a suite of Small Language Models (SLMs) known as “Phi,” showcasing remarkable advancements in performance across various benchmarks. Beginning with Phi-1, a 1.3 billion-parameter model excelling in Python coding benchmarks like HumanEval and MBPP among existing SLMs, the team progressed to Phi-1.5, matching larger models’ performance in common sense reasoning and language understanding despite being 5 times smaller.
Now, the release of Phi-2, a 2.7 billion-parameter language model, demonstrates exceptional reasoning and language understanding capabilities, outperforming models up to 25 times larger. Key to Phi-2’s success are innovations in model scaling, leveraging strategic training data curation and knowledge transfer from prior models.
How did they achieve this feat? Well, the team’s strategy involved a mix of smart data selection and innovative scaling techniques. They fine-tuned the training data to include high-quality educational content and meticulously curated web data. But that’s not all. They cleverly embedded knowledge from their previous models to turbocharge Phi-2’s performance.
The significance of training data quality and model scaling techniques is pivotal. The team focuses on “textbook-quality” data, combining synthetic datasets to teach common sense reasoning and augmenting with filtered educational web data. Innovative scaling techniques involve embedding Phi-1.5’s knowledge into Phi-2, accelerating training convergence and boosting benchmark scores.
Phi-2, despite its compact size, serves as a versatile tool for researchers to explore mechanistic interpretability, safety enhancements, and task-specific fine-tuning. Released on Azure AI Studio, it fosters research and development in language models.
Training Phi-2 involved a Transformer-based model trained on 1.4T tokens from Synthetic and Web datasets for NLP and coding. The training spanned 14 days on 96 A100 GPUs. Notably, Phi-2 showcases improved behavior regarding toxicity and bias compared to other models, despite not undergoing reinforcement learning alignment or fine-tuning.
Phi-2’s evaluation across academic benchmarks and proprietary datasets indicates its superior performance compared to models like Mistral and Llama-2, even surpassing larger models like Google’s Gemini Nano 2. However, challenges remain in model evaluation, requiring testing on concrete use cases.
Additionally, extensive testing on research prompts aligns with benchmark results, demonstrating Phi-2’s capability to solve complex problems. For instance, Phi-2 successfully solved a physics problem and corrected a student’s erroneous solution, showcasing its reasoning abilities despite not being fine-tuned for such tasks.