This is Google's ranking of the best AI systems for Android app development.
Google has launched a new benchmark to determine the best AI models for Android app development. According to the company, the idea behind this project is to create a system that evaluates the models' capabilities in development tasks, thereby increasing productivity. As expected, Gemini 3.1 Pro tops the list of best AI models, while Claude and GPT-5.2 come in second.
Google believes that generic standards are inadequate for measuring competition in the Android market. Writing generic Python code doesn't replace managing the entire lifecycle of a mobile app or application with a clean architecture. Google believes that an Android standard will serve as a fundamental reference point, preventing developers from wasting time using ineffective tools.
According to the ranking, Google and Anthropic are considered the best models for app development. Gemini 3.1 Pro Preview achieved a score of 72.4%, representing the average of 100 successful tests across 10 rounds. The company's top model achieves a confidence margin between 65% and 79%, a metric used to measure the statistical reliability of the results.
Following Google are Claude Opus 4.6 and GPT-5.2 Codex, with confidence levels of 66.6% and 62.5%, respectively. Next are Claude Opus 4.5 and Gemini 3 Pro, though Claude Sonnet 4.6 also ranks highly. The mid-range Anthropic model demonstrates up to five times the performance of Gemini 2.5 Flash, an AI model that barely achieves 10% reliability.
Unlike other performance benchmarks, AndroidBench consists of 100 tasks selected from an initial pool of approximately 39,000 GitHub pull requests. Google filtered out repositories with more than 500 stars and changes made within the last three years to ensure performance benchmarks are tested against current standards and not outdated code.
According to the Android Bench website, the highest scores are awarded for performance criteria that demonstrate high efficiency in four key areas: user interface, synchronization, persistence, and dependency injection.
71% of the tests are based on Kotlin, compared to 25% on Java. Furthermore, while most repositories on GitHub are applications, benchmarking shows that 58% of their tasks involve library development. The tasks range in size from fixes as small as 27 lines to changes exceeding 400 lines, covering almost the entire scope of a seasoned software developer's work.
To prevent AI from succeeding simply by memorizing code during training, Google employs safeguards and manually checks the steps the model follows. This ensures that Gemini's 72.4% success rate reflects its ability to solve problems in real time.
According to the Android Bench chart, this is the ranking of the best AI models for developing applications for your mobile operating system:
Gemini 3.1 Pro Preview: 72.4%
Claude Work 4.6: 66.6%
GPT-5.2 Codex: 62.5%
Claude Work 4.5: 61.9%
Gemini 3 Pro Preview: 60.4%
Claude Sonnet 4.6: 58.4%
Claude Sonnet 4.5: 54.2%
Gemini 3 Flash Preview: 42%
Gemini 2.5 Flash: 16.1%
