Apple's Ferret UI: A Glimpse in Apple's Future
Apple just released a paper on their Ferret UI ( user interface) agents which may indeed be the future of Siri, as they offer a groundbreaking approach to navigating user interfaces. While Apple has been quiet as a church regarding its AI efforts, even though some information has gotten out about Apple negotiating with news platforms such as the New York Times about licensing content for their suspected upcoming AI offering. However, the development of Ferret UI showcases their commitment to advancing technology in this space. Ferret UI allows Siri to perform tasks that go beyond basic voice commands by delving into the realm of visual elements within user interfaces. This innovative model is not only able to describe and perceive visual elements in great detail but also it can propose goal-oriented actions and deduce overall screen functionality through functional inference.
The Ferret-UI model stands out from previous multimodal language models (MLLMs) due to its ability to read UI screens directly without requiring external detection modules or screen view files. This self-sufficiency allows for advanced interactions with single-screen applications and opens up possibilities for improving accessibility and most likely with WWDC around the corner, how we interact with Apple products as well.
To train the Ferret model on examining UI, a database of iPhone and Android screens was used, including datasets such as Rico for Android screens and AMP for iPhone screens across various apps. The training data involved both elementary tasks like detailed descriptions and widget captions, as well as more complex reasoning tasks like conversation perception, interaction, and function inference.
In experiments comparing Ferret with other models like CogAgent, Fuyu, and GPT-4V on 14 different tasks related to UI navigation, Ferret proved its efficacy by excelling in aspects such as deciphering aspect ratios and identifying sub-images within different UIs. The training process took approximately one day for the base model (Ferret-UI-base) while an enhanced version (Ferrett-UI-anyres) required around three days using multiple GPUs. With these impressive capabilities showcased in research papers detailing the development processes behind Ferret-UI agents, it raises questions about whether this technology will be featured prominently at Apple’s upcoming WWDC, which will most likely be an Apple/AI bromance. Could Ferrett be Apple's answer to competing AI agents on the market? Only time will tell if this innovative approach becomes integrated into mainstream products like Siri or leads toward new advancements in artificial intelligence technology developed by Apple.