Google Launches Gemini AI, New AI Multimodal

Over the past year, major tech players such as OpenAI, Microsoft, Meta, and Google Research have engaged in an AI war to develop a multimodal AI system.

Alphabet and Google's CEO Sundar Pichai, along with DeepMind's CEO Demis Hassabis, have collaborated to unveil Gemini AI, a highly anticipated generative AI system. This AI model stands out as the most advanced and versatile among their artificial intelligence offerings, capable of understanding and generating text, audio, code, video, and images. Gemini surpasses OpenAI's GPT-4 in general tasks, reasoning, math, and code. This launch follows Google's earlier release of PaLM 2, part of the model family powering Google's search engine.

What is Google Gemini?

The inaugural release, Gemini 1.0, represents a pinnacle in artificial intelligence, showcasing remarkable versatility and advancement. This generative AI model is well-equipped for tasks demanding the integration of multiple data types, designed with a high degree of flexibility and scalability to operate seamlessly across diverse platforms, ranging from expansive data centers to portable mobile devices. The models demonstrate exceptional performance, exceeding current state-of-the-art results in numerous benchmarks. It is capable of sophisticated reasoning and problem-solving, even outperforming human experts in some scenarios.

Now, let's dive into the technical breakthroughs that underpin Gemini's extraordinary capabilities.

Proficiency in Handling - Text, Video, Code, Image, and Audio

Gemini 1.0 is designed with native multimodal capabilities, as they are trained jointly across text, image, audio, and video. The joint training on diverse data types allows the AI model to seamlessly comprehend and generate content across diverse data types. It exhibits exceptional proficiency in handling:

Text

Gemini's prowess extends to advanced language understanding, reasoning, synthesis, and problem-solving in textual information. Its proficiency in text-based tasks positions it among the top-performing large language models (LLMs), outperforming inference-optimized models like GPT-3.5 and rivaling some of the most capable models like PaLM 2, Claude 2, etc.

Gemini Ultra excels in coding, a prevalent use case for current Large Language Models (LLMs). Through thorough evaluation on both conventional and internal benchmarks, Gemini Ultra demonstrates its proficiency in various coding-related tasks. In the HumanEval standard code-completion benchmark, where the model maps function descriptions to Python implementations, Gemini Ultra, tuned for instructions, accurately implements an impressive 74.4% of problems.

Additionally, on the newly introduced held-out evaluation benchmark for Python code generation tasks, Natural2Code, with no web leakage, Gemini Ultra achieves the highest score of 74.9%. These results highlight Gemini's outstanding competence in coding scenarios, positioning it at the forefront of AI models in this domain.

Image

Gemini performs comparably to OpenAI’s GPT-4V or previous state-of-the-art models in image understanding and generation. Gemini Ultra consistently outperforms existing approaches, even in zero-shot scenarios, particularly for OCR-related image understanding tasks without an external OCR engine. It demonstrates strong performance across diverse tasks, including answering questions on natural images and scanned documents, as well as understanding infographics, charts, and science diagrams.

Gemini can output images directly without relying on an intermediate natural language description, avoiding potential bottlenecks in the model's ability to express images. This unique capability enables the model to generate images with prompts using interleaved image and text sequences in a few-shot setting. For instance, a user could prompt the model to suggest images and text for a blog post or website design.

Video Understanding

Gemini's ability for video understanding undergoes rigorous evaluation across held-out benchmarks. Sampling 16 frames per video task, Gemini models exhibit exceptional temporal reasoning. In November 2023, Gemini Ultra achieved state-of-the-art results in few-shot video captioning and zero-shot video question-answering tasks, confirming its robust performance.

The provided example illustrates Gemini Ultra's qualitative ability to comprehend ball-striking mechanics in a soccer player's video, showcasing its proficiency in enhancing game-related reasoning. These findings establish Gemini's advanced video understanding capabilities, a crucial advancement in crafting a sophisticated and adept generalist agent.

Audio Understanding

Gemini Nano-1 and Gemini Pro's performance is assessed for tasks such as automated speed recognition (ASR) and automated speech translation (AST). Gemini models are compared against the Universal Speech Model (USM) and Whisper across diverse benchmarks.

Gemini Pro stands out significantly, surpassing USM and Whisper models across all ASR and AST tasks for both English and multilingual test sets. The FLEURS benchmark, in particular, reveals a substantial gain due to Gemini Pro's training with the FLEURS dataset, outperforming its counterparts. Even without FLEURS, Gemini Pro still outperforms Whisper with a WER of 15.8. Gemini Nano-1 also outperforms USM and Whisper on all datasets except FLEURS. While Gemini Ultra's audio performance is yet to be evaluated, expectations are high for enhanced results due to its increased model scale.

Triads of Gemini Model

The model comes in three sizes, with each size specifically tailored to address different computational limitations and application requirements:

Gemini Ultra

The Gemini architecture enables efficient scalability on TPU accelerators, empowering the most capable model, Gemini AI Ultra, to achieve state-of-the-art performance across diverse and complex tasks, including reasoning and multimodal functions.

Gemini Pro

An optimized model prioritizing performance, cost, and latency, excelling across diverse tasks. It demonstrates robust reasoning abilities and extensive multimodal capabilities.

Gemini Nano

Gemini Nano is the most efficient mode and is designed to run on-device. It comes in two versions: Nano-1 with 1.8B parameters for low-memory devices and Nano-2 with 3.25B parameters for high-memory devices. Distilled from larger Gemini models, it undergoes 4-bit quantization for optimal deployment, delivering best-in-class performance.

Now, let’s look at the technical capabilities of the Gemini models.

Technical Capabilities

Developing the Gemini models demanded innovations in training algorithms, datasets, and infrastructure. The Pro model benefits from scalable infrastructure, completing pretraining in weeks using a fraction of Ultra's resources. The Nano series excels in distillation and training, creating top-tier small language models for diverse tasks and driving on-device experiences. Let’s dive into the technical innovations:

Training Infrastructure

Training Gemini models involved using Tensor Processing Units (TPUs), TPUv5e, and TPUv4, with Gemini Ultra utilizing a large fleet of TPUv4 accelerators across multiple data centers. Scaling up from the prior flagship model, PaLM-2, posed infrastructure challenges, necessitating solutions for hardware failures and network communication at unprecedented scales. The “single controller” programming model of Jax and Pathways simplified the development workflow, while in-memory model state redundancy significantly improved recovery speed on unplanned hardware failures. Addressing Silent Data Corruption (SDC) challenges at this scale involved innovative techniques such as deterministic replay and proactive SDC scanners.

Training Dataset

The Gemini models are trained on a diverse dataset that is both multimodal and multilingual, incorporating web documents, books, code, and media data. Utilizing the SentencePiece tokenizer, training on a large sample of the entire corpus enhances vocabulary and model performance, enabling efficient tokenization of non-Latin scripts. The dataset size for training varies based on model size, with quality and safety filters applied, including heuristic rules and model-based classifiers. Data mixtures and weights are determined through ablations on smaller models, with staged training adjusting the composition for optimal pretraining results.

Gemini’s Architecture

Although complete details are undisclosed, researchers mention that Gemini models are built on Transformer decoders with architecture and model optimization improvements for stable training at scale. The models, written in Jax, are trained using TPUs. The architecture resembles DeepMind's Flamingo, CoCa, and PaLI, featuring a separate text and vision encoder.

Ethical Considerations

Gemini adheres to a structured approach for responsible deployment, identifying, measuring, and managing foreseeable downstream societal impacts on the models.

Safety Testing and Quality Assurance

Emphasizing responsible development, Gemini focuses on safety testing and quality assurance. Rigorous evaluation targets set by Google DeepMind’s Responsibility and Safety Council (RSC) across key policy domains underscore Gemini's commitment to upholding ethical standards. Safety considerations are integral to the development process, ensuring Gemini meets the highest quality and ethical responsibility standards. Gemini Ultra undergoes trust and safety evaluations, including red-teaming by external parties, and is refined through fine-tuning and reinforcement learning from human feedback (RLHF) before wide availability.

Potential Risks and Challenges

The creation of a multimodal AI model introduces specific risks, and Gemini prioritizes risk mitigation across various aspects, aligning with Google’s AI Principles.

Application & Performance Enhancements

Gemini Pro x Google BARD Chatbot

Google's counterpart to ChatGPT, Bard, is now powered by Gemini Pro. Bard, an experimental conversational AI service by Google, was initially driven by LaMDA (Language Model for Dialogue Applications). It combines extensive knowledge with large language models to deliver creative and informative responses, aiming to simplify complex topics and engage users in meaningful conversations.

Gemini Nano x Pixel 8 Pro

Designed for on-device applications, Gemini Nano will be released as a feature update on the Pixel 8 Pro. This integration introduces two enhanced features: Summarize in Recorder and Smart Reply in Gboard. Gemini Nano ensures sensitive data stays on the device, providing offline functionality. Summarize in Recorder offers condensed insights from recorded content without a network connection, while Smart Reply in Gboard, powered by Gemini Nano, suggests high-quality responses with conversational awareness.

Generative Search

Gemini AI is now employed for the Search Generative Experience (SGE), resulting in a 40% reduction in latency for English searches in the U.S. This enhancement accelerates the search process and enhances the quality of search results. Gemini's application in Search represents a significant stride toward a more efficient and refined generative search experience, potentially reshaping how users interact with information through Google Search.

Google Platform Integrations

In the upcoming months, Gemini is poised to expand its presence across various Google products and services, offering enhanced functionalities and experiences. Users can expect Gemini's integration in key platforms such as Search, Ads, Chrome, and Duet AI.

What’s Next?

Gemini 1.0's prospects focus on new applications and use cases enabled by its capabilities:

Complex image understanding: Gemini's ability to parse complex images opens possibilities in visual data interpretation. Multimodal reasoning: The model's capacity to reason over interleaved images, audio, and text sequences is promising for applications requiring diverse information integration.
Educational applications: Gemini's advanced reasoning skills can enhance personalized learning and intelligent tutoring systems.
Multilingual communication: Proficiency in multiple languages makes Gemini valuable for improving communication and translation services.
Information summarization and extraction: Gemini's ability to process and synthesize information suits summarization and data extraction tasks.
Creative applications: The model's potential for creative tasks, generating novel content, or assisting in creative processes, is significant.

Google Launches Gemini AI, New AI Multimodal

What is Google Gemini?

Proficiency in Handling - Text, Video, Code, Image, and Audio