No menu items!
Home Blog

Meta Introduces V-JEPA

The V-JEPA model, proposed by Yann LeCun, is a non-generative model that learns by predicting missing parts of a video in an abstract representation space. It’s pretty much learning by watching video. Unlike generative approaches, V-JEPA has the flexibility to discard unpredictable information, leading to improved training efficiency. It takes a self-supervised learning approach and is pre-trained entirely with unlabeled data, using labels only for task adaptation after pre-training. V-JEPA’s masking methodology involves blocking out portions of videos in both space and time to force the model to develop a deeper understanding of the scene. This approach allows the model to focus on higher-level conceptual information rather than details irrelevant for downstream tasks. V-JEPA’s efficiency lies in its ability to pre-train once without labeled data and then reuse parts of the model for various tasks efficiently.

Abstract Representations: Unlocking Object Interactions in Raw Video Data

At the heart of V-JEPA’s capabilities lies its unique ability to predict object interactions by learning abstract representations from raw video data. Through self-supervised learning, the model excels at predicting missing parts of video segments, gaining insights into latent features that define how elements in a scene interact.

Key Ideas:

  • Non-Generative Model: V-JEPA doesn’t focus on reconstructing videos pixel by pixel. Instead, it learns to predict missing pieces of a video within a conceptual, or abstract, space of representations.
  • Abstract Representation Space: Think of this space like a set of high-level features that describe important parts of a video (objects, actions, relationships). V-JEPA understands videos through these features, not just their raw pixels.
  • Comparison with I-JEPA: V-JEPA is an extension of I-JEPA. Both systems aim to learn by comparing pieces of data in this abstract representation space, rather than directly comparing pixels.
  • Flexibility and Efficiency: Since V-JEPA targets the important concepts rather than every single pixel, it can ignore irrelevant details. This makes it faster and more efficient during training. Data that’s unpredictable or noisy gets less focus.

Stability and Efficiency: Setting V-JEPA Apart

V-JEPA’s distinctive approach results in a more stable and efficient system, marking a departure from traditional AI models. Its adaptability and stability make it a standout choice for various applications, particularly in fields like robotics and self-driving cars, where understanding the environment is crucial for effective decision-making.

Versatility in Action: Adaptable Without Direct Parameter Fine-Tuning

One of V-JEPA’s key strengths lies in its versatility. The model serves as a foundation for various tasks and can be easily adapted without the need for direct parameter fine-tuning. This flexibility positions V-JEPA as a powerful tool for industries requiring quick and efficient implementation.

Future Prospects: Bridging the Gap to Natural Intelligence

While V-JEPA currently outperforms other models in video reasoning over several seconds, Meta’s research team is pushing boundaries further. The goal is to enhance the model’s time horizon and bridge the gap between JEPA and natural intelligence by exploring multimodal representations, indicating a commitment to continuous innovation.

Path Towards Advanced Machine Intelligence (AMI)

While V-JEPA has primarily focused on perceptual tasks related to video understanding, the next phase involves leveraging the model’s predictive abilities for planning and sequential decision-making. By training JEPA models on video data without extensive supervision, there is potential for these models to passively learn from visual inputs and quickly adapt to new tasks with minimal labeled data. This progression hints at the broader applications of V-JEPA in embodied AI systems and contextual AI assistants for augmented reality devices. The future prospects of V-JEPA lie in its ability to revolutionize machine intelligence by bridging the gap between human-like learning processes and efficient task completion across various domains.

Yann LeCun’s Endorsement: Advocating for the Promise of JEPA

Yann LeCun’s longstanding advocacy for JEPA raises intriguing questions about the technology’s relatively limited attention in the broader research community. With the success of V-JEPA, the promise of JEPA as a paradigm-shifting approach gains further credence, challenging established norms in AI research.

V-JEPA could potentially play a significant role in Llama 3’s advancements, offering enhanced video reasoning and understanding for improved user experiences. Having released Llama 2 not too long ago, it was seen as a massive advancement in open source AI. As rumors circulate about a potential July release, the integration of V-JEPA could signify a leap forward in Llama 3’s capabilities, providing users with a more sophisticated and intuitive AI experience.

Mistral Large is Officially Released – Partners With Microsoft

Mistral has finally released their largest model to date, Mistral Large. It’s a cutting-edge language model with top-tier reasoning capabilities. It is proficient in English, French, Spanish, German, and Italian, excelling in tasks like text understanding, transformation, and code generation. Mistral Large ranks as the world’s second model available through an API, just after GPT-4. It offers a 32K tokens context window for precise information recall and supports function calling. Mistral AI has partnered with Microsoft to make their models available on Azure, providing access through Azure AI Studio and Azure Machine Learning. Mistral Large outperforms other models in multilingual tasks and excels in coding and math challenges. You can test the model yourself on their site.

Mistral Comparison

Mistral Large is a cutting-edge text generation model with top-tier reasoning capabilities. This comes just after Mistral released their 7B model late last year. They really seem to be moving fast, only shortly after, they released their 8x7B MoE model. This new Mistral Large model excels in complex multilingual tasks like text understanding, transformation, and code generation. It ranks as the world’s second-best model available through an API, just after GPT-4. Detailed benchmarks show its strong performance on various tasks, making it a powerful tool for developers and researchers.Key Features of Mistral Large:

  1. Multilingual Proficiency: Fluent in English, French, Spanish, German, and Italian with a deep understanding of grammar and cultural nuances.
  2. Large Context Window: With a 32K tokens context window, it can recall precise information from extensive documents.
  3. Precise Instruction-Following: Enables developers to create custom moderation policies efficiently, as demonstrated in setting up system-level moderation for le Chat.
  4. Function Calling Capability: In-built function calling ability combined with constrained output mode on la Plateforme facilitates application development and modernization of tech stacks at scale.

Side note, Mistral-Large is priced ~20% cheaper than GPT-4-Turbo. It’s a slightly weaker model as well. Curious to see how things play out and whether this is a worthwhile trade-off for many applications. Any interesting question will be if this 20% will be enough of a selling point?

Mistral Large Reasoning Capabilities

Mistral Large’s performance is compared to the top-leading LLM models on commonly used benchmarks, showcasing its powerful reasoning capabilities. The figure in question reports the performance of pre-trained models on standard benchmarks.

Mistral-Microsoft Partnership

The partnership between Microsoft and Mistral AI aims to accelerate AI innovation by leveraging Azure’s cutting-edge AI infrastructure to develop and deploy next-generation large language models (LLMs). Mistral AI’s flagship commercial model, Mistral Large, is now available on Azure AI, offering state-of-the-art reasoning and knowledge capabilities for various text-based applications. This collaboration focuses on supercomputing infrastructure support, scaling premium models through Models as a Service (MaaS), and exploring AI research and development opportunities, including training purpose-specific models for select customers like the European public sector. Here is a tweet by Microsoft’s CEO Satya Nadella.

This partnership between Microsoft and Mistral AI is particularly interesting, considering Microsoft’s significant investment and role as a computing provider to OpenAI. The collaboration brings together the strengths of both companies, with Mistral AI focusing on developing advanced large language models and Microsoft providing its powerful Azure AI infrastructure.

The previous two models by Mistral is seen as a positive example of open sourcing leading to commercial success with LLMs. However, some may feel conflicted due to the company’s strong pro open source stance and the potential influence of Microsoft after acquiring an interest. There is uncertainty about Mistral’s future open sourcing practices. It is suggested that if they stop, releasing the full weights of Miqu for community fine-tuning would be a good gesture, especially since Mixtral was disappointing in tuning.

Closing Thoughts

Another set of releases and, again, no AI has definitively beat GPT-4, which was in private beta well over a year ago Gemini Advanced is the only one of similar level, Mistral Large is below. On deck possibilities: Gemini 1.5 Ultra… and GPT-5. (Maybe Llama 3? Grok 2? Claude 3?). Sadly, they didn’t choose to open-source Mistral medium. Previously, Mistral AI offered open-source models like open-mistral-7B and open-mixtral-8x7b, aligning with their earlier promise of openness and contributing to the open-source community. Despite moving towards a more commercially oriented stance, Mistral AI still maintains elements of openness, allowing users to deploy and manage their models independently, supporting portability across clouds and infrastructures, and enabling extensive customizations and fine-tuning capacity.

Mistral had always maintained that they would retain the largest models for their own use. In all honesty, it would be a foolish decision if they were to simply replicate OpenAI’s approach. Although Mistral Large is a capable model, it falls short of GPT-4 in terms of intelligence and lacks the flexibility of Gemini 1.5 Pro. Therefore, it wouldn’t be logical to invest in the third-best option when there are alternatives available that offer superior intelligence and a larger context window.

A Guide to Stable Diffusion Inpainting

0

Have you ever stared at a beautiful image, wishing you could tweak just one little detail? Maybe add a soaring bird into a serene landscape, or transform a blank canvas into a portal to another world, or maybe even remove a watermark? Well, hold onto your brush (or mouse!), because Stable Diffusion Inpainting is here to make your artistic dreams a reality.

This powerful tool might sound complex. This guide will demystify the magic of Stable Diffusion Inpainting, equipping you with the knowledge and confidence to breathe life into your images. Get ready to:

Whether you’re a seasoned graphic designer or a curious newbie, this guide is your passport to the exhilarating world of Stable Diffusion Inpainting. So, grab your virtual paintbrush, buckle up, and let’s embark on this creative adventure together.

How it Works

Stable Diffusion Inpainting is a process that involves using the Stable Diffusion model to regenerate or restore missing parts of an image. Here is how Stable Diffusion Inpainting, you can do so using Diffusers.:

How Stable Diffusion Inpainting Works:

Stable Diffusion Inpainting involves the following steps:

  1. Creating an Inpaint Mask: Use tools like the paintbrush to create a mask indicating the area to be regenerated.
  2. Selecting Inpainting Settings: Choose settings such as image size, face restoration, and mask content like latent noise or latent nothing.
  3. Using an Inpaint Model: Optionally, you can use a Stable Diffusion model specifically trained for inpainting to achieve better results.
  4. Generating Images: The model generates images based on the provided mask and settings, aiming to seamlessly blend with the original image style.

Inpainting with Stable Diffusion allows for not only restoring missing parts but also creating entirely new content within an existing image, providing a wide range of creative possibilities.

Limitations of Stable Diffusion Inpainting

Stable Diffusion Inpainting, despite its advantages, has some limitations that are important to consider:

  1. Complexity of Images: It excels in handling images with complex structures like textures, edges, and sharp transitions; however, inpainting very intricate or detailed areas may still pose challenges.
  2. Resource Intensive: The process can be resource-intensive, requiring significant GPU RAM for certain workflows. For example, a workflow may need around 6.1 GB of GPU RAM, which can be reduced to 4.9 GB by choosing a smaller SAM model.
  3. Artifacts in Output: While Stable Diffusion Inpainting guarantees stability and smoothness, other inpainting techniques may produce visible artifacts. However, achieving a completely artifact-free result may not always be guaranteed.

Despite these limitations, Stable Diffusion Inpainting remains a powerful tool for image restoration and manipulation across various fields such as film restoration, photography, medical imaging, and digital art.

How to Use

There are many Paid/Cloud Services out there where you can use inpainting. Photoshop for example, recently introduced Generative Fill, a new feature in Photoshop powered by Adobe Firefly, a family of creative AI models. Generative Fill enables users to easily manipulate images by adding, extending, or removing content using simple text prompts. This feature not only enhances user experience but also expands the possibilities within Photoshop. 

Open Source & Local

When you’re ready to dive into the world of Stable Diffusion Inpainting without reliance on paid or cloud services, the open-source community offers a robust and flexible solution with AUTOMATIC1111. This is an interface tailored for users who prefer to harness their own computing power for the art of inpainting.

Using AUTOMATIC1111

AUTOMATIC1111 stands as a beacon for enthusiasts of open-source software, providing access to the capabilities of Stable Diffusion models right on your local machine. But before you embark on this voyage, there are a few prerequisites to address:

  1. Compatible Hardware: You’ll need a machine equipped with a high-performance GPU. Since Stable Diffusion Inpainting can be resource-intensive, having a capable graphics processing unit is crucial for smooth operation and brisk processing times.
  2. Installation: The setup of AUTOMATIC1111 requires a bit of technical know-how. You’ll need to clone the repository from GitHub, install the necessary dependencies, and configure your environment. Detailed instructions are typically provided within the repository’s documentation to guide you through this process.
  3. Model Selection: Although AUTOMATIC1111 furnishes the framework for inpainting, you’ll also need to select a Stable Diffusion model that meets your creative needs. There are various models available, each trained to generate certain styles or types of images.
  4. Running the Software: Once you have everything in place, running the software is a matter of following the user interface’s prompts to upload your images, specify the area to inpaint, and choose your desired settings to let the magic happen.

Choosing to go the open-source and local route with AUTOMATIC1111 is about taking full control of your Stable Diffusion Inpainting adventures. It’s for those who have the technical inclination and the hardware to back it up, offering a sense of freedom and independence from subscription services. It’s a rewarding path for the savvy user, offering limitless possibilities to those who take the time to learn it.

Phind AI Releases Their 70B Model

0

Phind AI has released their Phind-70B model, which is their largest and most performant model to date. This model is based on the CodeLlama-70B model and is fine-tuned on an additional 50 billion tokens, resulting in significant improvements. Phind-70B supports a context window of 32K tokens and scores 82.3% on HumanEval, surpassing the latest GPT-4 Turbo score of 81.1% in their evaluation. It also runs at up to 80 tokens per second, providing high-quality answers for technical topics efficiently.

Phind-70B is faster than GPT-4 Turbo, running at 80+ tokens per second compared to GPT-4 Turbo’s ~20 tokens per second. This speed is achieved by running NVIDIA’s TensorRT-LLM library on H100 GPUs, with ongoing optimizations to further enhance inference speed. Phind-70B is available for free to try without a login, with the option to access higher limits by subscribing to Phind Pro.

Phind-70B Examples

Solving Hard Leetcode Problems

You are given an m x n integer array grid where grid[i][j] could be:

    1 representing the starting square. There is exactly one starting square.
    2 representing the ending square. There is exactly one ending square.
    0 representing empty squares we can walk over.
    -1 representing obstacles that we cannot walk over.

Return the number of 4-directional walks from the starting square to the ending square, that walk over every non-obstacle square exactly once.

This is from Leetcode 980. Unique Paths III.

Phind-70B’s answer

from typing import List

class Solution:
    def uniquePathsIII(self, grid: List[List[int]]) -> int:
        self.directions = [(0,  1), (0, -1), (1,  0), (-1,  0)]  # Right, Left, Down, Up
        self.empty_squares =  1  # Start counting from  1 to include the start square
        self.paths =  0
        self.start = None

        # Find the starting point and count empty squares
        for i in range(len(grid)):
            for j in range(len(grid[0])):
                if grid[i][j] ==  1:
                    self.start = (i, j)
                elif grid[i][j] ==  0:
                    self.empty_squares +=  1

        def dfs(x, y, empty_squares):
            if not (0 <= x < len(grid)) or not (0 <= y < len(grid[0])) or grid[x][y] <  0:
                return  0
            if grid[x][y] ==  2:
                return  1 if empty_squares ==  0 else  0
            grid[x][y] = -2  # Temporarily mark as visited
            self.paths += sum(dfs(x + dx, y + dy, empty_squares -  1) for dx, dy in self.directions)
            grid[x][y] =  0  # Unmark after exploring all paths
            return  0

        dfs(*self.start, self.empty_squares)
        return self.paths

Unfortunately it was not able to get the initial test to pass.

Frontend UI Code

Prompt:

Generate a header using tailwind css. We want a logo on the left, a home, about and contact link in the center. On the right we want a Log In and Sign Up buttons
<header class="bg-gray-800 text-white p-4 flex items-center justify-between">
  <!-- Logo -->
  <div class="flex items-center">
    <img src="path/to/your/logo.png" alt="Logo" class="h-8 w-auto">
  </div>

  <!-- Navigation Links -->
  <nav class="hidden md:flex space-x-4">
    <a href="#" class="hover:text-gray-300">Home</a>
    <a href="#" class="hover:text-gray-300">About</a>
    <a href="#" class="hover:text-gray-300">Contact</a>
  </nav>

  <!-- Log In and Sign Up Buttons -->
  <div class="flex space-x-4">
    <button class="bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded">
      Log In
    </button>
    <button class="bg-green-500 hover:bg-green-700 text-white font-bold py-2 px-4 rounded">
      Sign Up
    </button>
  </div>
</header>

Ok this one was actually pretty impressive and had great coherence.

Working With Libraries

Prompt:

Given a NumPy array arr with shape (3, 4, 5), reshape it into a new array with shape (6, 10).
import numpy as np

# Assuming arr is your original array with shape (3,  4,  5)
# arr = np.random.rand(3,  4,  5)  # Example of how to create such an array

# Reshape the array
new_arr = arr.reshape(6,  10)

print(new_arr.shape)  # This should output: (6,  10)

Explanation:

  1. arr.reshape(2, 3, 4, 5): This reshapes the array arr into a four-dimensional array with shape (2, 3, 4, 5). This is an intermediate step to achieve the desired shape of (6, 10).
  2. arr.transpose(0, 2, 1, 3): This transposes the axes of the array arr in the following order: (0, 2, 1, 3). This effectively swaps the first and second dimensions, and the third and fourth dimensions. This results in an array with shape (2, 4, 3, 5).
  3. arr.reshape(6, 10): Finally, this reshapes the array arr into a two-dimensional array with shape (6, 10). This achieves the desired output shape.

This solution demonstrates an understanding of NumPy’s reshape and transpose functions, as well as the concept of reshaping multidimensional arrays.

Closing Thoughts

We’ve all seen the benchmarks on how Phind AI can beat GPT-4. Benchmarks tend to not be so accurate, many models seems to be contaminated just to approach proprietary models in coding benchmarks. Overall it’s a decent model, I wouldn’t say it’s better than GPT-4, but certainly better at than most, for programming at least. Also they don’t mention what it actually beats GPT on. Every time I see one of these it “wins” on incredibly simple and small functions, while not even being able to understand more complex requests.

Stable Diffusion 3 is Here

0

Stable Diffusion 3 is a new text-to-image model by Stability AI that is currently in an early preview phase. This model boasts improved performance in multi-subject prompts, image quality, and spelling abilities. The suite of Stable Diffusion 3 models ranges from 800M to 8B parameters, offering users scalability and quality options. This comes shortly after Stability AI released Stable Cascade. The model combines a diffusion transformer architecture and flow matching. Safety measures have been implemented to prevent misuse, with ongoing collaboration with experts and the community to ensure responsible AI practices. The aim is to make generative AI open, safe, and universally accessible. Users interested in commercial use of other image models can visit Stability AI’s Membership page or Developer Platform. To follow updates on Stable Diffusion 3, users can connect with Stability AI on various social media platforms.

Stable Diffusion 3 Examples

Prompt Coherence

Prompt: “Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”

For those that don’t get how this is impressive, SDXL and DALL-E below

Stable Diffusion 3 seems to have pretty good prompt coherence. This is very big news if SD3 can understand prompts this well.

Stable Diffusion 3 can handle text

Some images shared by Emad Mostaque, CEO of Stability AI.

Architecture

The model ranges from 800M to 8B parameters and is based on the Sora architecture. It’s their most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Stable Diffusion 3 utilizes a new type of diffusion transformer combined with flow matching, allowing it to scale efficiently and generate high-quality images based on text descriptions called “prompts”.

Diffusion Transformers (DiTs) leverage the power of transformer architecture, which has proven to be highly effective in various natural language processing tasks, and adapt it for image generation.

The use of transformers in DiTs allows for better scalability, robustness, and efficiency compared to traditional U-Net backbones. By replacing the U-Net architecture with transformers, DiTs can process images more effectively and generate higher-quality results. This is evident in the research findings, which show that higher forward pass complexity (measured in Gflops) leads to lower Fréchet Inception Distance (FID) scores, indicating better performance.

Is Stable Diffusion 3 Open Source?

Like prior SD models it will be open source/parameters after the feedback and improvement phase. They are open data for our LMs but not other modalities

This model is not yet widely available but is being offered for early preview through a waitlist to gather insights for further improvements before an open release. Stability AI emphasizes safety practices by implementing safeguards throughout the training, testing, evaluation, and deployment phases to prevent misuse of Stable Diffusion 3.

Google Releases Gemma, an Open Sourced LLM

Gemma is a new generation of open models from Google aimed at assisting developers and researchers in building AI responsibly. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It includes model weights in two sizes: Gemma 2B and Gemma 7B, each released with pre-trained and instruction-tuned variants, you can try the model on Perplexity Labs now.

Google is releasing a Responsible Generative AI Toolkit to provide guidance and essential tools for creating safer AI applications with Gemma. This comes just a week after Google announced Gemini Pro 1.5. The models are available worldwide and can run on various platforms, including laptops, workstations, and Google Cloud. Gemma is designed with AI Principles at the forefront, and extensive evaluations have been conducted to ensure the safety and reliability of the models. The Responsible Generative AI Toolkit includes safety classification, debugging, and guidance tools.

Gemma also supports a wide variety of tools and systems, including multi-framework tools, cross-device compatibility, and cutting-edge hardware platforms. Furthermore, free credits for research and development are being offered to enable developers and researchers to start working with Gemma. The models are optimized for Google Cloud, and advanced customization is available with fully-managed Vertex AI tools or with self-managed GKE. Researchers can apply for Google Cloud credits of up to $500,000 to accelerate their projects. More information about Gemma and quickstart guides can be accessed on ai.google.dev/gemma.

Pushing open source

Google is no stranger to open source having literally releasing Transformers and many open source tools. Google is reinforcing its commitment to open-source innovation with the release of Gemma, creating a ripple of anticipation across the AI landscape. This isn’t Google’s first foray into open-source AI—its release of the transformative Transformers library laid a cornerstone for what has become a thriving ecosystem of machine learning models and applications. Gemma’s introduction into this space marks a substantial stride forward for open-source AI by not only providing a new suite of powerful tools but also by championing responsible AI practices.

Comparing to Other Models

A Redditor from the r/locallama community recently benchmarked Gemma 7B against other pre-trained 7Bs on the Open LLM Leaderboard, noting its standout performance boost in GSM8K, which highlights Gemma’s enhanced capabilities in mathematical reasoning.

Despite the specialized improvement in mathematical tasks with the 7B Gemma model, it doesn’t quite surpass the overall performance of the Mistral 7B. It seems to have a slight edge over Llama 7B, perhaps due to its architectural similarities to Llama 2, such as RMSNorm and RoPE. Notable differences include the use of a GeLU activation function instead of SiLU, a larger context window of 8k (twice that of Llama 2), wider intermediate layers, and fewer layers overall, at 28 compared to Llama’s 32. After hours of hands-on testing, it’s clear that Gemma 7B falls short of becoming a go-to over the established Mistral 7B models for broader applications.

Introducing Groq One of The Fastest LLM Chats

In the rapidly advancing realm of artificial intelligence, speed and efficiency are not just goals; they are imperatives. As AI models grow increasingly complex, the quest for faster, more responsive computing has led to a groundbreaking innovation: Groq’s Tensor Streaming Processor (TSP), a Linear Processor Unit (LPU) that stands poised to redefine the landscape of AI computations. With response times clocking in at an astonishing rate of nearly 500 T/s, Also note that Grok is NOT an LLM, the underlying models are Mixtral or Llama the improvment is performance in due to hardware not algorithmically.

Traditional GPUs, with their parallel processing capabilities and multitude of cores, have long been the standard bearers in the field of AI and graphics rendering. However, these GPUs operate on the SIMD (Single Instruction, Multiple Data) model, a structure that, while powerful, comes with its own set of complexities, particularly when it comes to scheduling and latency. Enter Groq’s LPU, a novel design that sidesteps these issues by adopting a deterministic performance model specifically catered to AI workflows.

The LPU’s architecture eschews the conventional approach in favor of a streamlined, every-clock-cycle-counts design, ensuring a level of consistent latency and throughput that was once thought unachievable. For developers, this translates to unprecedented precision in performance prediction and optimization, a pivotal advantage for real-time AI applications.

This design is not only a beacon of performance but also of energy efficiency. By eliminating the need to manage multiple threads and by maximizing core utilization, the LPU ensures more computations per watt than ever before. Energy efficiency, combined with the LPU’s scalability—wherein multiple TSPs can be seamlessly linked without the common bottlenecks present in GPU clusters—heralds a new era of simplified hardware expansion for large-scale AI models.

The implications extend far beyond mere technical specs. LPUs promise to shape the future of AI application serving, offering a robust alternative to the highly sought-after A100s and H100s. With Groq’s TSP, we stand on the precipice of a transformative leap in performance—one that could very well accelerate the pace of AI innovation and broaden the horizons of what is computationally possible.

Potential Applications

Autonomous agents

Building autonomous agents, for example something with LangChain stands to gain substantially from the increased token per second (T/s) processing capabilities provided by advanced processors like Groq’s Tensor Streaming Processor (TSP). Autonomous agents, ranging from virtual assistants to sophisticated robots, require rapid processing of data to interact with their environment effectively and make autonomous decisions. Here’s how faster T/s can be beneficial in this context:

  1. Real-Time Decision Making: Autonomous agents must process a vast array of inputs to make decisions in real time. The faster T/s rate allows for quicker analysis of sensor data, which is critical for agents that operate in dynamic or unpredictable environments.
  2. Improved Perception: Agents rely on processing visual, auditory, and other sensory data to perceive their surroundings. Accelerated T/s rates can lead to more advanced perception capabilities, enabling agents to understand and react to complex scenarios with higher accuracy.
  3. Interactive Learning: Machine learning algorithms, especially those involving reinforcement learning where an agent improves through trial and error, can greatly benefit from faster processing. With more computations per second, agents can iterate and learn from interactions much quicker.
  4. Advanced Natural Language Understanding: For agents that interact with humans, rapid T/s enables sophisticated language models to parse, understand, and generate language in real-time, leading to more natural and fluid conversations.
  5. Dynamic Path Planning: In robotics, quick processing speeds can facilitate more efficient path planning and obstacle avoidance, as the agent can reassess and adjust its trajectory instantaneously in response to changes in the environment.
  6. Enhanced Multi-agent Coordination: Faster T/s processing can improve the coordination among multiple autonomous agents, such as a fleet of drones or autonomous vehicles, allowing them to operate in harmony and respond to each other’s actions promptly.
  7. Human-like Reflexes: When speed is critical, such as in medical robots or in disaster response scenarios, the ability for an autonomous agent to respond quickly and appropriately can make the difference in outcomes.
  8. Robust Simulations for Training: Training autonomous agents often involves simulations that can be computationally intensive. High T/s rates can make these simulations more efficient, leading to better-trained agents in a shorter amount of time.

The development of autonomous agents that can respond and adapt to their environment in real time is a challenging task, and the demand for computational speed is ever-present. With the advancements in processors and higher T/s rates, it is becoming increasingly possible to create agents that are not only responsive and efficient but also capable of complex, nuanced interactions and behaviors that more closely mimic human-like intelligence.

How Did Groq Do It?

Groq’s LPU (Linear Processing Unit) is faster and more energy-efficient than Nvidia GPUs for inference tasks. Unlike Nvidia GPUs, which require high-speed data delivery and High Bandwidth Memory (HBM), Groq’s LPUs use SRAM, which is 20 times faster and consumes less power. Groq’s LPUs also use a Temporal Instruction Set Computer architecture, reducing the need to reload data from memory and avoiding HBM shortages. Groq claims that its technology could replace GPUs in AI tasks with its powerful chip and software, potentially eliminating the need for specialized storage solutions. Does this mean LLM’s were the killer app for TPU clouds?

This tweet goes more in-depth of their hardware.

OpenAI Releases Sora Text-to-Video

Unless your living under a rock, in the world of AI OpenAI has released their first text-to-video model and it is impressive.

Sora is an AI model developed by OpenAI that can create realistic and imaginative scenes from text instructions. It is a text-to-video model capable of generating videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Sora is designed to understand and simulate the physical world in motion, with the goal of training models that help people solve problems requiring real-world interaction. The model can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.

Examples

Sora can generate multiple videos side-by-side at the same time.

Sora can combine videos

Sora can follow up and edit videos.

The Architecture

According to Sora’s technical report, Sora’s architecture involves turning visual data into patches, compressing videos into a lower-dimensional latent space, training a network that reduces the dimensionality of visual data, and extracting a sequence of spacetime patches which act as transformer tokens. Sora is a diffusion model that scales effectively as a video model and can generate videos with variable durations, resolutions, and aspect ratios. It can also be prompted with other inputs, such as pre-existing images or video, enabling a wide range of image and video editing tasks. Additionally, Sora exhibits emerging simulation capabilities, such as 3D consistency, long-range coherence, object permanence, interacting with the world, and simulating digital worlds. However, it also has limitations in accurately modeling the physics of basic interactions and other failure modes.

Sora is a comprehensive diffusion transformer model that processes text or images and generates video pixel output. By analyzing vast volumes of video data using gradient descent, Sora acquires an internal understanding of physical dynamics, essentially forming a trainable simulation or “world model.” While Sora doesn’t directly integrate Unreal Engine 5 (UE5) into its processing loop, it can incorporate text and video pairs created with UE5 into its training data as synthetic examples.

Limitations

Sora’s emergent physics understanding is still fragile and imperfect.

Despite extensive research and testing, OpenAI acknowledges that it cannot predict all the beneficial ways people will use the technology, nor all the ways people will abuse it. The model is based on a diffusion architecture and uses a transformer architecture similar to GPT models. It builds on past research in DALL·E and GPT models, using the recaptioning technique from DALL·E 3 to follow the user’s text instructions in the generated video more faithfully. Sora serves as a foundation for models that can understand and simulate the real world, a capability believed to be an important milestone for achieving AGI.

Putting Together an OpenAI Agent With LlamaIndex 

Thanks to the new OpenAI API that supports function calling, creating your own agent has never been easier!

In this tutorial notebook, we’ll demonstrate how to build an OpenAI agent in just 50 lines of code or less. Despite its brevity, our agent is fully-featured and capable of carrying on conversations while utilizing various tools. Using LlamaIndex we will bu agent that will get the current price of a stock using the Yahoo Finance API.

Setting Up

The main thing we need is:

  1. the OpenAI API (using our own llama_index LLM class)
  2. a place to keep conversation history
  3. a definition for tools that our agent can use.

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
!pip install llama-index

Next, we need to set up the foundation for working with LlamaIndex agents, tools, and OpenAI’s LLMs, while also ensuring proper asynchronous execution.

import json
from typing import Sequence, List

from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool

import nest_asyncio

nest_asyncio.apply()

Here we will create a function using the Yahoo Finance API, to get the current price of a stock.

import yfinance as yf
from functools import cache

@cache
def get_stock_price(ticker: str) -> Union[float, None]:
  """
  Retrieves the current price of a stock using yfinance, with caching for performance.

  Args:
    ticker: The ticker symbol of the stock.

  Returns:
    The current price of the stock, or None if an error occurs.
  """
  try:
    stock = yf.Ticker(ticker)
    return stock.info["regularMarketPrice"]
  except (yf.YFException, KeyError) as e:
    print(f"Error retrieving price for {ticker}: {e}")
    return None

stock_tool = FunctionTool.from_defaults(fn=stock)

Agent Definition

Now, we define our agent that’s capable of holding a conversation and calling tools in under 50 lines of code.

The meat of the agent logic is in the chat method. At a high-level, there are 3 steps:

  1. Call OpenAI to decide which tool (if any) to call and with what arguments.
  2. Call the tool with the arguments to obtain an output
  3. Call OpenAI to synthesize a response from the conversation context and the tool output.

The reset method simply resets the conversation context, so we can start another conversation.

class YourOpenAIAgent:
    def __init__(
        self,
        tools: Sequence[BaseTool] = [],
        llm: OpenAI = OpenAI(temperature=0, model="gpt-3.5-turbo-0613"),
        chat_history: List[ChatMessage] = [],
    ) -> None:
        self._llm = llm
        self._tools = {tool.metadata.name: tool for tool in tools}
        self._chat_history = chat_history

    def reset(self) -> None:
        self._chat_history = []

    def chat(self, message: str) -> str:
        chat_history = self._chat_history
        chat_history.append(ChatMessage(role="user", content=message))
        tools = [
            tool.metadata.to_openai_tool() for _, tool in self._tools.items()
        ]

        ai_message = self._llm.chat(chat_history, tools=tools).message
        additional_kwargs = ai_message.additional_kwargs
        chat_history.append(ai_message)

        tool_calls = ai_message.additional_kwargs.get("tool_calls", None)
        # parallel function calling is now supported
        if tool_calls is not None:
            for tool_call in tool_calls:
                function_message = self._call_function(tool_call)
                chat_history.append(function_message)
                ai_message = self._llm.chat(chat_history).message
                chat_history.append(ai_message)

        return ai_message.content

    def _call_function(self, tool_call: dict) -> ChatMessage:
        id_ = tool_call["id"]
        function_call = tool_call["function"]
        tool = self._tools[function_call["name"]]
        output = tool(**json.loads(function_call["arguments"]))
        return ChatMessage(
            name=function_call["name"],
            content=str(output),
            role="tool",
            additional_kwargs={
                "tool_call_id": id_,
                "name": function_call["name"],
            },
        )

The agent serves as a bridge between the user and the LLM, managing conversation flow and tool integration. Tools extend the agent’s capabilities with custom functions. The agent maintains a chat history for context. It handles tool calls requested by the LLM, enabling dynamic interactions.

agent = YourOpenAIAgent(tools=[stock_tool])
agent.chat("Hi")
'Hello! How can I assist you today?'
agent.chat("What is the stock price of appl")

LlamaIndex has different implementation methods, some better than others. For example, they provide an OpenAIAgent .

OpenAIAgent 

This agent implementation not only adheres to the BaseChatEngine and BaseQueryEngine interfaces, making it seamlessly compatible with the LlamaIndex framework, but also boasts several advanced features such as support for multiple function calls per conversation turn, streaming capabilities, async endpoints, and callback and tracing functionality.

from llama_index.agent.openai import OpenAIAgent
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo-0613")
agent = OpenAIAgent.from_tools(
    [multiply_tool, add_tool], llm=llm, verbose=True
)

Streaming Chat

One key advantage is the ability to receive responses in a streaming fashion, allowing for incremental and real-time interaction with the model. This can be particularly useful for applications where immediate feedback or step-by-step processing is required, such as in conversational interfaces, real-time translation, or content generation. Additionally, streaming chat supports async endpoints, callback and tracing, and async streaming chat, providing flexibility and efficiency in handling conversations and responses

response = agent.stream_chat(
    "What is 121 * 2? Once you have the answer, use that number to write a"
    " story about a group of mice."
)

response_gen = response.response_gen

for token in response_gen:
    print(token, end="")

Google Releases Gemini 1.5 With 10M Context Window

Google has released its next-generation AI model, Gemini 1.5. It is a significant advancement over the previous model, Gemini 1.0 Ultra, and offers dramatic improvements across various dimensions. Gemini 1.5 Pro, the first model released for early testing, achieves comparable quality to 1.0 Ultra while using less compute. This is just 2 months after the initial release of Gemini.

One of the key breakthroughs is its long-context understanding, with a capacity to process up to 1 million tokens, enabling entirely new capabilities and applications for developers and enterprise customers. The model is built upon leading research on Transformer and Mixture-of-Experts (MoE) architecture, making it more efficient to train and serve. It also delivers enhanced performance, outperforming its predecessor on 87% of the benchmarks used for developing large language models (LLMs). Additionally, extensive ethics and safety testing have been conducted to ensure responsible deployment of the model.

Gemini 1.5 is Here

This is one of the most shocking examples:

With only instructional materials (500 pages of linguistic documentation, a dictionary, and ≈ 400 parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua2 , and therefore almost no online presence. Moreover, we find that the quality of its translations is comparable to that of a person who has learned from the same materials.

– Gemini 1.5 is Google’s next-generation AI model, offering significant improvements over the previous model, Gemini 1.0 Ultra.

– It achieves comparable quality to 1.0 Ultra while using less compute and introduces a breakthrough in long-context understanding, with a capacity to process up to 1 million tokens.

Architecture

Gemini 1.5 is a state-of-the-art deep learning model built on top of cutting-edge research in Transformer and MoE (Model Parallelism with Experts) architectures. Unlike traditional Transformers that use one large neural network, MoE models are composed of smaller “expert” networks.

MoE models dynamically activate only the most relevant expert pathways within their neural network based on the input they receive, significantly improving efficiency compared to conventional approaches. Google has been at the forefront of developing and implementing MoE techniques for deep learning through various groundbreaking research papers like Sparsely-Gated MoE, GShard-Transformer, Switch-Transformer, M4, and more.

Gemini 1.5 leverages these advancements in model architecture to learn complex tasks faster while maintaining high-quality results. It is also more efficient during both training and serving phases. These efficiencies enable our teams to iterate quickly, train advanced versions of Gemini rapidly, and continue working towards further optimizations.

Impressive Context lengths

The Impressive Context lengths of Gemini 1.5 cannot be overstated, especially when it pertains to navigating the complex and dense world of codebases. With the ability to process up to 1 million tokens, which is over 30K lines of code, it now means the model has dramatically expanded the horizon of possibilities for software development and maintenance. Programmers and engineers can now leverage this AI to understand and work with larger sections of code in a single instance, allowing for a more comprehensive analysis and quicker troubleshooting. Not only that but equally if not more impressive is its near-perfect retrieval accuracy. This ensures that the most relevant and useful information is available when needed, minimizing the risk of overlooking crucial details inherent in massive code repositories and hallucinations.

This technological leap places significant competitive pressure on Retrieval-Augmented Generation (RAG) models, which may struggle to keep up with the vast context window and precision of Gemini 1.5. Google’s tech report suggests that performance remains robust even when scaling up to staggering sizes like 10 million tokens. As developers embrace this expansion in context size, they’re unlocking opportunities for AI-assisted programming that were once considered science fiction. However, the cost of managing such a voluminous stream of tokens remains a topic for discussion. The financial and computational resources required to sustain these capabilities are substantial, and whether they justify the benefits is yet to be seen. Additionally, the future of Gemini hints at an evolution toward multi-modal learning, with plans to ingest various media types such as files and videos, further enriching the context and utility of the AI — a step beyond its current limitation to image inputs.