They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining.
According to Google, when given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm (developed by Google Robotics) and execute the actions by itself.
PaLM-E does this by analysing data from the robot's camera without needing a pre-processed scene representation. This eliminates the need for a human to pre-process or annotate the data and allows for more autonomous robotic control. It's also resilient and can react to its environment.
For example, the PaLM-E model can guide a robot to get a chip bag from a kitchen -- and with PaLM-E integrated into the control loop, it becomes resistant to interruptions that might occur during the task. In a video example, a researcher grabs the chips from the robot and moves them, but the robot locates the chips and grabs them again. In another example, the same PaLM-E model autonomously controls a robot through tasks with complex sequences that previously required human guidance.
PaLM-E is a next-token predictor, and it's called "PaLM-E" because it's based on Google's existing large language model (LLM) called "PaLM" (which is similar to the technology behind ChatGPT). Google has made PaLM "embodied" by adding sensory information and robotic control. Since it's based on a language model, PaLM-E takes continuous observations, like images or sensor data, and encodes them into a sequence of vectors that are the same size as language tokens.
This allows the model to "understand" the sensory information in the same way it processes language. In addition to the RT-1 robotics transformer, PaLM-E draws from Google's previous work on ViT-22B, a vision transformer model revealed in February. ViT-22B has been trained on various visual tasks, such as image classification, object detection, semantic segmentation, and image captioning.