11/14 2025
544
On the morning of the 22nd, the IEEE International Conference on Computer Vision (ICCV) announced the winner of this year's Best Paper Award.
The prestigious Best Paper Award was bestowed upon the team led by the young scholar Zhu Junyan from Carnegie Mellon University. Their paper, titled "Generating Physically Stable and Buildable Brick Structures from Text," stood out among the submissions.
ICCV ranks as one of the top three global conferences in the field of computer vision and is held every two years. Data indicates that this year's conference received 11,239 valid submissions. The program committee recommended 2,699 papers for acceptance, culminating in a final acceptance rate of 24%. This marks a significant increase compared to the previous conference.
Zhu Junyan completed his undergraduate studies at Tsinghua University. Currently, he serves as an Assistant Professor at the School of Computer Science at Carnegie Mellon University and is a former research scientist at Adobe. His primary research interests encompass computer vision, graphics, computational photography, and generative models.
The award-winning paper introduces "Brick" GPT, also known as BrickGPT. This innovative method is the first of its kind to generate physically stable and interconnected brick assembly models based solely on textual prompts. Image source: https://arxiv.org/pdf/2505.05469
Creating real-world objects using existing techniques remains a formidable challenge. Zhu Junyan's team is committed to tackling the issue of generating physically realizable objects. Their aim is to develop a method that can directly generate brick assembly structures from free-form textual prompts while ensuring both physical stability and buildability.
The team has developed StableText2Brick, a brand-new, large-scale dataset that encompasses over 47,000 different brick assembly structures.
To achieve sequence and text comprehension, the researchers fine-tuned a pre-trained Large Language Model (LLM) specifically for the brick structure generation task. To further enhance the stability and buildability of the designs, the research team incorporated block-by-block rejection sampling and physics-aware rollback mechanisms during the inference process.
The foundational model is capable of generating brick structures through in-context learning, highlighting the immense potential of utilizing pre-trained LLMs for task completion.
To ensure physical stability, stability analysis can be applied at each step of the process, allowing for the resampling of bricks that may lead to collapse.
However, this method is somewhat inefficient. The team suggests adopting a block-by-block rejection sampling approach in conjunction with physics-aware rollback to strike a balance between stability and diversity.
To address physical configuration challenges, the model integrates physical stability verification into autoregressive inference.
Firstly, when the model generates a brick and its position, the brick must be properly formatted and not positioned outside the designated workspace. Secondly, it guarantees that the newly added brick does not collide with the existing structure.
According to evaluations, the research method developed by the team generated high-quality, diverse, and novel brick structures that closely aligned with the given textual prompts. It outperformed all benchmark methods and simplified settings in terms of both effectiveness and stability, while maintaining a high degree of text similarity.
Due to limited computational resources, the team has not yet explored the largest 3D dataset and is currently restricted to generating designs within a grid of 21 categories. In contrast, recent 3D generation methods are capable of creating a wider variety of objects.
Secondly, the method currently supports a fixed set of commonly used toy bricks. In future endeavors, the research team plans to expand the brick library to include a broader range of sizes and brick types, such as ramp bricks and tile bricks, to facilitate more diverse and complex designs.
Nevertheless, the experimental results clearly demonstrate that Zhu Junyan's team's method outperforms LLM backbone models and some recent text-to-3D generation methods, marking a significant breakthrough in LLM research.
References:
https://arxiv.org/pdf/2505.05469