The author of this document has limited its availability to on-campus or logged-in CSUSB users only.

Off-campus CSUSB users: To download restricted items, please log in to our proxy server with your MyCoyote username and password.

Date of Award

8-2025

Document Type

Restricted Project: Campus only access

Degree Name

Master of Science in Computer Science

Department

School of Computer Science and Engineering

First Reader/Committee Chair

Chen, Qiuxiao

Abstract

Zero-Shot Semantic Segmentation (ZSSeg) addresses the challenge of segmenting images into semantic classes that were not seen during training. Traditional ZSSeg methods rely on fixed visual-language mappings, often suffering from limited generalization to unseen categories. Recent advances like CLIP have provided a breakthrough by aligning image and text embeddings in a shared space. However, such models still struggle to incorporate domain-specific knowledge and often require extensive learnable prompt engineering. This project aims to overcome these limitations by introducing domain-adaptive prompt learning into a CLIP-based ZSSeg pipeline.

Our approach enhances CLIP-based segmentation with learnable domain-agnostic and domain-specific prompts. The domain-agnostic prompt captures general knowledge across visual domains, while the domain-specific prompt encodes contextual cues from specific training datasets. These prompts are trained using a contrastive learning objective in a classification task before being integrated into a segmentation model. The proposed method significantly reduces manual engineering and enables better adaptation across domain shifts, such as synthetic-to-real transitions or changes in lighting, texture, or style.

The project is implemented by extending the DAPrompt framework, originally designed for classification, to a semantic segmentation setting using MaskFormer as the base architecture. A dual-stage training process is followed: (1) prompt learning via classification on COCO Stuff 164k, and (2) integration of learned prompts into a MaskFormer segmentation model. Experimental results show improved performance over baseline models using learnable prompts. Notably, domain-specific prompts lead to more accurate predictions for unseen classes and better segmentation confidence across domain boundaries.

Extensive evaluations on COCO Stuff demonstrate that domain-adaptive prompts offer improved generalization capabilities compared to standard CLIP prompts. Ablation studies highlight the contribution of prompt length, threshold sensitivity, and prompt embedding initialization strategies. The best configuration—using both domain-agnostic and domain-specific prompts—achieves competitive mean Intersection-over-Union (mIoU) scores, validating the effectiveness of the proposed approach.

This project contributes a new perspective on integrating domain knowledge into vision-language models through learned prompt structures. The pipeline remains lightweight, modular, and easily extendable to other segmentation architectures and datasets. Moreover, by reducing the dependence on labeled data and learnable prompt tuning, the method paves the way for more robust deployment of semantic segmentation in real-world applications such as autonomous driving, medical imaging, and remote sensing.

In conclusion, this work demonstrates that domain-adaptive prompt learning is a viable and impactful strategy to enhance zero-shot segmentation. It effectively bridges the gap between general-purpose vision-language models and the nuanced requirements of domain-specific tasks. Future work will explore fine-tuning frozen image encoders, introducing prompt selection strategies based on content, and extending the framework to multimodal segmentation and temporal video data.

Share

COinS