VLM
CLIP is a multimodal machine learning model proposed by OpenAI. By performing comparative learning on large-scale image and text pairs, the model can simultaneously process images and text, mapping them into a shared vector space. This example demonstrates using CLIP for image management and text search on the RDK platform.
Code repository: ( https://github.com/D-Robotics/hobot_clip.git )
Application scenarios: Use the CLIP image feature extractor to manage images, perform image-text search, and image-based search.
More information
Please refer to the D-Robotics official documentation https://developer.d-robotics.cc/rdk_doc/Robot_development/boxs/function/hobot_clip