What is even more important here is the caching capability. Here, we are just calling the utilities we had previously developed to return the URLs of the most similar images and their scores. """ images_redis_key = keyword + "_images" urls_redis_key = keyword + "_urls" if redis_client.exists(images_redis_key) and redis_client.exists( urls_redis_key ): keyword_images = redis_client.get(images_redis_key) keyword_image_urls = redis_client.get(urls_redis_key) else: (keyword_images, keyword_image_urls) = fetch_images_tag( keyword, pixabay_max ) redis_t(images_redis_key, keyword_images) redis_t(urls_redis_key, keyword_image_urls) (top_indices, top_scores) = self.similarity_model.perform_sim_search( keyword_images, semantic_query, top_k ) top_urls = for index in top_indices] return (top_urls, top_scores) :return: Tuple of top_k URLs and the similarity scores of the images present inside the URLs. :param pixabay_max: Number of maximum images to retrieve from Pixabay. :param semantic_query: Query to find semantically similar images retrieved from Pixabay. :param keyword: Keyword to search with on Pixabay. Following is the main class of that script:Ĭlass Searcher: def _init_(self): self.similarity_model = SimilarityUtil() def get_similar_images(self, keyword, semantic_query, pixabay_max, top_k): """ Finds semantically similar images. Now, we need to collate our utilities for fetching images from Pixabay and for performing the natural language image search inside a single script - perform_search.py. If you want to interact with this utility in a live manner you can check out this Colab Notebook. This is the meat of our application and we have tackled it already. Note that we are initializing the CLIP model while instantiating the SimilarityUtil to save us the model loading time. We then sort the vector in a descending manner. This gives us a vector (logits_per_image) that contains the similarity scores between each of the images and the query. In the code above, we are first invoking the CLIP model with images and the natural language query. To know more about the particular methods we are using for the CLIP model please refer to this documentation from Hugging Face. These two embeddings are matched with one another during inference. The text-based query is also encoded using A Transformers-based model for generating the embeddings. """ values, indices = scores.squeeze().topk(top_k) top_indices, top_scores =, for score, index in zip(values, indices): top_indices.append(int(index.numpy())) score = score.numpy().tolist() top_scores.append(round(score, 3)) return (top_indices, top_scores)ĬLIP_MODEL uses a ViT-base model to encode the images for generating meaningful embeddings with respect to the provided query. :param top_k: Number of top scores to return. """ model = (vice) # Obtain the text-image similarity scores with torch.no_grad(): inputs = self.processor( text=, images=images, return_tensors="pt", padding=True ) inputs = inputs.to(vice) outputs = model(**inputs) # Image-text similarity scores logits_per_image = outputs.logits_per_image.cpu() (top_indices, top_scores) = self.sort_scores(logits_per_image, top_k) return (top_indices, top_scores) def sort_scores(self, scores, top_k): """ Sorts the scores in a descending manner. :return: Top-k indices matching the query semantically and their similarity scores. :param top_k: Number of top images to return from `images`. :param query_phrase: A list containing a single text query, e.g. :param images: A list of PIL images initially retrieved with respect to some entity e.g. For this post, we won’t be using this approach.įigure 2 below depicts the architecture design of our application and the technical stack used for each of the components.Ĭlass SimilarityUtil: def _init_(self): self.model = om_pretrained(CLIP_MODEL) self.processor = om_pretrained(CLIP_PREPROCESSOR) vice = "cuda" if _available() else "cpu" def perform_sim_search(self, images, query_phrase, top_k=3): """ Performs similarity search between the images and query. Note: Instead of two queries, we could have only taken a single long query and run named-entity extraction to determine the most likely important keywords to run the initial search with. These images should be semantically similar to this query.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |