🥑➡️📝 FoodExtract-Vision with a fine-tuned SmolVLM2-500M

Overview

Extract food and drink items in a structured way from images. The original model outputs fail to capture the desired structure. But the fine-tuned model sticks to the output structure quite well. However, the fine-tuned model could definitely be improved with respects to its ability to extract the right food/drink items. Both models use the input prompt:

Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
Only return valid JSON in the following form:
```json
{
  'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)
  'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present
  'food_items': [], # list[str] - list of visible edible food item nouns
  'drink_items': [] # list[str] - list of visible edible drink item nouns
}
```

Except one model has been fine-tuned on the structured data whereas the other hasn't. Notable next steps would be:

  • Remove the input prompt: Just train the model to go straight from image -> text (no text prompt on input), this would save on inference tokens.
  • Fine-tune on more real-world data: Right now the model is only trained on 1k food images (from Food101) and 500 not food (random internet images), training on real world data would likely significantly improve performance.
  • Fix the repetitive generation: The model can sometimes get stuck in a repetitive generation pattern, e.g. "onions", "onions", "onions", etc. We could look into patterns to help reduce this.
Examples