vision (module)
Last updated
Last updated
The purpose of this document is to define the requirements and design of the vision system in the OpenAdapt process automation library.
Motivation:
Want to incorporate vision into OpenAdapt and be able to ask the model questions about the screenshots, as currently we are just relying on WindowEvents
Literature review:
Many different vision models are popping up recently, an evaluation of some of these can be found here
Ultimately the goal is to create a ReplayStrategy with vision capabilities so that we can feed it the screenshots in a recording, rather than relying on WindowEvents
As all of the vision models require a large amount of GPU memory, we started out by using the modal library to run them in a cloud container. All of the models can be run in modal following a similar structure:
Create an image: install all requirements, download the weights for the model (so they’re saved in the container image and don’t have to be redownloaded), and any other preliminary work needed to setup the model (usually just following the instructions on the GitHub repo). Non-huggingface models must be git cloned in the container.
Create a modal class for the model:
Assign it a gpu and a timeout ex. @stub.cls(gpu="A100", timeout=18000)
use the __enter__ method to set the class’ processor/model/device/any other attributes the class needs
Create a method with the modal @method decorator to generate completions given an image and prompt. For huggingface models, this is really simple - create the inputs to the model from the given image and question, generate the completion using the model’s generate method, then decoding the output. For non-huggingface models, this is a bit more complicated and every repo structures their model differently. Generally, these models usually have a demo file that uses Gradio, so you can follow the same idea without using Gradio.
@stub.local_entrypoint(): open the image(s) using PIL and run the generate method defined above using generate.call to run it in the cloud. The model will be asked all of the questions in OpenAdapt/openadapt/vision/questions.txt for each of the images in OpenAdapt/openadapt/vision/images
Currently, we are manually testing each model and inputting the completions in this spreadsheet.
The first time any of the models are run, they will take a long time since all the models need to be downloaded. After that, the models will be saved in the container and will run much faster.
Currently this process must be done manually.
To create a dataset and add to it, make a recording, then visualize it and find the window event timestamp and screenshot timestamp you would like to add to the dataset. Then, run `python openadapt/scripts/tag_dataset.py <window_event_timestamp> <screenshot_timestamp> [<dataset_id>]’.
If you want to create a new dataset and insert the 1st entry, leave dataset_id blank.
If you want to add to an existing dataset, add the dataset_id.
Currently, the dataset created here is using dataset_id=1.
Once you are done and would like to save the dataset locally for finetuning, run ‘python openadapt/scripts/generate_dataset.py <dataset_id>’ and the dataset will be saved at openadapt/ml/data/vision_dataset as a JSON file containing the window states and a directory of images. The images are named after their id and the window states use this id to preserve the relationship.
Note: Currently all the dataset images were taken on a Windows computer with 1920 x 1080 resolution.
Out of the box performance of select multimodal LLMs found from this evaluation
https://docs.google.com/spreadsheets/d/1etgr3LnM_NrAMwMmFLUttmXmBrJ4Uia9-notBzyXk6c/edit?usp=sharing
Currently the next step is to finetune the best vision model using the vision dataset here. Based on the testing I’ve done so far, we suspect the best model we will choose to be finetuned will be either MiniGPT-4, Otter, InstructBlip.
For the dataset, we would also like to come up with a way to generate similar data as what has been recorded, but with different parameters for e.g. windows theme settings, window sizes/positions, etc. We would like to create a script (vision/augment.py) to augment the data to automate the process of creating varying screenshots and window states. This would either be done by modifying the screenshots directly, or by replaying the recordings in a way that recreates the same screenshots, but with different windows themes, window sizes, etc.