Caption-IT : Image to Sequence Deep Learning Model : Development to Deployment!

23 min readSep 19, 2020

— Today we will look into another interesting domain of Deep Learning called Seq-to-Seq model, specifically Image-to-Sequence.

Disclaimer : This is going to be a full fledged end-to-end tutorial of a project, it will be quite lengthy. So, now that we have mentioned, let’s go ahead.

Ever wonder how does Google find you a perfect Image for the keywords you provide in Google Image Search?

Google stores what we call relevant caption in the form of keywords for the images as a repository and retrieves the best match, whenever you search for specific keywords.

Now, how does it generate captions for so many images in the whole Google repository? By Humans? By machines? By Software?

Well, as we all know Google has the most cutting edge AI application of all the time, and one of those application is what we call Image Caption Generator, which takes Image as an input and tries to understand the context of the image and generate the best possible outcome in the form of text, which of course describes the image.

Today we will try to implement this, will try to keep it simple and elegant and easy to understand, of course it will not be as efficient as Google’s Model but not dumb as well.

Before jumping into the essence of it, I would like you to look at the research paper, “Show and Tell: A Neural Image Caption Generator”, this is where we took our reference from and probably Google might have taken its reference from, at that time when this was released.

Let’s dive deep:

Now, before I start explaining, I would like to make some clarifications first, as I will explain everything from scratch and we will build a model together, train it and even deploy. Yes, you read that right we will deploy this model. There are a certain assumptions that we made which will help us keep this blog at readable length. Below are the assumptions:

You know basics of Python, as there is going to be a lot of code involved.
You know basics of Deep Learning Models, and how they function, as explaining these two concepts is a vast area in itself.

Okay, now that we are done with the pre-requisites, let’s move on with the core concept.

For any ML or DL model the main ingredient is data, so let’s first understand what data we have, and from where we got the data.

The data can be publicly available here.

This is Flicker 8k dataset, which consists of 8091 images and 5 captions for each image, which makes it roughly 40K captions in total.

The images are in .jpg format and has dimension of roughly 500x600 pixels, which varies per image, the caption is stated inside an excel file with the filename of the images, so that we can map our images to captions.

Now, let’s look at one such example.

Captions:

A dog is jumping over grass
A dog is airborne
Dog Playing in grass
Dog Jumping
Dog jumping on lawn during daytime photo

So, as we can see, there are 5 relevant captions for a single image, which will be used for training.

Data Analysis:

Let us do some data analysis. After successfully converting the data into a dataframe, it looks like this:

As you can see the filename is repeated, and we have 5 unique captions for it.

Now, let’s check some stats…

Let’s check the total number of unique images in the dataframe.

Checking the number of files in the Image dir, to make sure that we are not having mismatch of data.

We found a unique file name which has ‘.’ before the number, which could have created issues in training. When I checked manually such filename doesn’t exist in the image directory. We can remove it from our dataframe.

So, that’s what we saw about the physical image files in our system, now let’s move on with Data Pre-Processing.

Data Pre-Processing:

We will first process our text data from raw format to a clean standardized format, for that we will write down a function, and pass all our captions into it, so let’s look at the function:

As we all can see, I have made use of regular expression to convert the patterns and replace some characters.

Now we will pass our captions to this function. We iterate over the dataframe and pass individual caption to the above function.

Let’s have a look at the captions after preprocessing.

Vectorizing Text Data:

Now to convert words to Tokens we can use an existing powerful function in Python which can take care of multiple things at once, i.e. CountVectorizer,

1. analyzer=’word’ , makes sure that single word is taken into consideration for tokenization

2. min_df=10, makes sure that only that word is taken into consideration whose occurrence is more than 10 times

Get the feature names from the vectorizer as a word and store it in an index mapping dict, this will be used later in this module, we give number as Index to every word from the vectorizer and store it as Index_to_Word mapping and also the reverse Word_to_Index mapping.

We add 4 additional tokens to our dict, they are “a”, “startSeq”, “endSeq” and “<pad>”, their significance will be explained later in this blog.

We reverse the word index map into index word map. Thanks to the power of python this is just 1 line code. We need this for future references.

The vocab size is kept as length of the dict +1 because 0th index is not actually a word, but it is padding token, hence the vocab size is always total number of words, which in our case is len(word_index_mapping) + 1

Now, let’s check the integrity of the dict values, first take any random word, pass it as key into word_index_mapping and get its corresponding index value. Then pass it as key to index_word_mapping, if both the words (you passed as key and printed) are same, then it’s perfect.

So, here we have successfully converted our words to index and vice versa, which indicates that we can move further.

Data Preparation for Training:

So, we have around 8091 images in all. We will take 7091 images as training data and remaining will be used to evaluate the model.

We achieve this by taking first 7091 images as training and then take a set difference to separate out the rest.

Now bifurcating the dataframe into test and train.

Now we have to append a start token as “startSeq” and end token as “endSeq” to each caption’s start and end respectively, so as to make our model understand where the caption starts and where it ends. Basically used as a start point and end point.

Let’s preview our captions now:

Okay, they look good…

Now, we have to get the maximum length for our caption, this is done because we want to make sure that our input sentences are of same length. So what we do is, we take the length of the largest caption and make all the sentences of equal length by padding till the max length. Let’s have a look how to achieve it…

This is how we get max length from our list of captions, and the max length here is 39, which signifies that we have a caption which is 39 words long.

Let’s vectorize our captions first. We will embed those words which are present in our word_index_Mapping as there are a total of 400,000 words in glove file, and we only have 1955 words. So we only need 1955 words in vector form. And this is how we achieve it.

The reason why it is 1952 and not 1955 is because we included 3 extra tokens on our own, like startSeq, endSeq and <pad> which are not actually words.

We will be creating a matrix as weights so that it can be used for training in our model.

So, that’s all the processing we need for our text data, now let’s look at our Image data.

Working on Image Data:

Checking the load function of an image using cv2 lib

As cv2.imshow() method is not allowed in Google Colab, as it crashes the jupyter session, we use Google’s own cv2_imshow()

Now, we will convert our images to vector, using a Deep Learning model named InceptionV3 and using pre-trained weights of ‘ImageNet’. This will help us leverage the hard work of the people who came up with InceptionV3 and trained it on ImageNet dataset.

For that we need to download the InceptionV3 model from keras, and load it with ‘ImageNet’ weights and we have to remove the last 2 layers of the Model as they are fully connected layer for classification, hence we do not need that. What we will be using is the vector generated by the last third layer of the model.

Next task is to convert these images and then store the vectorized format in a pickle file so that we can use it directly for training, why are we doing so? This can be answered once we look into the training process, so have patience till then and follow through…

Let’s see what each line signifies, here we are just checking whether we already have the file created or not, if not then we will create the file and store it using pickle or else if the file is already created then we will just load it using pickle.load().

So before we pass our images to the InceptionV3 model, we need to make sure of some pre-processing, and that’s exactly what we are doing:

Step 1: read the file

Step 2: convert the image file from BGR to RGB

Step 3: resize the images to dimension 299 x 299, as inception model accepts images in that dimension

Step 4: expand dims create an additional dimension for the images

Step 5: using the keras preprocess_input we will be standardizing the images so the values range from 0–1 for each pixel, and many more such operations

Step 6: Finally we pass our image to the Inception model and generate a vector file for that image, and change the dimension to be 1 x vector.shape[1]

Step 7: And then we finally store the generated Numpy array as value in a dict with key as filename.

This same procedure is followed for Test data as well.

That’s all the pre-processing required for Images.

Now, we can move ahead with Model Architecture.

Image-to-Sequence Model :

Image Encoder: Our image encoder model is Inception V3’s output so we are just going to pass the data directly to the model and use a Dropout, followed by a Dense Layer. That’s all we need, as all the hard part is taken care of by our Pre-Trained model.

Text Encoder: Our text encoder model is a simple Embedding layer + Dropout + LSTM layer.

Decoder: Our decoder model takes input as both the values of the encoder, merges them and passes through a Dense layer and finally a Softmax layer and returns us the output.

Understanding the model isn’t tough, but understanding how it gets trained is. The training part is a complex area for such kind of models, because the data is passed in slightly different way than usual.

But not to worry, it will be covered as well. For now let’s see the Model’s Architecture.

Now we will set the weights for our Embedding layer as the matrix of glove vectors we created above, and we will set the trainable parameter as False, because we do not want those values to get updated while training.

Now, we can select our optimizer and loss function, as we want to setup our model for training.

We choose Adam as our optimizer and categorical_crossentropy as our loss function. Why categorical_crossentropy? It will be explained in a short while, please bear with me…

Before we can setup our training method, let’s have some utility functions in place:

These 2 methods of load loss and save loss will be used in our training for saving and retrieving the loss values, so that we can plot and check how our model is performing.

Training:

We will be training our model with the help of a Generator function, which will generate the data in the format we require. As I mentioned earlier, training such model is a bit tricky. Let’s first understand how we are supposed to pass the data in our model.

Let’s understand with the help of an example:

Caption: startSeq dog jumping on lawn during daytime photo endSeq.

Let’s understand Step wise how the data is passed into the model for training.

Step 1 : we pass the whole photo feature and pass the “startSeq” as index value to our model and train it that “dog” should be our output word, based on the photo feature and “startSeq”

Step 2 : we input the whole photo feature and this time we pass “startSeq” and previously generated output i.e. “dog” to the model. Do remember we don’t put the actual word, we pass the index of the word as machine only understands numbers, and then train it that upon passing the photo feature and the 2 words it should generate the next word that is “jumping”

And similarly, you can see how we have passed the data, one by one.

Let’s understand how the model generates the output. Our model generates the output as a categorical value, it gives the highest probability to the words it has to generate and we get the output as an index. Then using the index_to_word dict from above we can get the word out of it.

Now as the model generates an output as a categorical variable hence the loss “categorical crossentropy”.

Once we get the output as endSeq from the model we stop the training for that image and we change our photo feature and caption pair and then generate the same data set to be passed in our model for further training.

And this is how our model is trained.

Now, the main thing is how to create a self-generating function which will give us such breakdown of data, or such formation of data that can be passed to our model for training.

This is the function that generates photo caption pair of dataset for 1 image, given the name of the image.

Let’s understand what is inside this function, if you understand this, then u understand everything in this module, because this is likely to be the toughest part of the blog.

Now that we understand the intuitive way of how the data is passed, let’s understand how we will write a function that will generate the same format for 1 single image.

I pass the filename into the function, and now I need 2 things for that:

1. Photo Feature of that image, which we will retrieve from the dictionary we have generated above.

2. The caption associated with it, which we will get from the dataframe we have with us.

Since 1 photo has 5 captions, we need to create separate datasets out of those image caption pair.

Now our “seq” variable stores the sequence of our caption in index form.

We will run a loop over our “seq” variable and generate the input and output values from it.

The range of the loop is set form 1 to (len-1)

So,

in_seq = seq[:1] i.e index 0

out_seq = seq[1] i.e. index 1

in_seq is padded till the max seq length, so as to maintain same input length for all the input we have, and here we make use of the max_caption_length.

out_seq is now converted to a categorical variable so that our model can predict it as a category.

We consider each word at a time and create the sequence and then append it to a variable called input and output. And hence we create the dataset for our training, this is just for 1 image. We still need to create the generator function for our model.

Let’s look at the train generator function which will generate the data pattern for the number of images we will be passing to it.

Now, as we already know about the get_photo_caption_pair() method, let’s understand what this method does.

We pass the total number of images we need for training, let’s say we pass 3, now the value “n” is set to 0 initially, so that we can use it as a counter variable. There is an infinite while loop inside which we have a for loop that runs for all the images we have in our dataset. We get the data pattern for each photo from the above method, and then we check if the number n is equal to our total number of images. If yes then we yield the output and set our variables to initial values. You might actually have to understand what “yield” does for fully understanding this method, if u already do, then you already understood the functionality of this method.

Now, let’s see how does our training happen…

We will first set the weights of the module, if there is any, or else will start from scratch.

Then we finally set the epochs value, steps value and number_of_images to be used for our train generator and finally train our model using the Adam optimizer.

We create a generator from the train_generator function and then pass it to our model, and get the loss value for that epoch. If the loss value is less than our previous loss value then we save the weights or else we don’t.

This is how the training looks like…

And this is our model’s loss value while training.

Finally when the training is done, we can now move to evaluation phase of our model.

Evaluation:

As training of the model was a bit tricky, evaluation is also a bit different. Now that u already understood the process of training, it won’t be tough to understand how the prediction happens.

Let’s understand how that happens.

The get_caption_for_photo is where all the magic happens, as we have separated our test data from train data. What we do is we give this method a filename, and then it retrieves the file from the location which we have already set for our training data.

We have already converted the photo features of our images from Inceptionv3 model and saved it to a dict file using pickle, so we will make use of that to get the photo feature.

The caption generation happens word by word, hence we start with the startSeq as input and the photo feature along with it.

We run a loop from 1 to the max_caption_length as we know that none of the caption should be larger than it, so we create the data seq same as that of training, we take the index of words from “in_text” and convert it to a vector and then make our data sequence.

And then we pass both the inputs to our model (photo feature and text). It gives me an output as an index of word which we convert to word using the dict index_to_word, and then append it to our in_text variable, and in the next iteration these 2 words become my text sequence and the same process happens all over again until our model outputs an endSeq.

Then we finally return the caption removing the startSeq and the endSeq from the sentence.

This method of generating caption is known as Greedy Search.

End Results:

Let’s see some end results for our model. Will first see some relevant captions.

It wouldn’t be fair to not show the incorrect results, because as said the model isn’t perfect. So below are some of the Incorrect results.

Deployment:

Well, deployment is totally a different phase from building a machine learning model. We need to build a UI for our model, need to work on saving and loading the model. The code will be simple, and we will build an API for accessing the model and getting results out of it.

Let’s look into some of the main contents of deployment before we move onto it, some basic assumptions:

· You know what is API

· You know how to save a model.

· You know basics of HTML and Javascript.

Before we move on to deployment phase, we have to save certain modules from our ML code and those are:

· The Image-To-Seq Model, using keras model save method as model.h5

· We need to save our index_to_word dict and word_to_index dict as pickle files.

The code for these points will not be shown as we already mentioned that we will assume that you know basics of python and saving pickle file is quite easy. If you happen to not know how to save pickle files, I request you to google it out, it’s very simple.

Now let’s understand what all code we need to write for us to deploy the model. Let me make one thing clear we will try to keep the code as clean as possible, as elegant as possible and as small as possible. The less time it takes to load into the memory, the better.

Hence unnecessary codes need to be removed.

Let’s look at what needs to be written.

We will name our code file as app.py, We will let you know why so, we can name it anything but it is for our ease that we are naming it as app.py.

We will first set all the imports we need, only import referenced libraries that will help us keep the code light.

After the import section is done, we will now see the code line by line.

Line 15: This specifies the name of our Flask application, __name__ is actually the name of our file i.e. app.py

Line 17–18: Are the parameter values we used from our ML code and we will use it as static value as our model is trained on these values.

Line 21–22: This is where we load our word_to_index and index_to_word dict modules.

Line 24: This is where we set our vocab_size

Line 26–27: As we already saw that we made use of Inception V3 model to vectorize our images, so for new images also we need to vectorize them using the same, so the config should match with our trained ML model.

Line 29–30: Here we are loading our model which is saved under the name ‘model.h5’

Before moving on to the next section, as mentioned earlier, we have to design our own UI to accept input from user and show the results as well, so we will have 2 html pages, as follows:

· Home.html

· Results.html

Our first page is Home.html and the end result is shown in Results.html. We will look into the design a little later, why mention it now because this is going to be introduced in our code now, and I don’t want you guys to get lost in it.

Okay, moving on…

This specifies our home page, Input will be taken from this page and will be passed on to our server where the code is residing for execution. Whenever someone will hit our server they will be navigated to this page by default, because of ‘/’.

@app.route is an annotation of flask, it helps us to route pages which is mentioned in the code. render_template helps us to send the specified html pages to client’s browser. For now this explanation is wholesome for you to understand what’s happening, we will not go into the very depth of it.

Now this is the whole code where all the magic is happening, we will understand each and every line of it, don’t get scared, we already saw it before, this is our prediction code which is similar to get_caption_for_photo. Only difference is rather than taking the picture from a saved file or a dict we will convert the image on the fly and then predict. So all you see here is what u have seen earlier in our ML code, difference is you have seen those codes scattered in different places, and now they are put together. There is some additional part of how to fetch the data and from where, we will look into it.

To understand how this works, we need to understand what our input from the UI is, user provides us with Image link from internet which should be the direct link where we can download our image from, we receive the link, we download the image in a folder and rest of the process is what we have already done for conversion and prediction.

@app.route (‘/predict’, methods = [‘POST’])

This line specifies that our server will use POST method to submit data, also when we click on submit button from UI we will navigate to this method and this url ‘/predict’.

Line 38: specifies the count variable we have, just to keep a track of images we have processed till now.

Line 39–41: specifies we retrieve the image from the input URL, download and store it in the folder, for this we get the url from an input box in our form in the html page. We set the image name, retrieve it and store in the location.

Line 42–46: In these specific lines, we pre-process the image and also convert it into vectorized form before putting it as an input to the model.

Line 47–61: These lines are same as our predict method in ML model, i.e. get_caption_for_photo, hence will be skipped.

Lines 62–71: These lines are just clearing the used variables and space management, and removing the image stored so that we do not run out of space.

Line 72: we return the prediction from our model and the url of the image as parameters for the method. That will be used in the UI in html page.

Line 74–75: This specifies the run method, that if we run this file it should call app.run() which will start our server.

Now, we will make a simple UI for home and results page.

Home.html

This is how the UI looks, we will only look how to create the form and the essential fields into it.

We will focus on certain parts of this code, i.e.

Form name=’imageName’ action={{ url_for(‘predictCaption’) }}, here the action field specifies the app.route for predictCaption() method.

We are only interested about the textarea name=’imageSource’ as this will be used in the method to retrieve the image from the url.

2. Results.html

We will focus image.src=”{{ urlImage }}”, this is the parameter used in render_template from line no. 72

{{ prediction }}, this is the parameter used in render_template from line no. 72

Also, we will have to keep these files into a folder named ‘templates’

Next comes something called as a Procfile, this file is named as ‘Procfile’ and yes it is without any extension, it looks like this:

Next is the requirements.txt, which will help our server machine to install all the required packages used by our python modules.

The gunicorn is a mandate as it is used by our procfile.

These are all the files we need, and we will have to upload all these files into a public repository in GitHub, pretty basic stuff you can google that out.

That’s all we need, now everything depends on how we setup our Heroku deployment environment. I will show you step by step guide to setup the environment.

https://dashboard.heroku.com/login

Visit this link and create your own profile.

Click on create new app.

Give the desired name and choose region as US.

Navigate to Deploy tab, then choose GitHub as the deployment method and connect your Heroku app to your GitHub project.

Scroll down and select manual deploy and click on button Deploy Branch,

Wait for it to show that your project is deployed to Heroku.

Finally click on Open app to see your application go live.

You can visit mine here: https://bbose-caption-it.herokuapp.com/

Please read through the steps for proper usage of the application.

If you reached here, then thank you for reading this blog, I know this is wholesome, but I’m sure you did enjoy.

You can find the whole code in my GitHub profile, here.

So what’s stopping you? Go ahead, create and deploy, and don’t forget to Have FUN !!!

Stay updated with all my blogs & updates on Linked In. Welcome to my network. Follow me on Linked In Here — -> https://www.linkedin.com/in/bishalbose294/