# IMDB 
The Movie Review Data (often referred to as the IMDB dataset) is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The Large Movie Review Dataset contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing.

## Sentiment Analisys
Sentiment analysis aims to determine the attitude of a spoken or written text with respect to some topic or the overall contextual polarity or emotional reaction to the subject. 

In this exercise we are going to use Deep Learning to analise film reviews and understand from the text if the film had a positive, good review or a bad one.

In [37]:
# python modules import party ;-)
import numpy as np

# required model layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
# optional extra layers
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout

# utilitis for sequence processing 
from tensorflow.keras.preprocessing import sequence

# access to the IMDB dataset of Keras
from tensorflow.keras.datasets import imdb

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

## Keras IMDB dataset
Keras comes with many Deep Learning benchmark datasets, and the IMDB is one of those because it is very common and used as the "Hello World" case for sentiment analysis and context extraction.

In the Keras IMDB dataset, reviews have been preprocessed and encoded as a sequence of word indexes (integers). Words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data.

The <code><a href='https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification'>imdb.load_data()</a></code> returns two tuple for the train and validation pair sequences.

The functions takes additional arguments including:
- <code>num_words</code>: the number of unique words to load ( words with a lower integer are marked as zero in the returned data)
- <code>skip_top</code>: the number of top words to skip (i.e: avoid all “the”, "an", etc)
- <code>maxlen</code>: the maximum length of reviews to support (longer are truncated).

The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

We can reconstruct original review text using:<br>
<code>word_index = reuters.get_word_index(path="reuters_word_index.json")</code><br>
which returns a dictionary where key are words (str) and values are indexes (integer) (i.e: word_index["giraffe"] return 1234).


In [38]:
print('Loading data...')

# maximum number of unique most frequent words to consider
max_features = 20000

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

print('train sequences length:', len(x_train))
print('test  sequences length:', len(x_test))

Loading data...
train sequences length: 25000
test  sequences length: 25000


### Dataset normalization

Always verify that the input dataset is "well normalized" or, in the case of word sequences, in the right shape expected.

We are dealing with a set of sequences of numbers, each sequence has a different length. A neural network layer requires a FIXED dataset shape, so we have to transform and manipulate our input to fit to this constraint.

Keras provides some preprocessing utility functions for text, images and sequences. 

The <code><a href='https://keras.io/preprocessing/sequence/#pad_sequences'>sequence.pad_sequencies</a></code> function does the job:

keras.preprocessing.sequence.pad_sequences(
  sequences, maxlen=None, dtype='int32', 
  padding='pre', truncating='pre', value=0.0)

the <code>pad_sequence</code> function transforms a list of num_samples sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps). num_timesteps is either the maxlen argument if provided, or the length of the longest sequence otherwise.

- sequences that are shorter than num_timesteps are padded with value at the end.

- sequences longer than num_timesteps are truncated so that they fit the desired length. 

The position where padding or truncation happens is determined by the arguments padding and truncating, respectively.

In [39]:
# check the mean length of the sequences
review_lengths = [len(x) for x in all_reviews]
mean_review_length = int(np.mean(review_lengths))
print("Mean words in a review:", mean_review_length)

Mean words in a review: 234


In [40]:
# cut reviews's length after maxlen words
maxlen = mean_review_length
# maxlen = 100

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test  shape:', x_test.shape)

x_train shape: (25000, 234)
x_test  shape: (25000, 234)


### Words Embedding
Words are mapped to integers, but unfortunately, neural networks work with floats.

A recent breakthrough in the field of natural language processing is called word embedding. This is a technique where words are encoded as real-valued vectors in a higher-dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Keras provides the Embedding layer to convert positive integer representations of words into a word embedding representation. Embedding layer can only be used as the first layer in a model.

The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.

## Long Short Term Memory (LSTM) Layer

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

They work very well on a large variety of problems, expecially those which regards with data sequences such as text, speech, signals, trends and so on.

<img src=imgs/LSTM-3-cells.png width=400pt>

(see: http://colah.github.io/posts/2015-08-Understanding-LSTMs)

Keras provides a special recurrent neural network layer for Long Short Term Memory neurons, with only one required parameter to set: the dimensionality of the output space, that is the length of the output sequence to retain. 

So LSTM(units=N) means that every cell of a LSTM layer have N neurons (at every moment the hidden state is vector with size N). While the number of LSTM cells in the layer is determined automatically (for example: if you feed LSTM with input tensor with shape (150, 300) then 150 LSTM cells will be applied).

Besides always keep under control the following parameters:
- <code>return_sequences</code>: wether to return the last output in the output sequence, or the full sequence.
- <code>return_state</code>: whether to return the last state in addition to the output.
- <code>stateful</code>:  if True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch (default False).

LSTM method has other zillion optional parameter for tuning internal details such as parameter itialization and activation functions to use, other parameters regards tuning for optimizations purposes, such as unrolling of the LSTM sequence (for small lengths) and the implementation algorithm to use (some better for CPU, other for GPU).

## Model

In [41]:
print('Build model...')

# embedding subspace dimensions (features per word)
embedding_size = 64 

# LSTM
lstm_size = 70

# Convolution (optional layer)
model_use_convolution_layer=False
kernel_size = 5
filters = 64
pool_size = 4

model = Sequential()

# EMBEDDING: turn eahc integere into vectors of embedding_size dimensions floats
# the model will take as input an integer matrix of size (batch, input_length).
# the largest input integer should be no larger than max_feature-1 (vocabulary size).
# here model.input_shape == (None, maxlen), None is the batch dimension
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
# here model.output_shape == (None, maxlen, embedding_size), None is the batch dimension

# use drop to ease overfitting problems
model.add(Dropout(0.25))

# we can use one or more convolution layers
if model_use_convolution_layer:

    # 1D CONVOLUTION: inspect a window of kernel_size length elements
    model.add(Conv1D(filters, kernel_size, activation='relu'))

    model.add(MaxPooling1D(pool_size=pool_size))

# LONG SHORT TERM MEMORY LAYER
model.add(LSTM(lstm_size))

# last layer to make 1D classification with sigmoid activation
model.add(Dense(1, activation='sigmoid'))

# binary_crossentropy pairs with sigmoid activation
# in classification problems of yes/no kind
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# print model summary and used parameters
model.summary()

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 234, 64)           1280000   
_________________________________________________________________
dropout_5 (Dropout)          (None, 234, 64)           0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 70)                37800     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 71        
Total params: 1,317,871
Trainable params: 1,317,871
Non-trainable params: 0
_________________________________________________________________


## Training

In [42]:
print('Train...')

# Training parameters
batch_size = 256
epochs = 2

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

# report model score and accuracy
score = model.evaluate(x_test, y_test, batch_size=batch_size)
print("Accuracy: %.2f%%" % (score[1]*100))

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2
Accuracy: 87.35%


# Exercises

1. Play with model's parameters to enhance accuracy:
  - number of <code>epochs</code> and <code>batch_size</code>
  - embedding subspace size (<code>embedding_size</code>)
  - LSTM units (<code>lstm_size</code>)

1. We used 25000 samples for training, which is often a too little number for good training results. Try to get more training sequences from the testing sequences, traing the model anche check if accuracy is improved.

1. Frequently final verdict over film takes place along last sentences of the review. Try to reconstruct new input sequences taking only last_max words from original sequences. Run the model again and check if prediction accuracy is improved.

1. Try to classify topics from the
   <a href='https://keras.io/datasets/#reuters-newswire-topics-classification'>Reuters Newswire dataset</a> using a LSTM model.