{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IMDB \n", "The Movie Review Data (often referred to as the IMDB dataset) is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.\n", "\n", "The Large Movie Review Dataset contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentiment Analisys\n", "Sentiment analysis aims to determine the attitude of a spoken or written text with respect to some topic or the overall contextual polarity or emotional reaction to the subject. \n", "\n", "In this exercise we are going to use Deep Learning to analise film reviews and understand from the text if the film had a positive, good review or a bad one." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# python modules import party ;-)\n", "import numpy as np\n", "\n", "# required model layers\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Embedding\n", "from tensorflow.keras.layers import LSTM\n", "# optional extra layers\n", "from tensorflow.keras.layers import Dense\n", "from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout\n", "\n", "# utilitis for sequence processing \n", "from tensorflow.keras.preprocessing import sequence\n", "\n", "# access to the IMDB dataset of Keras\n", "from tensorflow.keras.datasets import imdb\n", "\n", "# fix random seed for reproducibility\n", "seed = 7\n", "np.random.seed(seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keras IMDB dataset\n", "Keras comes with many Deep Learning benchmark datasets, and the IMDB is one of those because it is very common and used as the \"Hello World\" case for sentiment analysis and context extraction.\n", "\n", "In the Keras IMDB dataset, reviews have been preprocessed and encoded as a sequence of word indexes (integers). Words are indexed by overall frequency in the dataset, so that for instance the integer \"3\" encodes the 3rd most frequent word in the data.\n", "\n", "The imdb.load_data() returns two tuple for the train and validation pair sequences.\n", "\n", "The functions takes additional arguments including:\n", "- num_words: the number of unique words to load ( words with a lower integer are marked as zero in the returned data)\n", "- skip_top: the number of top words to skip (i.e: avoid all “the”, \"an\", etc)\n", "- maxlen: the maximum length of reviews to support (longer are truncated).\n", "\n", "The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.\n", "\n", "We can reconstruct original review text using:
\n", "word_index = reuters.get_word_index(path=\"reuters_word_index.json\")
\n", "which returns a dictionary where key are words (str) and values are indexes (integer) (i.e: word_index[\"giraffe\"] return 1234).\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading data...\n", "train sequences length: 25000\n", "test sequences length: 25000\n" ] } ], "source": [ "print('Loading data...')\n", "\n", "# maximum number of unique most frequent words to consider\n", "max_features = 20000\n", "\n", "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)\n", "\n", "print('train sequences length:', len(x_train))\n", "print('test sequences length:', len(x_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset normalization\n", "\n", "Always verify that the input dataset is \"well normalized\" or, in the case of word sequences, in the right shape expected.\n", "\n", "We are dealing with a set of sequences of numbers, each sequence has a different length. A neural network layer requires a FIXED dataset shape, so we have to transform and manipulate our input to fit to this constraint.\n", "\n", "Keras provides some preprocessing utility functions for text, images and sequences. \n", "\n", "The sequence.pad_sequencies function does the job:\n", "\n", "keras.preprocessing.sequence.pad_sequences(\n", " sequences, maxlen=None, dtype='int32', \n", " padding='pre', truncating='pre', value=0.0)\n", "\n", "the pad_sequence function transforms a list of num_samples sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps). num_timesteps is either the maxlen argument if provided, or the length of the longest sequence otherwise.\n", "\n", "- sequences that are shorter than num_timesteps are padded with value at the end.\n", "\n", "- sequences longer than num_timesteps are truncated so that they fit the desired length. \n", "\n", "The position where padding or truncation happens is determined by the arguments padding and truncating, respectively." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean words in a review: 234\n" ] } ], "source": [ "# check the mean length of the sequences\n", "review_lengths = [len(x) for x in all_reviews]\n", "mean_review_length = int(np.mean(review_lengths))\n", "print(\"Mean words in a review:\", mean_review_length)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x_train shape: (25000, 234)\n", "x_test shape: (25000, 234)\n" ] } ], "source": [ "# cut reviews's length after maxlen words\n", "maxlen = mean_review_length\n", "# maxlen = 100\n", "\n", "x_train = sequence.pad_sequences(x_train, maxlen=maxlen)\n", "x_test = sequence.pad_sequences(x_test, maxlen=maxlen)\n", "print('x_train shape:', x_train.shape)\n", "print('x_test shape:', x_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Words Embedding\n", "Words are mapped to integers, but unfortunately, neural networks work with floats.\n", "\n", "A recent breakthrough in the field of natural language processing is called word embedding. This is a technique where words are encoded as real-valued vectors in a higher-dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.\n", "\n", "Keras provides the Embedding layer to convert positive integer representations of words into a word embedding representation. Embedding layer can only be used as the first layer in a model.\n", "\n", "The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Long Short Term Memory (LSTM) Layer\n", "\n", "Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.\n", "\n", "They work very well on a large variety of problems, expecially those which regards with data sequences such as text, speech, signals, trends and so on.\n", "\n", "\n", "\n", "(see: http://colah.github.io/posts/2015-08-Understanding-LSTMs)\n", "\n", "Keras provides a special recurrent neural network layer for Long Short Term Memory neurons, with only one required parameter to set: the dimensionality of the output space, that is the length of the output sequence to retain. \n", "\n", "So LSTM(units=N) means that every cell of a LSTM layer have N neurons (at every moment the hidden state is vector with size N). While the number of LSTM cells in the layer is determined automatically (for example: if you feed LSTM with input tensor with shape (150, 300) then 150 LSTM cells will be applied).\n", "\n", "Besides always keep under control the following parameters:\n", "- return_sequences: wether to return the last output in the output sequence, or the full sequence.\n", "- return_state: whether to return the last state in addition to the output.\n", "- stateful: if True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch (default False).\n", "\n", "LSTM method has other zillion optional parameter for tuning internal details such as parameter itialization and activation functions to use, other parameters regards tuning for optimizations purposes, such as unrolling of the LSTM sequence (for small lengths) and the implementation algorithm to use (some better for CPU, other for GPU)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Build model...\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "embedding_6 (Embedding) (None, 234, 64) 1280000 \n", "_________________________________________________________________\n", "dropout_5 (Dropout) (None, 234, 64) 0 \n", "_________________________________________________________________\n", "lstm_5 (LSTM) (None, 70) 37800 \n", "_________________________________________________________________\n", "dense_5 (Dense) (None, 1) 71 \n", "=================================================================\n", "Total params: 1,317,871\n", "Trainable params: 1,317,871\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "print('Build model...')\n", "\n", "# embedding subspace dimensions (features per word)\n", "embedding_size = 64 \n", "\n", "# LSTM\n", "lstm_size = 70\n", "\n", "# Convolution (optional layer)\n", "model_use_convolution_layer=False\n", "kernel_size = 5\n", "filters = 64\n", "pool_size = 4\n", "\n", "model = Sequential()\n", "\n", "# EMBEDDING: turn eahc integere into vectors of embedding_size dimensions floats\n", "# the model will take as input an integer matrix of size (batch, input_length).\n", "# the largest input integer should be no larger than max_feature-1 (vocabulary size).\n", "# here model.input_shape == (None, maxlen), None is the batch dimension\n", "model.add(Embedding(max_features, embedding_size, input_length=maxlen))\n", "# here model.output_shape == (None, maxlen, embedding_size), None is the batch dimension\n", "\n", "# use drop to ease overfitting problems\n", "model.add(Dropout(0.25))\n", "\n", "# we can use one or more convolution layers\n", "if model_use_convolution_layer:\n", "\n", " # 1D CONVOLUTION: inspect a window of kernel_size length elements\n", " model.add(Conv1D(filters, kernel_size, activation='relu'))\n", "\n", " model.add(MaxPooling1D(pool_size=pool_size))\n", "\n", "# LONG SHORT TERM MEMORY LAYER\n", "model.add(LSTM(lstm_size))\n", "\n", "# last layer to make 1D classification with sigmoid activation\n", "model.add(Dense(1, activation='sigmoid'))\n", "\n", "# binary_crossentropy pairs with sigmoid activation\n", "# in classification problems of yes/no kind\n", "model.compile(loss='binary_crossentropy',\n", " optimizer='adam',\n", " metrics=['accuracy'])\n", "\n", "# print model summary and used parameters\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train...\n", "Train on 25000 samples, validate on 25000 samples\n", "Epoch 1/2\n", "25000/25000 [==============================] - 85s 3ms/step - loss: 0.4922 - acc: 0.7554 - val_loss: 0.3652 - val_acc: 0.8560\n", "Epoch 2/2\n", "25000/25000 [==============================] - 82s 3ms/step - loss: 0.2350 - acc: 0.9112 - val_loss: 0.3080 - val_acc: 0.8735\n", "25000/25000 [==============================] - 18s 710us/step\n", "Accuracy: 87.35%\n" ] } ], "source": [ "print('Train...')\n", "\n", "# Training parameters\n", "batch_size = 256\n", "epochs = 2\n", "\n", "model.fit(x_train, y_train,\n", " batch_size=batch_size,\n", " epochs=epochs,\n", " validation_data=(x_test, y_test))\n", "\n", "# report model score and accuracy\n", "score = model.evaluate(x_test, y_test, batch_size=batch_size)\n", "print(\"Accuracy: %.2f%%\" % (score[1]*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises\n", "\n", "1. Play with model's parameters to enhance accuracy:\n", " - number of epochs and batch_size\n", " - embedding subspace size (embedding_size)\n", " - LSTM units (lstm_size)\n", "\n", "1. We used 25000 samples for training, which is often a too little number for good training results. Try to get more training sequences from the testing sequences, traing the model anche check if accuracy is improved.\n", "\n", "1. Frequently final verdict over film takes place along last sentences of the review. Try to reconstruct new input sequences taking only last_max words from original sequences. Run the model again and check if prediction accuracy is improved.\n", "\n", "1. Try to classify topics from the\n", " Reuters Newswire dataset using a LSTM model." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }