Feedback recurrent neural network-based embedded vector and its application in topic model

While mining topics in a document collection, in order to capture the relationships between words and further improve the effectiveness of discovered topics, this paper proposed a feedback recurrent neural network-based topic model. We represented each word as a one-hot vector and embedded each document into a low-dimensional vector space. During the process of document embedding, we applied the long short-term memory method to capture the backward relationships between words and proposed a feedback recurrent neural network to capture the forward relationships between words. In the topic model, we used the original and muted document pairs as positive samples and the original and random document pairs as negative samples to train the model. The experiments show that the proposed model consumes not only lower running time and memory but also has better effectiveness during topic analysis.


Introduction
Natural language processing (NLP) [1] is an interdiscipline about linguistics, statistics, computer science, artificial intelligence, and so on. Statistical theory is one of the most important tools in analyzing natural language documents. In a document collection, each term (word or phrase) is denoted with a one-hot vector, and each sentence, paragraph, or document is denoted with a term frequency vector. In a term frequency vector, each element represents the number of occurrence for the corresponding term. With these term frequency vectors, researchers could compute the similarities between documents and thus discover underlying topics in such a document collection [2]. Topic models can be classified into statistical semantic models [3][4][5][6][7][8][9][10] and embedded vector models [11][12][13]. While capturing the semantics of documents, statistical semantic model computes the similarities between documents with co-occurrence matrix of terms, and embedded vector model uses neighbor(s) to represent the meaning of a target term; however, both of them cannot describe the term orders in a document.
In order to better capture the order relationships between terms, and thus improve the topic discovering effectiveness, this paper classified the order relationships between terms into forward dependence and backward dependence and proposed a feedback recurrent neural network-based topic model. While capturing backward dependences, this paper denoted each term with a one-hot vector and applied LSTM recurrent neural network [14] to compute its corresponding embedded vector. While capturing forward dependences, this paper designed a feedback mechanism for the recurrent neural network.

Related works
Statistical semantic models, such as latent semantic analysis (LSA) [3], probabilistic latent semantic analysis (PLSA) [4], and latent Dirichlet allocation (LDA) [5], are power tools for mining underlying topics in a document collection. PLSA is the probabilistic version of LSA, and both of them compute the similarities between documents with co-occurrences of terms, such as TF-IDF [6], but they ignore the order relationships between terms. For two sentences with synonyms, both LSA and PLSA consider them as unsimilar, but LDA can capture the semantics behind terms, and consider them as similar sentences. The LDA model assumes that a document collection is generated by distribution of topics, and each topic is a probabilistic distribution of terms. Based on LDA model, researchers also propose HDP-LDA [7], ADM-LDA [8], Tri-LDA [9], MB-LDA [10], and so on. Although the LDA and its extended models take the semantics between terms into consideration, they ignore the term orders in a document. For example, "Tom told Jim" and "Jim told Tom" have different semantics, but the LDA model can never differentiate them.
In recent years, embedded vector models, such as Word2vec [11], GloVe [12], and DSSM [13], are more and more popular, and deep learning for NLP has attracted more and more researchers. By embedding term frequency vectors into continuous multi-dimension vector space (dimension of continuous vector space is much smaller than that of one-hot vector), the similarities between documents can be accurately and quickly computed. During the generating of embedded vectors, Word2vec uses one of the neighbors of the target term, and GloVe uses the average of all neighbors of the target term to represent the target term. In addition, DSSM denotes a document with a set of triple characters, and this tri-character set can only capture the relationships between two adjacent terms.
In order to better capture the order relationships between terms, and thus improve the topic discovering effectiveness, this paper classifies the order relationships between terms into forward dependence and backward dependence and proposes a feedback recurrent neural network-based topic model.

Feedback recurrent neural network-based embedded vectors
In this section, we embed documents into a continuous vector space. Firstly, we generate an embedded vector for each document with simple recurrent neural network, then capture backward dependences of words by adding memory to each neural cell, and finally capture forward dependences by introducing feedback links.

Generating embedded vectors with recurrent neural network
With recurrent relationships, recurrent neural networks can model the relationship between the current word with its previous one. Theoretically, recurrent neural networks can be used to model sequences of any length.
Given a document collection D, after removing stop words, there are total T different words remained, so each word w t (1 ≤ t ≤ T) can be denoted with a one-hot vector x t . In this paper, we use a four-latent-layer recurrent neural network to compute the embedded vector for each document, and the structure of the proposed neural network is in Fig. 1. In Fig. 1, the arrows denote the dependences of data between neural cells; x t and y t are input and output of the t-th words respectively, and the dimension of latent layers is much smaller than the dimension of the input one-hot vector.
We apply Sigmoid function σ(x) = (1 + exp(−x)) − 1 as the activation function for each neural cell, where σ(x) is element-wise. While using the Sigmoid function, data of latent and output layers can be computed as follows.
While learning parameters from documents, we use the output y t of x t as the prediction the next word x t + 1 , and the object function is given in Eq. 4. Where jjU j jj 2 is the regularization item, λ is the weight of the 1.1.2.2 Adding memory to neural cell For each neural cell of the above neural network, we apply long shortterm memory (LSTM) [15] to represent the backward dependences between words. The structure of LSTM is in Fig. 2, and the computation for each unit is as follows.
In the above equations, the meaning of Sigmoid function σ(⋅) is the same as before and ⊙ is the dot product. Details of LSTM can be found in [15].

Adding feedback to recurrent neural network
LSTM-based recurrent neural networks can only capture backward dependences between words of a document. In order to capture forward dependences between words, we need to add feedback links between neural cells. Figure 3 illustrates our proposed feedback recurrent neural network, and this model can capture the relationships between x t with x t + 1 and x t + 2 .
In Fig. 3, each dotted line is a time slot, and the computation of h 1 t and h 2 t is the same as Fig. 1. However, the computation of h 3 t and h 4 t depends on three neural cells of the last time slot and can be computed as follows.
In the proposed four-latent-layer feedback recurrent neural network, we can capture the relationships between x t with x t + 1 and x t + 2 . If the number of latent layers is k, then we can capture the relationships between x t with its following k − 2 words. So, the proposed feedback recurrent neural network can capture forward dependences between words.

Embedded vector-based topic model
where the dimension of v d is p. Given d, we randomly choose a word w from d, substitute w with a random word w ' ≠ w and then get a muted document d '. For d and d ', the only difference is between w with w ', so we assume that they have similar topics. For choosing the muted word w ', we choose synonyms at the beginning, but the result is not very well. The reason is that by substituting a word with its synonym, the two documents are almost the same, so the generalization of the model is very weak. After embedding d and d ' into continuous vector space, we get v d and v ' d . Then, we use the topic model described in Fig. 4 to analyze the underlying topics in a document collection. We assume that there are k topics in the document collection. Taking v d and v ' d as input, if the probability that both d and d ' belong to i-th topic is z i , then we can estimate it as follows.
where W i is a p × p matrix, and the whole W is a p × p × k tensor.
While training the above model with d and d ' pairs, if d ' is a muted document of d, then z i = 1; and if d ' is a randomly chosen document except for d, then z i = 0. During learning parameters of the model, we apply the sum of squared errors as loss function and L 2 regularization, then the objective is

Experiments
The experimental platform is a laptop with Intel Core i5-2450 2.5GHz CPU and 4GB memory. The operation system is Ubuntu 10.04, and all algorithms are implemented with Java.

Datasets
In the experiments, we use NIPS and RML two public datasets from [16], and the statistics of datasets is in Table 1. In Table 1, W is the number of distinct words after removing stop words, D is the number of documents including training and test data, T is the number of documents in test data, and N is the total number of words after removing stop words.

Baseline algorithms
We denote our proposed topic model with GRNN and compare it with PLSA [4], HDP-LDA [7], seTF-IDF [6], and DSSM [13]. PLSA is the basic probabilistic latent semantic analysis method, HDP-LDA extends LDA by modeling topics with hierarchical Dirichlet generation process, seTF-IDF is a term frequency-inverse document frequency method that takes semantics into consideration, and DSSM analyzes semantics by denoting each document with a set of triple characters.

Metrics
We apply point-wise mutual information (PMI) [17] and perplexity [18] to measure the performance of algorithms. PMI is used to measure the mutual information between words of a topic, and its computation is in Eq. 13. Bigger PMI means a better algorithm. Perplexity is used to measure the energy of entropy in a  topic, and smaller is better. Computation of perplexity is in Eq. 14.
Comparison of efficiency We compared running time and maximum memory usage of algorithms under the two datasets, and the results are in Figs. 5 and 6 respectively. In the comparison of running time, GRNN is obviously better than other algorithms. Running time of PLSA, HDP-LDA, SeTF-IDF and DSSM under RML dataset is one half of that under NIPS, and running time of GRNN under RML is a quarter of that under NIPS. For maximum memory usage under the two datasets, GRNN use 520MB and 230MB respectively, and both of them are smaller than other algorithms.

Comparison of perplexity
By increasing the numbers of topics in both datasets, we compared the Perplexity of algorithms, and the results were in Figs. 7 and 8. In these two figures, Perplexities of PLSA and HDP-LDA decrease, Perplexity of SeTF-IDF does not have a trend, and Perplexities of DSSM and GRNN almost do not change. GRNN has the lowest Perplexity, DSSM has the second lowest Perplexity, and both of them are better than others.

Comparison of PMI
The same as comparison of Perplexity, we compared the PMI of algorithms by increasing the numbers of topics in both datasets, and the results were in Figs. 9 and 10. With the increasing of topic number, PMIs of all algorithms increase. In the NIPS dataset, when the number of topic is smaller than 45, PMI of DSSM is the biggest; and when the number of topic is bigger than 45, PMI of GRNN is the biggest. In the RML dataset, PMI of GRNN is bigger than others all the time.   In this paper, we proposed a feedback recurrent neural network to embed documents into continuous vector space and an embedded vector-based topic model. We applied long short-term memory recurrent neural network to capture backward dependences between words and feedback links between neural cells to capture forward dependences between words. With our proposed model, we can capture the relationships between the target word with its two following words. Massive experiments validate the effectiveness and efficiency of the proposed model.