# Feedback recurrent neural network-based embedded vector and its application in topic model

- Lian-sheng Li
^{1}, - Sheng-jiang Gan
^{2}Email author and - Xiang-dong Yin
^{1}

**2017**:5

https://doi.org/10.1186/s13639-016-0038-6

© The Author(s). 2016

**Received: **27 December 2015

**Accepted: **8 June 2016

**Published: **16 July 2016

## Abstract

While mining topics in a document collection, in order to capture the relationships between words and further improve the effectiveness of discovered topics, this paper proposed a feedback recurrent neural network-based topic model. We represented each word as a one-hot vector and embedded each document into a low-dimensional vector space. During the process of document embedding, we applied the long short-term memory method to capture the backward relationships between words and proposed a feedback recurrent neural network to capture the forward relationships between words. In the topic model, we used the original and muted document pairs as positive samples and the original and random document pairs as negative samples to train the model. The experiments show that the proposed model consumes not only lower running time and memory but also has better effectiveness during topic analysis.

### Keywords

Wireless sensor networks Data aggregation Aggregation tree Aggregation delay## 1 Review

### 1.1 Introduction

Natural language processing (NLP) [1] is an interdiscipline about linguistics, statistics, computer science, artificial intelligence, and so on. Statistical theory is one of the most important tools in analyzing natural language documents. In a document collection, each term (word or phrase) is denoted with a one-hot vector, and each sentence, paragraph, or document is denoted with a term frequency vector. In a term frequency vector, each element represents the number of occurrence for the corresponding term. With these term frequency vectors, researchers could compute the similarities between documents and thus discover underlying topics in such a document collection [2]. Topic models can be classified into statistical semantic models [3–10] and embedded vector models [11–13]. While capturing the semantics of documents, statistical semantic model computes the similarities between documents with co-occurrence matrix of terms, and embedded vector model uses neighbor(s) to represent the meaning of a target term; however, both of them cannot describe the term orders in a document.

In order to better capture the order relationships between terms, and thus improve the topic discovering effectiveness, this paper classified the order relationships between terms into forward dependence and backward dependence and proposed a feedback recurrent neural network-based topic model. While capturing backward dependences, this paper denoted each term with a one-hot vector and applied LSTM recurrent neural network [14] to compute its corresponding embedded vector. While capturing forward dependences, this paper designed a feedback mechanism for the recurrent neural network.

#### 1.1.1 Related works

Statistical semantic models, such as latent semantic analysis (LSA) [3], probabilistic latent semantic analysis (PLSA) [4], and latent Dirichlet allocation (LDA) [5], are power tools for mining underlying topics in a document collection. PLSA is the probabilistic version of LSA, and both of them compute the similarities between documents with co-occurrences of terms, such as TF-IDF [6], but they ignore the order relationships between terms. For two sentences with synonyms, both LSA and PLSA consider them as unsimilar, but LDA can capture the semantics behind terms, and consider them as similar sentences. The LDA model assumes that a document collection is generated by distribution of topics, and each topic is a probabilistic distribution of terms. Based on LDA model, researchers also propose HDP-LDA [7], ADM-LDA [8], Tri-LDA [9], MB-LDA [10], and so on. Although the LDA and its extended models take the semantics between terms into consideration, they ignore the term orders in a document. For example, “Tom told Jim” and “Jim told Tom” have different semantics, but the LDA model can never differentiate them.

In recent years, embedded vector models, such as Word2vec [11], GloVe [12], and DSSM [13], are more and more popular, and deep learning for NLP has attracted more and more researchers. By embedding term frequency vectors into continuous multi-dimension vector space (dimension of continuous vector space is much smaller than that of one-hot vector), the similarities between documents can be accurately and quickly computed. During the generating of embedded vectors, Word2vec uses one of the neighbors of the target term, and GloVe uses the average of all neighbors of the target term to represent the target term. In addition, DSSM denotes a document with a set of triple characters, and this tri-character set can only capture the relationships between two adjacent terms.

In order to better capture the order relationships between terms, and thus improve the topic discovering effectiveness, this paper classifies the order relationships between terms into forward dependence and backward dependence and proposes a feedback recurrent neural network-based topic model.

#### 1.1.2 Feedback recurrent neural network-based embedded vectors

In this section, we embed documents into a continuous vector space. Firstly, we generate an embedded vector for each document with simple recurrent neural network, then capture backward dependences of words by adding memory to each neural cell, and finally capture forward dependences by introducing feedback links.

### Generating embedded vectors with recurrent neural network

With recurrent relationships, recurrent neural networks can model the relationship between the current word with its previous one. Theoretically, recurrent neural networks can be used to model sequences of any length.

*D*, after removing stop words, there are total

*T*different words remained, so each word

*w*

_{ t }(1 ≤

*t*≤

*T*) can be denoted with a one-hot vector

*x*

_{ t }. In this paper, we use a four-latent-layer recurrent neural network to compute the embedded vector for each document, and the structure of the proposed neural network is in Fig. 1. In Fig. 1, the arrows denote the dependences of data between neural cells;

*x*

_{ t }and

*y*

_{ t }are input and output of the

*t-*th words respectively, and the dimension of latent layers is much smaller than the dimension of the input one-hot vector.

*σ*(

*x*) = (1 + exp(−

*x*))

^{− 1}as the activation function for each neural cell, where

*σ*(

*x*) is element-wise. While using the Sigmoid function, data of latent and output layers can be computed as follows.

*y*

_{ t }of x

_{ t }as the prediction the next word x

_{ t + 1}, and the object function is given in Eq. 4.

Where \( {\displaystyle \sum_{t=1}^{T-1}{\left({\boldsymbol{y}}_t-{\boldsymbol{x}}_{t+1}\right)}^2} \) is the prediction loss, \( {\displaystyle \sum_{i=1}^5\left|\right|{\boldsymbol{W}}^i\left|\right|{}_2}+{\displaystyle \sum_{j=1}^4\left|\right|{\boldsymbol{U}}^j\left|\right|{}_2} \) is the regularization item, *λ* is the weight of the regularization item, and ||W||_{2} is the sum of all squared elements in W.

With the document collection *D*, we can train the above model until stabilization and then get an embedded vector for each document. For each document *d*, we input x
_{
t
} (*t* = 1, …, *T*) in turn, compute all latent vectors, and use \( {\boldsymbol{h}}_T^4 \) to represent the embedded vector of *d*, i.e., \( {\boldsymbol{v}}_d={\boldsymbol{h}}_T^4 \).

### Adding memory to neural cell

In the above equations, the meaning of Sigmoid function *σ*(⋅) is the same as before and ⊙ is the dot product. Details of LSTM can be found in [15].

### Adding feedback to recurrent neural network

_{ t }with x

_{ t + 1}and x

_{ t + 2}.

In the proposed four-latent-layer feedback recurrent neural network, we can capture the relationships between x
_{
t
} with x
_{
t + 1} and x
_{
t + 2}. If the number of latent layers is *k*, then we can capture the relationships between x
_{
t
} with its following *k* − 2 words. So, the proposed feedback recurrent neural network can capture forward dependences between words.

#### 1.1.3 Embedded vector-based topic model

From Section 1.1, we know that each document *d* ∈ *D* can be denoted with a vector v
_{
d
} (\( {\boldsymbol{v}}_d={\boldsymbol{h}}_T^4 \)), where the dimension of v
_{
d
} is *p*. Given *d*, we randomly choose a word *w* from *d*, substitute *w* with a random word *w* ' ≠ *w* and then get a muted document *d* '. For *d* and *d* ', the only difference is between *w* with *w* ', so we assume that they have similar topics. For choosing the muted word *w* ', we choose synonyms at the beginning, but the result is not very well. The reason is that by substituting a word with its synonym, the two documents are almost the same, so the generalization of the model is very weak.

*d*and

*d*' into continuous vector space, we get v

_{ d }and v '

_{ d }. Then, we use the topic model described in Fig. 4 to analyze the underlying topics in a document collection. We assume that there are

*k*topics in the document collection. Taking v

_{ d }and v '

_{ d }as input, if the probability that both

*d*and

*d*' belong to

*i*-th topic is

*z*

_{ i }, then we can estimate it as follows.

_{ i }is a

*p*×

*p*matrix, and the whole W is a

*p*×

*p*×

*k*tensor.

*d*and

*d*' pairs, if

*d*' is a muted document of

*d*, then

*z*

_{ i }= 1; and if

*d*' is a randomly chosen document except for

*d*, then

*z*

_{ i }= 0. During learning parameters of the model, we apply the sum of squared errors as loss function and L

_{2}regularization, then the objective is

#### 1.1.4 Experiments

The experimental platform is a laptop with Intel Core i5-2450 2.5GHz CPU and 4GB memory. The operation system is Ubuntu 10.04, and all algorithms are implemented with Java.

#### 1.1.5 Datasets

*W*is the number of distinct words after removing stop words,

*D*is the number of documents including training and test data,

*T*is the number of documents in test data, and

*N*is the total number of words after removing stop words.

Statistics of datasets

NIPS | RML | |
---|---|---|

| 13,649 | 16,994 |

| 1740 | 19,813 |

| 348 | 6188 |

| 23.0 M | 1.27 M |

#### 1.1.6 Baseline algorithms

We denote our proposed topic model with GRNN and compare it with PLSA [4], HDP-LDA [7], seTF-IDF [6], and DSSM [13]. PLSA is the basic probabilistic latent semantic analysis method, HDP-LDA extends LDA by modeling topics with hierarchical Dirichlet generation process, seTF-IDF is a term frequency—inverse document frequency method that takes semantics into consideration, and DSSM analyzes semantics by denoting each document with a set of triple characters.

#### 1.1.7 Metrics

#### 1.1.8 Comparison of perplexity

#### 1.1.9 Comparison of PMI

## 2 Conclusions

In this paper, we proposed a feedback recurrent neural network to embed documents into continuous vector space and an embedded vector-based topic model. We applied long short-term memory recurrent neural network to capture backward dependences between words and feedback links between neural cells to capture forward dependences between words. With our proposed model, we can capture the relationships between the target word with its two following words. Massive experiments validate the effectiveness and efficiency of the proposed model.

## Declarations

### Acknowledgements

The work was supported by the following funds: Science Research Foundation of Hunan Province Education Department(14C0483), Science Research Foundation for Distinguished Young Scholars of Hunan Province Education Department(14B070), and Science and Technology Project of Hunan Province of China(2014FJ6095).

### Competing interests

The authors declare that they have no competing interests.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- CD Manning, H Schutze,
*Foundations of statistical natural language processing*(MIT press, Cambridge, MA, 1999)MATHGoogle Scholar - DM Blei, Probabilistic topic models. Communications of the ACM
**55**(4), 77–84 (2012)MathSciNetView ArticleGoogle Scholar - N Evangelopoulos, X Zhang, VR Prybutok, Latent semantic analysis: five methodological recommendations. European Journal of Information Systems
**21**(1), 70–86 (2012)View ArticleGoogle Scholar - F Zhuang, G Karypis, X Ning et al., Multi-view learning via probabilistic latent semantic analysis. Information Sciences
**199**, 20–30 (2012)View ArticleGoogle Scholar - SP Crain, K Zhou, SH Yang et al., Dimensionality reduction and topic modeling: from latent semantic indexing to latent dirichlet allocation and beyond, in
*Mining text data*(Springer, USA, 2012), pp. 129–161View ArticleGoogle Scholar - A Aizawa, An information-theoretic perspective of tfidf measures. Information Processing & Management
**39**(1), 45–65 (2003)MathSciNetView ArticleMATHGoogle Scholar - J Paisley, C Wang, DM Blei et al., Nested hierarchical Dirichlet processes. Pattern Analysis and Machine Intelligence, IEEE Transactions on
**37**(2), 256–270 (2015)View ArticleGoogle Scholar - A Bagheri, M Saraee, JF De, ADM-LDA: an aspect detection model based on topic modeling using the structure of review sentences. Journal of Information Science
**40**(5), 621–636 (2014)View ArticleGoogle Scholar - W. Ou, Z. Xie, X. Jia,B. Xie. Detection of topic communities in social networks based on Tri-LDA model, in Proceedings of the 4th International Conference on Computer Engineering and Networks. (Springer International Publishing, 2015), 1245–1253Google Scholar
- C Zhang, J Sun, Large scale microblog mining using distributed MB-LDA, in
*Proceedings of the 21st international conference companion on World Wide Web*(ACM, New York, 2012), pp. 1035–1042Google Scholar - T Mikolov, K Chen, G Corrado et al., Efficient estimation of word representations in vector space. arxiv preprint arxiv
**1301**, 3781 (2013)Google Scholar - J Pennington, R Socher, CD Manning, Glove: global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)
**12**, 1532–1543 (2014)View ArticleGoogle Scholar - P Huang, X He, J Gao et al., Learning deep structured semantic models for web search using clickthrough data, in
*Proceedings of the 22nd ACM international conference on Conference on information & knowledge management*(ACM, New York, 2013), pp. 2333–2338View ArticleGoogle Scholar - K Du, MNS Swamy, Recurrent neural networks, in
*Neural Networks and Statistical Learning*(Springer, London, 2014), pp. 337–353View ArticleGoogle Scholar - S Hochreiter, J Schmidhuber, Long short-term memory. Neural Computation
**9**(8), 1735–1780 (1997)View ArticleGoogle Scholar - WL Buntine, S Mishra, Experiments with non-parametric topic models, in
*Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining*(ACM, New York, 2014), pp. 881–890Google Scholar - D Newman, JH Lau, K Grieser, Automatic evaluation of topic coherence, in
*Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics*(Association for Computational Linguistics, Stroudsburg, PA, USA, 2010), pp. 100–108Google Scholar - HM Wallach, I Murray, R Salakhutdinov et al., Evaluation methods for topic models, in
*Proceedings of the 26th Annual International Conference on Machine Learning*(ACM, New York, 2009), pp. 1105–1112Google Scholar