One can use hash for building a dictionary and converting text documents to vector space representation. Dictionary size N has to be specified and documents tokenized to terms. Then, Hash(term) mod N is a term index in VSM. More details at: http://en.wikipedia.org/wiki/Feature_hashing and http://www.shogun-toolbox.org/static/notebook/current/HashedDocDotFeatures.html.
No comments:
Post a Comment