1.Item数据格式 Item是保存在chunk中的实际数据 2.新建Item分配内存过程 快速定位slab classid 计算key+value+suffix+32结构体，如90byte 如果>1MB，无
I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents.
I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a
LabeledPoint contains only doubles, i.e it looks like
Instead what I have as output from the rest of my code would be something like
I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?
I believe the answer is in the
HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).
2 Answers 2active oldest votes up vote 10 down vote accepted +50
HashingTF uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There is the possibility of feature collisions but this can be made smaller by choosing a larger number of features in the constructor.
In order to create features based on not only the content of a feature but also some metadata (e.g. having a tag of 'cats' as opposed to having the word 'cats' in the document) you could feed the
HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.
If you've created feature count vectors using
HashingTF you can use them to create bag of words features by setting any counts above zero to 1. You can also create TF-IDF vectors using the
IDF class like so:
val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)
In your case it looks like you've already computed the counts of words per document. This won't work with the
HashingTF class since it's designed to do the counting for you.
This paper has some arguments about why feature collisions aren't that much of a problem in language applications. The essential reasons are that most words are uncommon (due to properties of languages) and that collisions are independent of word frequencies (due to hashing properties) so that it's unlikely that words that are common enough to help with one's models will both hash to the same slot.share | improve this answer edited Apr 19 '16 at 13:11 answered Dec 9 '14 at 23:50 mrmcgreg 1,654 1 11 14 thanks, just one extra clarification: if I understand correctly,
HashingTFis basically used as the
modvalue used to bound the number of features to a given maximum? If so, shouldn't it just be
Double.MAX_VALUE? Or is the idea to use it so that i.e. it can restrict different features to given ranges and limit cross-collisions? (i.e. put some kind of features in 1..N and some other in N..2N, you'd have collisions among the same kind but not cross-kind) – riffraff Dec 16 '14 at 9:21 Yes, the computation looks like
features[hash(feature) % numFeatures] += 1. The vectors that are created are usually used as input to some model so using
Double.MAX_VALUEwould imply a gigantic model. One of the main motivations of the hashing trick is memory reduction. You certainly could create features in the way you are suggesting but I'm not sure how to evaluate the benefits of such an approach. – mrmcgreg Dec 16 '14 at 13:54 ah of course, I was thinking of sparse vectors so didn't consider the array size. Thanks for your help! – riffraff Dec 17 '14 at 8:34 add a comment | up vote 0 down vote
I found this nice example of a Naive Bayes text classifier. It contains exactly what you need.share | improve this answer answered Dec 10 '14 at 21:48 Max 1,250 1 11 25 add a comment |
Your Answerdraft saved draft discarded
Sign up or log in
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guestName Email
Post as a guestName Email discard
Not the answer you're looking for? Browse other questions tagged java apache-spark machine-learning apache-spark-mllib feature-selection or ask your own question.
一、Memcache内存分配机制 关于这个机制网上有很多解释的，我个人的总结如下。 Page为内存分配的最小单位。 Memcached的内存分配以page为单位，默认情况
- 1开源内存数据库H2 实现单元测试用例的独门独户
- 3使用memcached进行内存缓存 - 太阳火神的美丽人生
- 5服务器数据库系列 - memcache内存分配