ITKeyword,专注技术干货聚合推荐

注册 | 登录

Memcache内存分配

wbj0110 2013-12-05

相关推荐:动态内存分区分配方式模拟

假设初始态下,可用内存空间为640K,并有下列请求序列,请分别用首次适应算法和最佳适应算法为作业分配和回收内存块,并显示出每次分配和回收后的空闲

up vote 16 down vote favorite 2

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents.

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]].

Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]].

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).

java apache-spark machine-learning apache-spark-mllib feature-selection share | improve this question edited Apr 25 '16 at 12:33 zero323 95.9k 19 187 254 asked Dec 6 '14 at 18:01 riffraff 1,812 12 25 add a comment  | 

2 Answers 2

active oldest votes up vote 10 down vote accepted +50

HashingTF uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There is the possibility of feature collisions but this can be made smaller by choosing a larger number of features in the constructor.

In order to create features based on not only the content of a feature but also some metadata (e.g. having a tag of 'cats' as opposed to having the word 'cats' in the document) you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

相关推荐:程序的内存分配之堆和栈的区别

堆栈概述在计算机领域,堆栈是一个不容忽视的概念,堆栈是两种数据结构。堆栈都是一种数据项按序排列的数据结构,只能在一端(称为栈顶(top))对数

If you've created feature count vectors using HashingTF you can use them to create bag of words features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

In your case it looks like you've already computed the counts of words per document. This won't work with the HashingTF class since it's designed to do the counting for you.

This paper has some arguments about why feature collisions aren't that much of a problem in language applications. The essential reasons are that most words are uncommon (due to properties of languages) and that collisions are independent of word frequencies (due to hashing properties) so that it's unlikely that words that are common enough to help with one's models will both hash to the same slot.

share | improve this answer edited Apr 19 '16 at 13:11 answered Dec 9 '14 at 23:50 mrmcgreg 1,654 1 11 14      thanks, just one extra clarification: if I understand correctly, numFeatures in HashingTF is basically used as the mod value used to bound the number of features to a given maximum? If so, shouldn't it just be Double.MAX_VALUE ? Or is the idea to use it so that i.e. it can restrict different features to given ranges and limit cross-collisions? (i.e. put some kind of features in 1..N and some other in N..2N, you'd have collisions among the same kind but not cross-kind) –  riffraff Dec 16 '14 at 9:21      Yes, the computation looks like features[hash(feature) % numFeatures] += 1. The vectors that are created are usually used as input to some model so using Double.MAX_VALUE would imply a gigantic model. One of the main motivations of the hashing trick is memory reduction. You certainly could create features in the way you are suggesting but I'm not sure how to evaluate the benefits of such an approach. –  mrmcgreg Dec 16 '14 at 13:54      ah of course, I was thinking of sparse vectors so didn't consider the array size. Thanks for your help! –  riffraff Dec 17 '14 at 8:34 add a comment  |  up vote 0 down vote

I found this nice example of a Naive Bayes text classifier. It contains exactly what you need.

share | improve this answer answered Dec 10 '14 at 21:48 Max 1,250 1 11 25 add a comment  | 

Your Answer

  draft saved draft discarded

Sign up or log in

Sign up using Google

Sign up using Facebook

Sign up using Email and Password

Post as a guest

Name Email

Post as a guest

Name Email discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged java apache-spark machine-learning apache-spark-mllib feature-selection or ask your own question.

相关推荐:[C++]内存管理器--谈论如何自定义内存分配机制

内存管理器–谈论如何自定义内存分配机制 Memory pools, also called fixed-size blocks allocation, is the use of pools for memory managemen

Memcached的内存分配以page为单位,默认情况下一个page是1M,可以通过-I参数在启动时指定。如果需要申请内存时,memcached会划分出一个新的page并分配给需要的slab区域。page一旦被分配在重启前...

相关阅读排行


用户评论

游客

相关内容推荐

最新文章

×

×

请激活账号

为了能正常使用评论、编辑功能及以后陆续为用户提供的其他产品,请激活账号。

您的注册邮箱: 修改

重新发送激活邮件 进入我的邮箱

如果您没有收到激活邮件,请注意检查垃圾箱。