Do your friends ever have a conversation and you just nod along even though you're not really sure what they're talking about? 看完Michael G. Noll的文章,你就基本明白MapReduce了。原文请搜"Writing An Hadoop MapReduce Program In Python".
写个词频统计程序,例如foo foo quux labs foo bar quux,foo出现3次,bar1次。
# mapper.py
for line in sys.stdin:
for word in line.strip().split():
print '%s\t%s' % (word, 1)
# reducer.py
cur_word, cur_cnt, word = None, 0, None
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)
if cur_word == word:
cur_cnt += count
else:
if cur_word:
print '%s\t%s' % (cur_word, cur_cnt)
cur_cnt = count
cur_word = word
if cur_word == word:
print '%s\t%s' % (cur_word, cur_cnt)
$ cat data | mapper.py | sort -k1,1 | reducer.py
# sort -k1,1对第一个field排序,就是foo, bar这样的词而不是1, 2这样的词频。
$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
为什么不用python的字典,而要这么绕?因为要处理的数据很大。例如,假设有5万个词,词的2-gram(二元语法模型),就可能有5万*5万个词-词对。谷歌机器翻译用到了5-gram,如果有10万个词,就有10^25个词-词-词-词-词对,以上均为最大值。
有了Hadoop这个框架,你就进入了体制内,专心写好mapper和reducer就行,别的组织会安排的,例如把你的程序复制到机群里的n台机器上运行。
MapReduce was first popularized as a programming model in 2004 by Jeffery Dean and Sanjay Ghemawat of Google. 我的猜测是:有需求,有旧东西例如The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of unix software tools designed to facilitate language modeling work in the research community有foo 1这样的做法,Dean有想法—凝练出个通用的模型,有能力实现,有老板支持,事就成了。
一个音字转换用语言模型的例子:解决shi ji wen ti,shi ji=实际,比世纪、史记、试剂可能性大(化学论文例外),把"解决实际"当词存起来是土办法。
再看个图温故吧:
<图>
最后: mapreduce为什么被淘汰了?<链接>
https://zhuanlan.zhihu.com/p/145639231
--
FROM 106.121.141.*