配置nutch的MapReduce
配置nutch的MapReduce的时候出现问题,调整了多个配置参数都没有搞好。然后到处查资料,有指导性的并不多,很多都是从Cluster和Distributed
File System的角度来讲这个东西,看的头都大了。Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce:
Simplified Data Processing on Large Clusters". 是讲述这个比较全面的。
File System的角度来讲这个东西,看的头都大了。Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce:
Simplified Data Processing on Large Clusters". 是讲述这个比较全面的。
可是我并不需要去设计它,我只想知道在nutch中怎么能配置好,然后能让我透明的写程序就可以了。大概机制清楚了,可是还是没有弄通,让系统运行起来。被卡在这里很不爽,浪费了不少时间,其实没有必要盯在这里,我可以先用0.7版本的nutch,等有大牛写出来一个完整的nutch
MapReduce Tutorial出来的时候,我看图识字,一步一步的学着做就行了。
MapReduce Tutorial出来的时候,我看图识字,一步一步的学着做就行了。
维基百科
竟然在国内被屏蔽了,这点资料还是通过别的渠道找到的,转发出来,给大家共享一下,也请对这一技术感兴趣的同好一起交流。
竟然在国内被屏蔽了,这点资料还是通过别的渠道找到的,转发出来,给大家共享一下,也请对这一技术感兴趣的同好一起交流。
MapReduce是Google开发的C++编程工具,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(化简)",和他们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。[1]
当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(化简)函数,用来保证所有映射的键值对中的每一个共享相同的键组。
目录
1 映射和化简
2 分布和可靠性
3 Uses
4 Other
Implementations
5 References
6 External
link
2 分布和可靠性
3 Uses
4 Other
Implementations
5 References
6 External
link
映射和化简
简单说来,一个映射函数就是对一些独立元素组成的概念上的列表(例如,一个测试成绩的列表)的每一个元素进行指定的操作(比如前面的例子里,有人发现所有学生的成绩都被高估了一分,他可以定义一个“减一”的映射函数,用来修正这个错误。)。事实上,每个元素都是被独立操作的,而原始列表没有被更改,因为这里创建了一个新的列表来保存新的答案。这就是说,Map操作是可以高度并行的,这对高性能要求的应用以及并行计算领域的需求非常有用。
简单说来,一个映射函数就是对一些独立元素组成的概念上的列表(例如,一个测试成绩的列表)的每一个元素进行指定的操作(比如前面的例子里,有人发现所有学生的成绩都被高估了一分,他可以定义一个“减一”的映射函数,用来修正这个错误。)。事实上,每个元素都是被独立操作的,而原始列表没有被更改,因为这里创建了一个新的列表来保存新的答案。这就是说,Map操作是可以高度并行的,这对高性能要求的应用以及并行计算领域的需求非常有用。
而化简操作指的是对一个列表的元素进行适当的合并(继续看前面的例子,如果有人想知道班级的平均分该怎么做?他可以定义一个化简函数,通过让列表中的元素跟自己的相邻的元素相加的方式把列表减半,如此递归运算直到列表只剩下一个元素,然后用这个元素除以人数,就得到了平均分。)。虽然他不如映射函数那么并行,但是因为化简总是有一个简单的答案,大规模的运算相对独立,所以化简函数在高度并行环境下也很有用。
分布和可靠性
MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性;每个节点会周期性的把完成的工作和状态的更新报告回来。如果一个节点保持沉默超过一个预设的时间间隔,主节点(类同Google
File System中的主服务器)记录下这个节点状态为死亡,并把分配给这个节点的数据发到别的节点。Individual operations use
atomic operations for naming file outputs as a double check to insure that there
are not parallel conflicting threads running; when files are renamed, it is
possible to also copy them to another name in addition to the name of the task
(allowing for side-effects).
MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性;每个节点会周期性的把完成的工作和状态的更新报告回来。如果一个节点保持沉默超过一个预设的时间间隔,主节点(类同Google
File System中的主服务器)记录下这个节点状态为死亡,并把分配给这个节点的数据发到别的节点。Individual operations use
atomic operations for naming file outputs as a double check to insure that there
are not parallel conflicting threads running; when files are renamed, it is
possible to also copy them to another name in addition to the name of the task
(allowing for side-effects).
The reduce operations operate much the same way, but because of their
inferior properties with regard to parallel operations, the master node attempts
to schedule reduce operations on the same node, or as close as possible to the
node holding the data being operated on; this property is desirable for Google
as it conserves bandwidth, which their internal networks do not have much
of.
inferior properties with regard to parallel operations, the master node attempts
to schedule reduce operations on the same node, or as close as possible to the
node holding the data being operated on; this property is desirable for Google
as it conserves bandwidth, which their internal networks do not have much
of.
Uses
According to Google, they use MapReduce in a wide range of
applications, including: "distributed grep, distributed sort, web link-graph
reversal, term-vector per host, web access log stats inverted index
construction, document clustering, machine learning, statistical machine
translation..." Most significantly, when MapReduce was finished, it was used to
completely regenerate Google's index of the Internet, and replaced the old ad
hoc programs that updated the index.
According to Google, they use MapReduce in a wide range of
applications, including: "distributed grep, distributed sort, web link-graph
reversal, term-vector per host, web access log stats inverted index
construction, document clustering, machine learning, statistical machine
translation..." Most significantly, when MapReduce was finished, it was used to
completely regenerate Google's index of the Internet, and replaced the old ad
hoc programs that updated the index.
MapReduce generates a large number of intermediate, temporary files, which
are generally managed by, and accessed through, Google File System, for greater
performance.
are generally managed by, and accessed through, Google File System, for greater
performance.
Other Implementations
The Nutch project has developed an
experimental implementation [2] of MapReduce.
The Nutch project has developed an
experimental implementation [2] of MapReduce.
References
Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce:
Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.
^ "Our abstraction is inspired by the map and reduce primitives
present in Lisp and many other functional languages." -"MapReduce: Simplified
Data Processing on Large Clusters"
Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce:
Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.
^ "Our abstraction is inspired by the map and reduce primitives
present in Lisp and many other functional languages." -"MapReduce: Simplified
Data Processing on Large Clusters"
External link
Interpreting the Data: Parallel Analysis with Sawzall-
a paper on an internal tool at Google, Sawzall, which acts as an interface to
MapReduce, intended to make MapReduce much easier to use.
Discussion on
Lambda the Ultimate.
取自"http://zh.wikipedia.org/wiki/MapReduce"
Interpreting the Data: Parallel Analysis with Sawzall-
a paper on an internal tool at Google, Sawzall, which acts as an interface to
MapReduce, intended to make MapReduce much easier to use.
Discussion on
Lambda the Ultimate.
取自"http://zh.wikipedia.org/wiki/MapReduce"











