本学期选了分布式系统这门课,作业是搭建一个MapReduce的框架来实现WordCount这个简单Demo,因此需要进行Hadoop环境的配置;这个在我2年前上数据仓库的时候小组里也配置过,但当时不是我负责,这次重新把流程走通一遍;当时是使用Spark+Hive+Hadoop三件套用来对数据仓库进行SQL的查询,参考当时负责人的博客

构建镜像

1
2
3
# docker安装不多赘述
# 拉取ubuntu
docker pull ubuntu:latest

去官网下载jdk1.8,即jdk-8u281-linux-x64.tar.gz,并在目录同级构建一个Dockerfile文件,内容如下

1
2
3
4
5
6
FROM ubuntu:latest
MAINTAINER duanmu
ADD jdk-8u281-linux-x64.tar.gz /usr/local/
ENV JAVA_HOME /usr/local/jdk1.8.0_281
ENV CLASSPATH $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
ENV PATH $PATH:$JAVA_HOME/bin

Dockerfile

而后开始构建镜像,命名为ubuntu-with-jdk:

1
docker build -t ubuntu-with-jdk .

配置镜像

新建以上述镜像为基础的容器,命名为ubuntu_hadoop并指定容器的hostnamecharlie,并进入容器。

1
docker run -it --name=ubuntu_hadoop -h charlie ubuntu-with-jdk

此时的容器是带有java版本的,可以输入java -version查看

带JAVA版本的容器

安装常用工具

1
2
3
apt-get update
apt-get install vim
apt-get install wget

创建hadoop的目录

这里也和我所参考的文章保持一致,并下载hadoop安装包

1
2
3
4
mkdir -p /your/path/to/hadoop/
cd /your/path/to/hadoop
wget http://mirrors.ustc.edu.cn/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
tar -xvzf hadoop-3.2.4.tar.gz

修改环境变量

  1. 首先是系统的环境变量,加入jdk, hadoop等相关路径;
1
2
3
4
5
6
7
8
9
$ vim ~/.bashrc

export JAVA_HOME=/usr/local/jdk1.8.0_281
export HADOOP_HOME=/your/path/to/hadoop/hadoop-3.2.4
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

source ~/.bashrc
  1. 同时也要修改Hadoop的环境变量;
1
2
3
4
5
vim hadoop-env.sh
# 添加如下内容
export JAVA_HOME=/usr/local/jdk1.8.0_281
# 刷新hadoop的namenode
hadoop namenode -format
  1. 后续运行Hadoop可能会出现如下报错,在这里先行配置以防错误;
1
2
3
4
5
6
7
$ start-all.sh
Starting namenodes on [hadoop]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
FRROR. but there is no HDFS_DATANODE_USER defined. Aborting operation.

在镜像中设置相关的环境变量

1
2
3
4
5
6
7
8
9
$ vim /etc/profile

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

$ source /etc/profile

创建文件夹

1
2
3
4
cd $HADOOP_HOME
mkdir tmp
mkdir namenode
mkdir datanode

修改配置文件

  • $HADOOP_CONFIG_HOME/core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/hadoop-3.2.4/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<final>true</final>
</property>
</configuration>
  • $HADOOP_CONFIG_HOME/hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/your/path/to/hadoop/hadoop-3.2.4/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>/your/path/to/hadoop/hadoop-3.2.4/datanode</value>
<final>true</final>
</property>
</configuration>
  • $HADOOP_CONFIG_HOME/mapred-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>mapred.job.tarcker</name>
<value>master:9001</value>
</property>
</configuration>

免密登陆

由于分布式的节点之间需要相互访问,需要能够互相ssh连接,因此免密登陆是必须要做的事情,否则会出现无法访问分布式节点的情况。

  1. 先安装ssh
1
2
apt-get install net-tools
apt-get install ssh
  1. 创建sshd目录,生成访问密钥并放在authorized_keys;如果如果想自己电脑连接时免密登陆,则将自己电脑的id_rsa.pub放进去
1
2
3
4
5
6
mkdir -p ~/var/run/sshd

cd ~/
ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
cd .ssh
cat id_dsa.pub >> authorized_keys
  1. 修改ssh和sshd配置

ssh一般是客户端配置的,sshd是服务端配置的,这里需要互相能够访问,所以在组装镜像的时候就生成好相同的密钥

1
2
3
4
5
6
7
8
9
# /etc/ssh/ssh_config
StrictHostKeyChecking no # ask -> no

# etc/ssh/sshd_config
PasswordAuthentication no # 禁用密码验证
# 启用密钥验证
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
  1. 测试自己是否能够免密连接自己,出现下图即OK
1
2
service ssh restart # 重启ssh服务
ssh localhost

ssh localhost

导出镜像

根据上述的操作,Hadoop镜像已经配置完成,即对应的namenode和两个datanode的模板已经得到,将其导出为镜像即可,其中ubuntu:hadoop就是对应的镜像名称和标签。

1
docker commit COUNTAIN_ID ubuntu:hadoop

导出镜像

集群测试

运行容器

9870和8088是两个映射的端口;可以用tmux打开,这样便于让其在后台运行

1
2
3
docker run -it  -h master --name=master -p 9870:9870 -p 8088:8088 ubuntu:hadoop
docker run -it -h slave1 --name=slave1 ubuntu:hadoop
docker run -it -h slave2 --name=slave2 ubuntu:hadoop

修改配置

修改master容器中的worker文件(或者slaves文件),即在master中标明它的两个slave

1
2
3
localhost
slave1
slave2

在hosts中标注出各个节点的IP地址,即对于master, slave1, slave2,在/etc/hosts中加入其他两个容器的IP地址

1
2
3
4
5
6
7
8
9
127.0.0.1       localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3 master
172.17.0.4 slave1
172.17.0.5 slave2

启动节点

在sbin中存储了很多实用脚本,这里采用start-all.sh直接运行三个节点;如果需要关闭,则运行stop-all.sh即可,这里我本来就运行着,所以这里先停后开

启动节点

此时就可以访问localhost:9870localhost:8088去监视集群运行状态

  1. localhost:8088

localhost:8088

  1. localhost:9870

localhost:9870

demo测试

在hadoop的示例项目中,存在WordCount的示例,如下将其jar解码后如下所示

WordCount示例

想要运行这个.jar文件,先创建相关目录

1
2
3
4
5
6
7
8
9
# 创建input目录并将LICENSE.txt放入
hadoop fs -mkdir /input
hadoop fs -ls /
hadoop fs -put LICENSE.txt /input
hadoop fs -ls /input

# 运行.jar文件
hadoop jar /your/path/to/hadoop/hadoop-3.2.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.4.jar wordcount /input /output
hadoop fs -cat /output/part-r-00000 # 查看output中的输出结果

WordCount运行示例

WordCount统计结果


后续有机会会探索一下在Hadoop的基础上配置Hive,并使用Spark进行查询等功能,敬请期待…