新服务器初始化#
[toc]
1. 配置服务器#
修改IP#
sudo vi /etc/network/interfaces
1 | auto enp3s0f0 |
关闭NetworkManager1
2sudo systemctl stop NetworkManager.service
sudo systemctl disable NetworkManager.service
重启网络sudo systemctl restart networking.service
安装sshd#
1 | sudo apt-get update |
2. 初始化#
通过puppet修改服务器的主机名,用户授权等。
3. 安装应用#
3.1. 安装hadoop#
3.1.1. 准备工作#
运行Hadoop集群的准备工作。
1. 安装jdk环境
安装openjdk1
2
3
4
5sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk
java -version
2.配置好ssh
配置ssh localhost1
2ssh-keygen -t rsa
ssh-copy-id localhost
测试ssh localhost免密钥。
3. 解压包
解压所下载的Hadoop发行版。1
2
3
4
5
6sudo useradd -u 1006 -s /usr/sbin/nologin -M hadoop
sudo mkdir /opt/.hadoop_versions
tar -zxvf hadoop-2.6.0-cdh5.8.4.tar.gz -C /opt/.hadoop_versions/
cd /opt
ln -s /opt/.hadoop_versions/hadoop-2.6.0-cdh5.8.4 hadoop
sudo chown -R hadoop:hadoop /opt/.hadoop_versions
编辑 conf/hadoop-env.sh文件,至少需要将JAVA_HOME设置为Java安装根路径。export JAVA_HOME=/usr/java/latest
尝试如下命令:
$ bin/hadoop
将会显示hadoop 脚本的使用文档。
3.1.2.hadoop三种模型#
现在你可以用以下三种支持的模式中的一种启动Hadoop集群:
单机模式-Standalone
伪分布式模式-Pseudo-Distributed
完全分布式模式-Fully-Distributed
standalone#
安装openjdk
解压包
测试1
2
3
4mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
cat output/*
Pseudo-Distributed#
Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。
安装openjdk
解压包
配置本地ssh无秘钥登陆本地
修改配置文件
etc/hadoop/core-site.xml:1
2
3
4
5
6<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:1
2
3
4
5
6<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
执行一下步骤:
Format the filesystem
1
bin/hdfs namenode -format
Start NameNode daemon and DataNode daemon
1
2
3sbin/start-dfs.sh
日志位置:$HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).Browse the web interface for the NameNode
1
NameNode - http://localhost:50070/
Make the HDFS directories required to execute MapReduce jobs:
1
2bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/<username>Copy the input files into the distributed filesystem:
1
bin/hdfs dfs -put etc/hadoop input
Run some of the examples provided:
1
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
Examine the output files:
1
2
3
4
5bin/hdfs dfs -get output output
cat output/*
或者
bin/hdfs dfs -cat output/*When you’re done, stop the daemons with:
1
sbin/stop-dfs.sh
yarn在伪分布模式使用
如果希望在伪分布模式下,运行一个mapreduces计划任务在YARN上,需要调整一下参数。
假设上边的1-4步骤已经操作,
- Configure parameters
etc/hadoop/mapred-site.xml:1
2
3
4
5
6<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml1
2
3
4
5
6<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Start ResourceManager daemon and NodeManager daemon:
1
2
3
4
5
6
7
8
9
10sbin/start-yarn.sh
#检查进程
chongxiang@dev-04-dev-ofc:/opt$ jps
25619 DataNode
25430 NameNode
26471 NodeManager
25896 SecondaryNameNode
26187 ResourceManager
48030 JpsBrowse the web interface for the ResourceManager
1
ResourceManager - http://localhost:8088/
Run a MapReduce job
1
When you’re done, stop the daemons with:
1
sbin/stop-yarn.sh
配置环境变量1
2
3
4
5
6
7
8
9export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export YARN_CONF_DIR=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
分布式部署hadoop#
单独一篇讲解
3.2. 安装spark Standalone Mode#
解压所下载的spark发行版。1
2
3
4
5
6
7
8
9sudo useradd -u 1008 -s /usr/sbin/nologin -M spark
sudo mkdir /opt/.spark_versions
tar -zxvf spark-1.6.3-bin-hadoop2.6.tgz -C /opt/.spark_versions/
cd /opt
ln -s /opt/.spark_versions/spark-1.6.3-bin-hadoop2.6 spark
sudo chown -R spark:spark /opt/.spark_versions
#所需目录
sudo mkdir /data/sparklogs
配置文件
/opt/spark/conf$ cat spark-defaults.conf1
2
3
4
5
6
7spark.eventLog.enabled true
spark.eventLog.dir file:///data/sparklogs
spark.history.fs.logDirectory file:///data/sparklogs
spark.history.fs.cleaner.enable true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 3d
spark.serializer org.apache.spark.serializer.KryoSerializer
设置环境变量1
2export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
测试环境1
2
3
4
5#local模式
./bin/spark-submit --master local ./examples/src/main/python/pi.py 10
#yarn模式
./bin/spark-submit --master yarn ./examples/src/main/python/pi.py 10
3.3. 安装miniconda2#
1 | wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh |
conda常用命令:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18conda info --envs
conda config --describe > ./.condarc
conda create -n jupyter_note
source activate jupyter_note
source deactivate
conda remove --name jupyter_note --all
conda env create -f environment.yml
conda env create -f environment.yml
conda list --export > package-list.txt
conda create -n myenv --file package-list.txt
conda install --name MyEnvironment --file explicit-spec-file.txt
conda env create -f environment.yml
3.4. jupyter#
1 | #设置jupyter配置文件的密码 |