HttpClient是一个用来处理Http请求的工具。API十分丰富,是Java语言开发网络爬虫最常用的工具。HttpClient不同的版本之间API可能会有较大改变,目前最新版本为4.5.1,下面的例子(maven构建)将使用这个版本的HttpClient介绍其基本的用法。进入官网
1.maven构建
在pom.xml中加入HttpClient的dependency
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.1</version>
</dependency>
2.基础参数初始化
private HttpClient httpclient;
public void init() {
//httpClientConnectionManager = new BasicHttpClientConnectionManager();
HttpClientBuilder builder = HttpClientBuilder.create();
SocketConfig socketConfig = SocketConfig.custom().setSoTimeout(5000)
.build();
builder.setDefaultSocketConfig(socketConfig);
builder.setMaxConnPerRoute(10);
//builder.setRetryHandler();
//builder.setConnectionManager(httpClientConnectionManager);
httpclient = builder.build();
}
2.HttpGet请求
public String get(String url) {
HttpGet httpget = new HttpGet(url);
/**
* 设置user-agent
*/
httpget.addHeader(
"user-agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36");
int port=9999;// proxy port
HttpHost proxy=new HttpHost("your proxy hostName",port);
/**
* 设置请求参数,例如超时设置,代理等
*/
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(5000)
.setSocketTimeout(5000)
.setProxy(proxy).build();
httpget.setConfig(requestConfig);
try {
/**
* 请求的执行,返回HttpResponse
*/
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
/**
* 获取返回的内容String(官方不建议使用EntityUtils)
*/
String content = EntityUtils.toString(entity);
return content;
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
2.HttpPost请求
public String post(String url,Map<String,Object> params){
HttpPost httpPost=new HttpPost(url);
/**
* 设置user-agent
*/
httpPost.addHeader(
"user-agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36");
List <NameValuePair> nvps = new ArrayList <NameValuePair>();
nvps.add(new BasicNameValuePair("username", "vip"));
nvps.add(new BasicNameValuePair("password", "secret"));
try {
httpPost.setEntity(new UrlEncodedFormEntity(nvps));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
int port=9999;// proxy port
HttpHost proxy=new HttpHost("your proxy hostName",port);
/**
* 设置请求参数,例如超时设置,代理等
*/
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(5000).setSocketTimeout(5000).setProxy(proxy).build();
httpPost.setConfig(requestConfig);
try {
/**
* 请求的执行,返回HttpResponse
*/
HttpResponse response = httpclient.execute(httpPost);
HttpEntity entity = response.getEntity();
/**
* 获取返回的内容String(官方不建议使用EntityUtils)
*/
String content = EntityUtils.toString(entity);
return content;
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
3.HttpClient的主要作用
HttpClient主要是通过模拟正常的Http请求,获取网页源码。这里获取到的网页源码和在浏览器中通过右键-查看网页源代码看到的源码通常是一样的。这是抓取数据的第一步,接下来就需要从中解析出我们自己需要的数据,解析数据的话,通常可以选择Jsoup、HTMLCleaner、Gson、FastJson、正则等工具。