HttpClient是一个用来处理Http请求的工具。API十分丰富,是Java语言开发网络爬虫最常用的工具。HttpClient不同的版本之间API可能会有较大改变,目前最新版本为4.5.1,下面的例子(maven构建)将使用这个版本的HttpClient介绍其基本的用法。进入官网

1.maven构建

在pom.xml中加入HttpClient的dependency

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.1</version>
</dependency>

2.基础参数初始化

    private HttpClient httpclient;

    public void init() {
        //httpClientConnectionManager = new BasicHttpClientConnectionManager();
        HttpClientBuilder builder = HttpClientBuilder.create();
        SocketConfig socketConfig = SocketConfig.custom().setSoTimeout(5000)
                .build();
        builder.setDefaultSocketConfig(socketConfig);
        builder.setMaxConnPerRoute(10);

        //builder.setRetryHandler();
        //builder.setConnectionManager(httpClientConnectionManager);


        httpclient = builder.build();
    }

2.HttpGet请求

public String get(String url) {

        HttpGet httpget = new HttpGet(url);

        /**
         * 设置user-agent
         */
        httpget.addHeader(
                "user-agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64)
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36");

        int port=9999;// proxy  port

        HttpHost proxy=new HttpHost("your proxy hostName",port);

        /**
         * 设置请求参数,例如超时设置,代理等
         */
        RequestConfig requestConfig = RequestConfig.custom()
                .setConnectTimeout(5000)
                .setSocketTimeout(5000)
                .setProxy(proxy).build();

        httpget.setConfig(requestConfig);

        try {
            /**
             * 请求的执行,返回HttpResponse
             */
            HttpResponse response = httpclient.execute(httpget);

            HttpEntity entity = response.getEntity();

            /**
             * 获取返回的内容String(官方不建议使用EntityUtils)
             */
            String content = EntityUtils.toString(entity);
            return content;
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

2.HttpPost请求

public String post(String url,Map<String,Object> params){
        HttpPost httpPost=new HttpPost(url);
        /**
         * 设置user-agent
         */
        httpPost.addHeader(
                "user-agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36");


        List <NameValuePair> nvps = new ArrayList <NameValuePair>();
        nvps.add(new BasicNameValuePair("username", "vip"));
        nvps.add(new BasicNameValuePair("password", "secret"));
        try {
            httpPost.setEntity(new UrlEncodedFormEntity(nvps));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

        int port=9999;// proxy  port
        HttpHost proxy=new HttpHost("your proxy hostName",port);

        /**
         * 设置请求参数,例如超时设置,代理等
         */
        RequestConfig requestConfig = RequestConfig.custom()
                .setConnectTimeout(5000).setSocketTimeout(5000).setProxy(proxy).build();
        httpPost.setConfig(requestConfig);

        try {
            /**
             * 请求的执行,返回HttpResponse
             */
            HttpResponse response = httpclient.execute(httpPost);

            HttpEntity entity = response.getEntity();

            /**
             * 获取返回的内容String(官方不建议使用EntityUtils)
             */
            String content = EntityUtils.toString(entity);
            return content;
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        return null;
    }

3.HttpClient的主要作用

HttpClient主要是通过模拟正常的Http请求,获取网页源码。这里获取到的网页源码和在浏览器中通过右键-查看网页源代码看到的源码通常是一样的。这是抓取数据的第一步,接下来就需要从中解析出我们自己需要的数据,解析数据的话,通常可以选择Jsoup、HTMLCleaner、Gson、FastJson、正则等工具。

results matching ""

    No results matching ""