python爬取20000个单词音频-京东云开发者社区

虽然单词现在随处可见，但是对于锻炼技术来说是一个好方法，这篇博客将从找目标到代码完整的记录此过程。

真实需求：
下载了20000个单词，结果只有单词没有音频，这怎么行呢？
作为一名喜欢自动化的童鞋来说，才不会再去网上找音频，所以干脆写个程序吧。

步骤

1、找一个查单词的网站，找到单词发音的地址
2、使用python下载保存

接下来就一步步来

1、网站与地址

经过多次查找，发现以前有前辈写过的，但是那是个外国网站，而且实在难得操作，所以干脆找个国内的，然后发现几乎都不能直接找到地址，是通过js触发的，于是在js代码里找到地址：
1、此网站
http://www.chadanci.com/
2、找到页面发音的a标签：

<a onmouseover="asplay('and', 0)" onclick="asplay('and', 0)" class="play_word" href="javascript:;" title="真人发音"></a>

3、找到对应此函数的js代码：
在source里找到：http://www.chadanci.com/images/js/_xml_content.js
里面的方法：

function play_sentence(liju){
    $.ajax({
        type: "GET",
        url: "/e/extend/s/file.php?type=sentence",
        data: "q="+encodeURIComponent(liju),
        success: function(url){
            var asound = getFlashObject("asound");
            if(asound){
                asound.SetVariable("f",url);
                asound.GotoFrame(1);
            }
        }
    });
}

4、构造查询地址：

http://www.chadanci.com/e/extend/s/file.php?type=0&world=and

很清楚就出来了：0是英式发音，1是美式，word是单词

5、但是查询这个页面返回的是音频mp3的地址，可以直接进行下载。

2、使用python下载保存

因为是GET链接，可能服务器没有过多在意爬虫，所以也不搞代理和分布式了。

最开始想法非常简单：
1、读取单词文本
2、构造链接进行下载
3、写入文本

def download(word):
    url = "http://www.chadanci.com//e/extend/s/file.php?type=0&world="+word

    req = urllib2.Request(url)
    req.add_header("User-Agent",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36")

    res_data = urllib2.urlopen(req)
    mp3_url = res_data.read()
    #print mp3_url
    if mp3_url is None:
        return
    try:
        f = urllib2.urlopen(mp3_url) 
        with open("mp3/"+word+".mp3", "wb") as fword:
            fword.write(f.read()) 
    except:
        print "error1"

def process(file_name):
    done = [] #存储已下载单词
    #进度处理
    with open(file_name) as f:
        words = f.readlines()
        num = len(words)
        i = 0
        width = num/100 #用来控制进度
        p = '#'
        while i<num:
            word = words[i].strip('\n')
            # print word
            try:
                if word not in done:
                    download(word)
                    #加入已下载列表
                    done.append(word)
            except:
                print "error2"
            i+=1
            if i%width==0:
                p+='#'
            #原地刷新进度
            sys.stdout.write(str((i*1.0/num)*100)+"% :"+p+"->"+"\r")
            sys.stdout.flush()

if __name__=='__main__':
    process('word.txt')

结果：
发现到某个单词会卡住，然后整个就卡了，后来发现作出如下改正：

1、设置延时：

res_data = urllib2.urlopen(req,timeout=3)

2、采用多线程处理

3、改进代码

也许一次没有下载完，所以考虑将已下载的单词写入文件。

#!/usr/bin/env python
# coding=utf-8
import urllib2
import threading
import sys

#线程类
class MyThread(threading.Thread):
    def __init__(self,target,args):
        super(MyThread,self).__init__()
        self.target = target
        self.args = args

    def run(self):
        self.target(self.args)

def download(word):
    url = "http://www.chadanci.com//e/extend/s/file.php?type=0&world="+word

    req = urllib2.Request(url)
    req.add_header("User-Agent",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36")

    #延时
    res_data = urllib2.urlopen(req,timeout=3)
    mp3_url = res_data.read()
    #print mp3_url
    if mp3_url is None:
        return
    try:
        f = urllib2.urlopen(mp3_url) 
        with open("mp3/"+word+".mp3", "wb") as fword:
            fword.write(f.read()) 
        with open("done_word.txt","ab") as done:
            done.write(word+"\n")
    except:
        print "error1"

def process(file_name):
    #从文件把已下载单词加入列表里
    done = []

    #继续下载
    with open(file_name) as f:
        words = f.readlines()
        num = len(words)
        i = 0
        width = num/100
        p = '#'
        while i<num:
            word = words[i].strip('\n')
            # print word
            try:
                if word not in done:
                    download(word)
                    #加入已下载列表
                    done.append(word)
            except:
                print "error2"
            i+=1
            if i%width==0:
                p+='#'
            sys.stdout.write(str((i*1.0/num)*100)+"% :"+p+"->"+"\r")
            sys.stdout.flush()

def main():
    t1 = MyThread(process,'word.txt')
    t1.start()
    t1.join()

if __name__=='__main__':
    main()

结果，虽然没有卡顿，但这速度不敢恭维，半个小时才下了6000多个单词。

4、总结

使用到的技术：
1、urllib2爬取网页
2、文件处理
3、系统输出，进度刷新
4、多线程