Commits

luoboiqingcai  committed 9f21468

由于读秀数据库算法改变,每次的图片地址都不一样,所以先用火狐下载网页到本地再下载的方法失败了;另一方面通过javascript将图片地址隐藏得更深了,没法像从前那样可以从网页上通过正则表达式直接获得,因此通过网络下载图片的方法也失效了。郁闷……

  • Participants
  • Parent commits 8cb4b2a

Comments (0)

Files changed (1)

File downloadlib.py

 __contact__ = "sf.cumt@gmail.com"
 __licence__ = "MIT"
 
-duxiu_source_list = ["www2.zhengzhifl.cn","www.zirankxzl.cn","www.zhexuezj.cn","www.junshilei.cn",]
-
-DEBUG = False
+#duxiu_source_list = ["www2.zhengzhifl.cn","www.zirankxzl.cn","www.zhexuezj.cn","www.junshilei.cn",]
+duxiu_source_list = ["www.zhengzhifl.cn",]
+DEBUG = True
 TEM_FILE = "remote_webpage_content.html"
 
 class DownloadLib(object):
         url
             试读图片相对地址
         """
-        s = self.img_url_prex + url
+        self.log.debug("the full url of img,url:%s;"%url)
         try:
-            img = requests.get(s).content
+            img = requests.get(url).content
         except requests.exceptions.RequestException as e:
             self.log.warn('first time download failed: %s'%e)
             time.sleep(10)
             try:
-                img = requests.get(s).content
+                img = requests.get(url).content
             except requests.exceptions.RequestException as e:
                 self.log.warn('try second time failed: %s'%e)
                 self.log.info("%s wasn't downloaded."%s)
             content = self.get_content(url,remotep=False)
             for line in content:
                 imgurls = re.findall(p,line)
+                self.log.debug("line:%s;"%line)
+                self.log.debug("imgurls in downloaded html file:%s;"%imgurls)
+                if len(imgurls)>1:
+                    break
             content.close()
         return [{'url':self.img_url_prex+imgurl,'localname':os.path.join(self.prex,self.get_localname(imgurl))} for imgurl in imgurls]
 
         self.log.debug("entering downloadit")
         count = 0
         img_pairs =self.get_img_pairs(strategy,url,sp,ep)
+        self.log.debug("img_pairs:%s"%img_pairs)
         for ul in img_pairs:
             img = self.get_img(ul['url'])
+            self.log.debug("length of img:%d"%len(img))
             count += 1
             try:
                 f = open(ul['localname'],'wb')
         return resume_point
 
 if __name__ == '__main__':
-    pattern_for_duxiu_l = r'<input .+?src="drspath_files/(.+?)" scr="(.+?)" .*?>' #用于取得用'''浏览器'''下载下来的网页上的图片的正则表达式。
+    pattern_for_duxiu_l = r'<input .+?src=".+?" scr="(.+?)" .*? type="image">' #用于取得用'''浏览器'''下载下来的网页上的图片的正则表达式。
     pattern_for_duxiu_r = r'\s+var str = "(.+)";' #用于取得'''远程'''图片的正则表达式。
     pattern_for_chaoxing =  pattern_for_duxiu_r
     lnp = r'.+/(\d+)\?\.' #根据远程图片的完整地址取得图片本地存储名字的正则表达式。