本文共 7013 字,大约阅读时间需要 23 分钟。
是一个非常棒的UML绘图工具,需要对它了解的,可以直接看官网,在此不做更多介绍,最近要使用它来做一些设计,它有非常完备的在线,写得非常不错。这个时候问题来了,这些教学非常多,而且如果所有人都访问外网的话及学习效率都比较低。通过观察,发现里面的所有文章都有PDF可以下载,而且里面的示例也可以下载,呵呵,这就好办了,做个程序把它抓下来不就解决了?于是把此问题交给同学去干了,同学花了半天时间交工,我看了下,发现虽然局部有优化的地方,但是总体还是可以的,于是就写这篇文章做个说明。
注:HulkZ同学还没有大学毕业,正在大学4年级学习。
1 2 3 4 5 6 7 8 9 10 11 12 13 | public class VisualParadigmMain { public static void main(String[] args) throws Exception { Spider spider = new SpiderImpl( "UTF-8" ); Watcher watcher = new WatcherImpl(); watcher.addProcessor( new VisualParadigmMainProcessor()); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName( "li" ); nodeFilter.setIncludeAttribute( "class" , "tutorialLeftMenuItem" ); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl( "http://www.visual-paradigm.com/tutorials/" ); } } |
大意是创建一个抓取UTF-8的爬虫,然后创建一个观察者,设定对带有class为tutorialLeftMenuItem的li标签用VisualParadigmMainProcessor类去处理,然后去处理URLhttp://www.visual-paradigm.com/tutorials/。
到这里处理还是比较清晰的,接下来看看VisualParadigmMainProcessor是怎么样的。
1 2 3 4 5 6 7 8 9 10 | public class VisualParadigmMainProcessor implements Processor { public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { HtmlNode a = node.getSubNode( "a" ); File file = new File( "E:\\临时\\spider\\" + a.getPureText().trim()); if (!file.exists()) { file.mkdirs(); } VisualParadigmList.process(a.getAttribute( "href" )); } } |
这里传入的node实际上,就是上面说的带有class为tutorialLeftMenuItem的li标签,这里的意思是找到它下面的所有<a>标签,然后创建目录,然后再用VisualParadigmList类来对这些分类连接进行处理。
教学分类中主要就是一篇一篇的教学文章了,接下来当然是对这些文章进行处理了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | public class VisualParadigmList { public static void process(String url) throws Exception { Spider spider = new SpiderImpl( "UTF-8" ); Watcher watcher = new WatcherImpl(); watcher.addProcessor( new VisualParadigmListProcessor()); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName( "div" ); nodeFilter.setIncludeAttribute( "class" , "tutorial-link-container" ); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl( "http://www.visual-paradigm.com" + url); System.out.println(System.currentTimeMillis()+ "-" +url); } } |
VisualParadigmListProcessor类的内容如下:
1 2 3 4 5 6 | public class VisualParadigmListProcessor implements Processor { public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { HtmlNode a = node.getSubNode( "a" ); VisualParadigmPage.process(a.getPureText(), a.getAttribute( "href" )); } } |
意思就是再对它下面的a标签中的URL用VisualParadigmPage类进行处理。
上面就是具体的教学页面了,在其右上角,就有PDF的链接,下面要做的工作就是把这些PDF抓取下来,首先看VisualParadigmPage类:
1 2 3 4 5 6 7 8 9 10 11 12 13 | public class VisualParadigmPage { public static void process(String title,String url) throws Exception { Spider spider = new SpiderImpl( "UTF-8" ); Watcher watcher = new WatcherImpl(); watcher.addProcessor( new VisualParadigmPageProcessor(title)); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName( "html" ); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl( "http://www.visual-paradigm.com" + url); System.out.println(System.currentTimeMillis()+ "-" +url); } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | public class VisualParadigmPageProcessor implements Processor { static HttpClient httpClient = new HttpClient(); private final String title; public VisualParadigmPageProcessor(String title) { this .title = title; } public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { NameFilter<HtmlNode> titleUrlFilter = new NameFilter(node); titleUrlFilter.setNodeName( "title" ); HtmlNode titleNode = titleUrlFilter.findNode(); NameFilter<HtmlNode> pdfUrlFilter = new NameFilter(node); pdfUrlFilter.setNodeName( "a" ); pdfUrlFilter.setIncludeAttribute( "class" , "pdf notranslate" ); HtmlNode pdfNode = pdfUrlFilter.findNode(); NameFilter<HtmlNode> ol = new NameFilter(node); pdfUrlFilter.setNodeName( "ol" ); pdfUrlFilter.setIncludeAttribute( "class" , "contentPoint" ); HtmlNode olNode = pdfUrlFilter.findNode(); if (pdfNode != null ) { String pdfUrl = "http://www.visual-paradigm.com" + pdfNode.getAttribute( "href" ); saveUrl(titleNode.getPureText() + ".pdf" , pdfUrl); } if (olNode != null && olNode.getSubNodes( "a" ) != null ) { for (HtmlNode aNode : olNode.getSubNodes( "a" )) { String vppUrl = "http://www.visual-paradigm.com" + aNode.getAttribute( "href" ); saveUrl(aNode.getPureText(), vppUrl); } } } private void saveUrl(String name, String urlAddress) throws IOException { String fileName = "E:\\临时\\spider\\" + title + "\\" + name; GetMethod getMethod = new GetMethod(urlAddress); int iGetResultCode = httpClient.executeMethod(getMethod); if (iGetResultCode == HttpStatus.SC_OK) { InputStream inputStream = getMethod.getResponseBodyAsStream(); OutputStream outputStream = new FileOutputStream(fileName); byte [] buffer = new byte [ 4096 ]; int n = - 1 ; while ((n = inputStream.read(buffer)) != - 1 ) { if (n > 0 ) { outputStream.write(buffer, 0 , n); } } inputStream.close(); outputStream.close(); } getMethod.releaseConnection(); } } |
到文章中查找到标题,再查找到PDF的链接,再查找到其它附件的链接,如果有的话,就把它们存储下来,整个代码编写任务结束。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | D:\BaiduYunDownload\VPTutorials 的目录 2014/11/02 20:06 < DIR > . 2014/11/02 20:06 < DIR > .. 2014/11/02 20:05 < DIR > Business Modeling 2014/11/02 20:05 < DIR > Business Process Modeling 2014/11/02 20:05 < DIR > Business Rule 2014/11/02 20:05 < DIR > Code Engineering 2014/11/02 20:05 < DIR > Customization 2014/11/02 20:05 < DIR > Data Modeling 2014/11/02 20:05 < DIR > Database Tools 2014/11/02 20:05 < DIR > Design Animation 2014/11/02 20:05 < DIR > Diagramming 2014/11/02 20:05 < DIR > Enterprise Architecture 2014/11/02 20:05 < DIR > Glossary 2014/11/02 20:05 < DIR > Grid 2014/11/02 20:05 < DIR > IDE Integration 2014/11/02 20:05 < DIR > Impact Analysis 2014/11/02 20:05 < DIR > Interoperability 2014/11/02 20:05 < DIR > Modeling Toolset 2014/11/02 20:05 < DIR > Object Relational Mapping 2014/11/02 20:05 < DIR > Plug-in Development 2014/11/02 20:05 < DIR > Process Simulation 2014/11/02 20:05 < DIR > Project Referencing 2014/11/02 20:06 < DIR > Reporting 2014/11/02 20:06 < DIR > Requirements Capturing 2014/11/02 20:06 < DIR > SoaML Modeling 2014/11/02 20:06 < DIR > Team Collaboration 2014/11/02 20:06 < DIR > UML Modeling 2014/11/02 20:06 < DIR > Use Case Modeling |
上面是他提交给我的成果物,总共108M,任务完成得非常漂亮。
这部分代码,HulkZ同学已经Push给我,放在TinySpider工程当中。
另外做这个过程中,还发现TinySpider不能处理gzip方式处理过的Html文档,因此还增加了对gzip方式的html内容进行处理的支持。
更多内容,请查看Tiny框架官网:,也可以查看本人的博客,相信不会空手而归。
也可以添加本人QQ进行直接沟通,还可以加入Tiny群与TinyFans互动。
转载地址:http://zwwnx.baihongyu.com/