HTML解析器 – 轻松提取HTML信息

发表于 2019年9 月10日星期二下午 5:42:09

你好编码器，

本文提供了一些实用的代码片段来提取和处理HTML信息。以下主题将涵盖：

加载Html
扫描文件中的资产：图像，Javascript文件，CSS文件
更改现有资产的路径
更新现有元素：更改图像的src属性
根据id找到一个元素
从DOM树中删除元素
处理现有组件：删除硬编码文本
将处理过的HTML保存到文件中

所提供的源代码是从生产中使用的HTML Parser中提取的，用于解析平面HTML主题并将其转换为PUG，Jinja和Blade主题和组件。

什么是HTML解析器

根据维基百科，解析或句法分析是根据形式语法的规则分析一系列符号的过程，无论是自然语言还是计算机语言。这里应用的HTML解析的含义意味着加载HTML，提取和处理相关信息，如头部标题，页面资产，主要部分以及稍后，保存处理过的文件。

分析环境

该代码使用BeautifulSoup库，这是用Python编写的著名的解析库。要开始编码，我们需要在我们的系统上安装一些模块。

$pip install ipython # the console where we execute the code $pip install requests # a library to pull the entire HTML page $pip install BeautifulSoup # the real magic is here

加载HTML内容

该文件将作为任何其他文件加载，并且内容应该注入到BeautifulSoup对象中

from bs4 import BeautifulSoup as bs  # Load the HTML content html_file = open('index.html', 'r') html_content = html_file.read() html_file.close() # clean up  # Initialize the BS object soup  = bs(html_content,'html.parser')  # At this point, we can interact with the HTML  # elements stored in memory using all helpers offered by BS library

解析资产的HTML

此时，我们在BeautifulSoup对象中加载了DOM树。让我们扫描DOM树中的Javascript文件，脚本节点：

...    ...

找到Javascript的代码片段只有几行代码。 BS库将返回一个对象数组，我们可以轻松地改变每个脚本节点：

for script in soup.body.find_all('script', recursive=False):     # Print the src attribute    print(' JS source = "https://dev.to/ + script('src'))     # Print the type attribute    print(' JS type = "https://dev.to/ + script('type'))

以类似的方式，我们可以选择和处理CSS节点：

...  rel="stylesheet"https://dev.to/ href="css/bootstrap.min.css">  rel="stylesheet"https://dev.to/ href="css/app.css"> ...

和代码..

for link in soup.find_all('link'):     # Print the src attribute    print(' CSS file = "https://dev.to/ + script('href'))

解析图像的HTML

在此代码段中，我们将改变节点并更改 src 图像节点的属性

...  ...

for img in soup.body.find_all('img'):     # Print the path     print(' IMG src = "https://dev.to/ + img(src))      img_path = img('src')    img_file = img_path.split('/')(-1) # extract the last segment, aka image file      img(src) = '/assets/img/"https://dev.to/ + img_file     # the new path is set

根据ID找到元素

这可以通过一行代码来实现。让我们假设我们有一个带有id的元素（div或span） 1234：

...  id="1234"https://dev.to/ class="handsome"> Some text

和代码：

mydiv = soup.find("div", {"id": "1234"})  print(mydiv)   # delete the element mydiv.decompose()

删除硬编码的文本

此代码段对于组件提取和转换到不同的模板引擎非常有用。让我们假设我们有这个简单的组件：

 id="1234"https://dev.to/ class="cool">    Html Parsing    the practical guide

# locate the div mydiv = soup.find("div", {"id": "1234"}) print(mydiv) # print before processing # iterate on div elements for tag in mydiv.descendants: # NavigableString is the text inside the tag, # not the tag himself if not isinstance(tag, NavigableString): print( 'Found tag = "https://dev.to/ + tag.name ' -> "https://dev.to/ + tag.text ) # this will print: # Found tag = span -> Html Parsing # Found tag = span -> the practical guide # replace the text for Php tag.text = '' # replace the text for Jinja tag.text = '{{ title }}'

# mydiv is the processed component php_component is the string representation php_component = mydiv.prettify(formatter="html") file = open( 'component.php', 'w+') file.write( php_component ) file.close()

保存新的HTML

现在我们在内存中的BeautifulSoup对象中有变异的DOM。要将内容保存到新文件，我们需要调用 prettify() 并将内容保存到新的HTML文件。

new_dom_content = soup.prettify(formatter="html") file = open( 'index_parsed.html', 'w+') file.write( new_dom_content ) file.close()