How to find all comments with Beautiful Soup
You can pass a function to find_all() to help it check whether the string is a Comment.
For example I have below html:
<body>
<!-- Branding and main navigation -->
<div class="Branding">The Science & Safety Behind Your Favorite Products</div>
<div class="l-branding">
<p>Just a brand</p>
</div>
<!-- test comment here -->
<div class="block_content">
<a href="https://www.google.com">Google</a>
</div>
</body>
Code:
from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
print(c)
print("===========")
c.extract()
the output would be:
Branding and main navigation
============
test comment here
============
BTW, I think the reason why find_all('Comment')
doesn't work is (from BeautifulSoup document):
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
How to find the comment tag !--...-- with BeautifulSoup?
Pyparsing allows you to search for HTML comments using a builtin htmlComment
expression, and attach parse-time callbacks to validate and extract the various data fields within the comment:
from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar
# have pyparsing define tag start/end expressions for the
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")
# only want spans with class=titlefont
span.addParseAction(withAttribute(**{'class':'titlefont'}))
# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd
# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)
# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->
don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->
not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->
another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""
for comment in htmlComment.searchString(htmlsource):
parsedDate = comment.date
# date info can be accessed like elements in a list
print parsedDate[0], parsedDate[1]
# because we named the expressions within the dateExpr Group
# we can also get at them by name (this is much more robust, and
# easier to maintain/update later)
print parsedDate.day
print parsedDate.daynum
print
Prints:
Wednesday 110518
Wednesday
110518
Thursday 110521
Thursday
110521
Use BeautifulSoup to extract table data from comments
This will give you a list
of all comments. .find_all
returns a list
of items.
comment = stats_page.find_all(text=lambda text:isinstance(text, bs4.Comment))
You cannot create a soup object from a list. From above comment
is a list.
data = BeautifulSoup(comment,'lxml')
That is why you get that Error.
You have to find out the comment that you are looking for by iterating over the comment
and convert that to soup
object. Then you can extract your data.
Extracting Text Between HTML Comments with BeautifulSoup
You just need to iterate through all of the available comments to see if it is one of your required entries, and then display the text for the following element as follows:
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
print comment.next_element.strip()
This would display the following:
I would like to get this text
I would also like to find this text
Beautiful Soup 4: Remove comment tag and its content
You can use extract()
(solution is based on this answer):
PageElement.extract() removes a tag or string from the tree. It
returns the tag or string that was extracted.
from bs4 import BeautifulSoup, Comment
data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""
soup = BeautifulSoup(data)
div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
As a result you get your div
without comments:
<div class="foo">
cat dog sheep goat
</div>
How can I strip comment tags from HTML using BeautifulSoup?
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->
. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage
regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
Related Topics
How to Validate Ip Address in Python
Prevent Sleep Mode Python (Wakelock on Python)
Run a Linux System Command as a Superuser, Using a Python Script
How to Prevent Numbers Being Changed to Exponential Form in Python Matplotlib Figure
How to Activate an Anaconda Environment
How to Read a File with a Semi Colon Separator in Pandas
How to Remove Stop Words Using Nltk or Python
Some Unix Commands Fail with "<Command> Not Found", When Executed Using Python Paramiko Exec_Command
How to Check Type of Files Without Extensions
Keras Not Training on Entire Dataset
Appending the Same String to a List of Strings in Python
How to Make a Call to an Executable from Python Script
Importerror: Libcblas.So.3: Cannot Open Shared Object File: No Such File or Directory
When Does Python Allocate New Memory for Identical Strings
Extracting Text from a PDF File Using PDFminer in Python
Python: Calling 'List' on a Map Object Twice