How to use Beautiful Soup to extract string in <script> tag?

dilbert2010 · Feb 6, 2018

Hi there,

for a little programme i want to fetch the data of various plugins of Wordpress: to be concrete it is about 50 plugins
that have each a domain - see below.

the following data are needed: of the "Version", "Acitve installations" and "Tested up to:"

for a list of wordpress-plugins: - approx 50 plugins are of interest!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database and so on and so forth.

These plugins are listed in my favorites - so if i create a login with BS4 then i can log in and parse all those favorite-pages.
The first approach: Otherwise i can loop through a set of URL to fetch all the necessary pages.

i need the data of the following three lines:

see for example:

https://wordpress.org/plugins/wp-job-manager

Version: 1.29.3
Active installations: 100,000+
Tested up to: 4.9.4

we can solve this task with other methods than ousing only BeautifulSoup, but we can do it for example with BS + regular expressions

assuming were able to do this with regular expression we need to locate the script tag in the HTML. The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting the above mentioned text:

import re

from bs4 import BeautifulSoup

data = """


		<li>Version: <strong>1.29.3</strong></li>
		<li>
			Last updated: <strong><span>6 days</span> ago</strong>			</li>
		<li>Active installations: <strong>100,000+</strong></li>

						<li>
			Requires WordPress Version:<strong>4.3.1</strong>				</li>

					<li>Tested up to: <strong>4.9.4</strong></li>

"""
pattern = re.compile(r'\.val\("([regular expression ]+)"\);', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=pattern)
if script:
    match = pattern.search(script.text)
    if match:
        text = match.group(1)
        print(text )

Prints: text.

Well finally - i want to store the text in a database or a calc-sheet - so it would be great if we can get this in a CVS formate or in an array so that can store it in a db.

Here we are using a simple regular expression for the text but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.

so i have to refine this a bit...