How to filter tags using classes in Python and BeautifulSoup?


m4rk_Henry_ftw

I'm trying to scrape images from a website using the beautifulsoup HTML parser.

Every image on this site has 2 kinds of image tags. One for the thumbnail and the other for the larger sized image, which only shows up when the thumbnail is clicked and expanded. Larger tags contain a class="expanded-image" attribute.

I'm trying to parse through HTML and get the "src" attribute of an extended image that contains the source of the image.

When I try to execute the code, nothing happens. It just says the process is done without scraping any images. However, when I don't try to filter the code and just pass the tag as a parameter, it downloads all the thumbnails.

Here is my code:

import webbrowser, requests, os
from bs4 import BeautifulSoup

def getdata(url):
    r = requests.get(url)
    return r.text

htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')

list = []

for i in soup.find_all("img",{"class":"expanded-thumb"}):
    list.append(i['src'].replace("//","https://"))

def download(url, pathname):
    if not os.path.isdir(pathname):
        os.makedirs(pathname)

    filename = os.path.join(pathname, url.split("/")[-1])
    response = requests.get(url, stream=True)

    with open(filename, "wb") as f:
        f.write(response.content)

for a in list:
    download(a,"file")
ludwig vespers

You can run into problems using "list" as a variable name. This is a type in python. Start with this (replace TEST_4CHAN_URL with whatever thread you want) and combine the suggestions in the comments above.

import requests
from bs4 import BeautifulSoup

TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"

def getdata(url):
    r = requests.get(url)
    return r.text

htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")

src_list = []

for i in soup.find_all("a", {"class":"fileThumb"}):
    src_list.append(i['href'].replace("//", "https://"))

print(src_list)

Related


How to get the text of nested tags using Beautifulsoup in Python?

Jatin Serra after running this code section = soup.find_all('section', class_='b-branches') I understand <div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div> Now I just want to extract RJIT Roadlines, not... Firm so i tried for

How to get the text of nested tags using Beautifulsoup in Python?

Jatin Serra after running this code section = soup.find_all('section', class_='b-branches') I understand <div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div> Now I just want to extract RJIT Roadlines, not... Firm so i tried for

How to remove html tags from string in Python using BeautifulSoup

username New to programming here :) I want to print prices from a website using BeautifulSoup. Here is my code: #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup, SoupStrainer from urllib2 import urlopen url = "Some retailer's url"

How to remove html tags from string in Python using BeautifulSoup

username New to programming here :) I want to print prices from a website using BeautifulSoup. Here is my code: #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup, SoupStrainer from urllib2 import urlopen url = "Some retailer's url"

How to filter tags using classes in Python and BeautifulSoup?

m4rk_Henry_ftw I'm trying to scrape images from a website using the beautifulsoup HTML parser. Every image on this site has 2 kinds of image tags. One for the thumbnail and the other for the larger sized image, which only shows up when the thumbnail is clicked

How to extract td HTML tags using Python BeautifulSoup?

username I'm trying to scrape a web page and extract prefixes and their names from them. However, for some tags, I can't extract them, my guess is that there are invisible tags. Here is my python code: opener.addheaders = [('User-agent', 'Mozilla/5.0')] respon

How to get the text of nested tags using Beautifulsoup in Python?

Jatin Serra after running this code section = soup.find_all('section', class_='b-branches') I understand <div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div> Now I just want to extract RJIT Roadlines, not... Firm so i tried for

How to get the text of nested tags using Beautifulsoup in Python?

Jatin Serra after running this code section = soup.find_all('section', class_='b-branches') I understand <div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div> Now I just want to extract RJIT Roadlines, not... Firm so i tried for

How to filter tags using classes in Python and BeautifulSoup?

m4rk_Henry_ftw I'm trying to scrape images from a website using the beautifulsoup HTML parser. Every image on this site has 2 kinds of image tags. One for the thumbnail and the other for the larger sized image, which only shows up when the thumbnail is clicked

How to get the text of nested tags using Beautifulsoup in Python?

Jatin Serra after running this code section = soup.find_all('section', class_='b-branches') I understand <div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div> Now I just want to extract RJIT Roadlines, not... Firm so i tried for

How to filter tags using classes in Python and BeautifulSoup?

m4rk_Henry_ftw I'm trying to scrape images from a website using the beautifulsoup HTML parser. Every image on this site has 2 kinds of image tags. One for the thumbnail and the other for the larger sized image, which only shows up when the thumbnail is clicked