URL
This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.
from langchain_community.document_loaders import UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
Pass in ssl_verify=False with headers=headers to get past ssl_verification error.
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
Selenium URL Loader
This covers how to load HTML documents from a list of URLs using the
SeleniumURLLoader.
Using selenium allows us to load pages that require JavaScript to render.
Setup
To use the SeleniumURLLoader, you will need to install selenium and
unstructured.
from langchain_community.document_loaders import SeleniumURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()
Playwright URL Loader
This covers how to load HTML documents from a list of URLs using the
PlaywrightURLLoader.
As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.
Setup
To use the PlaywrightURLLoader, you will need to install playwright
and unstructured. Additionally, you will need to install the
Playwright Chromium browser:
# Install playwright
%pip install --upgrade --quiet "playwright"
%pip install --upgrade --quiet "unstructured"
!playwright install
from langchain_community.document_loaders import PlaywrightURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()