Data Scrapers Reference¶
This page documents all 13 Python scrapers used to collect crude oil industry chain data. The scrapers fall into two categories: international energy organization report downloaders (7 scrapers) and domestic Chinese market data collectors (6 scrapers).
Category 1: International Energy Organization Report Downloaders¶
These scrapers download PDF reports from major international energy bodies using HTTP requests. They all follow a common pattern: construct URLs based on date ranges, download PDFs, validate content, and save with standardized filenames.
1. EIA Annual Energy Outlook (AEO)¶
- File:
EIA_Annual_Energy_Outlook年度报告.py - Source: U.S. Energy Information Administration (EIA)
- Data collected: Annual Energy Outlook narrative reports (PDF), covering long-term energy projections for the United States
- Date range: 1979 to present
- Authentication: None (public data). Uses browser User-Agent spoofing.
- Output format: PDF files saved as
EIA_{YYYY}_Annual_Energy_Outlook.pdf - Output directory:
EIA_Annual_Energy_Outlook年度报告/ - URLs and endpoint logic:
- Base domain:
https://www.eia.gov/ - 2024+:
outlooks/aeo/pdf/{YYYY}/AEO{YYYY}-narrative.pdf - 2022-2023:
outlooks/archive/aeo{YY}/pdf/AEO{YYYY}_Narrative.pdf - 2021:
outlooks/archive/aeo21/pdf/AEO_Narrative_2021.pdf - 2018-2020:
outlooks/archive/aeo{YY}/pdf/AEO{YYYY}.pdf - 2000-2017:
outlooks/archive/aeo{YY}/pdf/0383({YYYY}).pdf - 1996-1998:
outlooks/archive/aeo{YY}/pdf/0383{YY}.pdf - 1980-1995:
outlooks/archive/aeo{YY}/pdf/0383({YY}).pdf - 1979:
outlooks/archive/aeo79/pdf/0173(79)3.pdf - Key features: PDF content validation (checks for
%PDFmagic bytes), 0.5s rate limiting, skip-if-exists logic - Libraries:
requests
2. EIA Short-Term Energy Outlook (STEO)¶
- File:
EIA_Short_Term_Energy_Outlook月度报告.py - Source: U.S. Energy Information Administration (EIA)
- Data collected: Short-Term Energy Outlook reports (PDF), providing monthly/quarterly energy market forecasts
- Date range: Quarterly from 1983 Q1 to 1997 Q1; monthly from February 1997 to present
- Authentication: None (public data). Uses browser User-Agent spoofing.
- Output format: PDF files saved as:
- Quarterly:
EIA_{Q}Q{YYYY}_Short_Term_Energy_Outlook.pdf - Monthly:
EIA_{YYYYMM}_Short_Term_Energy_Outlook.pdf - Output directory:
EIA_Short_Term_Energy_Outlook月度报告/ - URLs:
- Base:
https://www.eia.gov/outlooks/steo/archives/ - Quarterly:
{Q}Q{YY}.pdf(e.g.,1Q83.pdf) - Monthly:
{mon}{YY}.pdf(e.g.,feb97.pdf,mar26.pdf) - Key features: Two-phase download (quarterly history then monthly), skip-if-exists logic
- Libraries:
requests
3. IEA World Energy Outlook (WEO)¶
- File:
IEA-World_Energy_Outlook报告.py - Source: International Energy Agency (IEA)
- Data collected: World Energy Outlook annual reports (PDF), the IEA flagship publication on global energy trends
- Date range: 1998 to present
- Authentication: None (public data). Uses browser User-Agent spoofing.
- Output format: PDF files saved as
IEA_{YYYY}_World_Energy_Outlook.pdf - Output directory:
IEA-World_Energy_Outlook报告/ - URLs:
- Report page:
https://www.iea.org/reports/world-energy-outlook-{YYYY} - PDF download: Extracted dynamically from the report page by parsing HTML for links pointing to
iea.blob.core.windows.netwith.pdfextension - Key features: Two-stage download (fetch HTML page, then parse for real PDF link on Azure Blob Storage), streaming download with progress display, 1.5s rate limiting
- Libraries:
requests,beautifulsoup4
4. IEA Oil Market Report (OMR)¶
- File:
oil_market_report.py - Source: International Energy Agency (IEA)
- Data collected: Oil Market Report monthly publications (PDF), providing detailed oil supply/demand analysis
- Date range: January 2017 to present
- Authentication: None (public data). Uses browser User-Agent spoofing.
- Output format: PDF files saved as
IEA_{YYYYMM}_oil_market_report.pdf - Output directory:
IEA-oil_market_report月度报告/ - URLs:
- Report page:
https://www.iea.org/reports/oil-market-report-{month}-{year}(e.g.,oil-market-report-january-2024) - PDF download: Dynamically extracted from HTML, targeting
iea.blob.core.windows.netlinks - Key features: Same two-stage approach as WEO scraper, streaming download, 1.5s rate limiting
- Libraries:
requests,beautifulsoup4
5. OPEC Monthly Oil Market Report (MOMR)¶
- File:
OPEC_-oil_market_report月度报告.py - Source: Organization of the Petroleum Exporting Countries (OPEC)
- Data collected: Monthly Oil Market Report (MOMR) PDFs, covering OPEC's analysis of global oil supply, demand, and prices
- Date range: January 2001 to present
- Authentication: None. Uses browser User-Agent spoofing. SSL verification disabled to bypass certificate issues.
- Output format: PDF files saved as
OPEC_{YYYYMM}_oil_market_report.pdf - Output directory:
OPEC-oil market月度报告/ - URLs:
- Template:
https://www.opec.org/assets/assetdb/momr-{month}-{year}.pdf - Example:
momr-january-2024.pdf - Key features: Session-based requests with persistent headers, SSL verification disabled (
verify=False), PDF magic byte validation, 403 detection with retry guidance - Libraries:
requests,urllib3
6. OPEC Annual Statistical Bulletin (ASB)¶
- File:
OPEC_ABS年度报告.py - Source: Organization of the Petroleum Exporting Countries (OPEC)
- Data collected: Annual Statistical Bulletin (ASB) PDFs, containing comprehensive OPEC statistical data on oil reserves, production, exports, and revenues
- Date range: 1999 to present
- Authentication: None. SSL verification disabled.
- Output format: PDF files saved as
OPEC_{YYYY}_ASB_report.pdf - Output directory:
OPEC-ASB年度报告/ - URLs:
- Template:
https://www.opec.org/assets/assetdb/asb-{YYYY}.pdf - Key features: Simple year-based URL construction, SSL bypass, no User-Agent required
- Libraries:
requests,urllib3
7. OPEC World Oil Outlook (WOO)¶
- File:
OPEC_World_Oil_Outlook年度报告.py - Source: Organization of the Petroleum Exporting Countries (OPEC)
- Data collected: World Oil Outlook annual reports (PDF), providing OPEC's long-term outlook on oil supply, demand, and energy transitions
- Date range: 2007 to present
- Authentication: None. Uses browser User-Agent spoofing. SSL verification disabled.
- Output format: PDF files saved as
OPEC_{YYYY}_World_Oil_Outlook_report.pdf - Output directory:
OPEC-World_Oil_Outlook年度报告/ - URLs:
- Template:
https://www.opec.org/assets/assetdb/woo-{YYYY}.pdf - Key features: PDF magic byte validation, 403/404 handling, SSL bypass
- Libraries:
requests,urllib3
Category 2: Domestic Chinese Market Data Collectors¶
These scrapers use Selenium WebDriver for browser automation. They target Chinese financial institutions and commodity exchanges, where data is rendered dynamically via JavaScript and not available through simple HTTP requests.
8. CCB Futures Crude Oil Daily Report¶
- File:
建信期货/建信期货-原油日报.py - Source: CCB Futures (建信期货) -- a subsidiary of China Construction Bank
- Data collected: Daily crude oil analysis reports (PDF), covering daily market commentary and price analysis
- Date range: 2018 to present
- Authentication: None (public website). Selenium automates a real Chrome browser to bypass JavaScript rendering.
- Output format: PDF files saved as
原油日报{YYYYMMDD}.pdf - Output directory:
原油日报/(relative to script location) - Target URL:
https://www.ccbfutures.com/main/research/research_report/daily_paper/index.shtml - Key features:
- Selenium-based with ChromeDriver (
chromedriver.exe) - Automatic pagination (up to 3000 pages)
- Link text matching: looks for
<a>tags containing both "原油" and one of ["日评", "日报", "早评", "晚评"] - Date extraction from link text (8-digit YYYYMMDD pattern)
- Deduplication against local files before downloading
- File download via browser click, with download completion detection (monitors filesystem for new files, waits for
.crdownloadto finish) - Automatic file renaming to standardized format
- Libraries:
selenium
9. CCB Futures Crude Oil Monthly Report¶
- File:
建信期货/建信期货-原油月报.py - Source: CCB Futures (建信期货)
- Data collected: Monthly crude oil strategy/analysis reports (PDF)
- Date range: Historical to present
- Authentication: None. Selenium browser automation.
- Output format: PDF files saved as
原油月报_{YYYYMMDD}.pdf - Output directory:
原油月报/(relative to script location) - Target URL:
https://www.ccbfutures.com/main/research/research_report/month_paper/index.shtml - Key features:
- Enhanced "microscope debugging" mode that prints raw byte representations of link text to diagnose hidden characters
- Aggressive date extraction: scans for 6-digit (YYYYMM) patterns first, then 8-digit, then delimiter-separated dates
- Filters out daily and weekly reports to isolate monthly publications
- Content keywords: ["原油", "能化", "能源化工", "化工", "策略", "投资", "大宗"]
- Type keywords: ["月报", "月评", "月度", "观点", "策略"]
- Pagination up to 500 pages
- Libraries:
selenium
10. CCB Futures Energy-Chemical Weekly Report¶
- File:
建信期货/建信期货-能化周报.py - Source: CCB Futures (建信期货)
- Data collected: Weekly energy and chemical industry reports (PDF), covering crude oil and petrochemical market analysis
- Date range: Historical to present
- Authentication: None. Selenium browser automation.
- Output format: PDF files saved as
能化周报_{YYYYMMDD}.pdf - Output directory:
能化周报/(relative to script location) - Target URL:
https://www.ccbfutures.com/main/research/research_report/week_paper/index.shtml - Key features:
- Content keywords: ["原油", "能源化工", "能化"]
- Type keywords: ["周报", "周评", "周度"]
- Pagination up to 500 pages
- Same download-and-rename pattern as other CCB scrapers
- Libraries:
selenium
11. Zhejiang Mercantile Exchange (ZME) -- Auction Market Data¶
- File:
建信期货/浙江国际大宗商品交易中心-拍卖行情.py - Source: Zhejiang Mercantile Exchange (浙江国际大宗商品交易中心, zme.com.cn)
- Data collected: Historical auction transaction data for bulk commodities, structured as tabular data
- Columns: 场次编号 (Session ID), 标段编号 (Lot ID), 竞价日期 (Bidding Date), 品种 (Product), 成交量 (Volume), 成交价 (Transaction Price), 状态 (Status)
- Date range: All available historical data (6 pages)
- Authentication: None. Selenium browser automation.
- Output format: Excel file:
ZME_拍卖行情历史数据_6页.xlsx - Target URL:
https://www.zme.com.cn/lssj2/index.htm - Key features:
- Scrapes HTML tables (not PDF downloads)
- Extracts 7-column table data from
<table>elements - Pagination via numbered page links (CSS class
tcdNumberwithintcdPageCode1container) - Data cleaning: removes navigation text rows, deduplicates
- Exports to Excel via pandas
- Libraries:
selenium,pandas
12. Zhejiang Mercantile Exchange (ZME) -- Tender/Bid Market Data¶
- File:
建信期货/浙江国际大宗商品交易中心-招标行情.py - Source: Zhejiang Mercantile Exchange (浙江国际大宗商品交易中心)
- Data collected: Historical tender/bidding transaction data
- Columns: 场次编号 (Session ID), 竞价日期 (Bidding Date), 品种 (Product), 成交量 (Volume), 成交价 (Transaction Price)
- Date range: All available historical data (~20 pages estimated)
- Authentication: None. Selenium browser automation.
- Output format: Excel file:
ZME_招标行情历史数据.xlsx - Target URL:
https://www.zme.com.cn/lssj3/index.htm - Key features: Same architecture as auction scraper but with 5-column schema. Auto-stops when pagination ends.
- Libraries:
selenium,pandas
13. Zhejiang Mercantile Exchange (ZME) -- Listed/Spot Market Data¶
- File:
建信期货/浙江国际大宗商品交易中心-挂牌行情.py - Source: Zhejiang Mercantile Exchange (浙江国际大宗商品交易中心)
- Data collected: Historical spot/listed commodity transaction data -- the largest dataset in this collection
- Columns: 挂牌截止日期 (Listing Deadline), 品种 (Product), 挂牌价 (Listed Price), 挂牌量 (Listed Volume), 成交金额 (Transaction Amount)
- Date range: All available historical data (1,437 pages)
- Authentication: None. Selenium browser automation.
- Output format: Excel file:
ZME_历史挂牌数据_全量.xlsx - Target URL:
https://www.zme.com.cn/lssj/index.htm - Key features:
- Largest scraping job: 1,437 pages of data
- Anti-ban strategy: pauses for 5 seconds every 100 pages
- Same table extraction and pagination logic as other ZME scrapers
- Libraries:
selenium,pandas
Summary Table¶
| # | Scraper | Source | Frequency | Method | Output |
|---|---|---|---|---|---|
| 1 | EIA AEO | EIA (US) | Annual | HTTP/requests | |
| 2 | EIA STEO | EIA (US) | Monthly/Quarterly | HTTP/requests | |
| 3 | IEA WEO | IEA | Annual | HTTP+HTML parsing | |
| 4 | IEA OMR | IEA | Monthly | HTTP+HTML parsing | |
| 5 | OPEC MOMR | OPEC | Monthly | HTTP/requests | |
| 6 | OPEC ASB | OPEC | Annual | HTTP/requests | |
| 7 | OPEC WOO | OPEC | Annual | HTTP/requests | |
| 8 | CCB Daily | CCB Futures | Daily | Selenium | |
| 9 | CCB Monthly | CCB Futures | Monthly | Selenium | |
| 10 | CCB Weekly | CCB Futures | Weekly | Selenium | |
| 11 | ZME Auction | ZME | Historical | Selenium | Excel |
| 12 | ZME Tender | ZME | Historical | Selenium | Excel |
| 13 | ZME Listed | ZME | Historical | Selenium | Excel |
External URLs and Endpoints¶
EIA (U.S. Energy Information Administration)¶
https://www.eia.gov/outlooks/aeo/pdf/-- current AEO reportshttps://www.eia.gov/outlooks/archive/aeo{YY}/pdf/-- archived AEO reportshttps://www.eia.gov/outlooks/steo/archives/-- STEO archive
IEA (International Energy Agency)¶
https://www.iea.org/reports/world-energy-outlook-{YYYY}-- WEO report pageshttps://www.iea.org/reports/oil-market-report-{month}-{year}-- OMR report pageshttps://iea.blob.core.windows.net/-- Azure Blob Storage hosting actual PDF files
OPEC¶
https://www.opec.org/assets/assetdb/momr-{month}-{year}.pdf-- Monthly Oil Market Reportshttps://www.opec.org/assets/assetdb/asb-{year}.pdf-- Annual Statistical Bulletinshttps://www.opec.org/assets/assetdb/woo-{year}.pdf-- World Oil Outlook reports
CCB Futures (建信期货)¶
https://www.ccbfutures.com/main/research/research_report/daily_paper/index.shtml-- daily reportshttps://www.ccbfutures.com/main/research/research_report/month_paper/index.shtml-- monthly reportshttps://www.ccbfutures.com/main/research/research_report/week_paper/index.shtml-- weekly reports
Zhejiang Mercantile Exchange (浙江国际大宗商品交易中心)¶
https://www.zme.com.cn/lssj/index.htm-- listed/spot market historical datahttps://www.zme.com.cn/lssj2/index.htm-- auction historical datahttps://www.zme.com.cn/lssj3/index.htm-- tender/bid historical data