Data Collection Methodology¶
This page explains the overall data collection infrastructure used to build the crude oil industry chain analysis dataset. The system comprises 13 Python scrapers that systematically gather reports and market data from five distinct sources across three countries and two international organizations.
Architecture Overview¶
The data collection pipeline is divided into two distinct technical tiers based on the complexity of the target websites:
Tier 1: HTTP-Based PDF Downloaders (7 scrapers)¶
These scrapers target international energy organizations (EIA, IEA, OPEC) that publish reports as static PDF files. They use the Python requests library to download files directly via constructed URLs.
Common design pattern:
- URL construction -- Each scraper encodes knowledge of the target site's URL naming conventions, which often change across different year ranges. For example, the EIA AEO scraper contains 8 separate URL templates covering different eras from 1979 to present.
- HTTP GET with browser spoofing -- All scrapers set a Chrome-like
User-Agentheader to avoid being blocked by basic anti-bot measures. - Content validation -- Downloaded content is verified as genuine PDF by checking for the
%PDFmagic bytes at the start of the file, preventing silent failures where a 200 OK response returns an HTML error page instead of the actual report. - Idempotent operation -- Every scraper checks whether the target file already exists locally before attempting download, making re-runs safe and efficient.
- Rate limiting -- Polite delays (0.5-1.5 seconds) are inserted between requests to avoid overloading target servers.
OPEC-specific considerations:
The three OPEC scrapers disable SSL certificate verification (session.verify = False) because the OPEC website's SSL certificates have historically caused validation failures with Python's default certificate bundle. This is a pragmatic workaround, not a security best practice. The urllib3 SSL warnings are explicitly suppressed.
IEA-specific considerations:
The two IEA scrapers (WEO and OMR) employ a two-stage download process. IEA does not expose direct PDF URLs in a predictable pattern. Instead, each scraper first fetches the HTML report page (e.g., https://www.iea.org/reports/world-energy-outlook-2024), parses it with BeautifulSoup to find download links pointing to iea.blob.core.windows.net (IEA's Azure Blob Storage), and then downloads the actual PDF from that extracted URL. This makes these scrapers more fragile -- any change to IEA's page structure would break the link extraction.
Tier 2: Selenium Browser Automation Scrapers (6 scrapers)¶
These scrapers target Chinese domestic data sources (CCB Futures and ZME) where content is rendered dynamically via JavaScript and cannot be accessed through simple HTTP requests.
Common design pattern:
- Chrome WebDriver -- All six scrapers use Selenium with a local
chromedriver.exeto control a real Chrome browser instance. - DOM-based data extraction -- Content is located by searching the rendered DOM for
<a>tags (report links) or<table>elements (market data). - Automatic pagination -- Each scraper implements a pagination loop that finds and clicks "next page" buttons using multiple XPath selectors to handle different button label formats (Chinese, English, symbol-based).
- Download via browser -- For PDF reports (CCB Futures), files are downloaded by programmatically clicking links in the browser. The scraper monitors the filesystem for new files appearing in the download directory, waits for
.crdownloadtemporary files to complete, and then renames files to a standardized format. - Deduplication -- Before downloading, the scraper scans the local directory for existing files, extracts dates from filenames, and skips any reports that have already been collected.
CCB Futures sub-scrapers (3 scrapers):
These three scrapers (daily, weekly, monthly) share identical architecture but differ in:
- Target URL path (daily_paper, week_paper, month_paper)
- Keyword matching rules for identifying relevant links in the page
- Output file naming prefix
- Maximum page depth (500-3000 pages depending on expected volume)
The monthly report scraper includes a "microscope debugging" mode that prints the raw repr() of link text to diagnose issues with hidden Unicode characters that might interfere with date extraction -- an indication that the target website uses non-standard character encoding in places.
ZME sub-scrapers (3 scrapers):
These three scrapers (auction, tender, listed) extract structured tabular data rather than downloading files. They share identical architecture but target different URL paths and table schemas: - Auction data: 7 columns across 6 pages - Tender data: 5 columns across ~20 pages - Listed data: 5 columns across 1,437 pages (the largest dataset)
The listed data scraper implements an anti-ban strategy, pausing for 5 seconds every 100 pages to avoid being rate-limited by the exchange.
Data Sources and Coverage¶
International Sources¶
| Organization | Reports | Time Span | Frequency | Purpose in Analysis |
|---|---|---|---|---|
| EIA (US) | AEO, STEO | 1979-present | Annual, Monthly | US energy supply/demand forecasts, pricing outlook |
| IEA (International) | WEO, OMR | 1998-present | Annual, Monthly | Global energy market analysis, oil supply/demand balances |
| OPEC | MOMR, ASB, WOO | 1999-present | Monthly, Annual | OPEC production data, cartel policy analysis, long-term outlook |
Domestic Chinese Sources¶
| Source | Data Types | Purpose in Analysis |
|---|---|---|
| CCB Futures (建信期货) | Daily, weekly, monthly reports | Chinese futures market perspective on crude oil, petrochemical sector commentary |
| ZME (浙江国际大宗商品交易中心) | Auction, tender, and spot trading records | Physical commodity trading prices and volumes in China's spot market |
Output Formats¶
The collection produces two types of output:
-
PDF reports (10 scrapers) -- Unstructured text documents requiring downstream NLP/OCR processing for structured analysis. Filenames follow a consistent
{Source}_{Date}_{ReportType}.pdfconvention. -
Excel spreadsheets (3 scrapers) -- Structured tabular data from ZME, ready for direct quantitative analysis. Contains commodity trading records with dates, volumes, prices, and transaction statuses.
Dependencies and Runtime Requirements¶
Tier 1 (HTTP scrapers)¶
- Python 3.x
requests-- HTTP clientbeautifulsoup4-- HTML parsing (IEA scrapers only)- No special system requirements
Tier 2 (Selenium scrapers)¶
- Python 3.x
selenium-- browser automationpandas+openpyxl-- data processing and Excel export (ZME scrapers only)- Google Chrome browser installed
chromedriver.exematching the installed Chrome version, placed in the script directory- Windows OS assumed (uses
os.system("pause")and Windows-style paths)
Limitations and Fragility¶
-
URL pattern changes -- The HTTP scrapers encode URL patterns that have changed historically and will likely change again. The EIA AEO scraper already has 8 different URL templates for different year ranges.
-
IEA page structure -- The IEA scrapers depend on parsing HTML to find Azure Blob Storage links. Any redesign of IEA's website would break these scrapers.
-
OPEC SSL issues -- SSL verification is disabled for all OPEC scrapers, which is a security concern if running on untrusted networks.
-
Selenium brittleness -- The CCB Futures and ZME scrapers depend on specific DOM structures, CSS class names (
tcdPageCode1,tcdNumber,span.current), and page layout. Website redesigns would require scraper updates. -
Platform dependency -- The Selenium scrapers are Windows-specific due to hardcoded paths (
chromedriver.exe,D:\anaconda3\) andos.system("pause")calls. -
No error recovery -- None of the scrapers implement checkpoint/resume functionality. If a long scraping job (e.g., ZME listed data at 1,437 pages) fails partway through, the file-existence check provides some protection, but any partially downloaded files may cause issues.
-
No scheduling -- The scrapers are designed for manual execution. There is no cron/scheduler integration for automated periodic collection.
Related Pages¶
- domestic-crack-spread-analysis — 国内裂解价差数据源采集体系(隆众/卓创/金联创三源交叉验证,国际基准+国内现货+官方监管三层数据矩阵)
- industry-chain-variables — 完整195变量产业链框架
- data-scrapers — 13个Python采集器详细列表