Downloading historical market data is a foundational step for any serious trading strategy backtesting. This process involves retrieving past price and volume information from various exchange APIs and storing it systematically in a database for analysis.
The core method for this operation typically utilizes an ExchangeDB class with a function like insert_klines to write the retrieved candlestick (kline) data into your local storage. This organized data is what empowers you to test your trading ideas against historical performance.
Downloading China A-Share Historical Data
When working with China's A-Share market, two common data sources are Baostock and掘金(Goldminer). It's important to note that the latest version of掘金only supports downloading minute-level data for the most recent 90 days.
- Baostock Data Source: The script for this is commonly found at
script/crontab/reboot_sync_a_klines.py. - 掘金(Goldminer) Data Source: The corresponding script is usually
script/crontab/reboot_sync_gm_a_klines.py.
Within these scripts, you can define the specific stock symbols you wish to download using a parameter like run_codes. The start date for data retrieval for different timeframes (e.g., 1min, 5min, 1D) can be set with a parameter such as f_start_datetime.
A key recommendation is to avoid downloading the entire market's history unless absolutely necessary, as the process can be slow and generate extremely large datasets. Focus instead on downloading the historical data only for the specific symbols and timeframes you need for your backtesting research.
It is also highly advisable to use back-adjusted (后复权) data. This accounts for corporate actions like stock splits and dividends, providing a more accurate picture of price movement. You can set up a daily job to append the latest data, which simplifies maintenance as you won’t need to manually manage adjustment factors.
Note: The Baostock service provides a minimum timeframe of 5 minutes for stock data and does not include tick-level minute data.
Acquiring Hong Kong Stock Market Data
For Hong Kong stock (HK shares) historical data, Futu's API is a frequently used source.
- Futu Data Source: The download script is typically located at
script/crontab/reboot_sync_hk_klines.py.
Be mindful that this API often enforces limits on the number of securities you can query, so use it judiciously based on your specific requirements.
Sourcing Futures Market Historical Data
For futures market data, common sources include TqSdk (Tianqin) and掘金for Futures. Similar to its equity counterpart, the latest掘金for Futures API may restrict minute data downloads to the last 90 days.
- TqSdk (Tianqin) Data Source: The script can be found at
script/crontab/reboot_sync_futures_klines.py. - 掘金Futures Data Source: The script is usually
script/crontab/reboot_sync_gm_futures_klines.py.
Accessing historical data via the TqSdk API often requires a professional subscription. They may offer a free 15-day trial period, which can be an excellent opportunity to download the necessary futures historical data to your local database for all your subsequent backtesting needs.
Downloading Cryptocurrency Historical Data
The Binance exchange is a premier source for deep historical data on crypto assets.
- Binance Data Source: The script for downloading crypto data is commonly
script/crontab/reboot_sync_currency_klines.py. This typically uses the Binance USDT-margined perpetual futures API to fetch historical kline data for all listed trading pairs.
A crucial technical point: The ExchangeBinance class might check the database for existing data during its klines method call and will write new data accordingly. If you haven't downloaded historical data previously but have requested crypto data through a web interface, the "last update" timestamp in your database might be recent. This can cause the historical download script to skip downloading older data for that symbol.
The solution is to clear the relevant database tables for those symbols and then execute the historical download script. To maintain your dataset, it is essential to set up a scheduled task (e.g., a daily cron job) to run the historical download script to automatically append new data. For managing and analyzing this data effectively, you might want to explore more strategies on robust data pipeline management.
Utilizing Pre-Packaged Historical Data for Backtesting
For a quicker start, pre-compiled historical data packages are sometimes available through community channels. These zip files can be imported directly into your database for local backtesting, saving the initial time required to fetch data from APIs.
VNPY Historical Data Package
This package contains data essential for backtesting with the VNPY framework, including select stock and futures contract data.
- File Name:
vnpy_mysql_data.zip
Usage Instructions:
- Create a MySQL database named
vnpy. - Unzip the downloaded file.
- Import the SQL dump into your database. Note that this process may overwrite any existing data in the tables.
Included Data Examples:
- Stocks (1-minute data from January 1, 2019): Example codes like '000001', '000858'.
- Futures (1-minute data from January 1, 2019): Example contracts like 'ag2206', 'cu2205'.
Cryptocurrency Data Sample
This package includes historical data for major crypto pairs like BTC/USDT, ETH/USDT, EOS/USDT, and ETC/USDT.
- File Name:
chanlun_currency_data.zip
Included Timeframes: The data encompasses multiple timeframes, including weekly ('w'), daily ('d'), 4-hour ('4h'), 1-hour ('60m'), 30-minute ('30m'), 15-minute ('15m'), 5-minute ('5m'), and 1-minute ('1m').
Futures Data Sample
This extensive package contains data for a wide array of futures contracts from exchanges like CFFEX, DCE, SHFE, and INE.
- File Name:
chanlun_futures_data.zip
The data covers numerous specific contracts and includes a range of timeframes from weekly down to 1-minute charts, with start dates varying based on the resolution (e.g., 1-minute data often starts from 2018). This allows for deep, multi-timeframe backtesting research. To efficiently work with these complex datasets, consider tools that help you view real-time tools for analysis and execution.
Frequently Asked Questions
Q: Why is back-adjusted data recommended for stock backtesting?
A: Back-adjusted data modifies historical prices to reflect corporate actions like dividends and stock splits. This provides a continuous and accurate price series, ensuring your backtest results aren't distorted by these events, which is crucial for calculating realistic returns.
Q: What is a common mistake when starting to download historical data?
A: A frequent issue is not properly managing the 'last updated' timestamp in the database. If the system thinks it has the latest data, it won't download the historical depth you need. The solution is often to clear existing records for a symbol before running a full historical download script.
Q: How often should I update my local historical database?
A: It's best practice to update your database daily using an automated script or cron job. This appends the latest day's worth of data, keeping your local repository current and ready for the most up-to-date backtesting or analysis without manual intervention.
Q: What's the main advantage of using exchange APIs directly versus pre-packaged data?
A: Using APIs gives you complete control over the symbols, timeframes, and time periods you download. Pre-packaged data is convenient for a quick start but may not include the specific assets or the exact date range you require for your unique strategy research.
Q: Are there limitations to free data sources I should be aware of?
A: Yes, absolutely. Many free APIs, including some mentioned, impose limits on data depth (e.g., only 90 days of minute data), rate limits on requests, or restrictions on the number of symbols you can query in a single request. Always check the latest API documentation.
Q: Is it necessary to download 1-minute data for all backtesting?
A: Not at all. The necessary data granularity depends entirely on your strategy's trading timeframe. Strategies based on daily signals only require daily data. Using higher granularity than needed consumes significant storage and can slow backtesting execution without providing any benefit.