Skip to content
Resources

Datasets

For your and our work
Useful

Datasets

We have identified several useful datasets related to financial and life sciences applications of data science and machine learning/artificial intelligence.

On this page we have put together a list of datasets that we find particularly useful in our work. This list by its very nature incomplete and subjective, but we hope that our customers will find it as useful as we do.

Finance

Dataset

BitMEX

BitMEX is a cryptocurrency exchange and derivative trading platform. It is owned and operated by HDR Global Trading Limited, which is registered in the Seychelles and has offices worldwide.

BitMEX offers a fully featured REST API and a powerful streaming WebSocket API. All market and user data is available and updates in real-time.

The BitMEX APIs are open and complete. Every function used by the BitMEX website is exposed via the API, allowing developers full control to build any kind of application on top of BitMEX.

Dataset

FINRA TRACE

The Trade Reporting and Compliance Engine (TRACE) is the FINRA-developed vehicle that facilitates the mandatory reporting of over-the-counter secondary market transactions in eligible fixed income securities such as corporate bonds.

All broker-dealers who are FINRA member firms have an obligation to report transactions in corporate bonds to TRACE under an SEC-approved set of rules.

FINRA TRACE data is available on Wharton Research Data Services (WRDS).

Dataset

LOBSTER

LOBSTER is an online limit order book data tool to provide easy-to-use, high-quality limit order book data.

Since 2013 LOBSTER acts as a data provider for the academic community, giving access to reconstructed limit order book data for the entire universe of NASDAQ traded stocks.

LOBSTER has also been made available to commercial clients, especially investment banks, hedge funds and asset managers, so commerical clients can also profit from this new data set.

Dataset

Quandl

Quandl is the premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 400,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.

Dataset

TickData

For over 30 years, the world’s largest investment banks, asset managers, proprietary traders and univerities have relied upon Tick Data‘s historical intraday stock, futures, options and forex data to back-test trading strategies, develop risk and execution models, perform post-trade analysis, and conduct important academic research. Owned by its managers, Tick Data is passionately focused on data quality. And with data as far back as 1974, TickData offering is one of the largest of its kind.

Dataset

Yahoo! Finance

Yahoo! Finance is a media property that is part of Yahoo!’s network. It provides financial news, data and commentary including stock quotes, press releases, financial reports, and original content.

It is possible to download financial data from Yahoo! Finance via Python libraries such as yfinance.

Life Sciences

Dataset

AromaDb

AromaDb database is a comprehensive electronic library of aroma molecules of medicinal and aromatic plants of Indian origin, as well as foreign and include detail information about aroma plants, description, plants varieties, plants accessions, chemotypes, essential oils, oil yields and constituents, chromatograms, major and minor compounds, structural elucidation data, structural data (2D and 3D) of very small size volatile molecules (< 300 molecular weight) and medium size molecules (< 500 molecular weight), physico-chemical properties, biological pathways information and cross references. Here, aroma compounds classified by structure as esters, linear terpenes, cyclic terpenes, aromatic, amines, alcohols, aldehydes, ketones, lactones, thiols and miscellaneous compounds.

Dataset

DrugBank

The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.

The latest release of DrugBank (version 5.1.4, released 2019-07-02) contains 13,445 drug entries including 2,623 approved small molecule drugs, 1,349 approved biologics (proteins, peptides, vaccines, and allergenics), 130 nutraceuticals and over 6,335 experimental (discovery-phase) drugs. Additionally, 5,158 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each entry contains more than 200 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data.

Dataset

FooDB

FooDB is the world’s largest and most comprehensive resource on food constituents, chemistry and biology. It provides information on both macronutrients and micronutrients, including many of the constituents that give foods their flavor, color, taste, texture and aroma. Each chemical entry in the FooDB contains more than 100 separate data fields covering detailed compositional, biochemical and physiological information (obtained from the literature). This includes data on the compound’s nomenclature, its description, information on its structure, chemical class, its physico-chemical data, its food source(s), its color, its aroma, its taste, its physiological effect, presumptive health effects (from published studies), and concentrations in various foods. Users are able to browse or search FooDB by food source, name, descriptors, function or concentrations. Depending on individual preferences users are able to view the content of FooDB from the Food Browse (listing foods by their chemical composition) or the Compound Browse (listing chemicals by their food sources).

Dataset

PubChem

PubChem is an open chemistry database at the National Institutes of Health (NIH). “Open” means that you can put your scientific data in PubChem and that others may use it. Since the launch in 2004, PubChem has become a key chemical information resource for scientists, students, and the general public. Each month our website and programmatic services provide data to several million users worldwide.

PubChem mostly contains small molecules, but also larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically-modified macromolecules. We collect information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others.

Where does the data in PubChem come from? PubChem records are contributed by hundreds of data sources. Examples include: government agencies, chemical vendors, journal publishers, and more.

The amount of data in PubChem is ever-growing, please visit the PubChem Statistics page to find out what the latest data counts are.

Miscellaneous

Dataset

Kaggle

Kaggle offers a no-setup, customizable Jypiter Notebooks environment. Access free GPUs and a huge repository of community published data and code.

Inside Kaggle you will find all the code and data you need to do your data science work. Use over 19,000 public datasets and 200,000 public notebooks to conquer any analysis in no time.

Dataset

UCI Machine Learning Repository

The UCI Machine Learning Repository offers 488 data sets organised by default task (classification, regression, clustering, other), attribute type (categorical, numerical, mixed), data type (multivariate, univariate, sequential, time-series, text, domain-theory, other), area (life sciences, physical sciences, computer science and engineering, social sciences, business, game, other), etc.