This is part two of a series of posts exploring the Data Factory, the premier approach to data management. If you missed it, catch up on Part One: Acquiring the right data here.
In this post, we’ll tackle the stage that comes after finding the right data for your business’ needs: transforming the data so that it’s usable and reliable in application.
Data quality assurance
The truth of the matter is that raw, real-world data can have a multitude of problems—it can be messy, noisy, unstructured and full of gaps, errors, outliers and duplicates.
Before portfolio managers and research heads can decide how to best use it, the data needs to be prepared and, in doing so, made useful.
Let us take the example of our FX Settlements Volume and Flow datasets. They contain FX transaction data from the largest FX settlement service in the world. This settlement service trades around 35-50% of the market in various currencies and is aggregated and delivered hourly and end-of-day. The data includes total volume as well as volume sliced by participant type and direction of flow.
The fast hourly updates make them incredibly useful to investors but can introduce a high error rate if the appropriate quality assurance processes are not in place.
It is vital to have an established process to make sure that data is coming in clean and quality-checked.
Engineering and optimization
Transforming data doesn’t stop at the quality assurance stage–data also needs to be organized through clever schemas and smart structures.
Our US Job Listings data includes job postings activity for over one million US firms. Its fields include company name, size and sector as well as the position title, category, candidate profile, compensation and location. The data has been cleaned, de-duped and ticker-mapped to tradable securities.
But job listings are difficult to keep track of—sometimes a listing goes up, disappears, and then reappears. In the meantime, it’s difficult to ascertain exactly what is going on and how to interpret that activity. Is the job filled? Is the company hiring multiple headcounts for the same position? It’s an unavoidable challenge when it comes to certain types of data like this.
Our data operations team solved it through using a clever schema whereby we break down the data into different tables for job details, company details, entity details and job posting activity. Our users can therefore join and link them to serve their needs, without the data provider imposing assumptions.
Ensuring clarity and consistency
Symbology is a familiar challenge for anyone working with market data—the process of mapping indicators to tickers or other identifiers can be tedious, detail-oriented work.
It becomes especially complicated when working with alternative data because in alternative datasets, you’re often working with real world entities such addresses or products and brands. There is an entire discipline and expertise associated with mapping real world entities onto tradable securities.
Our Global Supply Chain Relationships (GSCR) dataset maps public companies with their key customers and suppliers, providing a comprehensive view into companies’ business-to-business relationships.
Conventions can vary across markets, and GSCR covers 35,000 global public and private companies. The dataset also quantifies the revenue dependencies of significant relationships by percentage. These supply chain relationships can include subsidiaries, joint ventures, partnerships, customer-client links and more. Therefore, understanding entity relationships and symbology is critical to making the GSCR data useful.
Symbology is highly specialized, detail-oriented work. It requires solid supplemental data and is unforgiving of errors. But when done correctly, as in GSCR, it makes data useful and illuminating.
Up next: applying your data
Your dataset is cleaned, organized and labelled for use. The next step in our assembly line is arguably the most exciting: it’s time to apply your data. From use cases to models to simulations and case studies, it’s time to see what you can get out of your data. Stay tuned for the next installment of the Data Factory: applying your data.