What is the problem with multi-source data?
Whatever field your business operates in, you almost certainly depend on a whole range of different data sources for management, planning and forecasting. For instance, if you manufacture retail products then you’ll want to see your EPOS sales through different retailers; many companies use different codes to represent different time periods; and so on. Some of this data will take the form of free-form text and documents. Much more, though, will be “codified”, ie using a particular code to refer to a specific thing, such as “2016W01” to refer to 4th-10th January 2016.
Unfortunately, external sources will often use a different code from your internal one, making direct comparison impossible: this is, in a nutshell, the problem of multi-source data. The problem isn’t even limited to external sources: internal data can pose the same problem, sometimes as a result of acquiring another company but also due to differences of approach between departments over time.
Options for managing the problem
There are a number of options open to address this problem and allow your business to benefit from multi-source data, with varying degrees of success:
1. Ensure the data supplier uses your codes
If you can convince a data supplier to use your codes, that might that solve the problem. This can work if the data owner is specifically working on your behalf, or if they have already put the work in to managing it. However:
- You then need to continually liaise with them to ensure they have your internal data structures, which you may not want to do for security reasons
- Problems can be hard to spot, and you are at the provider’s mercy to resolve them if they do come up
- Your internal processes need to have reached a certain point to ensure that a single code is even available for them to use – some work might still be required at your end to allow this
2. Take a piecemeal approach
Can you just tackle the problems as they arise? It is not unusual to manage external data by having a local “Excel guru” or other expert create a spreadsheet or script to perform any necessary conversions. This may be fine for very small companies but, as the organisation grows, this approach runs a number of real risks:
- By making the different sources hard to work with, the opportunity to make the most of the data is lost or, at best, limited
- The company becomes uncomfortably dependent on the individual managing the process; if they leave, their (often incomprehensible) solution is left in limbo with nobody able to take it over
3. Tackle it head-on: the Golden Orb approach
We have helped a number of clients to make the most of multi-source data. Before describing our approach, we will start by defining a few terms for convenience:
- Dimension – the structure of the data. Examples of a possible dimension might include product, time, region and so on
- Source – where data comes from. Typically, companies have multiple sources, one or more of which are internal to the organisation
- Master data – the definitive list of values for a dimension. Different internal/external sources may use different codes to describe something, but the sources can be brought together by translating everything to a master code
- Mapping – the relationship between the code from a particular source and the master code
- Attribute – information about data, for example the start date of a time period. These are often useful to help set up mappings
- Hierarchy – many dimensions can be usefully viewed as a hierarchy, for instance the time dimension might contain years, quarters, months, weeks and days
An illustrative walk-through
To illustrate the process, here is a simple example. Consider a fictitious company “Aardvark Productions”, a recording label whose sell sell MP3s and CDs of music from their list of musicians through retailers and online. To manage their business, they want to understand their sales through various channels and how they respond to promotions.
They have an internal list of products – CDs, individual MP3s and certain merchandise. The main data sources they are interested in are sales on Amazon and iTunes, and listening through Spotify. The data is to be used by marketing, to be updated weekly. On this basis, they identify:
- 4 sources (internal, Amazon, Apple and Spotify), 1 dimension (product); time doesn’t require management as all sources can supply dated information
- Feeds for the products on offer for each source, as well as the actual data to be worked with (daily sales). Each source can supply the product description, and the internal and Amazon data will also contain a product EAN
- An operator in marketing support to manage the data weekly
The operator obtains the 4 lists of products, 1 for each source. These products and attributes are loaded into the multi-source data system. Since there is no master data set up yet, the operator adds the internal list of products as master values, storing the EAN and description as an attribute – this is how marketing are used to seeing products.
Next, the operator takes each other source, and works on the mappings. To start with, there won’t be any mappings, so each external code must be mapped to, or associated with, a master product. They work down the list, finding and saving the corresponding master code – which is almost trivial when an EAN is available, but otherwise can be done by matching descriptions.
At this point, every external product from Amazon, Apple and Spotify has been associated with a master product code. Enough information is now available to be able to take a source’s data file (eg daily sales) and, from it, create a new data file with the same numbers but using the master product codes used in the marketing information system. The files can therefore be easily worked on.
Now might also be a good time to set up a product hierarchy: marketing like to view their products by genre. This isn’t absolutely necessary, but can be very useful if aggregated data is required by the marketing system. Possibly a hierarchy is already available for the internal source, in which case this could be easily copied over to the master dimension.
Continuing the process
A week later, some more data is available. A new product has been launched, so there is a new internal code. When the operator loads the internal source, an unmapped product appears, and the operator can save that onto the end of the master list. Since it hasn’t been sold yet, it doesn’t appear for the other sources, though. When those sources are processed, therefore, the system already knows the master product codes for everything sold, and no further work is required.
The following week, sales start to come through. Each source contains sales for a product whose code hasn’t been mapped to a master product. The operator sees the un-mapped code, and is quickly able to locate the master product it relates to and set up the mapping. Again, the data for those sources can then be easily transformed into a form ready to analyse for the weekly meeting.
Making it work for you
In simple terms, that gives an overview of our approach to working with multi-source data. Depending on your particular requirements, there are many ways that the process can be refined to make it easier to operate:
- Many different attributes can be used, making it easy to come up with a short-list of values to pick from for mapping
- Hierarchy management can be included in the process – making it easy to work with existing hierarchies, and also using that information to speed up mapping, where possible
- Where structured descriptions are used, these can also be brought into play
- Much of the process can be automated if required, eg validating data and throwing up exceptions when new codes appear for a source
- Different kinds of data sources can be used – flat file, spreadsheet, REST API, database query and so on
So long as you cover these bases, though, you should always find it simple to make really good use of your internal and external data sources – often in ways that you wouldn’t even have imagined if the data were harder to work with.
We can help!
We have built up a lot of experience of working with multi-source data, gained both providing in-house tools and running an external service. If you think you could be doing a better job of working with multi-source data, do get in touch with us and we would be very happy to help you reach your potential.