/ MDM

MDM? POLE? SVP?

MDM? POLE? SVP?

These are terms I use a lot when I hear anybody start talking about federating data sources; especially when the different sources contain customer/user data - as my best friends say, I love a good buzzword all taken from my previous consulting days, I see you S.A.M. :unamused:

Master Data Management is a term I hear being thrown around everywhere, for pretty much every use case I'm involved in and I've see a broad range of topics and challenges. My understanding and how I interpret nearly everyone's use of the term:

We want to define a group standard for data when sharing data cross departments/system/externally, and/or enforce of a standard of quality

Throughout the mid 00's, MDM was a hot topic for peoples lips with the world going out and designing enormous schemas (I will assume this also increased the demand for A3-0 printers in the work place) to then go on and place non-existent data into relational stores, hiring people like me to make sure their tables didn't grow too big and slow... we all did it.

After the first decade of fun, I was introduced by a dear friend & mentor - Sarah Morgan into modelling data sources using POLE as a base for quality and structure. Every data item when boiled up should be definable into one of the following entities:

  • People
  • Objects
  • Locations
  • Events

Each of the entities can be further broken down and coded into subtypes, if we were to look at a real world example, a person in this day and age is made up of a series of aliases used throughout the real and virtual worlds.
In order for matching to be performed to be able to get a full picture of user habits, you will need to somehow link these entities through aliases to confirm and create a master record. Simple. :thinking_face:

Let’s define our problem space to be with our marketing department:

Penny has an exciting vision. She wants to be able to align emails from customers to their social media sentiment they've been tracking in real-time, while monitoring support and complaints and so on. Like usual, the world is her department's oyster.
But they're in a pickle, as their requirements list is growing and they still haven't learnt to code.
Penny falls back on trusty Darren, he's the cool guy in the IT department, wiz with all sorts and all round lovely guy (he even asks whether you want a cuppa or not, and there are 4 floors between them).
So where the fudge does he start in using POLE as the structure of his MDM model.

and with that intro they want to start with looking at the data sources that Darren Penny is the SME for (they've been asked to gather data previously from these sources):

  • Customer Database for:
    • Mobile app data including location
    • Website page view history
    • Marketing history
  • Social Media Source
  • Email File server

Tell me more about this POLE you speak of, some might be saying...

Its simple, each letter stands for an master entity type which can also have children through sub types and hierarchies:

e.g.

-> Object
  -> Vehicle  -> Car
                -> Estate
                -> Hatchback
                ...
              -> Truck
                etc.

People

I see many people struggle with trying to get to grip with the use of the main two subtypes of entities, Alias and SVP. These don't have to solely be the only subtypes that are used as I've seen it continued and expanded with :Companies and :Systems that are important to the domain... Maybe it should really be changed to :Persona or something.
An alias can be any login, email, handle that comes across from your data sources that represent a Person, Customer, Companies, Users... whatever you want to call them.

For example, if I wanted to build an alias from an email I've discovered, I would use the address as the unique key and address extra metadata like IPs specific to alias into that datatype.

(a:Alias { id: email_address, primary_ip: ip1, known_ips: ['ip1', 'ip2'] } )

When I've analysed and defined my/the criteria for identifying aliases who we are confident can be grouped together to represent the same person, we can look to build an SVP. I will speak about SVP in a bit more detail later.

Objects

Objects are usually the easiest to define when the domain is understood. But structure of the objects is key, as some are constrained by their specific domains and/or may have strict topologies depending on whether they are process or communication driven.

In any domain there can be a range of objects from documents and emails; phones and devices; vehicles (we can break this category down even further, like we did in the example above - boats, planes, cars and trains) etc. We're even seeing DNA all the way through to Biometrics being persisted in Health Care and Security use cases!

Locations

Locations usually end up being a set of trees structures to represent different location hierarchies; where the nodes within certain trees may also interconnect at select points through the rules and enhancements applied by different stages/services that perform ETL from sources.

We may have customer addresses that we've defined as a particular entity structure forced by our input form which is super outdated, but for Penny's recently new found data science knowledge, Darren Penny will need to drill down and understand users by particular areas of interest - "Why did we see a peak in sales for product x in territory y after campaign z".

A postal structure is typically the easiest to mimic from the addresses provided by customers of a particular country, expanding a single :LocationAddress into its individual values such as :LocationCountry or :LocationRoad, or simply with Neo4j as multiple labels we :Location:Country, this will allow us to navigate and generalise the questions we want to discover without always having to know the specific details of our exploration.

Our postal location structure could end up looking similar to:

-> Location
  -> Country
    -> Counties or States
      -> Post or Zip Codes
        -> Roads
          -> House Numbers
            -> Address

But if we wanted to combine location data from our mobile application, we would probably look to introduce GPS coordinates :Location:GPS which could be made up of just an longitude and latitude, with or without a estimate of error/deviation.
The reason I said above that these trees may link, would be through us enhancing the data we can calculate this on ingestion or update of the data or use external geo-coding services to match GPS coordinates to countries, states, even down to the house number as service jobs. #GodBlessTheInternet

Pit stop: Transition Graph

Hopefully, each system or data source that we will look to federated will be state driven or we can generalise the data transition into states.

I like to first define with a customer my take on a "Transition Graph" (basically, combining a generalised transition graph with a State Diagram, pretty colours on low tech) so I can relay my understanding of the problem in order to clarify that we're singing from the same hymn sheet. These are great when we need refer back to specific transitions in order to describe how entities are affected or manipulated - Plus its a piece of brilliant documentation that can be picked up and understood by anyone in the business, especially when you're building connectors to systems with bespoke ETL requirements.

For each state, define the data requirements, all relevant parties & systems involved and the conditions they must adhered to - each state should be on an individual sheet and you should try and organise them into a numerical hierarchy

Example: A user order must move into the state of 4.0-Cancelled, before it can transition to 4.1-Refund or 4.2-Reordered

Taking a problem domain and breaking it down like this makes for simple documentation simply for traceability at different levels of granularity for each of the models individual entities. I see this especially useful when building distributed systems by distributed teams, when the vision is eventually to provide a means for automatic enhancement of the data.
It becomes super important and useful to constantly refer back to, when systems must collect and amalgamate data in a particular order of precedence either for a design or strategic reason (business can see a quick win) or that we have a predefined condition out of our control - e.g. someone was lazy back in the day, so in order to enhance customer C with metadata m from source S, we must use source system A to be able to join the user through property Z; some API will only provide all the information through querying separate endpoints, etc. etc.

Events

The states we discover and define can be trivially mapped to Event entities for our data model from real world definitions, our use case would carry the example of: the user's requests to cancel an order. This can also go on to provide the basis of future endpoints and message payloads which underpin a business process plus the principles of RESTful design.

Like all the other entities that make up POLE, Events I deem to be the enhancer of context through structure. Events add a notion of importance through time, or the requests that drive the model, by being the entity to connect all others through itself: *User Sign Up Complete* request could well produce an :Alias type.
Events help us in mapping how people, objects, location and even other events relate to each other. It allows us to ask detective like questions similar to: "Who was present in the vehicle stop and search that took place at location l on date x?" or in DarrenPenny's marketing use case we might reach options to progress from a state due to conditions like: Alias 1 is requesting agreement; Alias 1 failed a check e.g. > minimum salary requirement so we can either Finish or Request a co-signer/guarantor in which :Alias 2 with the added metadata, Mother, will enhance our graph model.

Using POLE

In a world where all of us have multiple devices, we've never really enjoyed the idea of one account to rule them all. Us users, give business and myself a hard time when we want to look at cross device attribution or perform matching to provide better search and recommendations or investigate fraudulent activity. Our habits are driven by the device we're using, the platform (designed for Android or iOS?), application type (desktop web, mobile web, mobile app?), content, you name it.
In order to link these aliases we hope to discover back to a human or entity in the eye of the law, I used the term - Single View of Person/a.

SVP

I'm unsure around the origins of the term, I would like to think it is a military term but I've also seen it used as Single View of Patient. Or mixed into Single Customer View in which the banks would claim the title.

An SVP should be thought to be the glue that binds our aliases as we look to match and aggregate a consistent and holistic representation of an entity. It's necessary for enterprises to take advantage of such simple or naive methods as a starting block in analysing and understanding customer behaviours on a micro and macro level.
Understanding these patterns, allows for us to deliver real-time recommendations; the right promotions to the decision maker all the way through to blocking transactions because of suspicious activity.

Hopefully this gave a bit more of an explanation to POLE if you hear someone jibbering on about it. I will get around to creating some more example of DarrenPenny's quest with Neo4j at some point