Why is data normalization evil OR Throw away what you’ve learned in school
Philippe Gauvin, Co-Founder of the innovative recruting software as a service matchFWD.com presents the challenges they faced when working with the database, as well as the solution they addopted to increase the speed of their application.
Watch the FULL interview
Links where you can read more, by Philippe Gauvin
- http://stackoverflow.com/questions/47711/how-do-you-determine-how-far-to-normalize-a-database
- http://www.agiledata.org/essays/dataNormalization.html
- http://en.wikipedia.org/wiki/Dimensional_database#Normalized_versus_dimensional_approach_for_storage_of_data
Raw Transcript
For this I have Phil from matchFWD.com. He is the founder and CEO. Phil, thank you for joining us.
Phil: Thank you. Thank you for being here.
Andrei: Phil, can you give our watchers more information about your background, about what your company does, when did you founded it?
Phil: All right. So I’ve been in technology for about 12 years in different roles, mostly developing as a software engineer. I’ve worked in image analysis field, mobile gaming, digital signage and the web, obviously.
We started Match FWD just about a year ago, a little less than a year ago, with a cofounder who has been in the recruitment industry for over 10 years, to build a new way for people to connect, for hiring managers to connect with talents and talents with hiring managers.
We think the best hire comes from referrals, from personal referrals. So we wanted to build a tool around that. So for the hiring manager, it takes the form of a social recruiting campaign manager. So you post a job and we give you all the tools to identify candidates and identify people that could put you in touch with good candidates that would match your company and your skills requirement.
We do the same thing for the candidate, too. So if you already have a job and you’re looking for a new opportunity, you want a new type of career, you can say that privately to a certain group of people and we help them see what’s available in their networks. So we can refer you good jobs or good people to work with.
Andrei: OK, OK. You mentioned that you’re using referrals. Are you taking your data from any part of…? How do you get…?
Phil: Yeah. So the whole platform is built on top of the LinkedIn API. Right now we’re integrating Facebook, too. So we don’t want people to start reentering all their profile information. The signup process is very easy. We basically import all the information we can find around your profile.
Obviously if you’ve never entered any work-related information in Facebook, you can do it on the side. But yeah, it’s a pretty easy signup.
Andrei: I understand, I understand. So if you’re using LinkedIn and soon Facebook, that means you are… I can imagine you are working with a lot of data. LinkedIn has a million if not millions of information on personnel, jobs, companies, et cetera. So how do you tackle with that?
Phil: Yeah. Well, obviously right now we don’t have access to all LinkedIn API. The API terms are pretty strict to what amount of data you’re allowed to tap in. But yes, nevertheless, even just in beta, the amount of data we had to work with is pretty impressive.
As you mentioned earlier as the topic or the title of this interview, we learned that what you learn in school around data normalization is not always the best solution when you’re dealing with a lot of data.
Andrei: OK. Can you give our watchers a better understanding of the number of data that you are talking about? I know that it depends on the actual information but perhaps you can give us a rough number of rows for example, or individuals, how do you have them?
Phil: Yeah. Well, I can tell you that not very long ago, just in test, we’re playing with around 200 to 250 different profiles. But that’s not really the impressive number.
The difficult part is trying to base on the job title and the job description that people entered in LinkedIn, figure out what’s their actual occupation. So to do that, we’ve built a pretty large library of job title aliases, all the different variation, misspelling, and version of a job title that someone can understand.
That occupation table right now is probably topping one million rows. When you have obviously to run it… So if you do one profile, one person, he already had five jobs and each of those five jobs you tried to match it properly with one of the alias that we have in the one million table row, and in your network you have, I don’t know, 800 or 900 people that it takes…
Andrei: It takes time.
Phil: It takes some time.
Andrei: OK. So basically you built a dictionary of possible outcomes and then you search for each job title that one person had. You search into that dictionary. So if I have, for example, three past jobs, then for each one you’re searching the 1.5 million rows. Correct?
Phil: Well, that’s a little bit like that. It’s a bit more complicated because we do different things on the string. We tokenize the string. We stem the string to make sure that misspellings are taken into account. Or that if you’re a developer or development, we can…
Stemming basically is taking just the core of a word. So “developer” and “development” is “develop,” “programmer” or “programming,” “program.” So it’s easier to match string that are not exactly similar but have very similar meaning.
So we do a lot of that. That takes a lot of time. Yeah.
Andrei: OK, I understand. And what was the first solution that you implemented to query these 1.5 million rows of dictionary regarding possible job titles? What did you use?
Phil: For the queries?
Andrei: How did you store, actually even better, this information? You were talking about initially you started with data normalization.
Phil: Yeah. Well, those rows are fairly normalized. But one place that we found out that normalization was not used is mostly in the rest of the profile, the company information, the domain of each company.
At first every little part of the LinkedIn profile was absolutely normalized. When you start a product and you’re not absolutely sure are you going to use the information and what level of granularity you want to be able to query on, because that’s what you’ve been told to do so many times, you just normalize all the data.
The more we work with you realize that there’s tons of stuff that you’re not so much interested on. There’s tons of query you don’t really need to do. That’s when you start denormalizing your database, because it then gets much, much easier to store.
So when we import your 600 or 700 friends, if we don’t have to store a separate company for each, that’s already a big plus. So that’s all we’re doing.
But a lot of trial and error, obviously. It would be fun to say, “We had the perfect data architecture when you start it,” but that’s absolutely not true.
Andrei: Those are big durations for.
Phil: Yeah.
Andrei: OK. Before we go into what is the normalization and giving them example, can you give a number… For example, for these 1.5 million rows of job titles, how much time did it take to work with them?
Phil: Well, the first time we try importing – the first person who worked on it was a bit surprised when we were importing LinkedIn contacts, it was taking… To someone with a fairly big network of 700 or 800 people, it was taking about two minutes. He was like, “Wow. That’s incredible. It’s way too slow.”
Then we showed him a notable site, a big site like Jive and indeed where there’s a LinkedIn API connection and you see, it takes them a lot of time too. We’re not the only one having problems dealing with all the data. But denormalizing it, it turns off a little optimization everywhere.
We took it down to about 12 seconds. So it’s probably one of the fastest LinkedIn connection import I’ve seen in any consumer application I think it’s pretty good.
Andrei: That’s over 90% decreasing the speed.
Phil: Yeah, absolutely. Yeah.
Andrei: And two minutes to 12 seconds, that is huge.
Phil: Yeah. The idea is that you don’t simply do it to be proud of yourself, but the goal is that as soon as somebody registers, we’re able to show him the full product. We don’t have to show him, “Oh, well, we’re still importing your connection, you cannot use all the site, blah, blah, blah.”
When somebody registers you have that little moment where you want the magic to happen. If it doesn’t it gets more difficult to resell your product after that. “Oh, no, come back. We’ve finished importing your connection.” You don’t know, maybe the guy is handling email or he’s back to work and you’ve lost the opportunity to impress him and make him a loyal user.
Andrei: I understand. Can you give an example, how does the normalization work? We know that it took from two minutes to 12 seconds. What does this mean from a technical perspective? How did you do it? For example, you could present A) when it was normalized, the account was like this and when we finish denormalizing it, the information was stored mostly like that. Is it OK with you?
Phil: Yeah. Sure. Absolutely as an example, for each profile you have a list of position or jobs that the person occupies. For each of those positions you have a job title, a company, and a summary. So in the beginning the job title was referring to another table, the big table of job title, because that’s one of our big tables, obviously. And the company was referring to another table, was company.
So you can usually imagine, for one LinkedIn profile, let’s say the person had eight different jobs. First, each job itself is in a different table. So it’s already a row. And for each one of those there’s also two foreign key, one to another job title and one to company.
It seems fantastic because it allows you to do some very deep queries on all the data, but after that you realize that it’s impossible to do it like that if you want to import connection very, very fast.
So we brought back the company in the position itself. So instead of being a foreign key, it’s just a string there. We’ve also imported the job title itself instead of storing it as a key foreign in another job and in another title. It’s a string there. Already that was a big, big, big improvement.
On the other end, we had to deal with LinkedIn has this very interesting thing where a company name is associated to a company ID, obviously. But multiple profile for the same company can have slightly different strings for the company.
So the company ID is the same, but the company name – one might be Barabas Inc., the other is Barbara Incorporated and the Barabas Company Incorporated So we had to deal with that and make sure that we understand that it’s all the same company too.
Andrei: OK. For example, some of the employees might put “Limited” in the name, some others don’t. I understand your point.
OK. So basically instead of having multiple tables and each information spread throughout those tables and when you need them, query them and join them, basically what you did, you pre-prepared everything so that when you need them you simply query them in one table which can be optimized with indexes, et cetera, and you can obtain faster results.
Did I understand correctly?
Phil: Yeah, absolutely.
Andrei: OK. Great. Do you think this approach of putting aside what you learn in school and thinking out of the box, do you think that works only for your particular case or it’s a thing that online entrepreneurs consider it, if not implement it, at least consider it? What is your opinion?
Phil: No, I think it’s applied already in multiple fields. I think that’s the main reason why all the Reddit and MongoDB and NoSQL database are so useful is that a lot of time you realize that all you need is a key and a value. And you don’t need all the complicated join to other tables. And you realize that it’s much faster this way. So no. But yeah, as I say, trial and error.
I think somehow you’ve been told a lot about normalization. Normalization is fantastic if you want to do reporting. For reporting it’s very generous. I’m guessing that some guys who wrote a lot of those books were doing a lot of reporting jobs back then.
Andrei: I understand. Yeah, well you can denormalize for your online usage and for reporting you can start at another database where the marketer or whoever needs those information could work with them in the normalized way.
Phil: Yep.
Andrei: OK. I agree. Can you suggest, or at least can you send me one or two links where you think our watchers could learn a little bit more about this approach of denormalization, sorry.
Phil: Sure.
Andrei: I [audio breaks up] to put it on the transcript.
Phil: Yeah, I don’t have one at hand but I can find quite a few very rapidly and that will be my pleasure to send it over.
Andrei: Sure. Also, if you would give one particular advice to fellow online entrepreneurs, what would that be?
Phil: [laughs] One advice.
Andrei: Well, you said a lot of trial and error.
Phil: Yeah. I think really think – it’s very difficult. I would say think about how you’re going to use the data. I think if you’re building something that it’s very clear how it’s going to be built and how it’s going to be used, it’s pretty easy and you should just sit down and think about it very well before you start your data model.
But when it gets difficult is that when you’re trying different product ideas and different features. At that point you’re not absolutely sure how you’re going to use the data. So the advice of really thinking how you’re going to use the data works only if you know exactly what you’re going to build.
If I had – I don’t know, it’s a pretty common one, but make it work and then make it fast. Don’t try to do both at the same time. It’s very depressing the first time because it’s awfully slow. But profiler – yeah, OK, so my advice would probably be learn to use the profiler.
If you’ve never used a profiler there’s something wrong. It should be the first thing you’re able to start and you should be able to use it very rapidly and proficiently.
Andrei: OK. Just a quick question, for those that don’t know exactly what the profiler is, can you explain in a short sentence what does it do?
Phil: Sure. Well, a profiler is very, very simple. It basically shows you all the methods that are called during the execution of a routine or a program and shows, well, for each method, how many times they’re called, how many total time is spend in the method and how many times, in average, it spent in the method.
So it allows you to decide where, probably, you should plan your time. If there’s one method that is called pretty often and that is fairly slow… Especially things like string transformation. Everything that’s related to string or conversion of type is often an easy place to start to improve things.
Sometimes you take a string, you switch it to an end and then you switch it to another type to realize that at the end you come back to a string, because it wasn’t coded in that order. So sometimes that’s an easy place to… But basically a profiler shows you all the methods that are called, the time you spent in each, and then it’s easy for you to spot which method you should try to prioritize in your optimization.
Andrei: OK. So this works also at programming level and database level, I guess.
Phil: Yeah, both. They’re different types of profiler, but yeah.
Andrei: OK. So, use the profiler.
Phil: Yeah, basically, use the profiler.
Andrei: OK. Do you want to say anything else about MatchFWD, about – not necessarily your plans – your value proposition…?
Phil: Sure. We’re coming up with a full product launch around mid-February. I think we’ve built something very interesting to putting back the people at the center of the employment game. So I think we’re really turning the job board idea upside down and trying to put people in touch with people, which is very, very interesting.
We’ve had great, great, great feedback from our data testers and early users. I just can’t wait to see it in full effect in a couple months.
Andrei: We too. We are looking forward to seeing. Everyone visit matchfwd.com. You will find the link below. Phil, thank you very much for doing this interview. Thank you, everyone, for watching. This is Andrei, have a great day.
About the author
Hi, I'm Andrei and together with the Barandi Solutions team I'm looking to bring you the best resources so you can learn what it takes to build and run a successful Software as a Service business.
- More at

Sign up for our newsletter!







