http://www.pqsw.com

by Tanya Charbury

Non-Technical Overview of Normalization

The purpose of this document is to explain, in non-technical terms, what normalization is, why we use it, how we implement it, and what the benefits are. This way, when you use the database systems we design, you will be more likely to recognize and enjoy the underlying design, and be better able to maximize the benefits to yourself.

What is Normalization?

Normalization is a principle of database design. It pertains to how the data containers are structured. It's a principle to which we adhere, intentionally. Normalization has several aspects, but one of the central principles is that a particular piece of data is only stored once, centrally. If you need that piece of data again, in the system, then you don't duplicate it: you point to it, instead.

A good formal definition comes to us from http://webopedia.lycos.com, which defines it as "the process of organizing data to minimize redundancy. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships."

Cleaner Data, Less Typing

The following is a typical non-normalized example:

Street City State Voice Fax
731 Lakefair Sunnyvale Calif 4087341234 4087341235
634 Lakefair Sunnivale Ca 4087341234 4087341235
834 Lakefair Sunnyvale Ca 4087341234 4087341235
869 Lakefair Sunnyvale Ca 4087341234 4087341235
145 Lakefair Sunnievale Cal. 4087341234 4087341235

In a typical non-normalized database, values like city names and state abbreviations are stored once per address record. As a result, the values typically have to be typed in full, and it is difficult to keep the data clean and consistent. Also, there is a lot of unnecessary typing.

In a typical normalized database, there would be a separate container for city values, and a separate container for state values. A city would be associated with a state. As in the non-normalized database, in the address data entry screen, the user would have the opportunity to type the city name. However, as the user types, the typical normalized system will try to guess ahead, as to the value desired. For example, if the user types "sun" and there is only one city name (Sunnyvale) which starts with those letters, the typical normalized system will display "Sunnyvale, Ca." The user can then accept the value, by simply proceeding to the next field. The user does not need to indicate that the city is in California, because the link between Sunnyvale and California has been pre-established.

So, two benefits of the normalized approach are that the user needs to type less, and has cleaner data as a result.

The typical non-normalized system would be difficult to use as a basis for accurate report-writing. For example, asking for addresses in "Sunnyvale" would yield only some of the records intended to be in Sunnyvale. The other records would be invisible to the report, due to having being misspelled. Similarly, a report trying to list all of the addresses in "Ca" would not actually find all of the address records intended to be in California. It would miss "Calif, Ca and Cal."

Updating the Control Data Containers

The normalized approach has an additional complexity. If the city of Sunnyvale is not on file, the user would type "Sun" and not see any helpful display as a result. In such a situation, the user would:

Data Value Changes

Another benefit of storing data centrally pertains to making changes to the data:

Product Department
Muffler Datsun
Exhaust manifold Datsun
Muffler Toyota
Exhaust manifold Toyota

Let's assume that, in an auto parts business, department "Datsun parts" changed its name to become "Nissan parts." In this example, the non-normalized database would require changing all of the "Datsun" values in the product records to "Nissan" instead. This would be a time-consuming process if there were thousands of "Datsun" product records. Also, if mass update process died halfway through, the database would be internally inconsistent.

In a normalized database, a product record would not contain the value of "Datsun", but would instead point to a record in the "Department" container, which would contain "Datsun." To change "Datsun" to "Nissan", the systems administrator would simply edit the "department" record. This change would affect all of the product records pointing to this "department" record. Suddenly, they would all be "Nissan" products.

Another good example, from the book "Oracle, a Beginner’s Guide" is where a person gets married and changes her last name. In most companies, this requires changing her name in many different places, including security, internal company directory, payroll, personnel, etc.

It would be ideal if a person’s name were stored centrally, so that the change needs to be made in only one place, and the effect of this update would immediately benefit all of the other systems which point to that record.

Data Structure Changes

Another example would be where a database contains a field called "phone number", and another called "fax number." Until cellular phones came along, this would be an adequate data structure. However, with the arrival of cellular phones, this structure would need to be expanded, by adding another field. Typically, restructuring a database in this way requires a lot of programmer work.

If the structure were normalized, where the phone types are not embedded in the type of field, no database changes would be necessary.

There would be a "type of phone" defined and stored in a central container, and every phone record would point to a record therein. The "phone type" container would have two records: "voice" and "fax." Some phone records would point to the voice type record, thereby effectively announcing "I am a voice line" and others would point to a fax type record thereby announcing "I am a fax line."

When cellular phones need to be supported, a "cellular" record would simply be added to the central phone type container. Then, when cellular phone numbers are entered, they would be pointed to the "cellular" record type, thereby effectively announcing "I am a cellular phone line." No program changes would be required to effect this change.

Storage Efficiency

In the non-normalized database, each address record would contain each city name, and would have to be structured to have enough space for the longest possible city name (which might be "Rancho Cucamonga"). Unfortunately, this means that each address record needs to allocate 16 characters of space to the "city" field. If there are 100,000 address records on file, this would require that 16 x 100,000 = 1,600,000 characters to store the city names.

The normalized database would not store the city names in the address records, but instead would store the city names in a central "city" data container. In each address record would be a numerical pointer to a central city record. A number requiring two characters of internal memory space would be adequate for pointing to any one of 65536 different city records, presumably enough for all the cities in the U.S.

So, each address record would contain only a number, requiring 2 characters instead of 16.

As a result, the database would need only 2 x 100,000 = 200,000 characters of space, to support city names in the database. Compared to 1,600,000 characters, this is a huge savings in disk space. As a result, the program would be able to run on less expensive hardware, and be able to retrieve data faster.

Conclusion

There are several benefits to normalization.

A major benefit is in programmer productivity, since tasks are more complex but less time-consuming. Systems can be made and expanded faster and maintained more easily.

Normalized systems also make it easier to retrieve the data and to keep it clean.