Loading...
Menu

Open Source Database : Virtue Or Vice?

p<{color:#000;}.

 

Copyright

 

This ebook is licensed for your personal enjoyment only. This ebook may not be re-sold

or given away to other people. If you would like to share this book with another person,

please purchase an additional copy for each recipient. If you’re reading this book and did

not purchase it, or it was not purchased for your use only, then please return to your

favorite ebook retailer and purchase your own copy. Thank you for respecting the hard

work of this author.

 

 

In case downloading this eBook, This book remains the copyrighted property of the author,

and may not be redistributed to others for commercial or non-commercial purposes. If you

enjoyed this book, please encourage your friends to download their own copy from their

favorite authorized retailer. Thank you for your support.

 

Copyright ©June 8, 2017, Binayaka Mishra

All rights reserved.

 

DEDICATION

 

I Would like to dedicate this manuscript to my stalwart elder brother, Mr. Bibhu Prasad Mishra for his outstanding courage and submission to Indian Air Force irrespective of all the odds; which includes earthquakes, T-Sunami and versatile natural calamities. Though he was suffered a lot for these moments and commanded to proceed with his duties, that always make me feel proud upon his achievements and encouraged me a lot. Not too stunningly mention, If I am being little eligible to write this book to today, it’s become successful due to the painful life of my brother which always inspired me.

Table of CONTENTS

 

1. Background

2. Delineation

2.1 Relational Database

2.2 Operational Database

2.3 Database Warehouse

2.4 Distributed Database

2.5 End-User Database

2.6 External Database

2.7 Hypermedia Database

2.8 Navigational Database

2.9 In-Memory Database

2.10 Document-Oriented Database

2.11 Real-Time Database

2.12 Analytical Database

3. Open Source DBMS VS Commercial DBMS

3.1 Method of Score Calculation

4. Featurette

4.1 PostgreSQL

4.2 MongoDB

4.3 Cassandra

4.4 Redis

4.5 SQLite

4.6 MySQL

4.7 FireBird

4.8 ElasticSearch

5. Data Modelling

5.1 Conceptual Data Model

5.2 Logical Data Model

5.3 Physical Data Model

6. Glossary

7. Data Dictionary

8. Data Mapping

9.Entity Relationship Diagram

9.1.1. Entity

9.1.2. Entity Type

9.1.3. Entity Set

9.1.4. Entity Category

9.1.5. Composite Entity

9.1.6. Recursive Entity

9.1.7. Entity Keys

9.2 Attributes

9.3 Relationship

9.4 Relationship Set

9.5 Keys

9.6 Cardinalities

9.7 Participation Constraints

10.Normalization

10.1 First Normal Form (1NF)

10.2 Second Normal Form (2NF)

10.3 Third Normal Form (3NF)

10.4 Fourth Normal Form (4NF)

10.5 Fifth Normal Form (5NF)

10.6 Boyce Codd Normal Form (BCNF)

10.7 Functional Dependency

10.8 Armstrong’s Axioms

10.9 Trivial Functional Dependency

11. De-normalization

11.1 Adding Redundant columns

11.2 Adding derived columns

11.3 Collapsing Table

11.4 Snapshots

11.5 VARRAYS

11.6 Materialized Views

11.7 Advantages of De-Normalization

11.8 Disadvantages of De-Normalization

12. Data Independence

13. Creating Logical Data Model

14. Creating Physical Data Model

15. Data Modelling Best Practices

16. Data Modelling Variety

16.1 Flat Data Model

16.2 Entity Relationship Data Model

16.3 Relational Data Model

16.4 Network Data Model

16.5 Hierarchical Data Model

16.5 Object oriented Data Model

16.6 Record base Data Model

16.7 Object relation Data Model

16.8 Semi structured Data Model

16.9 Associative Data Model

16.10 Context Data Model

17. Open Source Database Infrastructure for Digital Transformation

18. Consolidation

19.References

20. Citation

 

ACKNOWLEDGMENTS

 

 

This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION, OR SAMPLE. The author of this paper disclaims all liability, including liability for infringement of any property rights, relating to use of this information. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein. I would like to personally thanks the various Informational Journals as furnished on “Reference” & “Citation” to aid me produce such quality content and also let me learn new ideas.

ACKGROUND

 

Well, those of you are new bee to OSS [Open Source Software], I can advise them to go for my previous article as mentioned on the reference section of this article to have the prescience of the OSS concept before proceeding with OSD [Open Source Database]. It’s not always imperative to have the advance cognizance about the parent to identify its child, but if you would like to understand the hierarchical relationship of inheritance what OSS & OSD possess, then it’s always advisable to start from the scratch.

 

As we all understand that, database is the storage place to store one’s data. If we usually go back to ancient times about six to seven decades back, people are usually using disks like CD-ROM to store and manage their data which were not only vulnerable to manage but also possessed better security risks, which may be the idea which brought us DBMS [Database Management Systems], by late 1950 or early years of 1960 and the world started using the concept of storing their data on software, rather solely depending on hardware’s.

 

On 1962, it has been cited by Oxford University that, “State Development Corporation” of California,USA is the first to use data-base term on technical aspect. On mid-1960’s Charles William AKA “Charlie” Bachman III an American computer scientist, who was the author of IDS [Integrated Data Store], founded the Database Task Group within CODASYL, who was also sole responsible for creation and standardization of COBOL. In 1971, CODASYL delivered its approach to user database on the below terms, which welcomed other market leaders to follow. Although IBM had its own database known as IMS, which was developed using Apollo Program under system 360, but was always considered as hierarchical database due to its strict hierarchical data modelling unit to search any particular or group of data sets, whereas IDS was always considered as network database.

 

CODASYL Approach of IDMS:

#
p<>{color:#000;}. Use of a primary key (known as a CALC key, typically implemented by hashing[]-

http://interactivepython.org/runestone/static/pythonds/SortSearch/Hashing.html)

#
p<>{color:#000;}. Navigating relationships (called sets) from one record to another

#
p<>{color:#000;}. Scanning all the records in a sequential order

#
p<>{color:#000;}. B-Trees method to provide alternate access paths.

Fig 1.1 CODASYL Database Model

 

 

In 1970, Dr. E.D. Codd from IBM invented the model concept of RDBMS [Relational Database Management Systems] which followed 13 golden rules are prescribed by him. The theory suggests the data to be stored and searched following the method approach of Linked list [a form of table], where each record is identified by unique key. Codd’s Paper research was firstly snapped in 1973 by two people in Berkeley, Eugene Wong and Michael Stonebraker and started the project called INGRES, and the product was very much adoptable and widespread by 1979. INGRES was similar to IBM product System R, which had functionalities to emerging SQL standard & QUEL. In 1970, the University of Michigan began development of the MICRO Information Management System based on D.L. Childs’ Set-Theoretic Data model. MICRO was used to manage very large data sets by the US Department of Labor, the U.S. Environmental Protection Agency, and researchers from the University of Alberta, the University of Michigan, and Wayne State University. It ran on IBM mainframe computers using the Michigan Terminal System. The system remained in production until 1998.

 

In 1970’s & 1980’s, approach was taken to build the databases by integrating hardware & software with the vision to improve the performance and speed with reduced cost. Examples of such databases includes IBM’s System/38, Teradata & Britton Lee, Inc. database machine. Another approach of making such databases was ICL’s CAFS accelerator, a disk controller with programmable search capacities. But keeping up with these kinds of DBMS in the long run couldn’t become possible due to rapid development and progress of general purpose of computers. However few companies like Oracle[Exadata] and Netezza are still using these functionalities.

 

In late 1978 or early 1979, IBM produced first SQL based DBMS called DB2 which is still a powerful DBMS and running successfully worldwide. DB2 was the first superior version of Dr. Codd’s proposal System R, and also managed to overcome all the previous DBMS products as suggested by CODASYL.

 

In late 1970’s Larry Wilson started a different thread from IBM’s DB2 product papers on System R and produced the first version of Oracle database. It was 1978, when the world first comes to aware of the Oracle DB as the emerging market leader. Larry’s Oracle DB was not only occupied with higher version of DB2’s data model concepts, but also it encapsulated the entity-relationship model theory of normalization as presented on 1976, which made Oracle DB to be adorned by IT market leaders. In the same time Michal Stonebraker applied the methodical approach of INGRES and produced the world’s first Open Source Database call PostGres.

 

In 1980’s people have come across another DBMS called dBASE, popularly known as Desktop database and gestured its creator C. Wayne Ratliff. The major reason of dBASE DBMS popularity among the users was due to its capabilities of doing all the related user expected manipulation with data with minimal user inputs. The 1990s, along with a rise in object-oriented programming, saw a growth in how data in various databases were handled. Programmers and designers began to treat the data in their databases as objects.

 

In early years of 2000’s we have seen NoSQL databases , so called as OSD [Open Source Databases], which had almost every aspects of RDBMS as suggested by System R or DB2 or dBASE. It’s the time which has given birth to XML database, enterprise database and deformalized database for better performance with underlying data operations to end users and customer support personnel. Though all these open source database like PostGreSQL, MongoDB, Couch DB, SQLite, Firebird.

 

 

DELINEATION

 

 

It will be unfair to judge the General without seeing the structure of its battalion. Likewise if we really need to understand the democratic rule of DBMS or RDBMS or OSD, let’s dissect its base component called as data. Data is nothing but raw information about anything or anybody on their desired format and size. Furthermore, database is an organized collection of data objects which store data, whereas DBMS is the software who runs the application and database itself to process and analyse the data. There are various types of DBMS systems exist due to the structure of the data across the organizations. A few of them have been enlisted below:

 

2.1 Relational Database

This is the most common of all the different types of databases. In this, the data in a relational database is stored in various data tables. Each table has a key field which is used to connect it to other tables. Hence all the tables are related to each other through several key fields. These databases are extensively used in various industries and will be the one you are most likely to come across when working in IT. Examples of relational databases are Oracle, Sybase and Microsoft SQL Server and they are often key parts of the process of software development. Hence you should ensure you include any work required on the database as part of your project when creating a project plan and estimating project costs.

 

2.2 Operational Database

In its day to day operation, an organisation generates a huge amount of data. Think of things such as inventory management, purchases, transactions and financials. All this data is collected in a database which is often known by several names such as operational/ production database, subject-area database (SADB) or transaction databases. An operational database is usually hugely important to Organisations as they include the customer database, personal database and inventory database ie the details of how much of a product the company has as well as information on the customers who buy them. The data stored in operational databases can be changed and manipulated depending on what the company requires.

 

2.3 Database Warehouses

Organisations are required to keep all relevant data for several years. In the UK it can be as long as 6 years. This data is also an important source of information for analysing and comparing the current year data with that of the past years which also makes it easier to determine key trends taking place. All this data from previous years are stored in a database warehouse. Since the data stored has gone through all kinds of screening, editing and integration it does not need any further editing or alteration. With this database ensure that the software requirements specification (SRS) is formally approved as part of the project quality plan.

 

2.4 Distributed Database

Many organisations have several office locations, manufacturing plants, regional offices, branch offices and a head office at different geographic locations. Each of these work groups may have their own database which together will form the main database of the company. This is known as a distributed database.

 

2.5 End-User Database

There is a variety of data available at the workstation of all the end users of any organisation. Each workstation is like a small database in itself which includes data in spreadsheets, presentations, word files, note pads and downloaded files. All such small databases form a different type of database called the end-user database.

 

2.6 External Database

There is a sea of information available outside world which is required by an organisation. They are privately-owned data for which one can have conditional and limited access for a fortune. This data is meant for commercial usage. All such databases outside the organisation which are of use and limited access are together called external database.

 

2.7 Hypermedia Database

Most websites have various interconnected multimedia pages which might include text, video clips, audio clips, photographs and graphics. These all need to be stored and “called” from somewhere when the webpage if created. All of them together form the hypermedia database. Please note that if you are creating such a database from scratch to be generous when creating a project plan, detailed when defining the business requirements documentation (BRD) and meticulous in your project cost controls. I have seen too many projects where the creation of one of these databases has caused scope creep and an out of control budget for a project.

 

2.8 Navigational Database

Navigational database has all the items which are references from other objects. In this, one has to navigate from one reference to other or one object to other. It might be using modern systems like XPath. One of its applications is the air flight management systems.

 

2.9 In-Memory Database

An in-memory databases stores data in a computer’s main memory instead of using a disk-based storage system. It is faster and more reliable than that in a disk. They find their application in telecommunications network equipments.

 

2.10 Document-Oriented Database

A document oriented database is a different type of database which is used in applications which are document oriented. The data is stored in the form of text records instead of being stored in a data table as usually happens.

 

2.11 Real-Time Database

A real-time database handles data which constantly keep on changing. An example of this is a stock market database where the value of shares change every minute and need to be updated in the real-time database. This type of database is also used in medical and scientific analysis, banking, accounting, process control, reservation systems etc. Essentially anything which requires access to fast moving and constantly changing information. Assume that this will require much more time than a normal relational database when it comes to the software testing life cycle, as these are much more complicated to efficiently test within normal timeframes.

 

2.12 Analytical Database

An analytical database is used to store information from different types of databases such as selected operational databases and external databases. Other names given to analytical databases are information databases, management databases or multi-dimensional databases. The data stored in an analytical database is used by the management for analysis purposes, hence the name. The data in an analytical database cannot be changed or manipulated.

 

Below you can find all types of databases or DBMS across the world which are currently operational per their features.

 

Fig 2.1 Different Database Names

 

Since the list mentioned above is really huge to collate the base usage of individual DBMS, if you would like to go in detail about each database as represented on the above picture, please follow the DBMS wiki to know better.

 

By this time, if at all we have understood, although any OSD [Open Source Database] follow the definitions of any of the DBMS systems as outlined above, there is only one feature which makes OSD different from others and they are outlined below:

 

OSD is licensed software in which the source code is made available to users to enable them to modify it for their own purposes and (within certain restrictions) redistribute original and derived works as they see fit.”

 

*
p<>{color:#000;}. No one has exclusive control over the term “open source”

*
p<>{color:#000;}. Not an enforceable copyrighted term or trademark

*
p<>{color:#000;}. Open Source Initiative (OSI) http://www.opensource.org – was founded in 1998 & has unofficial power over the core concepts

 

Open source database is not limited to Linux. LAMP solutions are frequently deployed for database applications and systems that meet certain criteria. The acronym LAMP is commonly used as a shortcut to specify the most popular open source software: Linux, Apache Web server, MySQL DBMS, and the PHP/Python/Perl development languages. It is a collective of open source software that can be used to deploy applications with minimal cost, which is the intriguing part to most of its adopters.

3. Open Source DBMS VS Commercial DBMS

 

 

open source database technology is in an awkward phase of its lifecycle today. Typically, open source DBMS products – MySQL, Firebird, PostgreSQL, and Berkeley DB – are simpler to use than enterprise DBMS products because they do not have all the bells and whistles of enterprise software. Over time, though, features are being added to the open source players. Triggers, stored procedures, integrity constraints, and so on cause the open source DBMS products to become more complex to use. But their simplicity and easy-to-use reputation is what endeared them to smaller organizations and quick-and-dirty applications. As the open source DBMS moves away from its initial sweet spot in terms of how and when it is implemented, growing pains could ensue.

 

Below are few examples as taken from Forbes.com to show the various comparisons between Open Source & Commercial databases. Though the figures as represented below is the most up to date , I would still recommend you to go through the other websites which are mentioned on the reference sections to produce your customized report before giving the green signal to your choice to any particular Open source DBMS.

 

Fig 3.1 Number of Database Systems on World

 

Fig 3.2 Open Source Popularity

The below chart shows the historical trend of the popularity of open source and commercial database management systems.

3 Open Source Popularity Trend

Fig 3.4 Open Source Popularity Trend w.r.t Database Model

Fig 3.5 Open Source Popularity Trend w.r.t Database Model

Fig 3.6 Open Source Database Systems w.r.t Database Model

Fig 3.7 Open Source Database Ranking w.r.t Database Model

 

It has been cited that, the above rankings and popularity trends of the various DBMS systems including OSD software doesn’t depend on single scale factor to be adopted. While doing the market survey, what I have come across about the actual score calculation of each DBMS is actually based upon the below set of reasons:

 

 

3.1 Method of Score Calculation:

*
p<>{color:#000;}. Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google, Bing and Yandex for this measurement. In order to count only relevant results, we are searching for together with the term database, e.g. “Oracle” and “database”.

*
p<>{color:#000;}. General interest in the system. For this measurement, we use the frequency of searches in Google Trends.

*
p<>{color:#000;}. Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.

*
p<>{color:#000;}. Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.

*
p<>{color:#000;}. Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional networks LinkedIn and Upwork.

*
p<>{color:#000;}. Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

 

Out of 300 DBMS as available in today’s time, please find below the current rankings of the DBMS irrespective of open or closed source database. The rankings are score have been calculated via the method as outlined above.

 

Fig 3.8 Open Source Database Ranking w.r.t Database Model

 

4. FEATURETTE

 

As there are more than 300 DBMS and 150 open source databases currently running on the market per their best practice, I would hereby like to depict those who are currently leading on the market on the top notch positions. Well saying that doesn’t mean the other OSD products are worthless or they may not be able to set the trends on the near future, but the OSD products are outlined below are the ones who not only survived the best against close source Enterprise DB, but also thrived since the ancient times to exist so far.

 

4.1 PostgreSQL

4.2 MongoDB

4.3 Cassandra

4.4 Redis

[++]

4.5 SQLite

[++]

4.6 MySQL

[++]

4.7 FireBird

[++]

4.8 ElasticSearch

[++]

5. DATA MODELLING

 

 

It was Dr. Peter Chen’s original paper on the Entity-Relationship model (ER model) is one of the most cited papers in the computer software field.  His original ER model paper was selected as one of the 38 most influential papers in Computer Science according to a survey of 1,000 computer Science College, [*edited by P. Laplante, West Publishing, 1996). *] It was selected for inclusion as one of the important publications in the “Science Pearls” project of Wikipedia. Based on one particular citation database, Chen’s paper is the 35th most cited article in Computer Science. An original version of the Entity-Relationship history was published by Dr. Peter Chen on 1976. An unified model for the ER model characteristics was also represented on the first version by Dr Chen is still in usage by organizations. There are associations between entities, similar to the associations between classes – relationships, inheritance, composition, and aggregation are all applicable concepts in data modelling.

 

 

: An entity–relationship diagram for an MMORPG using Chen’s notation

 

 

There are basically three basic styles of data model:

 

table<>. <>. |<>.
p(<>{color:#000;}. onceptual data models These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders.  On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.  The common characteristics of the Conceptual data models are:

 

*
p<>{color:#000;}. Enterprise-wide coverage of the business concepts.  Think Customer, Product, Store, Location, Asset.

*
p<>{color:#000;}. Designed and developed primarily for a business audience

*
p<>{color:#000;}. Contains around 20-50 entities (or concepts) with no or extremely limited number of attributes

described. Sometimes architects try to limit it to printing on one page.

*
p<>{color:#000;}. Contains relationships between entities, but may or may not include cardinality and nullability.

*
p<>{color:#000;}. Entities will have definitions.

*
p<>{color:#000;}. Designed and developed to be independent of DBMS, data storage locations or technologies.  In fact,

it would address digital and non-digital concepts. This means it would model paper records and

artifacts as well as database artifacts.

ogical data models (LDMs) LDMs are used to explore the domain concepts, and their relationships, of your problem domain.  This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice). Common characteristics of a logical data model:

*
p<>{color:#000;}. Typically describes data requirements for a single project or major subject area.

*
p<>{color:#000;}. May be integrated with other logical data models via a repository of shared entities

*
p<>{color:#000;}. Typically contains 100-1000 entities, although these numbers are highly variable depending on the

scope of the data model.

*
p<>{color:#000;}. Contains relationships between entities that address cardinality and nullability (optionality) of the

relationships.

*
p<>{color:#000;}. Designed and developed to be independent of DBMS, data storage locations or technologies.  In fact,

it may address digital and non-digital concepts.

*
p<>{color:#000;}. Data attributes will typically have datatypes with precisions and lengths assigned.

*
p<>{color:#000;}. Data attributes will have nullability (optionality) assigned.

*
p<>{color:#000;}. Entities and attributes will have definitions.

*
p<>{color:#000;}. All kinds of other meta data may be included (retention rules, privacy indicators,  volumetrics, data

lineage, etc.) In fact, the diagram of a logical data model may show only a tiny percentage of the meta

data contained within the model.

 

5.3 Physical data models (PDMs)  PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modelling. Common characteristics of a physical data model:

 

*
p<>{color:#000;}. Typically describes data requirements for a single project or application. Sometimes even a

portion of an application.

*
p<>{color:#000;}. May be integrated with other physical data models via a repository of shared entities

*
p<>{color:#000;}. Typically contains 10-1000 tables, although these numbers are highly variable depending on the

scope

of the data model.

*
p<>{color:#000;}. Contains relationships between tables that address cardinality and nullability (optionality) of the

relationships.

*
p<>{color:#000;}. Designed and developed to be dependent on a specific version of a DBMS, data storage location or

technology.

*
p<>{color:#000;}. Columns will  have datatypes with precisions and lengths assigned.

*
p<>{color:#000;}. Columns will have nullability (optionality) assigned.

*
p<>{color:#000;}. Tables and columns will have definitions.

*
p<>{color:#000;}. Will also include other physical objects such as views, primary key constraints, foreign key constraints,

indexes, security roles, store procedures, XML extensions, file stores, etc.

*
p<>{color:#000;}. The diagram of a physical data model may show only a tiny percentage of the meta data contained

within the model.

 

|

Data modelling is often the first step in database design and object-oriented programming as the designers first create a conceptual model of how data items relate to each other. Data modelling involves a progression from conceptual model to logical model to physical schema. Moreover, Data modelling is the act of exploring data-oriented structures.  Like other modelling artifacts data models can be used for a variety of purposes, from high-level conceptual models to physical data models.  From the point of view of an object-oriented developer data modelling is conceptually similar to class modelling. With data modelling you identify entity types whereas with class modelling you identify classes. Data attributes are assigned to entity types just as you would assign attributes and operations to classes.  Should we require considering how to make an effective data model starting from the conceptual design, there are 4 below steps used to do so and they are:

 

*
p<>{color:#000;}. Glossary – for clarifying terminology and simplifying data modelling.

*
p<>{color:#000;}. Data Dictionary – to communicate data requirements in a well-organized way.

*
p<>{color:#000;}. Data Mapping – to resolve data issues for data migration or integration projects.

*
p<>{color:#000;}. Entity Relationship Diagram – to visualize relationships between key concepts.

6. GLOSSARY

 

A [*Glossary *]is a deliverable that documents terms that are unique to the business or technical domain. A glossary is used to ensure that all stakeholders (business and technical) understand what is meant by the terminology, acronyms, and phrases used inside an organization. While creating a glossary can take a bit of time and attention, doing so will generate many benefits, such as:

*
p<>{color:#000;}. Facilitate learning about a new business domain, by keeping terminology variations and acronyms straight.

*
p<>{color:#000;}. Clarify your requirements documents, by providing clear definitions of the important terms used inside them.

*
p<>{color:#000;}. Save time in requirements meetings, since stakeholders will be using a common language.

*
p<>{color:#000;}. Encourage more effective communication among stakeholders, by resolving terminology disagreements.

*
p<>{color:#000;}. Pave the way for data modelling and database designs that accurately reflect true business requirements.

 

The key elements of a glossary are:

*
p<>{color:#000;}. Terms – This is the unique words or short phrases that are part of business conversations. Typically terms are nouns – or persons, places, or things.

*
p<>{color:#000;}. Definitions – Provides the exact meaning of a term in an unambiguous way and clarifies the boundaries of when the term can appropriately be used in communication.

*
p<>{color:#000;}. Alias – A word, phrase, or acronym that is used interchangeably with the primary term in your glossary.

*
p<>{color:#000;}. Related Terms – References to separate terms in your glossary which are similar to, but not interchangeable with, your primary listed terms.

 

If we agree that the purpose of a glossary is to encourage consistent use of terminology by stakeholders and clarify requirements, then we’ll quickly realize that the glossary is only the beginning. A glossary is your reference tool and a way to capture terms, definitions, and variations as they come up in your requirements meetings. With persistence and consistent use of glossary-based terminology, your stakeholders will start communicating more effectively, your requirements specifications will be better understood, and your data modelling will be easier. However, is that you encourage the consistent use of terminology? Here are a few tips for doing so:

 

*
p<>{color:#000;}. Use the terms consistently – absolutely consistently – in your requirements specifications, as even small variations can cause confusion. (Out of the hundreds of requirements documents I’ve reviewed, inconsistent use of terminology is one of the most common mistakes I see.)

*
p<>{color:#000;}. During requirements meetings, clarify unfamiliar or new terms before moving on. Doing so will often save lots of time resolving other requirements-related conflicts between stakeholders, as often these boil down to varying definitions between terms.

*
p<>{color:#000;}. Encourage the use of business terminology in data models by bringing forward business terminology into your ERD (Entity Relationship Diagram and Data Dictionaries. Often these documents use technical language that is confusing to business stakeholders, and technical terms can be helpfully included as aliases in your glossary.

7. DATA DICTIONARY

 

 

A Data Dictionary, also called a Data Definition Matrix, provides detailed information about the business data, such as standard definitions of data elements, their meanings, and allowable values. While a conceptual or logical Entity Relationship Diagram will focus on the high-level business concepts, a Data Dictionary will provide more detail about each attribute of a business concept. Essentially, a data dictionary provides a tool that enables you to communicate business stakeholder requirements in such a way that your technical team can more easily design a relational database or data structure to meet those requirements. It helps avoid project mishaps such as requiring information in a field that a business stakeholder can’t reasonably be expected to provide, or expecting the wrong type of information in a field.

 

It is clear that any data dictionary would contain:

*
p<>{color:#000;}. The definitions of all database objects like tables, views, constraints, indexes, clusters, synonyms, sequences, procedures, functions, packages, triggers etc.

*
p<>{color:#000;}. It stores the information about how much space is allocated for each object and how much space has been used by them

*
p<>{color:#000;}. Any default values that a column can have are stored

*
p<>{color:#000;}. Database user names  – schemas

*
p<>{color:#000;}. Access rights for schemas on each of the objects

*
p<>{color:#000;}. Last updated and last accessed information about the object

*
p<>{color:#000;}. Any other database information

 

There are two types of data dictionary: Active Data Dictionary and Passive Data Dictionary

*
p<>{color:#000;}. Active Data Dictionary Any changes to the database object structure via DDLs will have to be reflected in the data dictionary. But updating the data dictionary tables for the changes are responsibility of database in which the data dictionary exists. If the data dictionary is created in the same database, then the DBMS software will automatically update the data dictionary. Hence there will not be any mismatch between the actual structure and the data dictionary details. Such data dictionary is called active data dictionary.

 

*
p<>{color:#000;}. Passive Data Dictionary In some of the databases, data dictionary is created separately from the current database as entirely new database to store only data dictionary informations. Sometimes it is stored as xml, excels or in any other file format. In such case, an effort is required to keep data dictionary in sync with the database objects. This kind of data dictionary is called passive data dictionary. In this case, there is a chance of mismatch with the database objects and the data dictionary. This kind of DD has to be handled with utmost care.

 

The description of the data elements of data dictionary is:

Data Element Domain: The context within which the data element exists. For example

information about a participant (the domain) could include data element information about the

participants address, phone number, title, and e-mail.

Data Element Number: A unique number for the data element used in technical documents.

Data Element Name: Commonly agreed, unique data element name.

Data Element Field Name(s): Names used for this data element in computer programs and

database schemas.

Data Element Definition: Description of the meaning of the data element.

Data Element Unit of Measure: Scientific or other unit of measure that applies to the data value.

Data Element Value: The reported value.

Data Element Precision: The level to which the data element value will be reported (e.g. miles

to 2 decimal places).

Data Element Data Type: Data type (characters, numeric, etc.), size and, if needed, any special

representation that applies to the data element.

Data Element Size: The maximum field length as measured in characters and the number of

decimal places that must be maintained in the database.

Data Element Field Constraints: Data Element is a required field (Y/N); Conditional field ©; or a null

field: Required fields (Y) must be populated. Conditional fields © must also be populated when another

related field is populated (e.g. if a city name is required a ZipCode may also be required). “Not null” also

describes fields that must contain data. “Null” means the data type is undefined (note: a null value is not

the same as a blank or zero value).

*
p<>{color:#000;}. Data Element Default Value: A value that is predetermined -it may be fixed or a variable,like current date and time of the day.

Data Element Edit Mask: An example of the actual data layout required (e.g. yyyy/mm/dd)

 

Data Element Business Rules (Could include any of the material below):

Data Element coding (allowed values) and intra-element validation details or

reference to other documents: Explanation of coding (code tables, etc.) and validation rules.

Related data elements: List of closely related data element names when the relationship is important.

Security classification of the data element: Organization-specific security classification level or possible

restrictions on use.

Database table references: Reference to tables where the element is used and the role of the element in

Each table. Indication when the data element is a primary or secondary key for the table.

Definitions and references needed to understand the meaning of the data element: Short application

Domain definitions and references to other documents needed to understand the meaning and use of the

data element.

Source of the data in the data element: Short description of where the data is coming from. Includes

rules used in calculations producing the data element value.

Validity dates for the data element definition: Validity dates, start and possible end dates for when the

data element is or was used. There may be several time periods when the data element has been used.

History references: Date when the data element was defined in present form, references to superseded

data elements, etc.

External references: References to books, other documents, laws, etc.

Version of the data element document: Version number or other indicator. This may include formal

version control or configuration management references.

Date of the data element document: Written date of this version of the data element document.

Quality control references: Organization-specific quality control endorsements, dates, etc.

 

: An Example of Data Dictionary

 

An established data dictionary provides numerous benefits for:

*
p<>{color:#000;}. Improved data quality: Labeling information consistently, with agreed-upon definitions for data elements and a common set of properties for each data element, makes systems and data analysis easier and business intelligence more effective because of access to high data quality information in the EDD.

 

*
p<>{color:#000;}. Easy access to trusted data: Business owners and developers access to validated data including approved definitions and properties supporting Federal Student Aid applications and systems in one central location. As the EDD will be available online, the information provided will always be up-to-date and changes are immediately available to all users. The delay caused by distribution of paper releases is eliminated.

 

*
p<>{color:#000;}. Improved documentation and control: Managing and maintaining all data elements through the EDD ensure consistency and completeness of the data element description.

 

*
p<>{color:#000;}. Reduced data redundancy: Describing data elements and the use of a defined set of properties for each data element reduce or eliminate the creation of redundant data elements. The EDD also allows controlling the addition of new data elements and thereby avoiding duplicates.

 

*
p<>{color:#000;}. Reuse of data: Creating the EDD promotes reuse of data and sharing of information across Federal Student Aid and the community of interest.

 

*
p<>{color:#000;}. Consistency in data use: Implementing a consistent labelling and agreed-upon definition for data elements across applications as well as a defined set of data standards such as naming conventions leads to consistency in data use.

 

*
p<>{color:#000;}. Easier data analysis: Business owners and users might use the EDD as a vehicle for robust query and report generation.

 

*
p<>{color:#000;}. Simpler programming: Using a common set of properties for each data element and consistent labeling of data elements ensure that business and programmer analysts can easily identify relevant data to support implementation of business requirements.

 

*
p<>{color:#000;}. Enforcement of standards: Implementing the EDD with its structure and required data properties establishes an agreed-upon standard that allows for monitoring, controlling, and enforcement of adherence to the standard.

 

*
p<>{color:#000;}. Better means of estimating the effect of change: The EDD will help to identify impact of changes made in the dictionary and its relevant applications and vice-versa.

 

The following criteria need to be kept in mind when developing an DD:

 

*
p<>{color:#000;}. Consistency: Corporate data, repositories, etc. are only successful when they resonate with Federal Student Aid and are consistently accessed and maintained within an organization, especially because that data crosses organizational boundaries. An DD helps to maintain the consistency of corporate data across organizations.

 

*
p<>{color:#000;}. Clarity: An DD makes data clear and usable for the business user and the developer. It supports efficient and consistent use of the data by both the originators and the various users of the data, regardless of the divisional organization to which they belong. Often, non-standardized data is used because data elements are known within the originating organization without regard to other users outside their organization. The lack of clarity can cause an outside user to misunderstand the meaning, use, or domain of a data element and thus create an erroneous report affecting a management decision.

 

*
p<>{color:#000;}. Reusability: An DD supports consistent use, which is a key ingredient in the ability of one divisional organization to incorporate work that has already been designed, tested, and approved by the corporation for reuse into its own new development projects. Reinventing the wheel costs money and time. Reusability is enabled by application of standards to produce consistent parts for fitting into future work.

 

*
p<>{color:#000;}. Completeness: An DD helps analysts know when data is clear, complete, and defined by specifying what completeness means and the steps to develop a complete data structure. Incomplete data properties or descriptions tend to be improperly used and lead to misunderstanding of the data. They can also cost extra time for a developer to make multiple phone calls to clarify and complete the information needed to use the data.

*
p<>{color:#000;}. Ease of Use for the Developer: An DD minimizes development time to have clear and complete definitions/descriptions for the data elements that the programmer must use to create accurately the application functionality.

 

The DD information can be classified as identification/description, configuration, properties, and

association:

*
p<{color:#000;}. Identification/ Description: Contains data element name and definition. This set of fields applies to all data elements (such as definition).

*
p<{color:#000;}. Configuration: Contains essential data element configuration management information provided by the data architect’s office. This set of fields applies to all data elements (such as data steward, version, comments, and models).

*
p<{color:#000;}. Properties: Contains attribute and column information (such as data source, data length, value, security, and privacy requirement.)

*
p<{color:#000;}. Association: Contains details of the attributes / columns across the logical and physical data models associated with the data element

 

A uniform approach in data dictionary development avoids fragmentation. In an effort to promote and improve international communications between governments, businesses, and scientific communities, ISO and IEC have developed standards for specification and standardization of data elements. The ISO/IEC 11179 standard consists of:

 

 

*
p<>{color:#000;}. A framework for the generation and standardization of data elements

*
p<>{color:#000;}. A classification of concepts for the identification of domains

*
p<>{color:#000;}. Basic attributes of data elements

*
p<>{color:#000;}. Rules and guidelines for the formulation of data definitions

*
p<>{color:#000;}. Naming and identification principles for data elements

*
p<>{color:#000;}. Registration of data elements

8. DATA MAPPING

 

A[* Data Mapping Specification ]is a special type of data dictionary that shows how data from one information system maps to data from another information system. Creating a data mapping specification helps you and your project team avoid numerous potential issues, the kind that tend to surface late in development or during user acceptance testing and throw off project schedules, not to mention irritating your stakeholders. You might be wondering if you need both a data dictionary and a data mapping, or you might want to make these two deliverables part of one simultaneous analysis activity. Data mapping is a special kind of data dictionary, so the two techniques are very closely related. In a certain set of circumstances, they can be done simultaneously as one stream of analysis activity. We’ll talk about when and how that works in a bit, but first let’s clarify the difference between the two deliverables and when to create each of them. In computing and *data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks including: Data transformation or data mediation between a data source and a destination.

 

In order to create a data mapping specification, you need to understand the data dictionaries for each of your data sources. In fact, your data mapping spreadsheet will take key pieces of information about the attributes of both sources and present them side-by-side. This allows you to evaluate how the data will flow from one source to another and what translation rules are required. When an organization is migrating from a pre-existing system to a new Commercial-Off-The-Shelf (COTS) product, planning the data migration is an important aspect of the project to ensure that the right data ends up in the right place in the new system. Expertise from both the vendor-side and client-side is needed as part of the mapping, so typically there is a lot of collaboration. Here are some of the issues you can expect to work through:

*
p<>{color:#000;}. For mapped fields, you’ll be looking at whether they actually mean the same thing in both systems or whether there is logic in either system that will impact how the data should be migrated over to the COTS product.

*
p<>{color:#000;}. You’ll also want to be sure that all of the client’s important data has a home in the target database, or you may need to customize the COTS product.

*
p<>{color:#000;}. On the flip side, you’ll want to evaluate the client’s data source has all the data needed to populate the new system in the format that’s required. Otherwise, a data clean-up project may be required prior to the data migration.

*
p<>{color:#000;}. As you work through these decisions, the client and vendor will need to decide where any clean-up and translation rules get implemented, whether in advance of the migration or as part of the data migration process.

*
p<>{color:#000;}. Finally, you’ll have decisions to make about what data to bring over. It’s not uncommon to archive older data and only bring over recent or active data, as this helps expedite the data migration process.

 

You’ll create a data mapping specification for the following types of projects:

*
p<>{color:#000;}. When source data is migrated to a new system as part of a data migration

*
p<>{color:#000;}. When source data is sent to a target data repository on a regular basis as part of a data or system integration

 

Data modelling can be a complex activity that requires a lot of intellectual activity for a business analyst and a lot of collaboration with business and technical stakeholders. One reason to keep these two deliverables separate is simply to break apart the analysis into interim steps and keep things moving smoothly. If you are working on a significant data migration project with hundreds of fields and you jump right into a data mapping exercise, you are likely to get stuck. And stuck = frustrated. One way to get unstuck is to back-pedal and create a separate data dictionary for each data source or even an ERD (entity relationship diagram) modelling out the whole project. When it comes down to it, as long as your data map helps you discover and resolve all of the data requirements, you’ve done what you need to do as a business analyst. If you are new to data modelling, data dictionaries and data maps will be easier to apply as discrete activities. And since they will not always be done together, you get to add two data modelling techniques to your BA toolbox.

 

That being said, here are some scenarios when it could work to do the exercises together.

*
p<>{color:#000;}. You are updating the data mapping specification for an existing data integration.

*
p<>{color:#000;}. You are creating a new data model by evaluating a pre-existing data model, and so the mapping exercise is a tool for requirements discovery.

*
p<>{color:#000;}. You are working on a data migration or integration that touches a relatively small number of attributes.

Fig 8.1 : Data Mapping Example

 

9.ENTITY RELATIONSHIP DIAGRAM

 

 

Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships among them. While formulating real-world scenario into the database model, the ER Model creates entity set, relationship set, general attributes and constraints. ER Data Model is based on the real world objects and their relationship. In other words, each and everything, either living or non-living things in this world forms the object in database world. We identify all the required objects for our database requirement and give the shape of database objects. Before putting them into database, it is very much essential to understand the requirement properly and design them efficiently. It is like a foundation of the building. For this purpose, we use ER diagrams where we plan the database pictorially. ER diagram basically breaks requirement into entities, attributes and relationship. Let us see them in detail.

 

An Entity Relationship (ER) Diagram is a type of flowchart that illustrates how “entities” such as people, objects or concepts relate to each other within a system. ER Diagrams are most often used to design or debug relational databases in the fields of software engineering, business information systems, education and research. Also known as ERDs or ER Models, they use a defined set of symbols such as rectangles, diamonds, ovals and connecting lines to depict the interconnectedness of entities, relationships and their attributes.

 

Entity Relationship diagrams is used in:

*
p<>{color:#000;}. Database design: ER diagrams are used to model and design relational databases, in terms of logic and business rules (in a logical data model) and in terms of the specific technology to be implemented (in a physical data model.) In software engineering, an ER diagram is often an initial step in determining requirements for an information systems project. It’s also later used to model a particular database or databases. A relational database has an equivalent relational table and can potentially be expressed that way as needed.

 

*
p<>{color:#000;}. Database troubleshooting: ER diagrams are used to analyze existing databases to find and resolve problems in logic or deployment. Drawing the diagram should reveal where it’s going wrong.

 

*
p<>{color:#000;}. Business information systems: The diagrams are used to design or analyze relational databases used in business processes. Any business process that uses fielded data involving entities, actions and interplay can potentially benefit from a relational database. It can streamline processes, uncover information more easily and improve results.

 

*
p<>{color:#000;}. Business process re-engineering (BPR): ER diagrams help in analyzing databases used in business process re-engineering and in modelling a new database setup.

 

*
p<>{color:#000;}. Education: Databases are today’s method of storing relational information for educational purposes and later retrieval, so ER Diagrams can be valuable in planning those data structures.

 

*
p<>{color:#000;}. Research: Since so much research focuses on structured data, ER diagrams can play a key role in setting up useful databases to analyze the data.

 

9.1.1 Entity: An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. For example, in a school database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity. An entity set is a collection of similar types of entities. An entity set may contain entities with attribute sharing similar values. For example, a Students set may contain all the students of a school; likewise a Teachers set may contain all the teachers of a school from all faculties. Entity sets need not be disjoint.

 

9.1.2 Entity Type: A group of definable things, such as students or athletes, whereas the entity would be the specific student or athlete. Other examples: customers, cars or products.

 

9.1.3 Entity Set: Same as an entity type, but defined at a particular point in time, such as students enrolled in a class on the first day. Other examples: Customers who purchased last month, cars currently registered in Florida. A related term is instance, in which the specific person or car would be an instance of the entity set.

 

9.1.4 Entity Category: Entities are categorized as strong, weak or associative. A strong entity can be defined solely by its own attributes, while a weak entity cannot. An associative entity associates entities (or elements) within an entity set. A strong entity Entities having its own attribute as primary keys are called strong entity. For example, STUDENT has STUDENT_ID as primary key. Hence it is a strong entity. Whereas a Weak Entity cannot form their own attribute as primary key are known weak entities. These entities will derive their primary keys from the combination of its attribute and primary key from its mapping entity.

 

9.1.5 Composite Entity[+:+] Entities participating in the many to many relationships are called composite entity. In this case, apart from two entities that are part of relation, we will one more hidden entity in the relation. We will be creating a new entity with the relation, and create a primary key by using the primary keys of other two entities.

 

ig 9.1.1 : An Example – Composite Entity

Consider the example, multiple students enrolled for multiple courses. In this case, we create STUDENT and COURSE. Then we create one more table for the relation ‘Enrolment’ and name it as STUD_COURSE. Add the primary keys of COURSE and STUDENT into it, which forms the composite primary key of the new table.

 

9.1.6 Recursive Entity[+:+] If a relation exists between the same entities, then such entities are called as recursive entity. For example, mapping between manager and employee is recursive entity. Here manager is mapped to the same entity Employee. HOD of the department is another example of having recursive entity.

 

Fig 9.1.2 : Recursive Entity Example

 

9.1.7 Entity Keys: Refers to an attribute that uniquely defines an entity in an entity set. Entity keys can be super, candidate or primary. Super key: A set of attributes (one or more) that together define an entity in an entity set. Candidate key: A minimal super key, meaning it has the least possible number of attributes to still be a super key. An entity set may have more than one candidate key. Primary key: A candidate key chosen by the database designer to uniquely identify the entity set. Foreign key: Identifies the relationship between entities.

9.2 Attributes: An attribute can have single value or multiple value or range of values. In addition, each attribute can contain certain type of data like only numeric value, or only alphabets, or combination of both, or date or negative or positive values etc. Depending on the values that an attribute can take, it is divided into different types.

*
p<>{color:#000;}. Simple Attribute

These kinds of attributes have values which cannot be divided further.

*
p<>{color:#000;}. Composite Attribute

This kind of attribute can be divided further to more than one simple attribute. For example, address of a person. Here address can be further divided as Door#, street, city, state and pin which are simple attributes.

*
p<>{color:#000;}. Derived Attribute

Derived attributes are the one whose value can be obtained from other attributes of entities in the database. For example, Age of a person can be obtained from date of birth and current date. Average salary, annual salary, total marks of a student etc are few examples of derived attribute.

 

*
p<>{color:#000;}. Stored Attribute

The attribute which gives the value to get the derived attribute are called Stored Attribute. In example above, age is derived using Date of Birth. Hence Date of Birth is a stored attribute.

*
p<>{color:#000;}. Single Valued Attribute

These attributes will have only one value. For example, EMPLOYEE_ID, passport#, driving license#, SSN etc have only single value for a person.

*
p<>{color:#000;}. Multi-Valued Attribute

These attribute can have more than one value at any point of time. Manager can have more than one employee working for him, a person can have more than one email address, and more than one house etc is the examples.

*
p<>{color:#000;}. Simple Single Valued Attribute

This is the combination of above four types of attributes. An attribute can have single value at any point of time, which cannot be divided further. For example, EMPLOYEE_ID – it is single value as well as it cannot be divided further.

*
p<>{color:#000;}. Simple Multi-Valued Attribute

Phone number of a person, which is simple as well as he can have multiple phone numbers is an example of this attribute.

*
p<>{color:#000;}. Composite Single Valued Attribute

Date of Birth can be a composite single valued attribute. Any person can have only one DOB and it can be further divided into date, month and year attributes.

*
p<>{color:#000;}. Composite Multi-Valued Attribute

Shop address which is located two different locations can be considered as example of this attribute.

*
p<>{color:#000;}. Descriptive Attribute

Attributes of the relationship is called descriptive attribute. For example, employee works for department. Here ‘works for’ is the relation between employee and department entities. The relation ‘works for’ can have attribute DATE_OF_JOIN which is a descriptive attribute.

 

9.3 Relationship: How entities act upon each other or are associated with each other. Think of relationships as verbs. For example, the named student might register for a course. The two entities would be the student and the course, and the relationship depicted is the act of enrolling, connecting the two entities in that way. Relationships are typically shown as diamonds or labels directly on the connecting lines.

 

9.4 Relationship Set: A set of relationships of similar type is called a relationship set. Like entities, a relationship too can have attributes. These attributes are called descriptive attributes.

 

The number of participating entities in a relationship defines the degree of the relationship.

*
p<>{color:#000;}. Binary = degree 2

*
p<>{color:#000;}. Ternary = degree 3

*
p<>{color:#000;}. n-ary = degree

 

9.5 Keys: Keys are the attributes of the entity, which uniquely identifies the record of the entity. For example STUDENT_ID identifies individual students, passport#, license # etc.As we have seen already, there are different types of keys in the database.

*
p<>{color:#000;}. Super Key is the one or more attributes of the entity, which uniquely identifies the record in the database.

*
p<>{color:#000;}. Candidate Key is one or more set of keys of the entity. For a person entity, his SSN, passport#, license# etc can be a super key.

*
p<>{color:#000;}. Primary Key is the candidate key, which will be used to uniquely identify a record by the query. Though a person can be identified using his SSN, passport# or license#, one can choose any one of them as primary key to uniquely identify a person. Rest of them will act as a candidate key.

 

Fig 9.5.1 : Primary Key / Candidate Key

 

*
p<>{color:#000;}. Foreign Key of the entity attribute in the entity which is the primary key of the related entity. Foreign key helps to establish the mapping between two or more entities.

 

Fig 9.5.2 : Foreign Key Example

9.6 Cardinalities: Cardinality defines the number of entities in one entity set, which can be associated with the number of entities of other set via relationship set.

*
p<>{color:#000;}. One-to-one − One entity from entity set A can be associated with at most one entity of entity set B and vice versa.

Fig 9.6.1 : One-To-One Relationship

*
p<>{color:#000;}. One-to-many − One entity from entity set A can be associated with more than one entities of entity set B however an entity from entity set B, can be associated with at most one entity.

 

Fig 9.6.2 : One-To-Many Relationship

*
p<>{color:#000;}. Many-to-one − More than one entities from entity set A can be associated with at most one entity of entity set B, however an entity from entity set B can be associated with more than one entity from entity set A.

 

Fig 9.6.3 : Many-To-One Relationship

 

[+ Fig 9.6.4 :- Many-To-Many Relationship +]

 

*
p<>{color:#000;}. Many-to-many − One entity from A can be associated with more than one entity from B and vice versa.

9.7 Participation Constraints:

*
p<>{color:#000;}. Total Participation − Each entity is involved in the relationship. Total participation is represented by double lines.

*
p<>{color:#000;}. Partial participation − Not all entities are involved in the relationship. Partial participation is represented by single lines.

[+ Fig 9.7.1 :- Participation Relationship +]

 

 

 

 

Fig 9.7.2 :- Comparing the syntax of common data modeling notations

 

 

[+ 3 :- Different Data Model Notation +]

 

 

A complete sets of Dr. Peter Chen’s Data modelling notations and guidelines with examples could be followed on http://www.conceptdraw.com/examples/what-is-chen-notation

10.NORMALIZATION

 

 

Normalization is a set of rules/guidelines/technique that is used while designing a database.  These rules help to remove all the anomalies and distribute the data among different related tables and query them efficiently and effectively. It removes all the duplication issues and incorrect data issues, helping to have a well-designed database. Normalization is divided into following normal forms:

 

10.1 First Normal Form (1NF): A table is said to be in First Normal Form (1NF) if and only if each attribute of the relation is atomic. That is,

*
p<>{color:#000;}. Each row in a table should be identified by primary key (a unique column value or group of unique column values)

*
p<>{color:#000;}. No rows of data should have repeating group of column values.

 

Let’s consider the STUDENT table with his ID, Name address and 2 subjects that he has opted for.

 

[+ Fig 10.1.1 :- Student Table +]

In order to have STUDENT in 1NF, we have to remove multiple SUBJECT columns from STUDENT table. Instead, create only one SUBJECT column, and for each STUDENT enters as many rows as SUBJECT he has. After making this change, the above table will change as follows:

 

[+ Fig 10.1.2 :- First Normal Form - Student Table +]

 

 

Now STUDENT_ID alone cannot be a primary key, because it does not uniquely determines each record in the table. If we want to records for Joseph, and we query by his ID,100, gives us two records. Hence Student_ID is no more a primary key.  When we observe the data in the table, all the four field uniquely determines each record. Hence all four fields together considered as primary key. Thus, the above table is in 1NF form.

10.2 Second Normal Form (2NF): A relation is said to be in a second normal form if and only if,

*
p<>{color:#000;}. it’s in first normal form

*
p<>{color:#000;}. Every non-key attributes are  identified by the use of primary key

*
p<>{color:#000;}. All subset of data, which applies to have multiple rows in a table must be removed and placed in a new table. And this new table and the parent table should be related by the use of foreign key.

 

Fig 10.2.1 :- Second Normal Form – Student Table

 

10.3 Third Normal Form (3NF): For a relation to be in third normal form:

*
p<>{color:#000;}. it should meet all the requirements of both 1NF and 2NF

*
p<>{color:#000;}. If there is any columns which are not related to primary key, then remove them and put it in a separate table, relate both the table by means of foreign key i.e.; there should not be any transitive dependency.

 

 

[+ Fig 10.3.1 :- Third Normal Form - Student Table +]

 

10.4 Fourth Normal Form (4NF): In the fourth normal form,

*
p<>{color:#000;}. It should meet all the requirement of 3NF

*
p<>{color:#000;}. Attribute of one or more rows in the table should not result in more than one rows of the same table leading to multi-valued dependencies

 

To understand it clearly, consider a table with Subject, Lecturer who teaches each subject and recommended Books for each subject. If we observe the data in the table above it satisfies 3NF. But LECTURER and BOOKS are two independent entities here. There is no relationship between Lecturer and Books. In the above example, either Alex or Bosco can teach Mathematics. For Mathematics subject , student can refer either ‘Maths Book1’  or ‘Maths Book2’.  i.e.;

[* SUBJECT --> LECTURER *]

SUBJECT—>BOOKS

[+ Fig 10.4.1 :- Before Fourth Normal Form – Course / Subject Table +]

 

 

This is a multivalued dependency on SUBJECT. If we need to select both lecturer and books recommended for any of the subject, it will show up (lecturer, books) combination, which implies lecturer who recommends which book. This is not correct. Now if we want to know the lecturer names and books recommended for any of the subject, we will fire two independent queries. Hence it removes the multi-valued dependency and confusion around the data. Thus the table is in 4NF.To eliminate the dependency:

 

 

[+ Fig 10.4.2 :- After Fourth Normal Form – Course / Subject Table +]

 

10.5 Fifth Normal Form (5NF): A database is said to be in 5NF, if and only if,

*
p<>{color:#000;}. It’s in 4NF

*
p<>{color:#000;}. If we can decompose table further to eliminate redundancy and anomaly, and when we re-join the decomposed tables by means of candidate keys, we should not be losing the original data or any new record set should not arise. In simple words, joining two or more decomposed table should not lose records nor create new records.

Consider an example of different Subjects taught by different lecturers and the lecturers taking classes for different semesters.

Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2 has only Mathematics in its academic year!!

 

[+ Fig 10.5.1 :- Before Fifth Normal Form – Course / Subject Table +]

 

 

Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and when join them by using keys, it should yield correct record. Here, we can represent each lecturer’s Subject area and their classes in a better way. We can divide above table into three – (SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS). That is we joined key columns of each table to get the correct data. Hence there is no lose or new data – satisfying 5NF condition.

[+ Fig 10.5.2 :- After Fifth Normal Form – Course / Subject Table +]

 

10.6 ): This normal form is also referred as 3.5 normal forms. This normal form

*
p<>{color:#000;}. Meets all the requirement of 3NF

*
p<>{color:#000;}. Any table is said to be in BCNF, if its candidate keys do not have any partial dependency on the other attributes. i.e.; in any table with (x, y, z) columns, if (x, y)->z and z->x then it’s a violation of BCNF. If (x, y) are composite keys and (x, y)->z, then there should not be any reverse dependency, directly or partially.

 

Below table will have all the anomalies too. If we delete any student from below table, it deletes lecturer’s information too. If we add any new lecturer/student to the database, it needs other related information also. Also, if we update subject for any student, his lecturer info also needs to be changed, else it will lead to inconsistency.

 

[+ Fig 10.6.1 :- Before Boyce Codd Normal Form – Lecturer / Student Table +]

 

 

 

Hence we need to decompose the table so that eliminates so that it eliminates such relationship. Now in the new tables below, there are no inter-dependent composite keys (moreover, there is no composite key in both the tables). If we need to add/update/delete any lecturer, we can directly insert record into STUDENT_ADVISOR table, without affecting STUDENT_MAJOR table. If we need to insert/update/delete any subject for a student, then we can directly do it on STUDENT_MAJOR table, without affecting STUDENT_ADVISOR table.  When we have both advisor as well as major subject information, then we can directly add/update both the tables. Hence we have eliminated all the anamolies in the database.

 

[+ Fig 10.6.2 :- After Boyce Codd Normal Form – Lecturer / Student Table +]

 

10.7 Functional Dependency: Functional dependency (FD) is a set of constraints between two attributes in a relation. Functional dependency says that if two tuples have same values for attributes A1, A2,…, An, then those two tuples must have to have same values for attributes B1, B2, …, Bn. Functional dependency is represented by an arrow sign (→) that is, X→Y, where X functionally determines Y. The left-hand side attributes determine the values of attributes on the right-hand side.

 

10.8 Armstrong’s Axioms: If F is a set of functional dependencies then the closure of F, denoted as F^+^, is the set of all functional dependencies logically implied by F. Armstrong’s Axioms are a set of rules, that when applied repeatedly, generates a closure of functional dependencies.

 

*
p<>{color:#000;}. Reflexive rule − If alpha is a set of attributes and beta is_subset_of alpha, then alpha holds beta.

*
p<>{color:#000;}. Augmentation rule − If a → b holds and y is attribute set, then ay → by also holds. That is adding attributes in dependencies, does not change the basic dependencies.

*
p<>{color:#000;}. Transitivity rule − Same as transitive rule in algebra, if a → b holds and b → c holds, then a → c also holds. a → b is called as a functionally that determines b.

 

10.9 Trivial Functional Dependency:

*
p<>{color:#000;}. Trivial − If a functional dependency (FD) X → Y holds, where Y is a subset of X, then it is called a trivial FD. Trivial FDs always hold.

*
p<>{color:#000;}. Non-trivial − If an FD X → Y holds, where Y is not a subset of X, then it is called a non-trivial FD.

*
p<>{color:#000;}. Completely non-trivial − If an FD X → Y holds, where x intersect Y = Φ, it is said to be a completely non-trivial FD.

11. DE-NORMALIZATION

 

As a thumb rule of database design, we should have normalized data so that there should not be any redundancy. Comparing the good design and performance of database, performance takes the priority. Any user accessing the database is more interested in quick and correct result than how a database is designed. Hence if we introduce redundancy in the table and if the performance of the query increases, then we can ignore 3NF. This process is known as denormalization. In this process, a normalized table will be re-joined to have the redundancy in the table to increase the performance of the query. Denormalization need not be true in all cases. It all depends on the data. Hence this task is performed after the design and after having the data in the tables. In addition, it also depends on the redundant column which we are re-introducing into the table and frequency of this column update. The basic criteria for denormalization would be:

 

*
p<>{color:#000;}. It should reduce the frequency of joins between the tables, and hence making the query faster. If any of the two or more tables are joined often to query the data and this joins costs more, we can combine them into one table. But after combining the table, it should still retain the correct data. There should not be any unwanted/unnecessary duplicate records. In our example above, after denormalization of STUDENT and ADDRESS, it should have all the students with correct address. It should not lead to wrong address of students.

*
p<>{color:#000;}. Most of the cases, when we have joins on tables, full table scan is performed to fetch the data. Hence if the tables are huge, we can think of denormalization.

*
p<>{color:#000;}.  The column should not be updated more frequently. If the columns are updated often, then the cost of update will increase, even though retrieval cost reduces. If it is less frequently updated, then database can bear the cost of update. Otherwise, database will always be hanging. In our case above, address is less frequently updated field (the frequency of a student changing his house is comparatively less). Also the column should very small to get re-joined with the table. Huge columns are again overhead to the table and cost of performance.

*
p<>{color:#000;}. The developer should have very good knowledge of data, when he denormalizes it. He should know very clearly about above factors, frequency of joins / access, updates, column and table size etc.

 

Denormalization is not only recombining the columns to have redundant data. Denormalization can be any technique with which performance of the normalized table can be increased.

 

There are few of de-normalization method discussed below:

 

11.1 Adding Redundant Columns: In this method, only the redundant column which is frequently used in the joins is added to the main table. The other table is retained as it is. For example, consider EMPLOYEE and DEPT tables.  Suppose we have to generate a report where we have to show employee details and his department name. Here we need to have join EMPLOYEE with DEPT to get department name.

 

 

Consider the below example as shown below :

 

[+ Fig 11.1.1 :- Denormalization Process – Before Addition +]

 

 

To avoid the normalization process, please consider the below table re-configuration. In this case, what we can do is add the redundant column DEPT_NAME to EMPLOYEE, so that it avoids join with DEPT and thus increasing the performance.

 

[+ Fig 11.1.2 :- Denormalization Process – After Addition +]

 

 

11.2 [+:+] Suppose we have STUDENT table with student details like his ID, name, address and course. Another table MARKS with his internal marks in different subjects. There is a need to generate a report for individual student in which we need to have his details, total marks and grade. In this case, we have to query STUDENT table, then join the MARKS table to calculate the total of marks in different subjects. Based on the total, we have to decide the grade too in the select query. Then it has to be printed on the report.

 

SELECT std.STD_ID, std.NAME, std.ADDRESS, t.TOTAL,

CASE WHEN t.TOTAL >=80 THEN ‘A’

WHEN t.TOTAL>= 60 AND t.TOTAL

 

Above query will run for each of the student records to calculate total and grade. Imagine how many students will exist and how many times this query will retrieve the data and do calculation? Instead what if we have total and grade stored in the STUDENT table itself? It will reduce join time and the calculation time. Once all the marks are inserted into the MARKS table, we can calculate the total and GRADE for each student and get STUDENT table updated for these columns (we can have trigger on MARKS to update STUDENT table, once marks are inserted). Now if we have to generate the report, simply fire a SELECT query on STUDENT table and print it on report.

 

 

SELECT std.STD_ID, std.NAME, std.ADDRESS, std.TOTAL, std.GRADE

FROM STUDENT std;

This made the query simple, and faster.

 

11.3 [+:+] We have already discussed this method in above examples. In this method, frequently used tables are combined into one table to reduce the joins among the table. Thus it increases the performance of the retrieval query. By joining the redundant column into one table may cause the redundancy in the table. But it is ignored as far as it does not affect the meaning of other records in the table.  For example, after denormalization of STUDENT and ADDRESS, it should have all the students with correct address. It should not lead to wrong address of students. In addition to collapsing the tables, we can duplicate or even split the table, if they increase the performance of the query. But duplicating and splitting are not methods of de-normalization.

 

11.4 Snapshots: This is one of the earliest methods of creating data redundancy. In this method, the database tables are duplicated and stored in various database servers. They are refreshed at specific time periods to maintain the consistency among the database server tables. By using this method, users are located at different places were able to access the servers which are nearer to them, and hence retrieving the data quickly. They need not access the tables located at remote servers in this case. This helps in faster access.

 

 

11.5 [+:+] In this method tables are created as VARRAY tables, where repeating groups of columns are stored in single table. This VARRAY method over-rules the condition of 1NF. According to 1NF, each column value should be atomic. But this method allows same data to be stored in different columns for each record. Consider the example of STUDENT and MARKS. Say MARKS table has marks of 3 subjects for each student. After applying 1NF, the MARKS table has structure as below.

 

[+ Fig 11.5.1 :- VARRAYS Process – After Addition of 1NF +]

 

Here if we have to see the marks of a particular student, MARKS table has to be accessed 3 times. But if we use VARRAY, the table will be changed to as below. Now, by in a single traversal, we can access all the marks of a student. It reduces the time consumed to retrieve the marks of each student.

 

 

[+ Fig 11.5.2 :- VARRAYS Process +] Post – Implementation

 

 

11.6 [+:+] Materialized views are similar to tables where all the columns and derived values are pre-calculated and kept. Hence if there is any query with same query used in the materialized view, then the query will be replaced by this materialized view. Since this view has all the columns as a result of join and pre-calculated value, there is no need to calculate the values again. Hence it reduces the time consumed by the query.

 

Consider the same example of calculating total and grade above.

SELECT std.STD_ID, std.NAME, std.ADDRESS, t.TOTAL,

CASE WHEN t.TOTAL >=80 THEN ‘A’

WHEN t.TOTAL>= 60 AND t.TOTAL

 

What if we create a materialized view for above query? Yes, it will benefit a lot. There is no need to update the STUDENT table with total and grade, each time when we insert the marks. Once all the marks are inserted, just creating a materialized view will store all the data that is required for the report. Hence when we have to generate the report, we have to query this materialized view just like we query STUDENT table.

The only problem with materialized view is it will not get refreshed like any other views when there is change in table data. We have to explicitly refresh them to get the correct data in the materialized view.

 

11.7 Advantages Of De-Normalization:

*
p<>{color:#000;}. Minimizes the table joins

*
p<>{color:#000;}. It reduces the number of foreign keys and indexes. This helps in saving the memory usage and less data manipulation time.

*
p<>{color:#000;}. If there is any aggregation columns are used to denormalize, then these computations are carried out at the data manipulation time rather than at the retrieval time. i.e.;, if we have used ‘total marks’ as the denormalized column, then the total is calculated and updated when other related column entries – say student details and his marks are inserted. Hence when we query STUDENT table for his details and marks, we need not calculate his total. Hence it saves the retrieval time.

*
p<>{color:#000;}. It reduces number of tables in the database. As the number of table increases, the mapping increases; joins increases; memory space increases and so on.

11.8 Disadvantages Of De-Normalization:

*
p<>{color:#000;}. Although it supports faster retrieval, it slows down the data manipulation. If the column is frequently updated, then it reduces the speed of updation.

*
p<>{color:#000;}. If there is any change in the requirement, then we need to analyze the data and tables again to understand the performance. Hence de-normalization is specific the requirement or application that a user is using.

*
p<>{color:#000;}. Complexity of coding and number table depends on the requirement / application. It can increase or decrease the tables. There can be chance that the code will get more complex because of redundancy in the table. Hence it needs thorough analysis of requirement, query, data etc.

12. DATA INDEPENDENCE

 

 

A database system normally contains a lot of data in addition to users’ data. For example, it stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult to modify or update a set of metadata once it is stored in the database. But as a DBMS expands, it needs to change over time to satisfy the requirements of the users. If the entire data is dependent, it would become a tedious and highly complex job. Metadata itself follows a layered architecture, so that when we change data at one layer, it does not affect the data at another level. This data is independent but mapped to each other.

 

Logical data is data about database, that is, it stores information about how data is managed inside. For example, a table (relation) stored in the database and all its constraints, applied on that relation. Logical data independence is a kind of mechanism, which liberalizes itself from actual data stored on the disk. If we do some changes on table format, it should not change the data residing on the disk.

 

[+ Fig 12.1 :- Data Independence +]

 

All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data independence is the power to change the physical data without impacting the schema or logical data. For example, in case we want to change or upgrade the storage system itself − suppose we want to replace hard-disks with SSD − it should not have any impact on the logical data or schemas.

13. CREATING LOGICAL DATA MODEL

*
p<{color:#000;}. Get Business requirements.

*
p<{color:#000;}. Analyze Business requirements.

*
p<{color:#000;}. Create High Level Conceptual Data Model. Get it approved.

*
p<{color:#000;}. Create a new Logical Data Model. Add the following to the logical data model.

*
p<{color:#000;}. Select target database where data modelling tool creates the scripts for physical schema.

*
p<{color:#000;}. Create standard abbreviation document for naming logical and physical objects according to

business/data modelling standard.

*
p<{color:#000;}. Create domain.

*
p<{color:#000;}. Create rule.

*
p<{color:#000;}. Create default.

*
p<{color:#000;}. Create Entity and add definitions.

*
p<{color:#000;}. Create attribute and add definitions.

*
p<{color:#000;}. Assign datatype to attribute. If a domain is already present then the domain should be attached to

attribute.

*
p<{color:#000;}. Add check constraint/rule or default to the columns (wherever it is required).

*
p<{color:#000;}. Create primary or unique keys to attribute.

*
p<{color:#000;}. Create unique index or bitmap index to attribute.

*
p<{color:#000;}. Based on the analysis, create surrogate key columns.

*
p<{color:#000;}. If required, create super types and sub types.

*
p<{color:#000;}. Analyze the relationship between entities and Create foreign key relationship (one to many or many

to many) between those entities.

*
p<{color:#000;}. Create subject areas and add relevant entities to those subject areas.

*
p<{color:#000;}. Align the objects in the main subject area and other subject areas.

*
p<{color:#000;}. Validate the data model.

*
p<{color:#000;}. Generate reports from the data model.

*
p<{color:#000;}. Take a print out of the data model.

*
p<{color:#000;}. Get it approved.

14. CREATING PHYSICAL DATA MODEL

 

*
p<>{color:#000;}. Create Physical Data Model from the existing logical data model.

*
p<>{color:#000;}. Add database properties to physical data model.

*
p<>{color:#000;}. Generate SQL Scripts from Physical Data Model. Tick or check the necessary parameters in the tool,

create scripts and then forward that to DBA. (DBA will execute those scripts in database).

*
p<>{color:#000;}. Compare database and data model. Make sure everything is okay.

*
p<>{color:#000;}. Create a change log document for differences between the current version and previous version of

the data model.

 

[+ Fig 14.1 :- Conceptual Data Model +]

 

[+ Fig 14.2 :- Logical Data Model +]

 

[+ Fig 14.1 :- Physical Data Model +]

 

15. DATA MODELLING BEST PRACTICES

*
p<>{color:#000;}. Make sure that each entity only appears once per diagram.

*
p<>{color:#000;}. Name every entity, relationship, and attribute on your diagram.

*
p<>{color:#000;}. Examine relationships between entities closely. Are they necessary? Are there any relationships missing? Eliminate any redundant relationships. Don’t connect relationships to each other.

*
p<>{color:#000;}. Use colors to highlight important portions of your diagram

*
p<>{color:#000;}. Show the level of detail necessary for your purpose. You might want to draw a conceptual, logical or physical model, depending on the detail needed. (See above for descriptions of those levels.)

*
p<>{color:#000;}. Watch for redundant entities or relationships.

*
p<>{color:#000;}. If you’re troubleshooting a database problem, watch for holes in relationships or missing entities or attributes.

*
p<>{color:#000;}. Make sure all your entities and relationships are labeled.

*
p<>{color:#000;}. You can translate relational tables and ER diagrams back and forth, if that helps you achieve your goal.

*
p<>{color:#000;}. Make sure the ER diagram supports all the data you need to store.

*
p<>{color:#000;}. There may be different valid approaches to an ER diagram. As long as it provides the necessary information for its scope and purpose, it’s good.

*
p<>{color:#000;}. Use well defined and consistent names for tables and columns (e.g. School, StudentCourse, CourseID …).

*
p<>{color:#000;}. Use singular for table names (i.e. use StudentCourse instead of StudentCourses). Table represents a collection of entities, there is no need for plural names.

*
p<>{color:#000;}. Don’t use spaces for table names. Otherwise you will have to use ‘{‘, ‘[‘, ‘“’ etc. characters to define tables (i.e. for accesing table Student Course you’ll write “Student Course”. StudentCourse is much better).

*
p<>{color:#000;}. Don’t use unnecessary prefixes or suffixes for table names (i.e. use School instead of TblSchool, SchoolTable etc.).

*
p<>{color:#000;}. Keep passwords as encrypted for security. Decrypt them in application when required.

*
p<>{color:#000;}. Use integer id fields for all tables. If id is not required for the time being, it may be required in the future (for association tables, indexing …).

*
p<>{color:#000;}. Choose columns with the integer data type (or its variants) for indexing. varchar column indexing will cause performance problems.

*
p<>{color:#000;}. Use bit fields for boolean values. Using integer or varchar is unnecessarily storage consuming. Also start those column names with “Is”.

*
p<>{color:#000;}. Provide authentication for database access. Don’t give admin role to each user.

*
p<>{color:#000;}. Avoid “select *” queries until it is really needed. Use "select [required_columns_list]" for better performance.

*
p<>{color:#000;}. Use an ORM (object relational mapping) framework (i.e. hibernate, iBatis …) if application code is big enough. Performance issues of ORM frameworks can be handled by detailed configuration parameters.

*
p<>{color:#000;}. Partition big and unused/rarely used tables/table parts to different physical storages for better query performance.

*
p<>{color:#000;}. For big, sensitive and mission critic database systems, use disaster recovery and security services like failover clustering, auto backups, replication etc.

*
p<>{color:#000;}. Use constraints (foreign key, check, not null …) for data integrity. Don’t give whole control to application code.

*
p<>{color:#000;}. Lack of database documentation is evil. Document your database design with ER schemas and instructions. Also write comment lines for your triggers, stored procedures and other scripts.

*
p<>{color:#000;}. Use indexes for frequently used queries on big tables. Analyser tools can be used to determine where indexes will be defined. For queries retrieving a range of rows, clustered indexes are usually better. For point queries, non-clustered indexes are usually better.

*
p<>{color:#000;}. Database server and the web server must be placed in different machines. This will provide more security (attackers can’t access data directly) and server CPU and memory performance will be better because of reduced request number and process usage.

*
p<>{color:#000;}. Image and blob data columns must not be defined in frequently queried tables because of performance issues. These data must be placed in separate tables and their pointer can be used in queried tables.

*
p<>{color:#000;}. Normalization must be used as required, to optimize the performance. Under-normalization will cause excessive repetition of data, over-normalization will cause excessive joins across too many tables. Both of them will get worse performance.

*
p<>{color:#000;}. Spend time for database modelling and design as much as required. Otherwise saved(!) design time will cause (saved(!) design time) * 10/100/1000 maintenance and re-design time.

16. DATA MODELLING VARIETY

 

Let’s go through the kinds of data models exist on the real world and as used by several organizations for building and running their applications.

 

16.1 Flat Data Model: Flat data model is the first and foremost introduced model and in this all the data used is kept in the same plane. Since it was used earlier this model was not so scientific.

 

[+ Fig 16.1 :- Flat Data Model +]

 

 

16.2 Entity Relationship Data Model: Entity relationship model is based on the notion of the real world entities and their relationships. While formulating the real world scenario in to the database model an entity set is created and this model is dependent on two vital things and they are : 

 

*
p<>{color:#000;}. Entity and their attributes

*
p<>{color:#000;}. Relationships among entities

 

 

[+ Fig 16.2 :- Entity Relationship Data Model +]

 

 

 

An entity has a real world property called attribute and attribute define by a set of values called domain. For example, in a university a student is an entity, university is the database, name and age and sex are the attributes. The relationships among entities define the logical association between entities.

 

16.3 Relational Data Model: Relational model is the most popular model and the most extensively used model. In this model the data can be stored in the tables and this storing is called as relation, the relations can be normalized and the normalized relation values are called atomic values. Each row in a relation contains unique value and it is called as tuple, each column contains value from same domain and it is called as attribute.

 

[+ Fig 16.3 :- Relational Data Model +]

 

16.4 Network Data Model: Network model has the entities which are organized in a graphical representation and some entities in the graph can be accessed through several paths.

 

[+ Fig 16.4 :- Network Data Model +]

 

 

 

16.5 Hierarchical Data Model: Hierarchical model has one parent entity with several children entity but at the top we should have only one entity called root. For example, department is the parent entity called root and it has several children entities like students, professors and many more.

 

[+ Fig 16.5 :- Hierarchical Data Model +]

 

 

16.5 Object oriented Data Model: Object oriented data model is one of the developed data model and this can hold the audio, video and graphic files. These consist of data piece and the methods which are the DBMS instructions.

 

 

Fig 16.5 :- Object Oriented Data Model

 

 

16.6 Record Base Data Model: Record base model is used to specify the overall structure of the database and in this there are many record types. Each record type has fixed no. of fields having the fixed length.

 

 

16.7 Object Relation Data Model: Object relation model is a very powerful model but coming to it’s design it is quiet complex. This complexity is not problem because it gives efficient results and widespread with huge applications. It has a feature which allows working with other models like working with the very known relation model. A brief and better guidance could be gathered via http://www.orm.net/

 

 

 

[+ Fig 16.6 :- Object Relation Data Model +]

 

 

 

 

16.8 Semi Structured Data Model: Semi structured data model is a self describing data model, in this the information that is normally associated with a scheme is contained within the data and this property is called as the self describing property.

 

16.9 Associative Data Model: Associative model has a division property, this divides the real world things about which data is to be recorded in two sorts i.e. between entities and associations. Thus, this model does the division for dividing the real world data to the entities and associations.

 

16.10 Context Data Model: Context data model is a flexible model because it is a collection of many data models. It is a collection of the data models like object oriented data model, network model, semi structured model. So, in this different types of works can be done due to the versatility of it.

 

 

 

[+ Fig 16.7 :- Context Data Model +]

 

Therefore, this support different types of users and differ by the interaction of users in database and also the data models in DBMS brought a revolutionary change in industries by the handling of relevant data. The data models in DBMS are the systems that help to use and create databases, as we have seen there are different types of data models and depending on the kind of structure needed we can select the data model in DBMS.   

[+ Fig 16.8 :- Entity-Relationship Data Modelling Methods +]

 

[++]

[++]

17. OPEN SOURCE DATABASE INFRASTRUCTUREDIGITAL TRANSFORMATION

 

For years, database technology converged around a single model – the relational model. The trend was to gain insight by consolidating all your data into one system. Over the last 10 years, the Web has driven new models such as NoSQL, Big Data (HDFS), and Graph databases. These technologies arose in response to new application models as applications for the Web grew in scale and with higher volumes of data and more granular transactions. These transactions are more so events that need to be captured, but they have a less stringent requirement for consistency. These events, whether clicks on a website tracking what people look at or generated by a device, are not the same sort of transaction as a financial transaction, or one where inventory must be committed. Eventual consistency makes sense with event capture, and allows higher throughput. Databases like MongoDB and Cassandra are being used to good effect for this.

 

The data generated in event capture is often consolidated in the Hadoop Distributed File System (HDFS) and then analyzed through Map Reduce or Graph database approaches. This analysis is different from reporting on a relational database. It is less ad-hoc, and more a specific set of applications that are written to discover relationships and trends in the data. Big Data is just that, a mechanism for exploring very large sets of very granular data and discovering insights. Examples are trends in failure based on events sent from an instrumented device such as a turbine. In retail, one might discover that people have a preference for purple, and see a lot of events where people bought things that were purple, but looked at other things where no purple was offered, and did not proceed with a purchase.

 

The new data landscape is composed of event capture data systems; HDFS; and traditional relational ACID transactional databases for capturing and processing orders; as well as other financial transactions, managing inventory, and dealing with payroll. The new data landscape puts new demands on the operational database system. To bridge across to NoSQL content, many of these systems have added new data types such as JSON so that they can consume unstructured data and match it with the structured data systems that support this—and perhaps key-value/attribute data—are often called multi-model systems.

 

The leading operational databases now support adaptors to allow including data from NoSQL and HDFS into transactions. This allows the operational database to establish relationships across the complete data landscape, or to query back into HDFS to gain in-context insights that have been generated there. Again, an enterprise must leverage its reach across its network of partnerships, not just its own. The operational database also supports replication to/from both like and unlike databases. One might want to replicate bill-of-materials data from an Oracle ERP system where the data is stored in the Oracle database, or replicate new customers captured from a website back into the ERP system. A viable data integration platform supports all this and more. This allows customers the freedom to leverage new more cost efficient open source-based solutions for newly developed applications that require the robustness and transactional capabilities of a relational database management system.

 

In fact, in a April 2015 Gartner report, the research firm forecast the use of open source databases would exceed traditional solution, noting: “By 2018, more than 70 percent of new in-house applications will be developed on an OSDBMS, and 50 percent of existing commercial [relational database management system] instances will have been converted or will be in process.”

 

Open source solutions offer greater flexibility in pricing models as well. In some cases, vendors offering open source  databases price on a subscription-based model that eliminates the licensing fees common to large proprietary systems. An important element to a subscription is that it qualifies as an operating expense versus a more complex capital expenditure. Thus, deploying open source and open source-based databases become a simpler process and can cost 80 to 90 percent less than traditional solutions. This allows agencies to refocus these resources on innovation and key organizational drivers. Open source database management systems have also matured to levels that Gartner said are on par with other commercial offerings, making management of the system less cost-prohibitive, and the necessary database administration skills more widely available. Hence, open source databases have a substantial advantage in total cost of ownership. Data security has always been and will continue to remain a major priority for government agencies, given the sensitive and business-critical nature of the information they collect. Some IT departments may be skeptical of the security capabilities of open source solutions. Gartner’s 2014 Magic Quadrant for Operational Database Management Systems showed that open source database solutions are being used successfully in mission-critical applications in a large number of organizations. In addition, mature open source solutions today implement the same, if not better, security capabilities of traditional infrastructures. This includes SQL injection prevention, tools for replication and failover, server-side code protections, row-level security and enhanced auditing features, to name a few. Furthermore, as open source technology, in general, becomes more widely accepted across the public sector – intelligence, civilian and defence agencies across the federal government have adopted open source – database solutions are also growing with specific government mandates, regulations and requirements.

Government IT systems need to be flexible and, sometimes, mobile. In addition to being housed in a permanent data center facility, databases support applications may be deployed on tanks, Humvees, temporary office spaces, aircraft, ships and underground sites, among other potential locations. As a result, deployment needs to be quick and simple, and the physical equipment light and transportable. Open source solutions, which use lower levels of memory and RAM and take up less server space, provide an ideal option for these scenarios, compared to installation of traditional systems can oftentimes be a Herculean effort and take up a huge amount of disk space. Open source is a proven database technology that can handle the most demanding government missions, providing secure, nimble and responsive infrastructure at much lower costs while also fulfilling government mandates, including the October 2009 US Department of Defense memorandum. The trend towards open source gives agencies greater agility to respond to a changing environment, more flexibility to meet mission requirements and a nimble and responsive infrastructure, which is exactly what our government’s systems need. Modern Open Source database technologies (PostgreSQL, MySQL, Riak, MongoDB, Neo4J, Redis, Cassandra, HBase) are now technically competitive with proprietary technologies (ORACLE, SQLServer, DB2) at a fraction of the cost. BJSS’s portfolio of services for migration modernises applications and enables value-added services like data analytics, while minimising risk and reducing TCO.

Features

*
p<>{color:#000;}. Portfolio-level audit and commercial database replacement feasibility study.

*
p<>{color:#000;}. Business plan creation and TCO benefits assessment.

*
p<>{color:#000;}. Migration plan development.

*
p<>{color:#000;}. Application architecture review/documentation, code review and feasibility analysis.

*
p<>{color:#000;}. Independent technology selection advisory for SQL based and NoSQL alternatives.

*
p<>{color:#000;}. Proof-of-Concept / technical spike / de-risking execution

*
p<>{color:#000;}. Application reconfiguration and redevelopment,

*
p<>{color:#000;}. Data Migration, reconciliation and implementation management.

*
p<>{color:#000;}. Regression testing (functional and non-functional),

Benefits

*
p<>{color:#000;}. Reduced TCO (licensing, operating, support costs) for database solutions.

*
p<>{color:#000;}. Improved support experience due to reduced vendor lock-in.

*
p<>{color:#000;}. ‘Designed for Cloud’ and easy to run on commodity hardware.

*
p<>{color:#000;}. Advanced functionality (eg analytics) without additional complex license fees.

*
p<>{color:#000;}. Reduced (usually zero) license cost for development / test environments.

*
p<>{color:#000;}. Performance, Security, Stability and Efficiency improvements.

*
p<>{color:#000;}. Improved architectural/design documentation and maintainability.

*
p<>{color:#000;}. Low risk implementation due to BJSS’s mission critical systems experience.

*
p<>{color:#000;}. More flexibility in responding to user or system growth.

18. CONSOLIDATION

 

If at all I shall be asked; what is the different between virtue and vice, unless I am not within a world of fantasy or being there in a time of 500 or 1000 years back from now by any kinds of time machine or not before any kinds of legislative bodies like Human Rights or NGO, it’s very much of difficult to prognosticate. Any hard work is a virtue and anything which is incurring the loss being the fame in the society is a curse. So if at all OSD software are curbing a few anomalies, but then again fertile the different markets with adequate freedom to enhance the technology and digital sector, then for me its bliss. Back there in 2000, it was a question by one of the judges on Miss Universe contest at that time to the title winner of that year, Ms Lara Dutta, “If ignorance is bliss, why do we seek knowledge”? So if any of you is having the answer to this question, then the future of the products like OSS & OSD solely depends on your decision.

Citation

 

 

[1] Rumbaugh J, Blaha M, Premerlani W, Eddy D and Lorenzn W. 1991, Object-Oriented
[_              Modelling and_] Design, Prentice Hall, ISBN: 0-1362-9841-9

 

[2] Servio. 1990, Programming in OPAL, Version 2.0, Servio Logic Development Corporation

 

[3] Sarkar M and Reiss S P. 1992, “A Data Model for Object-Oriented Databases”, Technical
[_              report CS-92-56_], Brown University, Department of Computer Science

 

[4] Spence S and Atkinson M, 1997, “A Scalable Model of Distribution Promoting Autonomy of
p.              and Cooperation between PJava Object Stores”, Proceedings of the Hawaii
[_              International_] Conference on System Sciences (HICSS-30), 1 (7),
p.              IEEE Publication: PR7743-QAJ, ISBN:0-8186-7743-0

 

[5] Stonebraker M, Rowe L A, Lindsay B, Gray J, Carey M, Brodie M, Bernstein P, Beech D.

             1990. “Third-generation Database System Manifesto”, SIGMOD Record, 19 (3),
p.              (September 1990)

 

[6] Silberschatz A, Stonebraker M, and Ullman J. 1991. “Database systems: Achievements and
p.              Opportunities”, Communications of the ACM, 34 (10), 110-120, (October 1991)

 

[7] Zdonik S. 1994, “What Makes Object-Oriented Database Management Systems Different”,
[_          Advances in_] Object-Oriented Database Systems, NATO ASI Series, Series F:
[_          Computer and System Science_], Vol. 130, 3-26, Springer Verlag, Berlin Heidelberg
p.          New York, ISBN: 3-540-57825-0 / 0-387-57825-0

 

 

.REFERENCES

 

table=. =. |=.
p<{color:#000;}. Open Source Initiative |=.
*
p<{color:#000;}. https://www.dwheeler.com/oss_fs_why.html#history

*
p<{color:#000;}. http://www.oreilly.com/openbook/opensources/book/index.html

*
p<{color:#000;}. http://info.enterprisedb.com/20150423-gartner-osdbms-report.html?src=2015gosdbms-edb-home-slide

*
p<{color:#000;}. http://dodcio.defense.gov/Open-Source-Software-FAQ/

| =. |=. p<{color:#000;}. Open Logistics Management Information Systems

 

|=. p<{color:#000;}. [+ ][+http://openlmis.org/] | =. |=. p<{color:#000;}. Open Health Information Exchange |=. p<{color:#000;}.  https://ohie.org/ | =. |=. p<{color:#000;}.  Open Government Partnership |=. p<{color:#000;}. [+ ][+http://www.opengovpartnership.org/] | =. |=. p<{color:#000;}.  International Aid Transparency Initiative

 

|=. p<{color:#000;}. [+ ][+http://www.aidtransparency.net/] | =. |=. p<{color:#000;}.  World Bank Open Data |=. p<{color:#000;}. [+ ][+http://data.worldbank.org/] | =. |=. p<{color:#000;}.  Open Source Disaster Management Software

 

 

|=. p<{color:#000;}.  

https://sahanafoundation.org/ | =. |=.
p<{color:#000;}.  Open Source Projects Hub |=.
*
p<{color:#000;}.  https://www.sep.com/labs/opensource/

*
p<{color:#000;}. https://medium.mybridge.co/22-amazing-open-source-react-projects-cb8230ec719f

*
p<{color:#000;}. https://www.linuxfoundation.org/open-source-professionals

*
p<{color:#000;}. http://oss-watch.ac.uk/resources/softwareexamples

*
p<{color:#000;}. https://code.nasa.gov/

| =. |=. p<{color:#000;}.  Open Source Community |=. * p<{color:#000;}.  https://opensource.google.com/

*
p<{color:#000;}. https://www.blackducksoftware.com/about/open-source-community

*
p<{color:#000;}. https://opensource.com/tags/community-management

| =. |=. p<{color:#000;}. Open Source Events |=. * p<{color:#000;}. https://opensource.com/resources/conferences-and-events-monthly

*
p<{color:#000;}. http://events.fossasia.org/

*
p<{color:#000;}. http://www.t-dose.org/

*
p<{color:#000;}. https://opensourceday.com/

*
p<{color:#000;}. http://lanyrd.com/topics/open-source/in/europe/

| =. |=. p<{color:#000;}. Open Source Videos |=. * p<{color:#000;}. https://www.linux.com/news/event/open-source-leadership-summit/2017/2/video-linus-torvalds-how-build-successful-open-source-project

*
p<{color:#000;}. https://getpocket.com/explore/open%20source%20projects?src=related_top

*
p<{color:#000;}. http://www.bbc.co.uk/opensource/

*
p<{color:#000;}. https://thoughtbot.com/upcase/videos/contributing-to-open-source

*
p<{color:#000;}. https://www.blender.org/features/projects/

*
p<{color:#000;}. http://docs.brightcove.com/en/video-cloud/open-source/

| =. |=. p<{color:#000;}. Open Source Articles |=. * p<{color:#000;}. https://www.elsevier.com/about/open-science/open-access/open-access-journals

*
p<{color:#000;}. http://www.sciencedirect.com/science/journal/20781520

*
p<{color:#000;}. http://www.cignex.com/news

*
p<{color:#000;}. https://www.acquia.com/resources/collateral/defense-innovation-digital-technologies-cloud-platforms-open-source-and-beyond

*
p<{color:#000;}. http://www.dxc.technology/innovation/insights/112737-dxc_technology_and_open_source

*
p<{color:#000;}. http://opensourceforu.com/2012/05/tutorvista-using-open-source-in-digital-education/

*
p<{color:#000;}. http://oregonstate.edu/psal/archie-development-and-implementation-open-source-archaeological-database-system-10

*
p<{color:#000;}. http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Pappas99a.pdf

*
p<{color:#000;}. http://info.enterprisedb.com/2016-Gartner_Newsletter.html

*
p<{color:#000;}. https://www.questia.com/library/journal/1G1-53281811/building-digital-library-applications-with-database

| =. |=. p<{color:#000;}. Open Source Tools |=. * p<{color:#000;}. http://www.thesimpledollar.com/30-essential-pieces-of-free-and-open-software-for-windows/

*
p<{color:#000;}. http://entrepreneurhandbook.co.uk/open-source-software/

*
p<{color:#000;}. http://www.oracle.com/us/technologies/open-source/overview/index.html

*
p<{color:#000;}. http://database-management.softwareinsider.com/saved_search/Best-Open-Source-Database-Management-System

*
p<{color:#000;}. https://en.wikipedia.org/wiki/Database

*
p<{color:#000;}. http://opensourcewindows.org/

*
p<{color:#000;}. http://entrepreneurhandbook.co.uk/open-source-software/

*
p<{color:#000;}. http://www.opensourcedigitalassetmanagement.org/

| =. |=. p<{color:#000;}. Data Modelling Blog |=. * p<{color:#000;}. http://www.dataversity.net/category/data-topics/modeling/data-modeling-blogs/

*
p<{color:#000;}. http://msoo.pbworks.com/f/Scott+W.+Ambler+-+Agile+Modeling.pdf

*
p<{color:#000;}. http://www.datasciencecentral.com/profiles/blogs/top-6-data-modeling-tools

*
p<{color:#000;}. http://www.databaseanswers.org/data_models/industry_specific.htm

*
p<{color:#000;}. https://www.datavail.com/blog/data-modeling-concepts-every-dba-should-know/

*
p<{color:#000;}. https://www.datascience.com/blog/defining-business-use-cases-for-data-modeling

*
p<{color:#000;}. https://tdwi.org/articles/2011/06/22/agile-data-modeling.aspx

*
p<{color:#000;}. http://blog.rdx.com/who-will-win-the-database-wars-open-source-vs-commercial-database-systems/

| =. |=. p<{color:#000;}. Data Modelling Ethics |=. * p<{color:#000;}. https://www.ncsu.edu/grad/preparing-future-leaders/docs/Ethical_Issues_in_Data_Management_Handout.pdf

*
p<{color:#000;}. http://www.datamodel.com/index.php/2014/04/21/big-challenges-in-data-modeling-ethics-data-modeling24-april/

*
p<{color:#000;}. http://tdan.com/stepping-up-to-data-ethics/20552

*
p<{color:#000;}. http://mominmalik.com/modeling_ethics.pdf

*
p<{color:#000;}. http://ieeexplore.ieee.org/document/6957277/?reload=true

| =. |=. p<{color:#000;}. Data Modelling Disaster Management |=. * p<{color:#000;}. http://www.gdmc.nl/publications/2008/Spatiotemporal_Modeling_Disaster_Management.pdf

*
p<{color:#000;}. http://www.nat-hazards-earth-syst-sci.net/15/335/2015/nhess-15-335-2015.pdf

*
p<{color:#000;}. http://www.teriin.org/events/docs/smallpdf/Santo3.pdf

*
p<{color:#000;}. https://ocw.tudelft.nl/wp-content/uploads/3D_data_model_topological_analyses.pdf

|

 

 

ABOUT THE AUTHOR

 

Mr. Binayaka Mishra is an experienced IT professional, in various tools and technologies like Data Warehousing, BigData Analytics, Cloud Computing, Reporting Analytics & Project Management documentation with 10+ years experience. He was Graduatad in Computer Science & Engineering from National Institute Of Science & Technology, Berhampur, Odisha, India in 2002. In the past he has worked in several critical roles with MNC’s like, Oracle Corporation, Wipro Technology, CapGemini India Pvt Ltd, UBS & AoN Hewitt Associates India Pvt Ltd.

 

 


Open Source Database : Virtue Or Vice?

On 1962, it has been cited by Oxford University that, “State Development Corporation” of California,USA is the first to use data-base term on technical aspect. On mid-1960’s Charles William AKA "Charlie" Bachman III an American computer scientist, who was the author of IDS [Integrated Data Store], founded the Database Task Group within CODASYL, who was also sole responsible for creation and standardization of COBOL. In 1971, CODASYL delivered its approach to user database on the below terms, which welcomed other market leaders to follow. Although IBM had its own database known as IMS, which was developed using Apollo Program under system 360, but was always considered as hierarchical database due to its strict hierarchical data modelling unit to search any particular or group of data sets, whereas IDS was always considered as network database.

  • ISBN: 9781370277575
  • Author: Binayaka Mishra
  • Published: 2017-06-08 19:20:44
  • Words: 16586
Open Source Database : Virtue Or Vice? Open Source Database : Virtue Or Vice?