Spring 2012 DISC 210 Week 3

= Our Flat File Twitter Dataset = Tweets to_user id     created_at      from_user       text None   159969042784993281      Thu, 19 Jan 2012 12:02:46 +0000 workhome4all    New post: what key do I push to insert symbols in Mircrosoft Excel? http://t.co/xXyIq3pt None   159953955965779968      Thu, 19 Jan 2012 11:02:49 +0000 workhome4all    New post: How do you get rid of the long black squiggle line on Mircrosoft Word 2010? http://t.co/2ulMOwaR dorkuhtron     159907193976668160      Thu, 19 Jan 2012 07:57:00 +0000 TAMMIJOGEE      @dorkuhtron and comp is short for computers lol :) i'm certified in mircrosoft :) lol None   159554252841304064      Wed, 18 Jan 2012 08:34:32 +0000 peter_coccia    i cant believe this worked Mircrosoft Word Magic http://t.co/8KHGSVoG None   159520281889742849      Wed, 18 Jan 2012 06:19:33 +0000 roberthollywood do this immediately. http://t.co/71HxPA8n None   159461572908027905      Wed, 18 Jan 2012 02:26:16 +0000 SPichardox0     RT @obeymermaids: mircrosoft, im pretty sure i know how to spell my own name #rude ...

Friends user friend_id CiiMy_Dimples  90084490 CiiMy_Dimples  335144661 CiiMy_Dimples  362719563 CiiMy_Dimples  337472773 CiiMy_Dimples  160631369 CiiMy_Dimples  293058824 CiiMy_Dimples  166033656 WildBoi_       390996426 WildBoi_       298166966 WildBoi_       308773826 WildBoi_       333759314 WildBoi_       69323836 ...

Followers user friend_id ...

User Information

...

= Data Modeling =

What is a model?
A simplified abstraction of a real world phenomenon that is used to promote informed decision making.

Modeling is the activity of abstracting information about a real-world phenomenon and developing it into a model that contains the essential features of that phenomenon with respect to requirements given to the modelers.

Are we modeling the behavior or the structure of a phenomenon?
If primary focus is on the role of data in representing structure, then we are performing data modeling.

For complex phenomena the volume of data involved will usually require a large, persistent data storage capability, which in turn will necessitate the use of a database management system. For this reason data modeling has become synonymous with database design.

Conceptual Data Modeling
These models capture the user’s perceptions of the data in a software system without any regard for how it might eventually be incorporated in a data storage system.

In this course we shall focus on semantic data models, examining in particular the extended entity relationship model (EER), but as it is represented in the Unified Modeling Language (UML).

Logical or Implementation Models
These models represent the organization of data in a way that will yield a direct implementation in a database management system, but avoid any details of how the data should actually be stored in the data storage system.

Implementation models include, relational, object-oriented model, the object-relational model, and the XML model.

Physical Data Models
These models describe the choices used for the storage of data on secondary storage devices and for accessing the data. in terms of clustering, partitioning, indexing, etc.

Entities
A conceptual data model attempts to represent a phenomenon as a collection of entities (also known as instances or objects). In any formal system there will always be terms that must be accepted as undefinable, and for us and conceptual data models entity will be such a term. Intuitively an entity is simply one of the “things” about an enterprise that one wants to represent. Some entities have tangible counterparts such as a person, a place, or a thing, but others may be conceptual such as a loan or a job assignment.

In conceptual data modeling it is more common to work with collections of similar entities rather than the entities themselves. Such a collection of similar entities is known as an entity set (or class).

Relationships and Roles
A relationship (or association) is an affiliation among two or more entities. For example in a data model for a university we may have the relationship that faculty member “Paul Anderson” (an entity of the FACULTY entity set) is a member of the “Computer Science Department” (an entity from the DEPARTMENT entity set).

As with entities, we usually collect similar relationship into a relationship set and use the relationship set to describe an affiliation between entity sets. Each relationship is then an element (or instance of) a relationship set. Thus, the relationship that faculty member “Paul Anderson” is a member of the “Computer Science Department” given above may be regarded as an instance of the relationship set “MemberOf” between the entity sets FACULTY and DEPARTMENT. It is common practice to use “entity” in place of “entity set” and “entity instance” instead of “entity.” Likewise “relationship” (or association) is used instead of “relationship set” and “instance of relationship/association” in place of “relationship.”  After a while we shall use this convention as well.

It is also common to sometimes further clarify the purpose of an entity set in a relationship by assigning it a role in the relationship. This is especially true in relationships involving an entity set and itself. For example given a relationship IsMarriedTo between the entity set PERSON and itself, we may assign one entity set the role HusbandOf and the other entity set the role WifeOf.

Attributes and Domains
Earlier we described an entity (set) as a collection of “similar” entity instances. What characterizes these similarities is a set of properties that are common to each entity instance, though in data modeling we use the term attribute rather than property In a conceptual model for a university, for example each instance of the FACULTY entity may share the attributes Name, Degree, Degree Area, and Rank with every other instance of FACULTY. What may be different about each entity instance is the value associated with each of these attributes. For example, one instance may have the values Paul Anderson, Ph.D., Computer Science and Engineering, and Assistant Professor for these attributes respectively, while another instance may be one with the attribute values Bill Manaris, Ph.D., Computer Science, and Professor.

The values assignable to attributes are drawn from a pool of candidate values known as the attribute’s domain. Unfortunately in data modeling there can a lot of confusion surrounding the notions of attributes, domains, and the additional concept of a data type. We will use a few examples to illustrate some of the sources of confusion


 * Consider two entities PERSON and DEPARTMENT. In associating attributes with these entities one would most likely associate a “name” attribute with each entity, though what each name attribute is actually called is usually up to the discretion of the modeler -- use “Name” for both, use Name for PERSON and DeptName for DEPARTMENT, use PersonName for PERSON and DeptName for DEPARTMENT, etc.
 * Suppose the data modeler decides to give both PERSON and DEPARTMENT an attribute called Name. How do we designate the domains of each attribute?
 * We could explicitly designate all the elements for each domain. This could work for domain of the Name attribute for DEPARTMENT since this set would probably be small, but would be impractical for the domain of the Name attribute of PERSON.
 * We could classify the structure of the values of each domain by its data type, but not enumerate each value. For example we might describe the domain of the Name attribute of PERSON as the set of strings of characters and we could also use this same description for the domain of the Name attribute of DEPARTMENT.  This would mean that we are willing to allow a person to have “Computer Science Department” for a name, or for a department to be called “George Pothering.”  More significantly, suppose we have an attribute E-Mail Address for PERSON and also describe its domain as the set of strings of alphabetic characters.  With this method of domain description we would be permitting more questionable equivalences.
 * We could specify domains with the names PersonNameDomain, DepartmentNameDomain, and EmailDomain and specify the data type of each as the set of strings of characters. We could then make the following associations:

Name attribute of PERSON --> PersonNameDomain

Name attribute of DEPARTMENT --> DepartmentNameDomain

Email Address attribute of PERSON --> EMailDomain

We conclude our current discussion of attributes by noting that while we have been concentrating on attributes of entities, it is also possible for relationships to have attributes. For example consider a data model for a university with entities STUDENT and COURSE and a relationship EnrolledIn between these entities. If a student can take a course more than once then we might use the term in which a given student enrolled in a course to distinguish between repeated enrollments. An attribute Term however is neither an attribute of STUDENT nor an attribute of COURSE, since it depends on both the student and the course involved. It is thus more properly seen as an attribute of the EnrolledIn relationship.

Identification and Primary Keys
In the physical world it is possible for two objects to exist and yet agree on all of their fundamental properties. Likewise, if an object changes a value of any its properties it does not become a new object. We shall leave it to the philosophers to address the nature of identity as it pertains to the physical world; when it comes to data modeling, however, one must be able to distinguish entity instances. This is usually accomplished via an entity’s attributes by one or more of the following: Each of the identifying attributes or groups of attributes established in steps a, b, or c above is known as a candidate key for the entity. One of these candidate keys is selected to serve as the primary key for the entity.
 * Search for an attribute for which it is known that no two entity instances will have the same value for this attribute.
 * Search also to see if a group of attributes can be used to identify (that is, distinguish) entity instances. This means that two entity instances may agree on some attribute values in this group, but not on all of them.  This group should also be minimal in the sense that if any attribute is dropped from the group then the remaining attributes are not sufficient to distinguish entity instances.
 * If no identifying attribute or group of attributes can be found, associate an artificial identifying attribute with the entity.

Additional Properties of Relationships

 * Degree: The degree of a relationship is the number of entities participating in the relationship. A relationship of degree 2 is known as a binary relationship and a relationship of degree 3 is known as a ternary relationship. Relationships of degree n > 3 are generally referred to as n-ary relationships.  If an entity participates more than once in a relationship the relationship is said to be a recursive relationship.  The most common recursive relationships are binary, meaning that an entity is being associated with itself.


 * Connectivity Constraints: The connectivity constraint of a relationship (or simply the connectivity of the relationship) specifies generally the number of relationship instances that instances of the related entities can participate in.


 * Connectivity Constraints for Binary Relationships: For binary relationships, connectivity constraints are one of three types:
 * 1) One-to-one (1:1):  A relationship R between E1 and E2 is one-to-one if an entity instance e1 &#8712; E1 can be related to at most one instance e2 &#8712; E2 and an entity instance e2 &#8712; E2 can be related to at most one instance e1 &#8712; E1.
 * 2) One-to-many (1:N):  A relationship R between E1 and E2 is one-to-many if an entity instance e1 &#8712; E1 can be related to several instances ei &#8712; E2,but an entity instance e2 &#8712; E2 can be related to at most one instance e1 &#8712; E1. Here the positioning of the entities E1 and E2 in the description of R is important.  We can also say that R is many-to-one between E2 and E1.
 * 3) Many-to-many (N:M):   A relationship R between E1 and E2 is many-to-many if an entity instance e1 &#8712; E1 can be related to several instances ei &#8712; E2,and if an entity instance e2 &#8712; E2 can be related to several instances ek &#8712; E2.


 * Connectivity Constraints for n-ary Relationships: Interpreting connectivity constraints for n-ary relationships is more complicated than in the binary case and is outside of the scope of this course.

Examples of Connectivity Constraints
Example 1: Consider a data model for a health club. Among the entities one might identify for this model is one for PATRON and one for LOCKER. The assignment of lockers to patrons can be represented by a relationship LockerAssignment. If we assume that no two patrons share a locker and that no patron will be assigned more than one locker, then LockerAssignment is a one-to-one relationship between PATRON and LOCKER.

Example 2: Consider a data model for a college that has entities DEPARTMENT and SCHOOL and a relationship BelongsTo between DEPARTMENT and SCHOOL to represent the organization of the college in terms of departments within schools. If we assume that no department can belong to more than one school, then BelongsTo is a one-to-many relationship between SCHOOL and DEPARTMENT, or a many-to-one relationship between DEPARTMENT and SCHOOL.

Example 3: Continuing with our data model for a college, consider the relationship Enrollment between the entities STUDENT and COURSE that represents the enrollment of students in courses. This is a many-to-many relationship since a student can take more than one course and a given course enrolls more than one student.

Example 4: In a data model an airline, one may have entity sets FLIGHT, PLANE, CREWMEMBER, and PASSENGER and a relationship FlightAssignment among the entities FLIGHT, PLANE and CREWMEMBER to represent plane and the individuals that were assigned to a given flight.

Multiplicities for Binary Relationships
The connectivity constraints for a binary relationship can be further refined by assigning multiplicities to the entities participating in the relationship. Multiplicities give an indication of how many entities from one entity set may be related to a given entity from the other entity set.

A common notation for conveying multiplicities is in the form lower-bound..upper-bound. An unlimited upper bound is represent by an asterisk *. The most commonly used lower limits are 0 or 1, while the most common upper limits are 1 and *. In a perplexing short-hand, simply indicating * by an entity set is often used to represent the multiplicity 0..*, while indicating 1 represents the multiplicity 1..1.

Example: Consider a relationship Advisor between entity sets STUDENT and FACULTY. A multiplicity of 0..* (or simply *) on STUDENT means that a given faculty member can advise an indefinite number (or simply “many”) students but does not have to advise any. A multiplicity of 1..1 (or simply 1) on FACULTY however means a given student must have an advisor, and can have only one advisor.

Representing Data Models
We shall simply introduce the most elementary symbols for what is known as the entity-relationship model. The entity-relationship (ER) model, initially proposed by Peter Chen in 1976 but has been modified and extended by others since then. Although it still remains important as a conceptual data model, it is being superseded by the class diagrams feature of the Unified Modeling Language (UML).

Alas, there is no standard set of symbols used in ER modeling. The ones we give here represent what is commonly found in data modeling tools and is close to what is done in UML. Each end of line may be labeled with the name of the role for its associated entity in the relationship. Multiplicities for a binary relationship are placed above or below the relationship line near the associated entity. Attributes for a relationship are placed in a rectangle near the center of the relationship line and this rectangle is connected to the relationship line with a (dashed) line.
 * In our E-R model an entity is represented by a rectangle and the name of the entity is placed inside the rectangle.
 * Entity attributes are placed inside the rectangle in an area directly below the entity name, with a line separating the entity name area from the attribute area. Sometimes the type of an attribute is indicated as well.
 * In the entity-relationship model all entities must have a primary key. The names of all attributes that participate in the primary key are underlined.
 * A binary relationship is represented by a line between the entities involved, with the name of the relationship appearing directly above or below the line.
 * N-ary relationships are represented similar to a binary relationship except a diamond is used to represent the relationship and ,lines emanate from the diamond to the entity participating in the relationship.

Example: We now develop an ER model for a simple data model of a company. This model will only use the concepts we have discussed up to this point. In developing our model we shall use the following approach:
 * 1) Identify potential entities and attributes - Where there’s a narrative describing the enterprise and the requirements for the data model, one can often find potential attributes and entities from the nouns and noun phrases in the narrative.
 * 2) Identify potential relationships - Where there’s a narrative describing the enterprise and the requirements for the data model, one can often find relationships from the verb phrases in the narrative.
 * 3) Determine entities, assign attributes, and define relationships.

The following suggestions may be helpful:
 * Find any synonyms and use one term or expression in place of all the synonyms.
 * Possessive forms and “has a” relationships often indicate where attributes for an entity may be prescribed. If the item possessed (or the object of the “has a”) is another entity, however, then a relationship is probably indicated.
 * Where you have several verb phrases suggesting the same relationship, determine if some of these verb phrases are identifying roles, especially where an entity is being related to itself.
 * Since the ER model requires that primary keys be specified, be alert for qualifiers such as “unique,” “exactly one,” etc. which, when associated with attributes, suggest potential candidate keys. If there are no obvious candidate keys for an entity, an artificial attribute may have to be assigned to serve as a primary key.
 * An attribute A that appears in several entities, E1, ..., En may be reclassified as an entity EA. This would also entail establishing relationships between each of E1, ..., En and the new entity EA.
 * If an entity has E1 has a single attribute A and is related to just one other entity E2, then E1 could be dropped and A made an attribute of E2.

Specify domains, including their data type and where feasible its range of possible values Tables are often used for this.

Example: The COMPANY data model keeps track of a company’s employees, departments, and projects. Suppose that as a result of a requirements analysis the following description of the aspects of the company to be represented in the data model results:
 * 1) The company is organized into departments and each department has a unique name and main office location.  The department has a particular employee who manages the department. We should keep track of the date when that person started managing the department.
 * 2) A department controls a number of projects, each of which has a unique number within the company.  Each project also has a description, employee who serves as its director, and budget.  The company also wants to keep track of all employees who are assigned to each project.
 * 3) We want to keep track of each employee’s name. social security number, address, salary, sex, birthdate, and starting date of employment with the company.  Each employee belongs to a department but may work on projects that are controlled by departments other than the one to which he or she is assigned.  We also want to keep track of the number of hours each employee has worked on each project.

Develop an ER model for this phenomenon.

Transformation of Conceptual Data Models in Relational Data Models
1. Entity sets/ classes: An entity set E with attributes A1, A2, ..., An can be represented as the relation

E(A1, A2,...,An)

Each tuple in the relation represents a instance of the class. When the relation is represented as a table, each attribute will appear as a column of the table and each tuple will be a row of the table. An attribute that is used for identification is underlined (if no such attribute has been designated among the attributes of E one can designate a new attribute for this purpose, such keys known as surrogate keys).

2. One-to-one relationship: Suppose the E1 and E2 are entity sets with have identifying attributes A1 and A2 respectively. If we assume there is a one-to-one relationship R between E1 and E2, the mapping to the relational model is straightforward. One merely appends the identifying attribute(s) from one entity set to the attributes of the other. In our case we could use the relation schemas

E1(A1, other attributes, A2) and E2(A2,other attributes)

or alternatively we could use the relation schemas

E1(A1,other attributes) and E2(A2, other attributes, A1)

3. One-to-many relationship. Suppose O and M are entity sets with have identifying attributes A1 and A2 respectively. If we assume the relationship between O and M is one-to-many then a mapping to the relational model is once again straightforward. Merely append the identifying attribute of the "one" entity set (Here O) to the attributes of the "many" entity set (here M). In our case we would have the relations

O(A1,other attributes) and M(A2, other attributes, A1)

4. Many-to-many relationship). Suppose M1 and M2 are entity sets with have identifying attributes A1 and A2 respectively.  If we assume the relationship between M1 and M2 is many-to-many, then to map this type of relationship to the relational model we create three relations M1, M2 and R. with the following structures

M1(A1, other attributes), M2(A2,other attributes), and R(A1,A2)

Note, if R had any attributes (for example X1, X2) these would appear as (non-key) attributes in the relation R: R(A1,A2, X1,X2)

Software that we'll be using
http://dev.mysql.com/downloads/workbench/

Tutorials and Additional Videos
http://dev.mysql.com/doc/workbench/en/wb-getting-started-tutorial-creating-a-model.html

http://www.youtube.com/watch?v=mVfQtp8Ve9I

http://rapidapplicationdevelopment.blogspot.com/2007/06/entity-relationship-diagram-example.html