Hajautettu tietokantamma koostuu kahdesta relaatiosta pisteissä S 1 ja S 2 Our distributed database constists of two relations at sites S 1 and S 2 - PDF

Description
(1) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Puoliliitos: Semijoins1 Hajautettu tietokantamma koostuu kahdesta relaatiosta

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Funny & Jokes

Publish on:

Views: 16 | Pages: 16

Extension: PDF | Download: 0

Share
Transcript
(1) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Puoliliitos: Semijoins1 Hajautettu tietokantamma koostuu kahdesta relaatiosta pisteissä S 1 ja S 2 Our distributed database constists of two relations at sites S 1 and S 2 At site S 1 : STUDENTS (StID, FirstName, LastName, Address, TutorDeptID, TutorID) With tuples. At site S 2 : TUTORS (TutorID, TutoringStartDate, IsFluentInEnglish, IsFluentInSwedish, ManagerID) With 100 tuples. Attribuuttien koot tavuina ovat: ID-päätteiset 5 tavua, kaikki muut attribuutit 25 tavua. The size of attributes StID, TutorDeptID, TutorID and ManagerId is 5 bytes, all other attributes are 25 bytes. Relaatiossa STUDENTS on riviä ja relaatiossa TUTORS on 100 riviä. Jokaisella opiskelija on yksi tutor. Each student has a single tutor. Halutaan suorittaa pisteessä S 3 seuraava kysely, jolla tulostetaan jokaisen tutorin nimi sekä tutorin aloituspvm: We need to run at site S 3 the following query, which prints out the name of each tutor along with his/her starting date: Select TU.TutorID, ST.FirstName, ST.LastName, TU.TutoringStartDate, From TUTORS TU, STUDENTS ST where TU.TutorID = ST.TutorID *Mikä on kyselyn hinta kun kysely suoritetaan pisteessä S 1 ilman puoliliitosta? *What is the cost of the query when carried out at site S 1 without semijoin? Vastaus/Answer: Lähetetään pisteeseen S 1 pienempi taulu TUTOR (suhteessa liitosattribuuttiin): 100 * (2*5 + 3*25) tavua= tavua. Send to site S 1 the smaller of the two tables (with respect to the joining attribute): (2) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) *Mikä on kyselyn hinta kun kysely suoritetaan puoliliitoksella? (semi-join) Tulokset voidaan jättää pisteeseen S 1. *What is the cost of the query when it is performed using a semi-join? The answer set can be left at site S 1. (a) Lähetetään TutorID arvot pisteeseen S 2 : Send the values for TutorID to site S 2 : Q =Select TutorID From TUTORS = 500 tavua (b) Suoritetaan puoliliitos: Perform the semi-join Q = Select Q.TutorID, ST.FirstName, ST.LastName From Q, STUDENTS ST Where Q.TutorID = ST.TutorID tästä 100 riviä/rows * 55 tavua = tavua/bytes (c) Lähetetään Q pisteeseen S 1 (5 500 tavua) jossa suoritetaan lopullinen kysely: Send Q to site S 1 to perform the final query: Q = Select Q.TutorID, Q.FirstName, Q.LastName, TU.TutoringStartDate From TUTORS TU, Q Where Q.TutorID = TU.TutorID yhteensä siirtoon tarvitaan siis vain 6 ktavua. So a total of 6 Kbytes is needed for the transfer. *Selitä miten alla oleva yhtälö liittyy puoliliitokseen. Kumpi kahdesta tehtävän tauluista vastaa taulua R? (3) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Explain how the following equation is related to a semi-join. Which of the two relations in this problem s Sql-query corresponds to relation R? R S = R (S R) Vastaus: Answer Annetussa kaavassa S on suurempi taulu, ja kahden taulun välinen liitos lasketaan niin, että suurempaa taulua S redusoidaan ensin puoliliitoksella (S R), jolloin S-taulusta mukaan tulevat kaikki tarvittavat attribuutit mutta R-taulusta vain liitosattribuutti. (Puoliliitoksen määritelmähän on juuri se, että se ei ole symmetrinen vaan ainoastaan operaattorin vasemmalla olevasta taulusta S tulevat kaikki taulun attribuutit mukaan). The relation S in the equation is the larger of the two relations and the join is computed so that the larger relation S is reduced with a semi-join (S R), so that all required attributes are included from relation S but only the joining attribute is included from relation R. Näin saatu välitulos voidaan sitten liittää normaalisti pienempään tauluun R. The sub-result thus obtained can be normally joined to relation R. Silmukka on tarpeen juuri siksi, että pienemmästä R-taulusta puoliliitoksessa ei ole käytössä muita attribuutteja kuin itse liitosattribuutti, joten muut R-taulun attribuutit, joita tarvitaan lopullisessa liitoksessa joudutaan 'hakemaan' erikseen pisteestä, jossa sijaitsee R-taulu. A loop is necessary because from the smaller relation R only the joining attributes get included so the remaining attributes needed in the final query have to be retrieved separate from the site containing relation R. (4) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Distributed Transactions, Global Deadlocks *What is the purpose of the atomic commit protocol? The atomic commit protocol ensures global atomicity *Is the following schedule globally serializable (=globaalisesti sarjallistuva)? Local Schedule S 1 : R 1 [x] C 1 W 3 [x] W 3 [y] C 3 R 2 [y] C 2 Local Schedule S 2 : W 4 [z] R 1 [z] C 1 R 2 [w] C 2 W 4 [w] C 4 No, there will be a cycle in the global serialization graph: T1 T3 (since R 1 [x] W 3 [x] in S 1 ) T3 T2 (since W 3 [y] R 2 [y] C 2 in S 1 ) T2 T4 (since R 2 [w] W 4 [w] in S 2 ) T4 T1 (since W 4 [z] R 1 [z] in S 2 ) T1 T3 T2 T4 T1 *What is the purpose of the probe message?(tiedusteluviesti) The Probe message is sent to each process that is holding the needed resources. When received by a process, if the recipient is executing and is not blocked, it ignores the message. Otherwise, if the process is blocked, it updates the probe with its own id and passes it along to other processes which are blocked. If the probe message returns to its original sender, we know there is a cycle and thus a deadlock. (5) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) *Contrast 0PC with 1PC? In 0PC there is no need for global atomicity: each sub-transactions either commits or aborts as soon as possible and just informs the Coordinator about it. In 1PC, the goal is to achieve global atomicity, but it cannot be 100% guaranteed : some Participants that receive the Commit message from the Coordinator may not be able any longer to commit: they need to keep their locks for a much longer period than in 2PC since there is no separate voting and decision phase. (6) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Kaksivaiheinen sitouttamisprotokolla 2PC/Presumed Abort, 2PC-Presumed Abort variaatio *Describe the 2PC without mentioning the logs? In 2PC, we have a voting and a decision phase (äänestysvaihe ja päätösvaihe). In the first phase, the Coordinator sends a prepare message to each of the participants and waits for a response from the participants. In the Decision phase, the Coordinator sends either a commit message to all participants or an Abort messages to those participants who were ready to commit their subtransaction. The Coordinator must decide to abort the transaction if even just one participant voted to abort its sub-transaction. The Coordinator waits for an acknowledgment from each Participant only if the decision was to Commit, this way it can clear the transaction from its transaction table. *For what reasons can a Participant P 1 fail in 2PC? The Participant P 1 can fail for different reasons such as a deadlock, failure in network, an integrity violation (inserting a tuple with non-unique values into primary key) or a general system crash. When it recovers, the participant has the right to abort unilaterally (siis muilta kysymättä) only if it is sure that it has not yet voted to commit the transaction. The participant uses its forced (levylle pakotettu) Prepared log record (valmiuskirjaus) to check if it has voted (T 1, ready). If the Participant finds T 1, L, P then it must wait for the decision from the Coordinator. If the Participant does not find such an entry in its log, either the Participant has not voted yet, or it has voted to abort its sub-transaction. In either case, the Participant can safely abort its sub-transaction. *When is the Timeout Protocol (Aikakatkaisukäytäntö) (= Termination Protocol) activated? It is invoked whenever the Coordinator or Participant fails to receive an expected message and times out. *What does the Coordinator do if there is a Timeout after a decision has been made? If the decision was to abort, the Coordinator does not do anything. (If a Participant asks it for the decision, the Coordinator will have already cleared the transaction from its transaction table and so it finds nothing. It therefore presumes that the transaction has aborted.) If the decision was to Commit, the Coordinator will find that transaction in its transaction table and then send the decision again to all Participants that have not yet acknowledged with done (T 1 ). (7) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) * When using the two-phase Commitment Protocol, what should the Coordinator do if it gets no response to its message PREPARE(T) from a certain Participant P? Perhaps there was a net failure, or the Participant crashed. The Coordinator keeps sending PREPARE for a while. * What should the Coordinator do if it still can t get response? It should abort, and send the ABORT(T,L) message to any participant which sent vote(t x, ready). * What would happen if the Coordinator should crash after it has sent the (T, abort) message? If the Coordinator crashes, when it recovers, it might not find anything in its log. If it does find something in the log, it will be the information that it aborted the transaction. If the Coordinator does not find anything in its log, then it will have crashed before it forced the log to disk. Because we are using the 2PC with Presumed Abort, the Coordinator knows that the transaction *cannot have* committed. So in any case, when it recovers, it could send to all Participants the msg (T, abort) or wait until the Participants inquire about the final decision before sending (T,abort). * What would happen if i) The Coordinator ii) The Participant should crash after the Coordinator has sent the Commit(T) message? If the Coordinator crashes, when it recovers, it must find a log saying the decision was to commit, so it just resends commit(t) to all the participants (or wait for a participant to enquire about the decision and then resend commit(t)). If a Participant crashes after voting (T,ready), when it recovers, it should contact the Coordinator to find out the status, because in its log it will only find the information that the Participant had voted to commit. (8) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Toisintaminen REPLICATION Eager = Innokas toisintaminen Mark the following as True or False. 1. If there are no updates to the data item, there can be no consistency problems: 2. As a general rule, it s good to try to keep a replica geographically close to its clients. 3. With Eager (synchronous) replication, reading the local copy might not give the most up-to-date value. 4. Eager (synchronous) replication is often slow since all copies have to be updated in a write operation. 5. With Eager (synchronous) replication if a node/site fails, the update operation cannot generally be completed. 6. With Lazy (asynchronous) replication, reading the local copy gives the most up-to-date value. 7. With Lazy (asynchronous) Primary Copy replication, the primary (master) copy must be kept consistent at all times and so it easily becomes a bottleneck. 8. With Lazy (asynchronous) replication, performance is generally good since no distributed transactions are needed. 9. With Update Everywhere, there is no master node so all nodes in the system are considered equal. 10. Update Everywhere can be used together with 2PC to make sure that all replica sites have committed the updates. 11. ROWA (Read-One-Write-All) is an example of eager replication. 12. With quorum-based protocols, a read-operation requires reading any single site in the Read Quorum. Only the following are False (the rest are True) (3) With Eager (synchronous) replication, reading the local copy might not give the most up-to-date Value. False as all the copies are kept up-to-date in Eager replication. (6) With Lazy (asynchronous) replication, reading the local copy gives the most up-to-date value. False. Changes propagate to other sites *eventually* outside the transaction bound. (12) With quorum-based protocols, a read-operation requires reading any single site in the Read Quorum. False A read operation requires reading a certain number R of sites, known as the Read quorum, that is, all the sites in the Read quorum must be read so that through version numbering, the most-up-to-date can be selected. (9) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) * If n=3 and p=2, how many Read Quorums are possible in the Quorum Consensus protocol (Sovittu päätösvaltakäytäntö)? Show graphically how many site can fail if q=p. The quorum of sites that can participate in a read quorum R with p =2 is {S 1, S 2 };{S 1, S 3 };{S 2, S 3 }. so a total of three Read Quorums are possible. That is, when committing a transaction, any one of three quorums will work. Both the read quorum (R) and the write quorum (W) are selecting the same sites in which case we have one extra site, S 3 which could fail, so robustness here =1. Discuss the advantage and disadvantage of ROWA. ROWA (Read-One Write-All) will translate a read operation to a read on any replica site (usually the closest site) and each write will be mapped to a write operation on all replica sites. (+) A Read-only transaction can be executed at the local site without any extra communication. (+) Very good for a situation where sites only need to read data. (-) If even just one of the replica site fails, a transaction that needs to update cannot complete. (10) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) * Caching and replication are related in the sense that both techniques generate redundant copies of data items in order to improve system scalability, availability, and performance. What are the basic differences between caching and replication (in terms of deciding what data are copied and how consistency of the copies is maintained)? Caching is the replication and storage of data at the client or a nearby proxy following a request for data made to a remote server by the client. The copies are created and managed by the client or proxy (based perhaps on the access patterns of the client), and are not controlled in any way by the server that owns and provides the data. Consistency of caches can be carried out in several ways. Timers can be associated with cached copies, for example. When the timer expires and the data is needed, the providing server is again contacted the and the cache updated. In addition, spatially clustered proxies and/or clients can synchronize with each other to propagate updates (to further minimize client/server interaction). Replication is controlled and carried out proactively by a group of servers, each of which maintains copies of the same data. While the replication scheme can be affected by the localities and access patterns of requesting clients, the servers themselves govern and perform the replication, and furthermore, storage is always at a server, not at a proxy or client. Consistency among servers which maintain replicated data can be done through the use of server-based, client-based, epidemic, or other protocols, all of which involve synchronization between servers. (11) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) Parallel DBs :Parallel Processing of Relational Operations *What is partitioning and what are its advantages? Partitioning is the process of splitting large relations (tables) into smaller ones which mean that DBMS does not need to retrieve as much data at any one time. There are two ways to partition a relation: horizontal and vertical. Advantages that partitioning brings are the following: (i) increase performance of the workload (reduce cost of accessing and processing data) ii) it allows parallel processing of data by locating tuples where they are most frequently accessed. *What is horizontal partitioning? Horizontal partitioning involves fragmenting a relation according to its tuples (rows) so that each partition is a fragment of the original relation, that is, it has the same data structure. The tuples of a relation are divided among many disks such that each tuple resides on a specific disk. The exact disk is determined according to the partitioning technique or Partition function: round-robin, hash partitioning, range partitioning. *What is vertical partitioning? Vertical partitioning involves splitting the attributes (columns) of a relation, placing them into two or more relations linked by the relation's primary key. * What is Fragmentation? A relation may be divided into a number of sub-relations, which are distributed across the disks. Our discussion of parallel algorithms assumes shared-nothing architecture (i.e. no shared memory or disk n processors, P 1,..., P n-1, and n disks D1,..., Dn-1, where disk D i is associated with processor P i If a processor has many disks, assume that it simulates a single disk Di. Shared-nothing architectures can be efficiently simulated on shared memory and shared-disk systems. Algorithms for shared-nothing systems can thus be run on systems using shared memory and shared-disk. (12) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) *Define Partition skew /? the workload is not of equal size on each processor. *What are two objectives for Parallel databases? (1) They should be faster than a standard, serial system. (2) Ideally, the scaleup and speedup should be linear or nearly so. *Fill in the following table Suitability for Round-robin partitioning Sequential Scan Point Query Range Query Hash partitioning Range partitioning Suitability for Round-robin partitioning Hash partitioning Range partitioning Sequential Scan Point Query (CustomerID = 5077) Range Query (CustomerID 5077) Very good (all disks have almost equal nr of tuples) Not suitable Not suitable (tuples in random different disks) Good Good (assuming hash key) Not suitable (no clustering) Good Good (assuming attribute in query is range partitioned )need to access only 1 disk) Good (assuming query range is the one which is range partitioned ) *What could be done about attribute value skew? Nothing, because attribute value skew is intrinsic ( sisäänrakennettu ) or already inherent in the data. For instance, the data may contain more instances of the name Smith than the name Montgomery Contrast this with partition skew (Ositusvinouma). *Define intra-operator parallelism? Intra-operator is also known as partitioned parallelism, it is a case of intraquery parallelism. (The other case of intraquery parallelism is interoperation parallelism which in turn can be broken down into independent parallelism and pipeline parallelism). The nodes/processors are working in parallel to process different fragments of the data for the *same* operator. An example would be parallel sorting or parallel join. (13) T Distributed Databases Kertausmateriaalia, ei kuitenkaan kata kaikkia aiheita laskuharjoituksista (15) *A simple application of partitioning is parallel join or partitioned join. How does it work? The two input relations are partitioned across the processors, and then at each processor, we compute the join locally and finally combine these joins together to get the final result. This works for for equi-joins and natural joins. The partitioning can be done using either range-partitioning on both R and S or hash function on both R and S, this means that processor P i or site S i gets partitions R i and S i *When would you use symmetric fragment-and-replicate-jo
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks