---[ Phrack Magazine Volume 8, Issue 53 July 8, 1998, article 05 of 15 -------------------------[ Introduction and Overview of Internet Routing --------[ krnl----[ Routing Overview: The process of routing can be quickly summarized as a node finding the path to every possible destination. Routing is present in everything from layer 1 (the physical layer) on up. The routing that most people are familiar with, however, occurs at layer 3 (the network layer) and as such, we will only reference layer 3 (and more specifically) Internet Protocol (IP) routing in this document. Protocols for exchange of routing information connect multiple routers around the world to provide them with a common view of the network through their heterogeneous, though generally consistent routing tables. Routing tables store all information necessary for the router to reach every destination on the network irrespective of size (i.e. the network could be j.random LAN with one ip router and two hosts off of an ethernet port or it could be the Internet proper). ----[ Routing Protocols: There are a wide variety of routing protocols used to contribute to the routing tables across a network. Protocols such as BGP, OSPF, RIP and ISIS help to convey a correct and coherent picture of the network to all network switches (routers). ----[ Routing Goals: You can imagine that if each router has to store information that would allow it to reach every destination on the network, there is the possibility for it to amass a large routing table. Large routing tables are difficult (and sometimes impossible) for routers to process because of physical constraints (cpu, memory or a combination). Therefore, we would like to minimize the routing table space without sacrificing the ability to reach every destination on the network. For example, if the router is connected to the Internet via one DS1 link to another router, the router could store routing table information for each destination on the Internet or it could just default non-local information out that serial link. What defaulting means is that if the router does not have a specific entry in its table for the destination that the packet is trying to find, it sends it out the default link. The router towards which a router sends defaulted packets is sometimes called the 'gateway of last resort'. This simple trick allows many routing tables to save a number of entries on the 30th order of magnitude. Routing information should not be exchanged between routers in a spurious fashion. Frequent churn in the routing table puts unnecessary stresses on the scare memory and cpu of any given router. Information propagation should not interfere with the forwarding operations of the router. Though this means that you should not send routing updates every nanosecond, it does not mean that routing information should only be exchanged and updated weekly. One of the important goals of routing is that it provide the host with a table which accurately reflects the current status of the network. The most important aspect of a router's operation is sending packets from input to correct output. Misrouting packets could cause a loss of data. Routing table inconsistencies could also cause routing loops whereby a packet is passed between two adjacent interfaces ad infinitum. It is desirous for routers to have quick convergence. Convergence can be informally defined as a metric which gauges the speed at which routers arrive at a consistent view of the network. It would be ideal to have infinitesimal convergence times because that would ensure that each router on the network can accurately reflect the current topology even after a drastic change (link failure). When the network is changing, each router must propagate data which will aid other routers in converging to the correct picture of the network status. Problems with quick convergence are found in the routing updates. If a link is flapping (changing line status from up to down) rapidly, it can generate numerous installation and withdrawal requests. Therefore, that one link can end up consuming the resources of every router on the network because the other routers are forced to install and withdraw the route in rapid succession. While convergence is an important goal of routing protocols, it is not a panacea to network woes. ----[ Distance Vector Routing Distance vector routing protocols distribute a list of tuples to all of the router's neighbors. These tuples assign a cost to reach every other node of the network. It is important to note that this routing information is only distributed to routers which are assigned as neighbors to the originating router. These neighbors are often physical, but can be logical in the case of eBGP multihop. That cost is the sum of the link costs for the router to reach a destination. Routers periodically send their neighbors distance vector updates; the neighbor then compares the received distance vector to its current distance vector. If the received values are lower, the router sends output to the destination in the distance vector over the link that it received the vector over. The count to infinity problem is a problem with many distance vector implementations. We will assume that all links have a unit cost and that each hop corresponds to a unit. For example, if router X is connected to router Y and router Y is connected to router Z, we can demonstrate this problem (see fig 1). Y knows a 1 hop path to Z and X knows a 2 hop path to Z. Assume that link YZ goes down and the cost of that route is increased to infinity (fig 2). Now, Y knows an infinite cost route to Z because it knows the link is down so it propagates this distance vector to X. Suppose X has sent an update to Y which advertises a 2 hop distance vector. Now, Y will think that it can get to Z through X, so it sends X an update that says it can get to Z in three hops (fig 3). Note that X has no idea that the distance vector being advertised to it was originated from X. This is a serious flaw in distance vectors. In their unmodified form, they do not contain the full path information that the route has traversed. As illustrated above, the router alternates states of advertising a path to Z and advertising infinity to Z. They keep this exchange up forever or until they have reached some internally defined infinity count (say 15 as in the case of RIP). Count to Infinity problem: X--------------------Y--------------------Z Y:1 X:1 X:2 Z:2 Z:1 Y:1 [ fig 1 ] All links are up, below each node we note the destination and hopcount from each respective node. X--------------------Y--------* *---------Z Y:1 <------------- Z:infinity Z:2 -------------> X:1 [ fig 2 ] The link Y - Z breaks. Node X advertises Z:2 to node Y. X--------------------Y--------* *---------Z Z:infinity(frm Y) -> X:1 Y:1 <------------- Z:3 [ fig 3 ] Node Y sends its Z distance vector to X _before_ it recieves node X's infinity. Once node Y receives node X's infinity, it sets its distance to infinity. A path vector is an easy way to defeat the count-to-infinity problem. Basically, each distance vector also includes the router path that it traversed (fig 4). The router rejects an update from its neighbor if the path included in the update includes the router receiving the update (fig 5). The Border Gateway Protocol (which is used to exchange routing information between Autonomous Systems on the Internet) incorporates the path vector to stop the count-to-infinity problem. Obviously, you have to incorporate more information in the routing table if you want to include the AS path information that the route has traversed. The designers of BGP decided that it was optimal to sacrifice storage space and processing power for the robustness that the path vector affords the routing protocol. Path Vector Solution: X--------------------Y--------------------Z Y:1 (Y) X:1 (X) X:2 (YX) Z:2 (YZ) Z:1 (Z) Y:1 (Y) [ fig 4 ] All links are up, below each node we note the destination, hopcount and path vector from each respective node. X--------------------Y--------* *---------Z Y:1 (Y) X:1 (X) Z:2 (Y Z) Z:infinity [ fig 5 ] The link Y - Z breaks. Node Y knows to ignore Xs advertisement of Z because Y is the path vector. The avoids the count-to-infinity problem. Another way to counter this problem is the split horizon. Basically, this means that a router shouldn't advertise a path to a neighbor if that neighbor is the next hop to the destination. This solves the problem presented in the example above because the path to Z from X through Y would not have been advertised to Y because Y is the neighbor _and_ the next hop to the destination (Z). A variation called split horizon with poisonous reverse has router X advertise an infinite cost to get to destination Z. Under a split horizon, router X would not advertise that it could get to router Z. ----[ Link State Routing A router using a link state routing protocol distributes the distance to its neighbors to every other router on the network. This allows each router on the network to make a routing table without knowing the full cost to the destination from any one source. The problems of loops are avoided because each router contains the full topology of the network. Basically, the router makes a 3 tuple containing the source router (itself) the neighbor and the cost to its neighbor. Therefore, if router A is connected to Router B over a link of cost 3 and router A is connected to router C over link cost 5, then router A would advertise the Link State Packets (LSPs) and to all routers on this network. Each router on the network would evaluate all of the LSPs that it receives and calculate a shortest path to every destination on the network. Obviously, the LSP is an integral part of the convergence process. If someone could inject false LSPs into the network, it could result in misrouted information (a packet taking a longer path than it should) or even in the blackholing of a router on the network. This is not necessary a malicious attack of a network, however. Router C could advertise a link to its neighbor D with the 3 tuple and then withdraw the announcement when the link goes down. Unfortunately, if the LSP advertising the link having an infinite cost arrives before the LSP advertising the cost of that link being 6, the routing table will not reflect the topology of the network and will be in that state until another LSP comes to correct the problem. To combat this, a sequence number is introduced into the LSP. Therefore, all of the routers on the network would initialize their sequence number to some starting value and then start advertising their LSPs. This solves the above problem in that the LSP advertising the link of infinite cost would have a higher sequence number than the LSP advertising the link as having cost 6. Some problems encountered when using sequences numbers are finite sequence space, sequence initialization, and aging. It is in the best interest of a robust link state protocol needs to protect its LSPs as well as choose a sequence space which is sufficiently large to accommodate updates. The sequence space that the LSPs can use is set to some finite value. Therefore, when the sequence numbers reach the top of the space, they must wrap around towards the smallest sequence number. This presents a problem because when a router compares link state updates, the greater sequence number takes preference. To combat this problem, you can define a maximum age of the LSP. Therefore, if you have not received an update in X ticks, you discard the current LSP information and wait for a further update. It must be noted that this invalidates the path information to a destination. For example, if router Y advertises a cost to its neighbor router Z where router Y is connected by one link to a meshed network, when the link between the mesh and router Y breaks, the other routers in the mesh have preserved link state information that will allow them to find a path towards Z. If they receive no updates in MAX_AGE, then they will assume that the link to Y is unreachable. This will allow each router to converge its table and allow it to advertise an infinite LSP for Y and Z. Sequence initialization is also an important facet of this problem. Say router Y crashed and is rebooting while the network is recalculating paths to it. When it starts its link state protocol back up, it must somehow indicate that it needs to reinitialize its sequence number to the last number it gave all of the other routers to allow for coherence. Therefore, it can announce paths with a sequence number in a special "initialization set". This initialization set will tell the other routers that this router needs the sequence where it left off. This is the "lollipop sequence" idiom. The sequence space really resembles a lollipop in that the normal sequence number keep churning around the finite sequence space while reinitialization takes place in a short linear sequence space (comparable to the stick :). Great pains are taken to ensure the integrity of LSPs. In fact, this entire routing algorithm depends on the LSP being digested in a coherent method to guarantee that each router has the correct view of the network topology. The question still remains how the root node router computes the distance to each destination. Because of the general nature of a link state protocol, you have various nodes of the network advertising the distance to get to their neighbors to every other node on the network. Thus each node has a collection of neighbor distances from various routers on the network. The routing table is basically 'grown' outward from the root node to all of the network extremities. This will be explained in a slightly rigorous fashion in the next section. ----[ Dijkstra's Algorithm This algorithm is a simple and elegant way to determine network topology. Basically, there are two distinct sets of destinations on the network. Destinations in set K are known routes for which a shortest path has been computed. Destinations in set U are routers for which the best path to that router is not currently known. In this set, paths are being considered as candidates to be the best path to that destination. To start off, add the current node p into the set K. Then add all of its neighbors into the set U with path/cost associations. If there is another path to one of the neighbors in the U set, then choose the path which costs the least. When the neighbors N* are added to U make sure that they indicate the cost through p as well as p's ID . Once this has been done for the set U, then pick the neighbor n to p which has the smallest cost to reach p. This is assuming that the neighbor has not already been installed in K. This algorithm stops when set U is equivalent to the empty set. When set U is null, it is implied that all destinations are in set K and have the shortest cost from the root node P on which this algorithm is running. Note, that each step evaluates adds ONE neighbor into K. That neighbor is the router with the smallest cost to reach p. ----[ Distance Vector vs. Link State We are left with these protocols like BGP which uses path vector and OSPF which uses link state. Why do they occupy such orthogonal space? When a link state protocol is working correctly, it guarantees that there will be no routing loops in the network. The link state protocol also guarantees fast convergence when there is a change in the topology of the network because the link state is distributed on a flat routing space. Since link state protocols contain these inherent advantages, why do protocols like BGP chose to employ the path vector approach? Taking a cross-section of routing protocols that are employed on the internet, one finds that the majority of large providers use OSPF to resolve routing information on their internal network and BGP to talk to other distinct networks (or autonomous systems) at their borders of administration. What suits BGP as an external protocol and OSPF for an internal routing protocol? One issue, which will be discussed in the next section, is hierarchy. BGP provides a mechanism for a routing hierarchy which enables it to greatly reduce the space of its table. OSPF, which is a link state protocol, provides a flat routing table whereby any internal router knows the full hop by hop path to any destination within the autonomous system. Furthermore, distance vector protocols understand that different areas can have different views of the network where link state protocols require that each node independently compute a consistent view of the network. This saves the DV protocol the overhead of maintaining a correct LSP database. BGP also has another 'advantage' in that it is layered on top of the Transmission Control Protocol (TCP). Therefore, in the 'best-effort' service of IP networks, BGP has assurance (to the level that TCP can guarantee) that routing information will be propagated. Whereas, you can (or should) be able to govern the status of your internal network, the nebulous exterior past your border routers confers no delivery guarantee on your routing information. Each type of routing algorithm is suited for its function. Link State protocols provide the quick convergence that is essential to an internal network while distance vector protocols provide external reachability. Hierarchy is not something that is inherent in distance vector protocols, but the implementation of a hierarchy has made BGP a widely used exterior gateway protocol. ----[ Routing Hierarchy Routing hierarchy is an oft fought debate that borders on religion. There are constantly questions about how hierarchy should be implemented (if at all) in the free form state of the current global network. Hierarchy imposes a tree of authority with the overall authority at the top of the tree and branching down to regional authorities, local authorities ad infinitum. Hierarchy simplifies routing because if a destination is not locally routable (or under your section of the tree). You can iterate up towards the top tree to try and deliver that information. As you move towards the top, the routing information contained in the routers becomes less and less specific until you reach the root node which is the least specific. It does, however, know how to route information to every possible destination on the network. It may help you to envision the hierarchy of the telephone network (built under one collective). If a call cannot be placed within a central office, it is handed to either another central office in the area code or a wide area link. The wide area link understands how to route to each area code in a full national mesh whilst the local 5ess switch only knows routing information for more specific prefixes. As the phone number becomes less specific (from right to left), the routing decision moves further up the strict hierarchy. This similar to how the domain name system (DNS) works on the internet (fig 6). You provide local records for domains that you host. When your nameserver receives a query for a record, it either returns the fact that it has authority for that record or points toward the root nameserver. The root nameserver knows the delegations of .com, .net, .org et al. and then points towards the site responsible for that top level domain. That site then points towards the site that has authority for the specific second level domain. Domain names take the form of most specific -> least specific; i.e. microsoft.com is more specific than just .com. Likewise gates.house.microsoft.com is more specific than microsoft.com. DNS Hierarchy: ___ . ___ / | \ .com. .org. .edu. / | \ microsoft.com. eff.org. isi.edu. / | \ billy.microsoft.com. x0r.eff.org. rs.isi.edu. [ fig 6 ] Each level in the hierarchy is responsible for levels of greater specificity. Root authority is controlled by the Internet Assigned Numbers Authority (IANA). It provides the top of the hierarchy in a "centrally" managed database (in fact, there are multiple root servers distributed across the county which maintain a consistent database). This is the closest example of strict hierarchy that can be found on the internet. With IP addresses, specificity increases in the opposite direction. IP addresses (version 4) are 32-bits. The rightmost bit signifies the greatest amount of specificity and the leftmost, the least. IP routing authority information is not maintained in a centralized database. Routing information is exchanged between autonomous systems via the BGP protocol. Routes take preference in order of most specific -> least specific. In this way, there is some type of hierarchy in the system (even though it is more loose than the DNS example). Generally, larger providers control larger parts of the total IPv4 space ((2^32) - 3 addresses). The converse is also true. Classless Inter-Domain Routing (CIDR) also helped to decrease the size of routing tables and increase the appearance of hierarchy. Now, instead of Sprint announcing routes to 130.4.0.0 through 130.20.0.0 (on the classical B space boundary) it could announce 130.4.0.0/12 which encompasses that entire 16 class B range. The classful ranges, subnetworking and the like are discussed in my "introduction to IP" page and are therefore not included in this document. ----[ Routing Hierarchy and Aggregation BBN divides their 8/8 network into two subnetworks and advertises reachability to the aggregate to save table space. Once inside an AS, routing obeys a fairly strict hierarchy. Router A is responsible for the entire 131.103/16. It divides it into two /17. Likewise, Router D in AS1 is responsible for 8/8 and chooses to divide it into 8.0/9 and 8.128/9 and divides responsibility for these networks to Routers E and F respectively (fig 7). Routers B, C, E, and F can further choose to subdivide their networks in a hierarchical fashion. Because of the binary nature of subnetting, networks can only be divided in half. Routing Hierarchy and Aggregation: BGP 131.169.0.0/16 <--------------------> 8.0.0.0/8 A (AS1239) D (AS1) / \ / \ B / \ C E / \ F / \ / \ 131.169.0.0/17 131.169.128.0/17 8.0/9 8.128/9 [ fig 7 ] In the internet, there is no strict routing hierarchy. There are simply large networks which peer via BGP to distribute aggregated routing information. The national backbone is populated by few nodes (when compared to the end nodes). Most national providers are one or two router hops away from every large network. Through aggregation in networks below, national providers provide fully (or at least we hope) aggregated routing information. In a strict hierarchy, only one router on any given hierarchy level can advertise reachability to a specific portion of the network. In the current state of the Internet, multiple routers can advertise reachability information. For example, Sprint announces 131.169.0.0/16 out to Digex, MCI, and BBN. Though this breaks some of the benefits of a strict hierarchy, it confers other benefits. This scheme allows for distributed control of routing information instead of depending on the node above. Also, nodes on the same level are often interconnected to aid in the dissemination of routing information. ----[ Aggregation As discussed slightly before, aggregation allowed the internet to reduce the size of its external reachability tables. Before, the granularity of route announcements allowed for only /8, /16, and /24 (octet boundaries). Now, with CIDR you could use variable length subnet masks. The only requirement was that they fall on one of the 32-bit boundaries of the IP address. Classless routing not only allows us to minimize routing table space, it also allows us to divide up large chunks of unused space into manageable pieces. Much of the Class A space is terribly under-utilized. With this scheme one can more efficiently allocate IP addresses to providers/netizens. The American Registry of Internet Numbers (ARIN) controls the allocation of IP addresses within the United States. Aggregation helps alleviate the problems of not being in a strict hierarchical structure. It allows the least amount of route table specificity at each level (assuming the routers on that level choose to fully aggregate their announcements.) The less specific aggregates do not necessarily indicate the position of a router in the hierarchy. For example, a university may announce a /8 and be 3 hops away from the national backbone. A problem with aggregates can be found when we examine candidate route specificity. If ISP A only has address space from within the allocated block to their parent P, then aggregation could cause problems if ISP A wanted to multihome to parent Q. The problem comes in that ISP A is obligated to advertise reachability only to their space. This would constitute them announcing their address space to Parent Q. Assume that parent P aggregates ISP A's space into a /16 for the sake of saving route announcements. Now, ISP A would seem to have better reachability only through parent Q because of the specificity of the route announcement (remember that more specific routes take precedence over less specific routes). This would nullify the benefits of multihoming in an attempt to distribute load over the two lines. In this case, ISP A would ask parent P to announce a more specific destination which has a length matching the length of the aggregate assigned to ISP A. Therefore, to the world above parent P and parent Q, the path to ISP A looks equally appealing. ----[ Exterior/Interior It is important to look at how routing information is disseminated throughout the network. It has already been discussed that we use separate routing protocols (with their respective benefits/costs) to talk to the internal and external world. However, these protocols cannot take orthogonal views on routing information. In fact, the interplay between interior and exterior routing protocols is what allows data to be effectively relayed to a destination external to the network as well as to distribute external routing information to all nodes on the internal network. There are a few ways to ensure that each router has a consistent view of the network. One is to distribute the external protocol into the internal protocol. Thereby, the internal protocol instantly has routing information injected in it for the best path to every external destination. Note that the router which talks eBGP (or comparable protocol) only redistributes the route that it installs in its routing table and not the other candidate routes which may have been advertised to it. Another approach is to inject the interior protocol into the exterior protocol. Of course, this necessitates filtering at the entrance point to the exterior protocol to prevent the announcement of all internal specifics. You can accomplish internal routing dissemination inside an Interior Gateway Protocol mesh. Because of the specifics of protocols like BGP, externally learned routing information will only be propagated one logical hop within the network. Therefore, every router that must know this external reachability information, must be fully meshed with the eBGP speaking routers. Also, if other routers are injecting information into the Exterior Gateway Protocol, the router should be logically fully meshed with them. ----[ Multicast Routing Overview What we have been talking about above is unicast routing. In unicast routing, you assume that each packet has a single destination address. Assuming infinite resources being available, unicast is a feasible solution for every situation. However, there are situations when it would be advantageous to send a packet to multiple destinations from a single source (point to multipoint) or from multiple sources to multiple recipients (multipoint to multipoint). The point of multicast is to provide a multicast group to which hosts can subscribe and get the specific multicast feed. The multicast group is a single IP address in class D space. Therefore, the senders and receivers are only associated by a given multicast group address. Thus, the senders move their data towards the multicast group address and the receivers specify that they want to receive information from a given group address. In fact, the sender is not required to know any information about the hosts that are receiving the feed. ----[ Multicast vs. Unicast If one was sending packets from a single source to a set of destinations, it is important to investigate how multicast and unicast handle the distribution. Assume that router A is sending data to routers B, D and E. A is at the top of the hierarchy, B and C are at the second level of the hierarchy, and D and E are below router B. With multiple unicast (fig 8), router A makes 3 copies of the packet and sends them down link AB. Router B then sends one packet to a host off of its ethernet and forwards the last two packets to routers D and E whereupon those routers send the packets to the their respective hosts in the multicast group. Therefore, this transmission takes up 3 packets per second on link AB and 1 pps on links B->Host(b), router D and router E. In a multicast routing implementation, assuming the same topology, we will have less packets. The difference is that router A sends _one_ packet over link AB. Router B then triplicates the packet and sends it to Host(b), router D and router E (fig 9). One way for triplicating the packet is to simultaneously close crossbars on a fully crossed switch fabric, thus sending data from one input to three outputs simultaneously. As long as there is route redundancy, multicast is very efficient because it minimizes redundant packets traveling to the same next-hop. Simply, as long as there is route redundancy for the distributed session (for example, an audio feed) you will see an advantage with multicast over unicast. Multicast Overview Example: Multiple Unicast: A A sends 3 packets to B. / \ / \ 3 / \ C B B sends 1 packet to each to D and E. / \ 1 / \ 1 / \ D E D and E send 1 packet to their respective hosts. [ fig 8 ] Multicast: A A sends 1 packet to B / \ / \ 1 / \ C B B duplicates the packet for its host; / \ therefore, there is 1 packet (at most) on 1 / \ 1 each link. / \ D E [ fig 9 ] This is a multicast topology rooted at node A. There is also a shortest path from A to every destination in the multicast group. This is called the shortest path multicast tree rooted in A. Data would like to shortest path on the network layer. One problem with multicast sessions is that recipients join and leave during a multicast session. This requires pruning of the multicast "tree" so that packets do not clutter a link on which there is no one requesting data from a given multicast group. To detect if there are hosts on a particular broadcast LAN that are interested in a multicast group, the router sends out Internet Group Management Protocol (IGMP) messages. Each packet has a random reply time from which the host will express interest. This is to prevent all hosts on a broadcast LAN from responding at the same time to an IGMP query. Once one host desires to receive data destined for a particular multicast groups, all other hosts which desire to be in the multicast group can discard their replies because the router knows to multicast all incoming packets destined for that group. The host then configures its adapter to answer for the MAC address corresponding to that group. Multicast must also be functional outside of the broadcast LAN. A simple solution to the problem is to give each router for which multicast is enabled the multicast packet. This is called flooding. Basically, it functions by forwarding the packet to every interface other than the one that the packet arrived on. The inherent flaws in this approach is that there is packet duplication as well as packets being sent to routers which have no hosts subscribed to the multicast group. To clarify the duplication statement, if Router A is physically meshed with routers B, C, and D and linked to its upstream via serial, when router A receives the multicast packet, it floods it out the interfaces to routers B, C, and D. These routers then flood the packet out the interface other than the one they received the packet on (namely the interface that connects them to A). This results in each of these routers receiving two copies of the packet (other than the one they received from A) in this exchange. A solution to this problem can be found in a technique called Reverse Path Forwarding (RPF). RPF specifies that the router forwards a packet with a source address of X only if the interface which the router receives the packet on corresponds to the shortest path that router has to source X (fig 10). Therefore, in the above example, each of the meshed routers _still_ receives 2 duplicate packets in the second step, but they refuse to forward them because only the packet received from the interface directly connected to A will be forwarded. As noted, RPF does not completely solve the problem of packet duplication. To solve this, we must introduce pruning. The idea is simplistic: inform neighbors that you do not wish to receive multicast packets from source X to multicast group Y. You can also specify prunes to a particular group. If a router tells its neighbors that it did not desire to receive packets for group Y and then has a host which desires to receive group Y, it sends a graft message to its neighbors specifying that it be added into the multicast tree. As a unicast aside, RPF can also be used to eliminate source address spoofing in that the router will only forward packets from source Y if it is receiving it on the interface which is the shortest path to source Y. Thus, if the router receives packets from an external link which say their saddr == saddr(y), the router will not forward them because its shortest path to Y is not from the external link. RPF Example: | <-- Point of ingress. | A-----------C |\ /| | \_______/ | | / \ | |/ \| B-----------D [ fig 10 ] ABCD are physically meshed. When A distributes a packet to BCD, there is no problem. Now, in the next step, B, C,and D forward the packet to each of their respective neighbors (for B it would be C and D and ! A because it received the packet from A). This results in C and D receiving 2 packets in this entire exchange. Note that C and D now do _not_ forward the packet they have received from A through B because that is not their shortest path to A. Their shortest path is their direct link. ----[ The Multicast Backbone (MBONE) It would be myopic to assume that every router on the internet supports multicast. Thus, when a router needed to find the shortest path to a destination (for forwarding of a multicast packet) it could look in the unicast routing table. Unfortunately (or fortunately depending on your perspective) most routers on the Internet do not support multicast or do not have it enabled by default. Therefore, until most routers support multicast, it has been "layered" over IP and tunneled from multicast router to multicast router (more specifically, the multicast header and data is encapsulated in a unicast IP header). The tunnel (which bridges the gap of unicast only routers between multicast routers) informs each end that some packets will contain a multicast group in their payload. This allows data to be routed by using unicast forwarding tables while at the same time preserving the integrity of the multicast group id. Because these multicast routers are not necessarily connected physically (though they are tunneled logically), they must be connected by a multicast routing protocol. This is necessary because the closest path via multicast may not correspond to the shortest path over unicast only routers. Distance Vector Multicast Routing Protocol (DVMRP) is an initial foray into this realm. DVMRP distributes "unicast" routes to facilitate the construction of shortest path trees. DVMRP uses the flood and prune method discussed above to discover /maintain multicast trees. There is also a link state foray into this arena. Multicast Open Shortest Path First (MOSPF) takes the LSP approach and calculates shortest absolute path. One host off of a multicast router can request to be in a multicast group. That router then distributes an LSP over the network. Of course, MOSPF (being a link state protocol) runs into salability problems. It is computationally expensive for a router to compute reachability information for each end node router. Core based trees (CBT) are an attempt to alleviate the problems that DVMRP and MOSPF experience. The concept is that multicast routers send join requests to core routers of arbitrary designation. When a router learns of a host which wishes to join a specific multicast group, it forwards a packet to the core multicast router. Every router along the way forwards the packet towards the core router and marks the interface on which the packet arrives so that it knows where to forward the multicast packets from the core. This solves the problem of having to communicate topology among all of the endpoints. The choice of a core multicast router is a non-trivial because all multicast traffic for multicast routers branching off of it _must_ pass through the core router. ----[ Protocol Independent Multicast Protocol independent multicast (PIM). Pim is a balance between flood and prune and CBT. When there are many multicast routers on a given network, it is more efficient to use the flood-and-prune method. This network environment is called "dense". On the contrary, sparse mode defines networks where there are few multicast routers. In sparse mode, it is more efficient to use CBT because the core router is not weighted in an environment when it 'polices' few multicast routers. When most of network is comprised of multicast routers, it is not prudent to require all sessions to be coordinated (and routed through) a core. Sparse mode PIM has been adapted from CBT to allow data to reach its destination via the core or through the shortest path tree. Currently, the operator must define whether groups are sparse or dense instead of leaving it up to the protocol. cisco systems' implementation of pim also supports a middle ground called 'sparse-dense' mode. ----[ Border Gateway Protocol There has been some mention of the Border Gateway Protocol (BGP) in this document. BGP was groomed as the successor to the Exterior Gateway Protocol (EGP). BGP is mainly an exterior routing protocol. It is used to communicate with systems outside of the operator's control and both distribute and receive network layer reachability information (NRLI) from the neighboring routers. BGP must be a robust protocol which has the capability of quick convergence while at the same time, not being influenced by frequent shifts in topology. When you use BGP to receive routing information, you are depending on the other networks to distribute correct information to your network. A BGP speaking router communicates with its peers via TCP. TCP over IP is a mechanism for guaranteeing the transmission of data over a best effort service at the IP layer. The choice of TCP as the distribution mechanism for BGP information is a point of contention. While TCP provides inherent checksums, acknowledgments, retransmissions and duplicate suppression mechanisms for received packets, it does not guarantee packet order or packet path. This can lead to headaches for the router receiving this information. BGP peers communicate with a variety of message formats. BGP speakers use the OPEN message to establish a peering relationship with other speakers. BGP speakers use the UPDATE message to transfer routing information between peers. Update information includes all routes and their associated attributes. KEEPALIVE messages assure that BGP peers are active. NOTIFICATION messages inform peers of error conditions. ----[ BGP path selection It is important that we understand the messages that constitute the Border Gateway Protocol, but we are still left with the question of how BGP works on the internet. One important area of clarification is in the BGP path selection algorithm. This algorithm is how BGP decides which route to prefer and attempt to install in the routing table. This algorithm is employed when there are multiple paths to a destination. As a failsafe, the first selection looks at the next hop and determines if it is accessible. If the next hop is not accessible, it is important not to consider that route as a candidate path to a destination because all data sent to its next_hop will be blackholed. The second selection mechanism is the weight of the route. Weight is a proprietary implementation to cisco Systems routers and is analogous to local preference. If two routes have different weights, the route with the largest weight is selected. Notice that the selection mechanism is basically a logical if->then chain. If candidate paths differ at a level of the selection algorithm, then the favorable path is selected and the algorithm ceases trying to decide which path to prefer. The next level is the local_pref attribute. This is a well known mandatory BGP attribute. Much like weight, the path with the highest local_pref is preferred. After local preference, the router selects the path that it originated. If the router didn't originate the paths, then the path with the shortest AS_PATH length should be selected. AS path length gauges the number of autonomous systems that this routing information has traveled through. The purpose of this selection relies in the assumption that the less ASNs the route has passed through, the closer the destination. If all of the above attributes are identical, then prefer path origin in this fashion IGP > EGP > Incomplete. If the path origins are the same, prefer the path with the lowest value MULTI_EXIT_DESCRIMINATOR (MED). MEDs are commonly used to distinguish between multiple exit points to the same peer AS. If none of these attributes are dissimilar, then prefer the path through the closest IGP neighbor. If that fails, the tiebreaker is preferring the path with the lowest IP address specified in the BGP router-id section discussed above. This selection algorithm allows effective establishment of routing policy. If I wanted to prefer routes from a certain AS over routes to another AS, I could establish a route-map at both entry points of external routing information and assign a higher LOCAL_PREF to the routes from the AS I want to favor. Unfortunately, this does not provide much granularity. This means that all traffic will be routed to the favorable AS and does not allow us to balance the load over our multiple connections. If you allow path selection to progress to the AS path length decision level, then you will get decent (though not 50-50) load balancing to destinations. This of course is assuming that you have providers with comparable customer routes and connectivity. If you have a DS3 to MCI and a DS3 to the local BFE provider, nearly all traffic will move out the MCI pipe if the BGP decision is allowed to progress down to the AS path length category. At earlier selections, you can change the preference of routes by using AS path access lists to select routes based on as path regular expressions. For example, if you wanted to select all routes that traversed UUnet and send them out your BFE provider, you could use a route map to match an AS path access list which contained _701_ and set the local_pref to 100 (or some value higher than the UUwho traversed paths from MCI). This will force all traffic destined for UUwho to exit your AS over your BFE DS3. While this affords you some granularity in load balancing, it is often not optimal. Basically, you are forcing traffic out a path that it would not normally select. If that BFE provider has many hops before it can reach UUnet, you are forcing the traffic you send out that link to traverse all of those hops and be subject to (possibly) more link congestion, latency, etc. Routing policy is something that requires the tweaking of many knobs. Much of the tweaking I have described above pertains to cisco Systems routers. It is important to understand that you must think through routing policy before you implement it. You must evaluate what load balancing you want, what traffic symmetry you want, and what general quality of service your traffic will receive because of your decisions. For information more specific than this, read the BGP RFC or current BGPv4 internet draft [1]. ----[ Open Shortest Path First v2 (OSPFv2) We are not going into great detail about OSPF. It is a link state routing algorithm. As noted above, link state algorithms route on flat space (no hierarchy). OSPF is an improvement over the vanilla LS protocol because it provides areas of maintenance for hierarchy purposes. Areas distribute full information internally by running a separate OSPF process with its area ID. Each router has an identical link state database with other routers within its area, but not with external routers. Each area operates in an autonomous state and transfers inter-area information at junction routers called area border routers. These routers are in two or more areas and help distribute information between these areas. The router has separate link state databases for each area to which it is connected. OSPFv2's main advantage is that it supports Variable Length Subnet Masks (VLSM). This means that a router can advertise reachability with more granularity than a scheme which advertised host reachability. Therefore, if the router can distribute packets to all hosts from 206.4.4.1 -> 206.4.5.254 it advertises reachability to 206.4.4.0/23 instead of each classful network separately or each host separately. This obviously saves immensely on link state database size and processing power required. For information more specific than this, read the current OSPFv2 RFC or internet draft [2]. ----[ References [1] Rehkter, Y., Li, T., " A Border Gateway Protocol 4 (BGP-4)", draft-ietf-idr-bgp4-07.txt, February 1998. [2] Moy, J., "OSPF Version 2", draft-ietf-ospf-vers2-02.txt, January 1998. ----[ EOF