Physical network evolution within data centers:
Many of us are truly excited about the resurgence of networking as cool and happening given the popularity of SDN. However, irrespective of the ability to programmatically define the network topology or network behavior, to my mind the datacenter network architecture and platforms have gone through a profound change in last 5-7 years.
Most of this commentary draws from in architecting very large-scale network infrastructure for Microsoft Online Services and UUNet.
Even mid last decade the standard network topologies used to be multi-tiered architectures as shown in typical Cisco books. The aggregation, distribution and core layers used to be typical large L2-L3 chassis devices that not only were complex but also resulted in a severely oversubscribed topology due to the extremely high cost of each port. In most deployments the boxes were deployed in pairs and an additional layer of complexity was imposed just to ensure that there is no outage in case of any platform or component failure. Frequently many of these layers were burdened with long packet filters (ACLs) to institute segmentations of networks, which has been one of the primary ways to implement security policies.
In a typical enterprise application environment, scale required by individual application is small and as a result the web-servers, application servers, databases/storage all share a simple network that covers only few racks. In such cases, having heavily oversubscribed upper layers of the hierarchy might not be as impactful.
However, for large web-scale environments the application needs had been drastically changing in the background and the paradigm of over complicated network layers with limited expensive bandwidth was being challenged severely. The typical north-south bandwidth demand was challenged by enormous amount of east-west bandwidth demand from emerging cloud and web services. The only way for the cloud services to succeed has been a horizontal scaling model that results in heavy workload between the servers. Enterprises are starting to experience similar needs with emerging applications like Big-data, etc.
Fortunately enough, the merchant silicon based platform evolution had already started to get foothold and it has been an excellent opportunity to put them to test. At present day, people have built large-scale network fabric within data centers that would allow near 1:1 over-subscription end to end anywhere within a datacenter. There are few aspects that have been tightly controlled to execute on such efforts.
1. The identification of topology is extremely important. For example, if the topology is well thought out, connecting 100K servers with each other over some number of 64X10G 1RU commodity routers, is not that difficult. Of course variations of CLOS architecture and innovative approaches come out handy here.
2. Management and operations aspects have to be thought through. It is one of the most crucial but misunderstood parts of the whole equation as managing and troubleshooting large CLOS networks can be non-trivial.
3. The right set of control plane protocol has to be chosen keeping in mind the present and future scaling needs.
4. There would be tons of ports, optics and cables needed here; so good understanding of cost structure of platforms is beneficial.
Inside datacenters large capacity symmetrical networks covering the whole datacenter or individual cluster can be easily built using routers and optics at commodity price points. In such networks ECMP based flow hashing works well if one assumes a good mix of the sizes of the flows generated by the applications. There have been concerns about elephant flows getting clubbed and creating hot spots, but those concerns are reduced if the server interface speed and infrastructure interface speed are 4X to 10X apart.
Here is the link on the work that addresses the control plane solutions on commodity platforms in a large real-world deployment. In order to build large-scale networks within datacenters, one need not introduce complicated logic into the control or data plane and try to drive every bit of link utilization possible. The fact is that bandwidth costs have been going down drastically and topologies can be easily constructed so that bandwidth is abundant. Adding more complexity can prove detrimental to operational stability and reliability.
Implications of datacenter evolution:
The datacenters built by large-scale cloud service providers have evolved in profound ways. Network is the fundamental enabler to cloud services and plays a pivotal role in following aspects:
1. Enabling the compute and storage resources to interact with each other in an unconstrained way based on application needs. Of course, in many cases the compute and storage have some affinity but they are spread around globally. So many serious players are building high-bandwidth global backbone.
2. Connecting the cloud to Internet for web based access. Many cloud providers will connect to Internet at multitude of egress points with commercial arrangements with settlement-free peers, local and global transit providers.
3. Segmenting users of cloud resources from one another by creating virtual network space and allowing appropriate controls between Virtual Networks.
4. Connecting the virtual network space to appropriate enterprise network domains, or to other virtual networks/domains as a part of a federated model.
The first two areas are clearly going to be addressed by the physical networking domain. Going forward the industry is going to see enormous innovation as the speeds and feeds are going to evolve from 40G to 100G to 400G and so on. Aspects like port density, interface mix, power draw, optical component integration and optical interface support etc. are going to define the coming generation datacenter switches. The switches will be more optimized with higher non-blocking forwarding capacity as opposed to proportionately higher control plane scaling capacity. Analytics, diagnostics and smart traffic engineering will be redefined as a result of the innovations in the physical networking space.
Vendors making the right platform choices for their customers are going to give their customers significant edge over their competitors. Getting locked with a single vendor would significantly reduce customers’ chances to leverage the innovation prowess of the industry.
The areas covered by third and fourth points will move virtual network infrastructure more naturally into the server space. The servers with their sizeable memory and flexible compute resources are well positioned to absorb the network edge and provide required scale. Industry standards are critical for cloud entities to interoperate with each other. The global cloud infrastructure will carry tons of business data that will be open for consumption in a streamlined way. A virtual network infrastructure with well orchestrated segmentation and access logic can potentially prove as an enabler to such a market place.
Conclusion …
The datacenter space running cloud infrastructure has already gone through some significant shifts in last several years, even before the arrival of the SDN approaches. The shift happened due to solid business needs driven by the emerging applications whose traffic demands had morphed from traditional requirements.
In the coming years, the datacenter space is going to see some very interesting progress in both physical networking as well as the virtual networking world. For the physical networking piece the Top of Rack switch could become the boundary of the network, but for the virtual networking piece the servers will become the new network edge.
The physical network evolution will be encouraged by adoption of multi-vendor approaches by cloud service providers. That will push the vendors to innovate aggressively and continue to reduce the cost and/or increase capacity as they move units of data around the datacenter. On the other hand, the network virtualization area would need well-vetted standards and scale-out approaches to become an enabler for cloud evolution.
Customized and non-standard solutions for virtualization have definitely done a good job in bringing awareness to the cloud space network virtualization. However, enterprise cloud infrastructures spanning over multiple public and private clouds will become the norm going forward. It is hard to see a world without standardization in the virtual networking space leveraging the well-known distributed system principles that we have learned through Internet evolution.
By Parantap Lahiri – VP Solution Engineering, Contrail Systems