loopback0 – Douglas Gourlay's Blog Data Centers, Virtualization, and Cloud Computing


16
Jul/09
8

Things I Would Like to Change Part 1/N

Why someone would do this to a poor Ethernet cable is beyond me...

Why someone would do this to a poor Ethernet cable is beyond me...

For those of you who know me you know I am a bit opinionated.  I don't profess to be right every time, but I will pretty much always have an opinion and will always argue it to the best of my ability.

I was sitting down today and I kept coming up with a bunch of things I wanted to change.  Whether in the networking world, the data center space, about virtualization platforms, or nebulous clouds, or even about tax-structures, legislation, and other things reasonably adjacent to the tech sector.

So in that vein I thought I would put this series together- Things I Would Like to Change.  Feel free to suggest your own things that you'd like to change, I really don't mind - may make a post out of  them too!

For the first thing I will err on the side of something techie-

EtherChannel. I would love to change the EtherChannel hashing function and do something far more intelligent, automated, and better performing.  Most switches today use a simple hash based on L2, L3, or L3 plus L4 port info to determine which link to send a given traffic flow down.  This link is chosen based on a hash algorithm and then stays constant unless there is a link failure in which case the traffic is remapped.

Why is this is not good enough?  It's actually okay for some traffic. But when host interconnect speeds and uplink speeds are identical we start running into problems where a host can generate a flow that can consume an entire uplink, and then you deal with contention and buffering and all sorts of fun-stuff.  Today, we are seeing a convergence of host speeds and uplink speeds at 10Gb, so this problem will rear its ugly head again.

What would be better?  Well, several options-

1) Wider hash.  A wider hash, say a 32-bit rather than 3-bit means I can have a better granularity of traffic apportionment when I have a non base-2 number of links.  It also means that link failure cases get re-apportioned much more fairly.

2) Wider hash with counters and dynamic bucket re-mapping.  Came up with this idea about five years ago.  Short version is that if any 'bucket' gets used at around the link speed you move other traffic to non-congested links.  This allows large flows to go through unhindered and not congest multiple smaller flows.  May cause some flapping if timers are tuned too tightly.

3) Out-of-Order bits.  Create an OOO-bit that can be set with an ACL.  Then for traffic that is not impacted by out of order delivery you can set the OOO bit with an ACL match and spray-and-pray that traffic across multiple links in an EtherChannel.  This would work for video flows that are protected by FEC, DNS lookups, and some of the more elegant bulk-file movement protocols.  This would not work for market data feeds where receipt is order-dependent and packet order is generally not encoded into the payload.

4) Out-of-Order with timing. Basically you run a TDR/OTDR test and determine the latency of the physical media for each link in an EtherChannel bundle.  Then maintain a small re-assembly buffer on the receive side that is allocated based on the maximum delta in latency between the fastest and slowest links between to given nodes.  While more complex this allows packet-striping ensuring almost perfect efficiency in link utilization.  If the distance of the physical media is too much and the latency sprad too wide then we would be able to identify up front that the links selected were incapable of this type of advanced operation.

How else would you address this EtherChannel load balancing issue?

btw- my first week as a product manager in the switching group this topic came up via a large customer at an executive briefing- 2001.  I had it again my last executive briefing at Cisco- 2009.

dg

sharing is fun
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • Ping.fm
  • RSS
  • Slashdot
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Technorati
  • Twitter

Author: Douglas Gourlay

high-tech executive with interests in networking, virtualization, cloud computing, and IT/Tech government policy. VP of Marketing at Arista Networks - this blog reflects Doug's personal views and opinions and not necessarily those of Arista Networks.
Comments (8) Trackbacks (0)
  1. How come every time I asked for number 2 you said no way?

    • because with the chipsets we were dealing with on the c6k it was not possible. Also with the transistor density and number of counters necessary if you assume a counter per bucket and per link it would be a very large number of counters that would be necessary. Some folks also worried, although never validated, that at the point of re-allocation of a bucket you could cause out-of-order frame delivery. I felt that this could be easily overcome with a PAUSE frame or minor hold-down.

      dg

  2. Did I ask for this years ago? :-)

    D.

    • yes you did. I can’t remember whether you were running the network at Broadcast.com then or Yahoo! but it was in one of our first meeting together in 2000/2001. :) and there is an opportunity for someone to still fix it….

  3. An interesting method may be to use dispersion routing. Basically, you have a number of routes between point A and point B. Now map flows between A and B according to some hash, or round robin. If a link congests, select a couple of flow(s) and remap a couple to under utilized links.

    It’s imperfect, but it breaks some polarization issues (InfiniBand suffers the same fate due to not having a priori notification of high volume flow polarization). You can also throw ECN into the mix as well, ie set TCP ECN, move flow hash and resume – if it’s a high volume flow latency isn’t likely an issue unless you drop frames ;^)

    If flow counters (Netflow) could be maintained and interrogated that may be a cool way of detecting high volume flows and deal with those selectively.

    BSG

  4. Thanks for bringing up the hashing issue. I’ve watched various switches and routers stumble on this for at least 10 years, and cowrote RFC 4814 as a starting point in describing the problem.

    When vendors respond with hand-waving about “not real world” conditions of using lots of pseudorandom addresses, my stock response is that vendors don’t get to decide what addresses their customers will use.

    As to your load balancing question: I hope more vendors will take at least your options 1 and 2 into consideration, and on option 1 will use more L2-L4 and possibly L7 criteria in making hashing decisions.

    dn

  5. Oh, and one other thing: It’s pronounced “link ag-gre-ga-shun.”

    No more of this proprietary lock-in nonsense…

    cheers

    • hahah, how true Dave. I got so used to using the term EtherChannel over the years I must concede the point.

      I think I may have to write a post on proprietary versus pre-standard tonight – always a flame-ridden topic :)

Leave a comment


No trackbacks yet.

Additional comments powered by BackType