Nexus 7700 Part V: TCAM Woes and Solutions
by Eric Stewart on Jan.21, 2015, under Networking, Technology
The two 7710s going into our data center (yeah, not moving as fast on that as I would like) threw me another curve today.
So we have quite a few VLANs originating from our data center that do not run through the firewall, so as such, might just have their own (often somewhat lengthy) Access Control Lists (ACLs). Without going into horrible detail about ACLs, they’re just rules that dictate what traffic is and is not allowed to enter/exit the VLAN.
I’ll try not to get into too much detail about hardware design on a Cisco device, but something we need to touch on is a TCAM: a memory location in hardware that holds ACL rules in order to make routing while consulting ACLs faster. Thing is, there’s often a limit to the size of the TCAM. Not being the principle architect of our network, I didn’t know that TCAM size limits might be something I would have to deal with.
In addition, on the F3 series blades we’re using, there’s four TCAM banks (two banks per TCAM address, identified as T0B0, T0B1, T1B0, and T1B1), and by default they are designated for specific purposes. The command:
show hardware access-list output vlan feature-combo <feature>
Will show you (for a given “feature”) which bank will be used. As an example:
svc-7700# show hardware access-list output vlan feature-combo VACL Feature combination supported: Yes ______________________________________________________________________________ Feature Rslt Type T0B0 T0B1 T1B0 T1B1 ______________________________________________________________________________ VACL Acl X ______________________________________________________________________________
shows you that the VLAN ACLs will go into TCAM 1, Bank 0.
Converting the VLANs that currently exist on the 6500s to the 7700s involved, for me, pasting the 6500 VLAN entries into a text file and editing them (including a nice little “shutdown” at the beginning of the definition, so that the 7700s stayed in L2 for the time being). This isn’t a fast process – there’s a lot of Catalyst-to-Nexus translation that needs to be done, as well as some adjustments to the ACLs that accompany the VLAN interfaces. I got a few interfaces/ACL rulesets into the process, and decided to paste the configuration for a few of the VLANs in shutdown mode into the 7700s, having already pulled the ACL rulesets in earlier.
Eight VLAN interfaces into the process, I get:
dc-7700(config-if-hsrp)# interface Vlan90 dc-7700(config-if)# shutdown dc-7700(config-if)# description [REDACTED] ... dc-7700(config-if)# ip access-group ACLXX-Out out ERROR: Module 1, 2, 10 returned status: Tcam will be over used, please enable bank chaining and/or turn off atomic update.If bank-chaining is enabled on other modules and this is a new linecard insertion,please enable bank-chaining prior to reloading this module. ...
Now, that warning is a little annoying, because it makes the situation sound a little more dire than it is:
dc-7700# show hardware access-list resource entries module 10 INSTANCE 0x0 ------------- Tcam 0 Bank 0: 10 valid entries 4076 free entries Tcam 0 Bank 1: 9 valid entries 4077 free entries Tcam 1 Bank 0: 2030 valid entries 2056 free entries Tcam 1 Bank 1: 308 valid entries 3778 free entries Index Protocol Encoding ref_count ------------------------------------ 6 protocol cam entries are in use 7 mac protocol cam entries are in use ACL Hardware Resource Utilization (Mod 10) -------------------------------------------- Used Free Percent Utilization ----------------------------------------------------- Tcam 0, Bank 0 20 4076 0.49 Tcam 0, Bank 1 19 4077 0.46 Tcam 1, Bank 0 2040 2056 49.80 Tcam 1, Bank 1 318 3778 7.76 ...
I did a little Googling and it would appear that the error occurs because the new ACL would cause the “Percent Utilization” to break 50% for T1B0. Atomic Updates is the reason why we can’t go beyond 50% usage; Atomic Updates, during the updating of an ACL, holds the old copy and the new copy of an ACL in memory at the same time in order to prevent any service interruption while updating the ACL (the old one will be removed once the new one is complete and active). We can turn that off, but then there might just be a service impact any time someone asks us to adjust an ACL (which, for some VLANs, can be frequently). The option is to make the default action to be a “permit” during ACL updates rather than the default “deny”, but I don’t so much like that.
Bank chaining may just be the way to go. It allows us to use all banks for TCAM entries, but with the loss of some functionality: certain combinations of ACL/TCAM usage cannot be done due to the lack of isolation that the default method allows.
Changing the configuration to use bank chaining requires the command to be run for every (non-supervisor) module, and existing TCAM entries do not get balanced. The command in question is:
hardware access-list resource pooling mod <module#>
Delete the eight interfaces (to remove the ACLs from the TCAM tables), re-add them with the ACLs, and then look at the resource entries:
dc-7700# show hardware access-list resource entries module 10 INSTANCE 0x0 ------------- Tcam 0 Bank 0: 1024 valid entries 3062 free entries Tcam 0 Bank 1: 9 valid entries 4077 free entries Tcam 1 Bank 0: 1026 valid entries 3060 free entries Tcam 1 Bank 1: 308 valid entries 3778 free entries Index Protocol Encoding ref_count ------------------------------------ 6 protocol cam entries are in use 7 mac protocol cam entries are in use ACL Hardware Resource Utilization (Mod 10) -------------------------------------------- Used Free Percent Utilization ----------------------------------------------------- Tcam 0, Bank 0 1034 3062 25.24 Tcam 0, Bank 1 19 4077 0.46 Tcam 1, Bank 0 1036 3060 25.29 Tcam 1, Bank 1 318 3778 7.76 ...
A little more balanced, and per the documentation‘s note that it first balances across Bank 0 before using all four banks.
So I haven’t had a chance to talk to $BOSSBIGBRAIN yet to see what says, but this is what I’m going with so far.
Don’t be surprised if this article is edited later.
EDIT: And from what I understand we have a TAC case open for the issue, so there may be more to this than I know.
1 Trackback or Pingback for this entry
- Twitter: Just start your Twitter message with @BotFodder and I'll respond to it when I see it.
- Reply to the post: Register (if you haven't already) on the site, submit your question as a comment to the blog post, and I'll reply as a comment.
July 24th, 2015 on 10:04 am
[…] a little bit of part 1 of this issue, we ran into an issue where the limited TCAM space (or, more accurately, the way […]