This is Alan R.'s Typepad Profile.
Join Typepad and start following Alan R.'s activity
Join Now!
Already a member? Sign In
Alan R.
Long-time geek and founder of Assimilation and Linux-HA projects with interests in managing computers, particularly monitoring, discovery and availability.
Recent Activity
Something I should have made a note of - it automatically configures _monitoring_, not _alerting_. It's much harder to figure out how to alert for a particular service than it is to figure out how to monitor it. Monitoring is apolitical and mostly independent of company, but alerting is not. Who to contact, when to contact them, what's the priority of this service? All good alerting questions. But you don't need to know the answers to these questions to monitor it.
I'm planning on taking a good bit of time off to make the first Assimilation release available. I'd describe its current state as a well-established proof-of-concept. Hopefully in the time I have before the end of the year, I can make it into a release worth having others try out. Feel free to play with it as is.
Alan R. added a favorite at New DivaBlog
Oct 20, 2012
Doing this right is definitely a pain in the you-know-what. It will require: (0) Renumbering all your IP addresses of your switches as noted above leaving room for all your servers AND their virtual IPs as well as fixed IPs (1) Enabling LLDP or CDP in your network (2) Writing a piece of software to intercept/read the incoming LLDP or CDP packets on all the interfaces you expect CDP or LLDP on (3) Integrating this CDP/LLDP reader into the bootup process to assign IP addresses as the packets arrive, and terminate when all the requested interfaces are assigned (4) Do a LOT of testing to make this all work. Obviously, you'd want to start with a small test network to use in creating and testing this software. Regarding Disaster Recovery - that's a complicated issue - not particularly related to cloud computing. It is unlikely that in most cloud computing scenarios that you'd have _any_ influence over IP address assignments.
Thanks! I fixed it. Not sure how that snuck in...
My code in the Assimilation project monitoring system knows how to decode LLDP and CDP packets, and is lightweight, and doesn't enable promiscuous mode on Linux. What you really want is to listen on all your interfaces at once, and enable each as the packet(s) come in. LLDP has this annoying property that the frame length is split across two bytes, taking one bit from the frame type.
This is good information to know. I don't currently expect for my problem domain to be creating large numbers of nodes of the same type in a short period of time on a routine basis. That would imply that lots of new hardware showed up all at once - which is likely to only occur during initial installation. But I much appreciate your expert opinion, and will keep your advice in mind as we go forward. Thanks Much!
Hi Luannem, Thanks for your comments! Sorry I was so slow to respond to it :-(. If you hang in there, you'll get more posts like this. I'm currently making about one a week - and I have at least 3 more weeks (after today) of posts I know what to say. Because of your interests, maybe you should join the Assimilation mailing list:
Thanks Peter! I have more articles like this lined up. I'll post one of them tonight - tomorrow morning your time ;-). I really wasn't sure how to solve this problem - but when I came up with the idea for "variable relationship names" - it seemed to be a good compromise. I rarely want to know all the memberships, and often want to know about specific memberships - so this seemed like a reasonable approach for how to get make things easy and maximum advantage from Neo4j.
Thinking about this design - it seems to me it's more like DHCP than any other internet protocol that I can think of, since it's centrally managed and has other things in common. For example... When we boot up, we send a multicast/broadcast packet asking for someone to tell us who we are and how we should be configured (analogous to getting DNS entries and so on from DHCP). Like DHCP clients, our machines "renew their leases" periodically - except we do it a *lot* more often. Instead of measuring lease renewal times in minutes, hours or days, we measure them in seconds (or potentially even in fractions of seconds). To compensate for this, we distribute the our "dhcp server analog" throughout the network.
That's an interesting (and thoughtful) comment. This doesn't look very much like Linux-HA (or Pacemaker) - and it's a long way from a complete solution. An OSPF network with 10K routers is a very big OSPF network. (not 10K hosts, or 10K switches - they don't participate in OSPF). What specific way do you think it should look more like OSPF? Here are my off-the-cuff thoughts on this question... How is this problem like OSPF? - it is trying to manage liveness, it is trying to be local network topology aware and network efficient. How is this problem different from OSPF? It's not trying to solve the "let's help independent fiefdoms work together" problem. At this level, all machines are "owned" by the same owner. It is not trying to provide anything more than liveness (it's trying to solve a simpler problem). There is no distributed control (at this level).
It's worth noting that Pacemaker (a child project of Linux-HA - formerly called the Linux-HA CRM) does implement this convenient type of health monitoring - that applies to every resource on the machine.
Alan R. is now following The Typepad Team
Mar 15, 2010
I spent the first 20 years of my career working for Bell Labs on exactly those kind of highly redundant systems. They've been largely abandoned largely because they are too expensive, and to get the benefit from them they need special software. Ditto for the Tandem systems - abandoned as too expensive. Everything fails. EVERYTHING. You just have to wait long enough. Eventually the sun will burn out. The only question is what you're going to do when it fails... Quite frankly, I think all HA cluster software (as it's been traditionally understood) is doomed. Virtualization makes redundancy and failover simple, and eventually it will make it easy - probably mainly through cloud computing.