Thursday, 17 April 2014

Technical Debt, a case study : tags

 

This is going to be more a discussion / wordy thing; no real code here. This is just to wander through some points, and maybe you’ll find them interesting or thought provoking. You have been warned.

At Stack Exchange, we have a fair understanding of technical debt. Like real debt, technical debt is not by necessity a bad thing – it can allow you to choose an acceptable (but not ideal) solution today, which means you can ship today, but you know that at some point you are going to have to revisit it. Like all loans, technical debt carries interest.

Today I’m talking about tags; specifically, the tags on Stack Exchange questions. These things:

image

I’m going to run through a lot of back-story, but if you’re lazy, just pretend you read all of it and skip to the last paragraph. Or, heck, go look at cat videos. I won’t tell.

Step 0 : the original problem

Way way back (long before I joined the company), the then-much-smaller team needed a way to store and query tags – in particular a wide range of “{a} and {b} and {c}”, “{d} or {e}”, “{f} but not {y}” etc queries for the “show me the questions in {some tags}” pages, or the search page (which allows a lot of tag-based filtering). Your classic database approach here might be to have a Posts table, and Tags table, and a PostTags table, and do a lot of work on PostTags, but as I understand it this simply didn’t want to perform well. Equally, we access and display questions a lot. No, a real lot. Huge amounts. The one thing that we want to happen really really efficiently is “fetch a question”.

Having that data spread over 3 tables requires complex joins or multiple queries (because you could be fetching 500 questions, which could each have 1-5 tags), and then processing the more complicated data. Because of this, we store the tags as a single character-data field in the database – all the tags in one string. This makes it possible to get the post including all the tags in a single row query, and just worry about deciphering it at the UI. Neat. But it doesn’t really allow for efficient query.

Aside: at this point I should also note that we additionally store the data in a PostTags table (although I can’t remember whether this was the case at the time) – the inline vs normalized data is kept in sync and used for different purposes; the things you do for performance, eh?

Step 1 : the evil/awesome hack

So, we’ve got inline tag data that is simple to display, but is virtually impossible to query. Regular indexing doesn’t really work well at finding matches in the middle of character data. Enter (trumpets) SQL Server Full-Text Search. This is inbuilt to SQL Server (which we were already using), and allows all kinds of complex matching to be done using CONTAINS, FREETEXT, CONTAINSTABLE and FREETEXTTABLE. But there were some problems: stop words and non-word characters (think “c#”, “c++”, etc). For the tags that weren’t stop words and didn’t involve symbols, it worked great. So how to convince Full Text Search to work with these others? Answer: cheat, lie and fake it. At the time, we only allowed ASCII alpha-numerics and a few reserved special characters (+, –, ., #), so it was possible to hijack some non-English characters to replace these 4, and the problem of stop words could be solved by wrapping each tag in a pair of other characters. It looked like gibberish, but we were asking Full Text Search for exact matches only, so frankly it didn’t matter. A set of tags like “.net c#” thus became “éûnetà écñà”. Obviously! This worked; it performed well (remember, Stack Overflow was still very young here), and allowed the team to ship. Shipping is a pretty important feature! The team knew it was a fudge, but all was well (enough) in the world…

Step 2 : the creaks begin

Stack Overflow was a hit. Over time, the question count grows steadily larger, as do the number of page requests. Eventually it became apparent that our flagrant abuse of Full Text Search was becoming a performance bottleneck. To be fair, Full Text Search wasn’t really intended to be used in this way, and we were using it at an alarming rate (even after caching). But by now the team had a viable product, more developers, and enough information to judge that more effort was both necessary and worthwhile. At different points, then, our search efforts were re-written to use Lucene.Net and then Elasticsearch, and the “list by tag” work grew a bespoke entity that we call the “tag engine” (which is actually what this previous post is about) – essentially an out-of-process index service. Nowhere near as fully-featured as (say) Lucene.Net, but it only needs to do one thing.

Step 3 : Stack Exchange has sites about language

In time, we had requests for sites that were still primarily in English, but were about languages; French, Japanese, Russian, etc. A lot of their tags were in English, but there was a need to remove our “ASCII alpha-numerics” restriction. Fortunately, since we had Elasticsearch and the tag-engine, this mainly meant changing the few remaining Full Text Search usages (things that  were used rarely, and hadn’t been worth fixing) to use alternatives. The format in the database, however, remained. And since we didn’t want to introduce even more bizarre rules, we kept the 6 reserved characters. Sorry, French.StackExchange – no “é” or “à” in your tags. In reality this was less of a problem than it sounds, as French.StackExchange elects to use accent-free tags – while sites like Russian.StackExchange could use the tags they wanted. And the world keeps turning.

Step 4 : Stack Exchange goes fully multi-lingual (well, ok, bilingual)

Enter Portuguese; in pt.stackoverflow.com, we have our first site that is completely in another language. No weasel room left now: they want tags like this:

image

And guess what: ç was one of our reserved tokens (representing +, so “c++” was stored as “écççà”). We couldn’t even whistle a happy tune and hope nobody would notice the gaps: they found it on day 1 of closed beta. Maybe day 2. We finally had a reason to remove this legacy from the past. But – and I cannot emphasize this enough: these changes were scary. So scary that we didn’t want to do a “update all the things” until we’d had chance to “bed it in” on pt.StackOverflow, and fix any glitches that we’d missed. As it happened, all the code was making use of a single “decipher the tags” method, so it was sensible and pragmatic to simply make that understand both the old format and whatever we came up with now. After some thought and consideration, we settled on a pipe (bar) delimited natural representation, with leading/trailing pipes, so “.net c#” becomes simply “|.net|c#|”. This has virtues:

  • very simple to parse
  • we can tell immediately from the first character which format it is
  • bulk update and removal of tags can be done with a simple replace (including the pipes, to avoid replacing mid-tag matches)
  • and unlike the old format, you don’t need to worry about special-casing the first/last tag when doing a bulk operation (trailing spaces, etc)

Sure, there’s still a reserved |, but we can live with that. We could have used a non-printing character, but that would be very awkward in things like JSON – lots of risk of subtle bugs.

Once this had been working on pt.StackOverflow for a while, we flipped a switch and all new write operations switched to the new format; all “language” sites could have free access to the 6 previously reserved characters. Hoorah!

Step 5 : the backfill

When we did that, we only affected new writes. There was still data using the old format. A lot of data (not just the questions themselves: all tag edits, for example, were stored in this way). But while it remained, our “bulk remove a tag” code had to cope with two formats, which is an unnecessary maintenance overhead. So finally this week we absorbed the pain to do a batched migration of all the old data to the new format. Fairly routine, if a little scary.

And happy dance; the hack is no more!

So why are you telling me all this? What is the point?

Our job as engineers is not always to write the best possible thing that we can, and that can solve every problem ever, and is beautiful code that makes us want to adopt it and take it on picnics. An important part of our job is to:

  • understand what the code actually and genuinely needs to be able to do today and tomorrow
  • think of a way of delivering that with the available resources
  • think a little about what might be needed in the future, acknowledging that this is only a best guess
  • make sure that it is not impossible (or prohibitively expensive) to change things later
  • At the time the original hack was put in, it was absolutely the right choice. A few lines of code, a few clicks in SQL Server, job done. Shipped. Working. It would not have been sensible to invest time getting an external indexing service working just for this. But there was always the possibility to change the implementation, and the “unscramble the mess” code was in one place. Sure: there was debt, but the level of pain was kept to a minimum, and has finally been paid back in full.

    Friday, 28 March 2014

    Windows Redis-64

    As you know, I’m a huge fan of Redis. This week, a team at the Microsoft Open Tech group released a new drop of redis-64 for windows (note: all links there are old – github is fresher), with binaries available on both NuGet and chocolatey. It is even marked by them as “production ready”. The current drop is 2.8.4 in Redis terms (the last public drop was 2.6.12).

    So congratulations, thanks and kudos to the team involved: an important milestone.

    In production, at Stack Exchange we use CentOS to host Redis, so I’m simply not in a position to comment on how well it works in production, but for windows developers using Redis it is really handy having convenient access to a Redis server without having to spin up a VM.

    One really important thing to watch

    From reading the "Redis Release Notes.docx" and "redis.windows.conf" file in the packages/Redis-64.2.8.4 folder, the appoach they used to solve the fork problem was to introduce a shared, memory-mapped file, and it defaults to the size of your physical memory (and that is before it has forked, when it can grow due to copy-on-write). So whether you are using it as a production server or a dev tool, you might want to pay particular attention the the “maxheap” setting in your .conf file. For my local dev work I usually spin up 5 servers, and I currently have 24GB physical memory in my machine – so by default, when running 5 completely empty databases, it instantly chewed up 120GB of hard disk space (you get it back when the server exits). If you don’t have enough disk space: the server won’t start. You'll probably see:

    QForkMasterInit: system error caught. error code=0x000005af, message=VirtualAllocEx failed.

    Fortunately it is simple to configure a memory bound:

    maxheap 4gb # or whatever

    Please note – this really isn’t a criticism; I understand the “why” – they need the shared memory-mapped file for the pseudo-fork, and they need to allocate the “maxheap” amount from the outset because IIRC you can’t grow such a map very flexibly. My intent is merely to flag it in flashing colours that if you’re playing with Redis-64, you want to think about your memory vs disk.

    Now go, play, have some Redis fun.

    Tuesday, 18 March 2014

    So I went and wrote another Redis client…

    …aka: Introducing StackExchange.Redis (nuget | github)

    The observant out there are probably thinking “Wut?” about now, after all didn’t I write this one? Yes, yes I did. This one too, although that was just a mechanism to illustrate the C# dynamic API. So you would be perfectly justified in thinking that I had finally lost the plot and meandered into crazy-town. Actually, you’d have been reasonably justified in thinking that already. So let me take a step back and explain….

    Why? Just… why?

    Firstly, BookSleeve has served us well: it has got the job done with ever increasing load through the network. If it is possible to salute code as a trusted friend, then BookSleeve: I salute you. But: some things had to change. There were a number of problems and design decisions that were conspiring against me, including:

    • Robustness: BookSleeve acts as a wrapper around a single connection to a single endpoint; while it is up: great! But occasionally sockets die, and it was incredibly hard to do anything useful either investigative or recover. In our internal code we had a whole second library that existed mainly to hide this wrinkle (and for things like emergency slave fallback), but even then it wasn’t perfect and we were getting increasing problems with network issues causing downstream impact.
    • Ability to identify performance issues: BookSleeve has only minimal instrumentation – it wasn’t a first class feature, and it showed. Again; fine when everything works great, but if you hit any load issues, there was virtually nothing you could do to resolve them.
    • Single-server: in common with “Robustness” above, the single-endpoint nature of BookSleeve means that it isn’t in a great position if you want to talk to multiple nodes; this could be to fully exploit the available masters and slave nodes, or thinking ahead could be in consideration of “redis cluster” (currently in beta) – and simply wrapping multiple RedisConnection instances doesn’t play nicely in terms of efficient network API usage
    • The socket query API: again, partly tied up to the above multi-server concerns, but also tied up to things like the thread pool and IO pool (the issues described there applies equally to the async network read API, not just the thread-pool)

    And those are just the immediate problems.

    There were some other longstanding glitches in the API that deserved some love, but didn’t justify huge work by themselves – things like the fact that only string keys were supported (some folks make use of binary keys), and things like constantly having to specify the database number (rather than abstracting over that) were troublesome. The necessity of involving the TPL when you didn’t actually want to be true “async” (forcing you to use sync-over-async, an ugly anti-pattern), and the fact that the names didn’t follow the correct async convention (although I can’t honestly remember whether the current conventions existed at the time).

    Looking ahead, “redis cluster” as mentioned above introduced a range of concerns; it probably would have been possible to wrap multiple connections in a super connection, but the existing implementations didn’t really make that feasible without significant work. Also, when looking at “redis cluster”, it is critically important that you know at every point whether a particular value represents a key versus a valuekeys impact how commands are targeted for sharding (and impact whether a multi-key operation is even possible); values do not – and the distinction between them had been lost, which would basically need an operation-by-operation review of the entire codebase.

    In short, to address any of:

    • our immediate internal needs
    • the community needs that weren’t internal priorities
    • any future “redis cluster” needs
    • providing a generally usable platform going forward

    was going to require significant work, and would have by necessity involved significant breaking API changes. If I had reworked the existing code, not only would it have shattered the old API, but it would have meant compromise both for users of the old code and users of the new. And that simply wasn’t acceptable. So: I drew a line, and went for a clean break.

    So what does this mean for users?

    Firstly, if you are using BookSleeve, feel free to continue using it; I won’t delete the files or anything silly like that – but: my main development focus going forward is going to be in StackExchange.Redis, not BookSleeve. The API is basically similar – the work to migrate between them is not huge, but first – why would you? How about:

    • Full multi-server connection abstraction including automatic reconnect and fallback (so read operations continue on a slave if the master is unavailable), and automatic pub/sub resubscription
    • Ability to express preferences to target work at slaves, or demand a certain operation happens on a slave, etc – trivially
    • Full support for “redis cluster”
    • Completely reworked network layer, designed to avoid stalls due to worker starvation while efficiently scaling to multiple connections, while also reducing overheads and moving steps like text encode/decode to the caller/consumer rather than the network layer
    • Full support for both binary and string keys, while reducing (not increasing) the methods necessary (you no longer need to tell it which you want in advance)
    • Observes TPL guidance: no more sync-over-async (there is a full sync API that does not involve the TPL), and the TPL/async methods are now named appropriately
    • Instrumentation as a design feature
    • And lots more…
    • … heck, when I get a moment I might also throw our 2-tier cache (local in-memory cache with a shared redis central cache, including pub/sub-based cache expiry notification etc) down into the client library too

    Is it ready?

    Let’s consider it a “late beta”; on the Q&A sites we have now replaced all of our BookSleeve code with StackExchange.Redis code, which meant that hopefully we’ve already stubbed our toes on all the big bugs. I don’t plan on any breaking changes to the API (and will try to avoid it). Lua script support is not yet implemented (edit: it is now), and “redis cluster” isn’t yet released and thus support for this is still a work in progress, but basically: it works, and works fine.

    A full introduction and example basic usage is shown on the project site; please do take a look, and let me know if I’ve moved too much cheese!

    Thursday, 13 March 2014

    Beware jamming the thread pool

     

    I’ve been struggling over the last few days to diagnose and eradicate a fun little bug in the cache tier of Stack Overflow / Stack Exchange; for context, we use redis extensively as a shared cache (and for some other things – including the realtime updates via web-sockets, which rely heavily on redis pub/sub) – and have this week deployed a new implementation of our redis communications stack. Brain-dead bugs aside (and I really did manage a howler, for which I apologise: sorry), we got it in and stable: it would be working just fine, processing a few million messages without issue, and then out of the blue… WHAM! 10 thousand timeouts in a second, and when you immediately go to look, everything is happy again, merrily churning through load as though you had imagined things. Local load testing failed to reproduce this issue.

    As always, I got bitten by the rule: interesting problems only happen at scale.

    ONE DOES NOT SIMPLY ... DO ANYTHING AT STACKEXCHANGE SCALE

    By which, I don’t mean to imply that we’re the biggest site out there, or that we’re doing anything especially clever (quite the opposite in my case, ahem), but we take great pride in the fact that we run a very busy site on very small numbers of servers. We accidentally found out on Tuesday (again, my bad) that we can still run the entire Stack Exchange Network on two web-servers. Not quite as well as normal, but it worked. edit - proof (courtesy of @Nick_Craver):

    The Stack Exchange Network running on two servers

    But unless you have a very expensive test lab and dedicated load-test team, it is really quite hard to simulate realistic load for the network.

    Enough rambling; what went wrong?

    I got hit by the thread-pool. You see, like BookSleeve (which I will be talking about probably next blog), the new client is an async-enabled multiplexer. As outbound messages for an endpoint come in, we dispatch them to a queue (you need to serialize work on a socket, and know which requests marry to which responses), and if we don’t already have a writer for that queue, we request one. The problem here was: I requested if from the thread-pool. Now, the thread-pool is very very clever – it has lots of smarts in there to handle automatic growth and shrinking, etc. It works perfectly well under sane conditions. But, I asked too much of it: asp.net will already be chewing through workers for requests, and we don’t currently use much async code – our requests are processed pretty much synchronously. This means that during a storm (and only then) we were essentially getting into a deadlock condition:

    • asp.net requests were using lots of workers
    • which were blocked waiting on a response from redis
    • which hadn’t been sent the request yet, because no thread-pool thread could be allocated to write the queue

    Fortunately, this is self curing; eventually either:

    • the blocking requests will timeout (we always specify timeouts, and pretty short ones), allowing the requests to complete and making more writers available
    • the thread-pool will grow and allocate a writer (although to be honest, when competing with constant inbound asp.net requests, it is dubious whether the writer would get there fast enough)

    That is why when you look at the system a second after the trouble, it was all healthy again - the timeouts have happened, releasing enough workers to the pool to service the queue.

    Sigh; all fun.

    The moral of the story

    See, I do have morals! Don’t use the thread-pool for time-critical operations if there’s a good chance you’ll already be stressing the thread-pool. The fix was pretty simple once we understood the problem: we now retain a dedicated writer thread per multiplexer (note: not per socket). We do reserve the right to seek help from the thread-pool if there is a backlog, but frankly that is pretty rare – the dedicated writer is usually more than able to keep up (in local testing, I’ve seen the multiplexer servicing around 400k ops/s – far above what most people need).

    Next time: announcing a new redis client for .NET! (also: reasons why, and what about BookSleeve?)

    Saturday, 8 March 2014

    Cleaner code: using partial methods for debugging utilities

    Often in a complex code base, there are additional bits of work you need to do to help with debugging. The easy but hard-to-maintain way to do this is with #if:

    obj.Wow();
    #if DEBUG
    // only want the overhead of checking this when debugging...
    Very("abc");
    #endif
    Such();

    This gets the job done, but can lead to a whole heap of problems:



    • general ugliness – especially when you have lots of different #if pieces all over the file
    • hard to track what uses each method (many code tools won’t look inside inactive conditional compilation blocks)
    • problems with automated tools, such as using directive removal or code organization tools, which get very confused

    We can do better!


    Partial classes were added to C# a long time ago – with one of their main aims being to help with extension points for generated code. You can then split a logical class into multiple files, as long as you use the partial modifier in each file. The contents of the file get merged by the compiler. So what about partial methods? These are like regular method, but with the odd property that they completely evaporate if you aren’t doing anything useful; as in – the compiler completely ignores all mention of them. You declare a partial method like so:


    partial void OnSomethingAwesome(int someArg);


    (noting that you cannot specify an access modifier, and that they follow the same rules as [Conditional(…)] methods: you cannot specify a return value or access modifier, and the parameters cannot be declared out (although they can be ref) – again, this is because the compiler might be removing all trace of them, and we can’t do anything that would upset “definite assignment”). Then elsewhere in the code you can use that method like normal:


    Write("tralala");
    OnSomethingAwesome(123);
    return true;

    That looks convincing right? Except: until we write the body of OnSomethingAwesome, it does not exist. At all. For example, in another file somewhere we could add:


    partial class TheSameClass // and the same namespace!
    {
    partial void OnSomethingAwesome(int number)
    {
    Trace(number);
    }
    }

    And only now does our code do anything useful. It is also important to note that just like [Conditional(…)] methods, the evaluation of the arguments and target are also removed, so you need to be a little careful… there is a huge difference between:


    OnSomethingAwesome(SomethingCritical());

    and


    var value = SomethingCritical();
    OnSomethingAwesome(value);

    In reality, this is rarely an actual issue.


    Applying this to debugging operations


    Hopefully, it should be obvious that we can use this to move all of our debugging operations out of the main code file – perhaps into TheSameClass.debugging.cs file (or whatever you want) – which can then legitimately have a single #if conditional region. So to take our earlier example:


    obj.Wow();
    OnVery("abc");
    Such();

    with (elsewhere):


    partial void OnVery(string much);

    How about interfaces?


    EDIT:It turns out I was completely wrong here; partial interface works fine - my mistake. I'm leaving this here as a mark of my public shame, but: ignore me. You can use interfaces much like the above.

    There is no such thing as a partial interface, but what you can do is declare a separate debug-only interface, and then extend the type in a partial class:


    partial class TheSameClass : IMagicDebugger
    {
    void IMagicDebugger.So() { /* ... */ }
    }

    Real world example


    Here’s something from some real code, where during debugging and testing I need to keep additional counters, that are not needed in the production code:


    #if DEBUG
    partial class ResultBox
    {
    internal static long allocations;

    public static long GetAllocationCount()
    {
    return Interlocked.Read(ref allocations);
    }
    static partial void OnAllocated()
    {
    Interlocked.Increment(ref allocations);
    }
    }
    #endif

    The intent should be pretty obvious from the context, but note that here everything to do with this debug-only feature is now neatly packaged together.


    Enjoy.

    Monday, 3 March 2014

    Be strict with me, dear CLI

    There is an old adage in code:

    Be liberal in what you accept, and conservative in what you send

    And right now, this adage can [obscenity]. The week before last, my main gripe was with broken browser implementations (not behind proxy servers) that clearly didn’t read RFC 6455 (aka WebSocket) – or at least, they read it enough to know to tend a Sec-WebSocket-Key, but not enough to understand client-to-server masking, or even simply to add a Connection or Upgrade header. But that’s not what I’m moaning about today; last week, my gripe was .NET. Specifically, IL emit.

    (wibbly-wobbly screen-fade going to backstory)

    A good number of the tools I work on involve metaprogramming, some of them to quite a scary degree. Which means lots of keeping track of the state of the stack as you bounce around the IL you are writing. And to be fair, yes it is my responsibility to make sure that the IL I emit is meaningful.. I just wish that the CLI was more consistent in terms of what it will allow.

    You see, there’s an interesting rule in the CLI specification:

    In particular, if that single-pass analysis arrives at an instruction, call it location X, that immediately follows an unconditional branch, and where X is not the target of an earlier branch instruction, then the state of the evaluation stack at X, clearly, cannot be derived from existing information. In this case, the CLI demands that the evaluation stack at X be empty.

    The scenario this describes is actually very common – for example in a while loop (which is also the core of foreach), the most common way of setting those out is:

    • (preceding code)
    • unconditional-branch: {condition test}
    • label: {the actual code}
    • (the actual code)
    • label: {condition test}
    • (condition test)
    • branch-if-true: {the actual code}
    • (subsequent code)

    The important point is that “{the actual code}” meets this special case; it is precisely the X mentioned in the specification: it immediately follows an unconditional branch, and is not the target of an earlier branch instruction (it is, however, the target of a later branch instruction). This means that to be valid, the stack (relative to that method) must be empty.

    This would actually be easy enough to smoke-test… just setup some simple IL that forces this condition, press the button, and wait for the CLI to complain about it. Here’s the C# for that, on pastie.org. The only problem is that it runs without complaining. Well, maybe DynamicMethod is a special case… we’ll try a full assembly instead. And again: it works without the slightest complaint. To get it to notice, we need to write it to disk (assembly.Save("Broken.dll");) and then run peverify Broken.dll, which finally gives us the complaint we wanted:

    [IL]: Error: [Broken.dll : BrokenType::BrokenMethod][offset 0x00000002] Stack height at all points must be determinable in a single forward scan of IL.
    1 Error(s) Verifying Broken.dll

    You might think I’m being fussy… I mean, if the CLI runs it anyway then what is the problem? A fair question, but the truth is more complicated. When the CLI is loading an assembly from disk it is often more strict. This depends on a lot of things, including the framework version, the framework platform, and the trust settings.

    Oh for a simple “strict mode” for running in-memory dynamic assemblies: that would make it so much easier for us long-suffering metaprogrammers. And I’m not alone here: a quick google search shows this issue has bitten mono, ikvm, mono-cecil, and many others.

    A silver lining…

    The good news is that the excellent Sigil utility (by my colleague Kevin Montrose) now has support for this, via the strictBranchVerification flag. It also makes the IL emit a lot easier to grok – here’s the same example via Sigil. Sadly, I can’t use it in my particular scenario, unless I create a custom Sigil build that uses IKVM for cross-platform emit, but for most people this should help a lot.

    Friday, 28 February 2014

    Playing with redis-cluster for non-linux-gods, step by step

     

    Redis-cluster is now available in in beta, and I do love me some redis. There’s a great tutorial available for working with redis cluster, but I found it a bit fiddly to get it working: I’m normally a Windows person - we use CentOS in production, but we have an excellent sysadmin team who know how to make linux work, but that doesn’t help me so much for playing with stuff as a developer. Hence thought I’d write up the step-by-step instructions to get it working if you’re not a linux god.

    As an aside for Windows developers

    For a lot of local development purposes, the “Microsoft Open Technologies, Inc” port of redis is available on nuget, which makes it really easy to run. I usually just:

    PM> Install-Package Redis-64

    to install it into a Visual Studio solution (via the Package Manager Console), and then create two solution-level files: redis-cli.cmd and redis-server.cmd, with (respectively)

    @packages\Redis-64.2.6.12.1\tools\redis-server.exe
    @packages\Redis-64.2.6.12.1\tools\redis-cli.exe

    This makes starting a redis server or console just a "right-click, Run" away. However! At the time of writing the nuget binary is based on 2.6.12, we need 3.0 for cluster (technically 2.9.50, but close enough).

    Installing your server OS

    For the purposes of this tutorial, I’m starting from a vanilla Ubuntu Server 13.10, installed into Hyper-V by mounting the iso; for VM configuration, I’m using 4GB memory, 2 CPUs, 127GB disk, and access to the external network (we’ll need files). For the OS install, I’ve simply used all the default options (except obviously my location and user details, and I enabled automatic security updates), and let it do its thing. This can take a little while (quite a bit longer than Ubuntu Desktop, but it seems a preferable choice). A long wait, a reboot and a login later, you should see your server OS:

    image

    Working on the server

    You can work directly in the VM console if you like, but I find it awkward because copy and paste doesn’t work, so I’m going to use putty on my regular desktop to talk to the server. First we need to install a ssh server (for more complete configuration options and recommendations, see help.ubuntu.com):

    $ sudo apt-get install openssh-server

    We will also need our server’s ip address:

    $ ip addr

    which should provide a 192.* address – for me 192.168.0.15.

    Now we can switch to putty and login (by entering the ip address as the host):

    image

    Note that the first time you log in you will get a security warning about the certificate – this is expected and is because openssh has been installed with a newly created certificate that we don’t yet trust. We know that this is our server, so we can just say yes, and proceed to enter our username and password – so now we’re logged in via a putty terminal:

    image

    Tip: right-clicking pastes whatever is in the clipboard!

    Getting a build environment

    Ubuntu Server doesn’t start with a lot installed; we’ll need a few tools…

    $ sudo apt-get install build-essential openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev automake libtool bison subversion pkg-config

    (I bet you’re glad of that working copy/paste about now, eh?)

    Installing Redis

    The package manager versions of redis are universally ancient, so the recommended option is direct install. The overall instructions are on the download page, but we need to replace the stable link with the beta link (which is the Download link on the 3.0.0 Beta line, currently). So:

    $ wget https://github.com/antirez/redis/archive/3.0.0-beta1.tar.gz
    $ tar xzf 3.0.0-beta1.tar.gz
    $ cd redis-3.0.0-beta1 $ sudo make install

    At this point we now have a usable redis-server, yay! You can test this trivially by starting a default server:

    $ redis-server

    image

    and then in another console session (or from your Windows host if you have redis-cli available – the version of redis-cli on nuget is fine for this):

    $ redis-cli –h 192.168.0.15 PING

    to which the server will reply, rather unoriginally, PONG. You can use ctrl+c back in the other session to stop the redis-server. Note that you don’t usually run server-based redis-server sessions directly like that, but it is a handy smoke test.

    Getting redis-trib.rb Working

    The most convenient way to configure a redis cluster is with the redis-trib.rb tool. Ubuntu Server doesn’t ship with the necessary Ruby tools installed, so we’ll need to install them. The easiest way to configure Ruby is with rvm. Now, you might think “apt-get install rvm”, but DO NOT DO THIS. Here’s the joy I snagged from a previous Ubuntu install (hence the slightly different appearance):

    image

    Which takes us to this very helpful stackoverflow.com answer. Fortunately, we can avoid that pain by getting it right first time:

    $ sudo curl -L https://get.rvm.io | bash -s stable --ruby --autolibs=enable --auto-dotfiles
    $ gem install redis

    Now redis-trib.rb should work:

    $ cd src
    $ ./redis-trib.rb

    image

    Now you should be fully setup and configured, ready to follow the tutorial. Note that you probably want to use the non-loopback ip when configuring the cluster, so that external clients don't get confused by MOVED replies:

    $ ./redis-trib.rb create --replicas 1 192.168.0.15:7000 192.168.0.15:7001 192.168.0.15:7002 192.168.0.15:7003 192.168.0.15:7004 192.168.0.15:7005

    Transferring Configuration Files

    If, like me, you don’t usually use terminal text editors, you may find it easier to transfer configuration files (discussed more in the tutorial; note I’ve added “daemonize yes”, which makes it run as in the background rather than blocking the console) from your host, for example:

    image

    The putty package includes the pscp utility for this; first you need to know your full directory on the linux server:

    $ pwd

    And then you can recursively copy the files from your host setup to the server:

    pscp -r d:\cluster-test marc@192.168.0.15:/home/marc/redis-3.0.0-beta1

    which should show the multiple files being copied:

    image

    And now, continuing to follow the tutorial, we should be able to start the 6 nodes successfully:

    image

    Outro

    Hopefully you should have everything here that you need to get a working install so you can follow the tutorial. The same steps should also be helpful to anyone wanting to configure a vagrant setup. Enjoy.