Friday 6 November 2015

Velocity Conference - Notes & Takeaways, pt.3

Finally got a chance to tidy up the final part of my notes from this year's Velocity conference!

The Definition Of Normal: An Intro and guide to anomaly detection (Alois Reitbauer)

As anomaly detection has a nice role to play in spotting issues early - ideally before any really bad things happen - I was really excited about this talk. Here is my high-level overview: Anomalies are defined as events or observations that don’t conform to an expected pattern. As such, the anomaly detection workflow is:
  1. Use actual data to define / calculate what is ‘normal’, i.e. define your ‘normal model.’
  2. The ‘normal model’ is continuously updated with new data.
  3. Hypotheses are derived from the ‘normal model.’
  4. Events are checked against your hypotheses, applying a likeliness judgement.
  5. How the event performs against this likeliness judgement translates into whether it is an anomaly or not.
But how should we best go about setting the baselines which define our normal model? One thing to bear in mind that some of them (such as mean / average or median) don’t learn very well. Reitbauer recommended using exponential smoothing instead, since it is both easy to calculate and learns very well.

The full slides are here and include more detailed recommendations, as well as presenting of a number of different options for doing anomaly detection, including all the juicy maths if you fancy them!


A Real-Life Account Of Moving 100% To A Public Cloud (Julien Simon and Antoine Guy)

Their company Viadeo moved to AWS for reasons commonly cited by other companies making a similar move, and like many other companies facing the challenge they moved over gradually. Rather than recounting the full steps and stages of their transition, I wanted to highlight some of the key lessons they learned in the process, which really resonated with my experience of our current move to the cloud at 7digital, and the challenges it presents for us:
  1. Outline your key objectives! - You will need all the focus and direction you can get your hands on during a potentially sprawling transition like this.
  2. Plan and build with a temporary hybrid run in mind - be able to roll back etc. 
  3. Ahead of the move, have a thorough report of your infrastructure - estimate equivalent cost in cloud; evaluate each for replacement (PaaS, Saas or leave as is?); identify pain points (tech debt; relevance of moving legacy apps). 
  4. Define a high-level migration plan - once again, for focus and direction.
  5. Tech is only half the work - identify all stakeholders and their goals; involve Legal / Finance early, especially if you might have to battle early terminations of legacy infrastructure contracts, ensure you work on awareness and knowledge transfer across teams at all key stages of the transition.
The slides for this talk have been made available, if you're interested in the full account.


Further resources

Following the close of the conference, O'Reilly have put together a really neat collection of relevant free eBooks, covering a range of subjects around web operations, distributed systems, performance optimisation, resilience and scalability.

All the short keynote talks of the conference can be watched here; unfortunately it doesn’t currently look like they’ll make the full-length talks available to the public, and slides of all the sessions (where speakers have chosen to share them) have been collated here.

Tuesday 3 November 2015

Velocity Conference - Notes & Takeaways, pt.2

Here is part two of my notes from this year's Velocity conference!

Blame, Language, Learning: Tips for Learning from Incidents (Lindsay Holmwood)

Good and helpful talk on maximising learning and minimising blame when dealing with incidents. Lindsay has also made an article version of the talk available. 

TL;DR: The language we use and views we hold when talking about failure, shape the outcome of that discussion, and what we learn for the future. Note in particular that both “Why...?” and “How...?” questions tend to limit the scope of our inquiry into incidents. Instead, “What...?” questions are a much better device for building empathy, and also help to focus the analysis on foresight, rather than it’s less constructive counterpart hindsight, which more easily falls prey to various cognitive bias and to blameful thinking. 

Another point stressed was to always assume local rationality: “people make what they consider to be the best decision given the information available to them at the time.” - there isn't really a just culture that doesn't revolve around this premise.


Alert Overload: Adopting a Microservices Architecture without being Overwhelmed with Noise (Sarah Wells) 

No huge surprises, but a good summary on how to set up useful alerts - some key points discussed were:

Focus on business functionality! - Look at your architecture and decide which parts or relationships are crucial to your core functionalities, and decide what it is that you care about for each - speed? errors? Throughput? 

Focus on End-to-End! - Ideally you only want an alert where you actually need to take action.

Make alerts useful, and build them with support in mind! - Ensure readability when setting up alerting (eg. use spaces rather than camel casing etc.), if possible make your alerts include links to more information or useful look-ups, and provide clear and helpful messages. And importantly, if you get to a point where most people filter out most of the email alerts they are getting, you should probably fix your alert system!

Finally, have radiators everywhere (things like dashing.io are great for dashboards), make sure you would know it if your alert system went down (!), and accept that alerts need continuous cultivation - they are never “finished”. As part of this, it is key to treat setting up an alert as part of fixing any new major issue that you weren’t previously able to detect.


WebPageTest using real mobile apps (Steve Souders)

The open source performance testing tool WebPageTest.org now offer a few “Real Mobile Networks” test locations - only a handful for the time being, but if they extend this it could be a really interesting tool for testing client web apps from different locations! To use the new service, go to webpagetest.org > enter web page URL > select one of the “Real Mobile Networks” options. 

The full talk was less than 7min (!), so if you are interested in some more detail, context and caveats, you can watch it here at minimal time investment.

Sunday 1 November 2015

Velocity Conference - Notes & Takeaways, pt.1

Lots of insightful and different things going on at Velocity conference in Amsterdam this year! I've written up some of the key takeaways from the sessions I joined.

Docker Tutorial (John Willis)

If you haven’t worked much with Docker yet, the slides for this tutorial might be useful - they are a general walk-through with both explanation of concepts, products, some hints at best practices and practical exercises for consolidation. Be aware it’s pretty long (at Velocity the session took 3hs and that was with him actually skipping all the exercises), but it really does cover a lot!


Using Docker Safely (Adrian Mouat)

Mouat discussed the different attack vectors of containers, as well as a good few practical steps and strategies for applying common security paradigms (defence-in-depth and least privilege) to Docker and containers generally. A book chapter version of the talk is available from O'Reilly, which is handy!


Tracking Vulnerabilities in your Node.js Dependencies (Guy Podjarny and Assaf Hefetz)

Very neat demo of a security project (snyk.io - or if you prefer: npmjs.com/package/snyk) that finds and fixes (!) known security vulnerabilities in your Node.js dependencies. Watch the actual demo yourself if you're curious, it’s only 13 min long!


Managing Secrets At Scale (Alex Schoof)

Really valuable talk, and well worth reviewing. Some key considerations: 
  • Secrets are everywhere, whether we think of them or not. 
  • As an industry, we don’t currently tend to manage secrets very well - even when bearing in mind that security is always about trade-offs. 
  • Secret management should be considered tier 0 / core infrastructure, i.e. should be highly available, have monitoring, alerting and access control.

In light of this, Schoof proposed the following core principles of modern secret management:
  1. The set of actors who can do something should be as small as possible.
  2. Secrets need to expire, so set up efficient, easy ways to do secret rotation (this shouldn't require a deploy). NB: This also implies that secrets shouldn't be in version control.
  3. Make secret management user friendly: It should be easier to handle secrets in secure ways than insecure ways.
  4. As the security of a system is only as strong as its weakest access link, make sure you know what your weakest links are, and address them.
  5. Secrets must be highly available, as they will stop the basic functioning of apps if they aren't.
The talk went on to discuss all the various aspects of building a secret management system, which you can follow along via the slides, it was quite interesting. Existing services that were discussed and recommended in the talk were: Vault, Keywhiz and CredStash, but all of these solutions are still pretty new, so with any of them there’ll probably still be quite a bit of tweaking required to get a management system in place that works well for your company.


Seeing the Invisible: Discovering Operations Expertise (John Allspaw) 

Etsy CTO John Allspaw reveals what he gets up to in his free time: he pursued an MA in “Human Factors and Systems Safety” at Lund University Sweden (as you do). His own research as part of completing this MA explores the area of human factors in web engineering, both with respect to understanding catastrophic failures, but also with respect to understanding the human factors involved in not having catastrophic failures in the face of things potentially going wrong literally all the time. Human Factor & Ergonomics (HFE) research has a long history in areas like aviation, surgery and mining, but for our industry is still relatively under-researched. 

The talk itself (20 mins) was more of a primer with not a lot of hard and fast content - for some of the latter have a look at Allspaw’s MA thesis or, for a shorter version, his contribution in the forthcoming book “Human Factors and Ergonomics in Practice.”


PS: I also managed to partake in book signings by both Kelsey Hightower and John Allspaw! Meaning I finally got my own copy of the incredible "Web Operations". RESULT.

Friday 6 February 2015

Graph Databases

As part of the Technical Academy (the scheme at 7digital through which I have started training as a software developer), I'm working on a personal project aiming to create a proof-of-concept API which can better support the metadata schema of classical music, since this is significantly more complex than that of pop music.

It's still early days, but as the project is built on top of a neo4j graph database, I ended up reading about graph DBs in general a fair bit, and decided to put together a brief introduction for our internal knowledge sharing this week. Here it goes!