I am providing a summary of my experience at DevOpsDays Indianapolis. For more context, please visit their website: https://devopsdays.org/events/2019-indianapolis. The conference spanned three days, with one half day of training. On the first day, I had to choose between these options for training: Kubernetes Training by IBM, Cloud Native Continuous Delivery with GoCD by Thoughtworks, and Navigating Your DevOps Journey by John Esser of Veracity Solutions. Because I only had a vague idea of what Kubernetes was, I chose the first one. In contrast, the next two days followed a conference format: 5 half-hour talks, 5 Ignite talks, and 3 half-hour Open Space sessions. I had 3-5 options (attendee generated) for each Open Space session. Unfortunately, I was not able to attend the third session on either day.
Once the training began, JJ emphasized that the purpose of this conference is to build the DevOps community. So our first task was to introduce ourselves to those around us. Before the lab work started, JJ went over some history and general information. If you are already familiar with Kubernetes, a lot of what I have to say in the following paragraphs is going to be redundant. Otherwise, I hope to provide a good summary of the training.
Introduction and Definition
The first thing I took note of is that IBM has decided to move their repositories to gitlab following Microsoft’s acquisition of github. JJ mentioned this while talking about cloning our lab repository in our IBM Cloud terminals. Anyway, onto Kubernetes for real this time. Kubernetes is an API to share compute resources. JJ stated this definition of Kubernetes multiple times. Kubernetes uses containers instead of VMs. When asked to run a container, Kubernetes runs it on an appropriate machine from the cluster. Containers operate on a shared kernel, greatly reducing startup time.
In Kubernetes, the most primitive element is a Pod. A Pod is a specification of a compute environment that contains one or more containers. This specification will include things like OS version, requested memory, specific hardware requirements, etc… Pods run on one or more Nodes. A Node is a compute resource. This could be a VM, a laptop, desktop, rack server, generally anything capable of running the Node software. One or more Nodes make up a Kubernetes cluster.
Configuration and Deployment
Pods can have one or more deployment replicas configured. This will ensure that they run on n Nodes, where n is the number of deployment replicas. Each Pod has a default deployment replica of 1. Kubernetes automatically restarts Pods to meet their deployment replica count if they go down in the course of operation.
Kubernetes has built-in DNS. However, multiple containers can communicate through localhost ports if they are in the same Pod. This is the main reason you would want to put more than one container in a Pod. Otherwise, the best practice is to have one container per Pod. To communicate outside of the Pod, you need to specify that in yaml. Most of the configuration parameters you will modify are under the ‘spec’ heading in the yaml. Feel free to disregard the other sections.
You can group a collection of Pods in a Namespace. This allows you to easily duplicate your production environment for development, test, or QA. Ingress is the specification for telling Pods how to connect to each other under a single domain. If you have worked with Apache configuration files, this will look similar.
Things not to do with Kubernetes:
- Do not use node ports. Node ports are a debug construct. Unfortunately, some big clients used them for production software. That is why they are still in the spec. Use LoadBalancer instead.
- Never run your database in Kubernetes. I believe JJ had us repeat this out loud multiple times.
- Do not use any wildcards / non-specific specifications in a docker file. Use a multi-stage docker configuration to reduce copy-paste. Be very specific about your environment configuration, including versions of any dependencies.
- Never use :latest in your docker images, even in development.
Things to do with Kubernetes:
- ‘kubectl get all -o wide’ checks the overall state of the cluster. Use this often.
- Every non-toy application should have livenessProbe and readinessProbe configured.
- ‘kubectl describe pod/x’ is a great way to see why Pod x is not working.
- The best practice for Kubernetes containers is to log to standard out. Logs to standard out will be visible in the console via kubectl commands. Note: logs to standard error will not be visible.
- Use Helm, the package manager for Kubernetes.
- In your Pod configuration, you should set requests and limits on the amount of ram/cpu your application needs. Set these in the Resource Quota section. By default there are no limits. However, nodes that are nearing resource capacity will kick unlimited Pods off first. Ideally the request and limit are the same.
- Be very careful about spacing in your yaml files. You will almost certainly have an error at some point caused by a missing space in a yaml file.
- Use Kubernetes configmaps instead of setting environment variables in docker.
Kubernetes Pods have a life cycle: PostStart, Prestop, Sigterm, Sigkill. Your application can wire into these life cycle events for setup and tear down. However, JJ does not recommend doing this. The best practice is to use Specialized Init containers to perform setup operations.
A short description of useful Kubernetes features:
- Kubernetes has a Scheduler and Descheduler. Pod and Node configurations specify the schedule priorities.
- Kubernetes Jobs run Pods that have limited lifetimes.
- Daemonset runs a specific Pod in every Node.
- Sidecar is a proxy to reroute Pod requests.
- Kubernetes has secrets built-in, however, it is useless. I didn’t get the full story, but the gist I got is that they are so easy to circumvent that they don’t add any effective security.
Lab
During the presentation we did lab work in our IBM Cloud terminals. We started by cloning a workshop repository. It had an application with an invalid Pod configuration file. The first step was deploying the Pod and observing the errors. Then we fixed the yaml file and re-deployed, verifying that the web-app was running. Finally, we injected invalid code into the web-app, and used debug commands to find the error in the logs. During the lab, I timed the Pod restart at ~36 seconds. I had been led to believe it would be sub-second. However, it is possible we were using some budget hardware for the lab cluster.
I liked the training; it was probably my favorite part of the conference. I enjoyed college, so my opinion is likely riding on nostalgia. However, I definitely know a lot more about Kubernetes now. For me personally, doing some hands-on work helps internalize the knowledge.
Responsible Service Ownership by Brian Weber
Brian opened by describing the design standard for doors to open inwards or outwards based on what is most likely to save lives during a fire. If there is a crowd, the doors open outwards to prevent bottleneck pressure from trapping the crowd. In homes, doors open inwards so that firefighters can more easily bash through them. The purpose of this story is to illustrate the importance of standards in mitigating and preventing disasters.
He then moved on to Twitter’s Technical Design Documents. These lay out at the beginning of a project how to run and maintain the product. This includes specifying the requirements for the microservice team responsible for maintaining it, number of on call personnel, etc… In an ecosystem containing numerous microservices it is important to develop them to a consistent standard. To that end all Twitter microservices use the same development stack: Finagle.
Finally, each new service requires a Launch Plan. This states who the initial customers are, the expected launch date, potential issues, and mitigations. In addition to following the Launch Plan, a new service has to pass the Production Readiness Review (PRR) before launching. The PRR is essentially a giant checklist. To bypass most of the checklist, build the service using common components. In addition you have to specify exactly who is on call, who owns the service, and reached a certain level of testing.
After launching the service, the Service Maturity Model (SMM) comes into play. This specifies the process for keeping standards moving forward. It also specifies how to apply new standards to existing services. For example, consider an OS patch that fixes a security exploit. The SMM ensures all existing services use that patch. To keep up with new technologies, periodically review and update both the SMM and the PRR.
I found it interesting to get a look into how these larger, specialized, companies work. In my line of work it is actually pretty rare to use even the same programming language on the next project, much less the same stack. Regardless, it is important to keep up with best practices in whatever environment you find yourself in. I also believe it is critical to understand why a ‘best practice’ is the best practice. When another developer asks why you have done things in a certain way, your reply should ideally not contain the words ‘best practice’ or ‘industry standard’.
Developing for Deterministic Deliveries by Mykel Avis
People who make your life unpredictable are assholes. This was the opening statement, and the core of Mykel’s presentation. Actual determinism is not the goal he is advocating for; rather, reasonable predictability. Try to think about the development process as an assembly line. You are delivering components with known inputs and outputs, labelled appropriately and individually tracked. Store any output you produce as an immutable artifact. Use these artifacts to do your work and focus on continuous improvement. Use Value Stream Maps to determine what parts to cut or replace.
How to avoid being an asshole:
- Promise to do the smallest amount possible. This increases the chances of it getting done.
- Use an inoffensive architecture and limit complexity.
- Do not be clever. Clever is dangerous because it is hard to be consistently productive with.
- Apply process to your development procedure.
- Apply even more testing.
- Avoid having a dumb process.
- Do not write word documents. Use markdown and keep it in source control.
- Do not use a Turing Complete language to do your builds. The build should not be such a source of complexity.
- Use tags to mark important places in your code’s history.
- Run your deployment process using a managed service.
- Use semantic versioning.
My takeaway from this is to be deliberate when injecting nondeterminism into a process. Mykel specifically called out artisanal software development as something to avoid. Instead of making X, you should be making a factory to produce X. Artisanal software development is a fairly accurate description of my job. Similarly to Brian’s talk, I can see the logic if you are working for a large, specialized company. Regardless, most of what he has stated is good advice even for software artisans.
Enabling Digital Transformation Through DevOps by Peter Varhol
Peter opened by stating that this was Gerie Owen’s presentation, but she was unable to attend the conference. Digital Transformations and DevOps are based in a similar mindset of continuous improvement and automation. In a Digital Transformation, you should ignore the existing structure and do what makes sense with a focus on the customer. Continuous experimentation with new processes is critical to learning to do better.
Continuous testing is important to prevent testing from becoming a bottleneck. The DevOps workflow should have continuous testing baked into it along with continuous monitoring in test and production. Test cases should be as simple as possible and the automated tests should run in parallel across multiple machines. Integrate multiple layers of testing into the application life cycle – this includes manual testing. Continuous monitoring and testing on production is critical to quickly find and resolve issues. Everyone on the project is responsible for software quality.
To be honest, I was almost nodding off during this presentation. The first paragraph of my summary was the first 3-5 minutes of the presentation. The rest was about testing. I don’t disagree with the importance of testing, but it felt like the presentation jack-knifed into it. I think I was continually trying to figure out how this was supporting the argument that DevOps enables Digital Transformation.
Security at DevOps Speed – Paul Meharg
Paul introduced Software Supply Chain Automation as a way to get security with DevOps. Our stakeholders continuously pressure us to deliver faster. DevOps and open source components can relieve this pressure. Open source components compose 85% of modern application code. Unfortunately, not all open source components are secure, or even good at what they do. 10% of Java components and 51% of JavaScript components have known vulnerabilities. 99% of vulnerabilities exploited by criminals are known by security and IT professionals at the time of the incident.
Developers are essentially in the procurement department. Yet they are not given the time to validate the OSS components they select. Deming had a way of tackling this problem on the automotive side:
- Source parts from fewer and better suppliers.
- Use only the highest quality parts.
- Never pass known defects downstream.
- Continuously track the location of every part.
To implement this advice in software, create a Software Bill of Materials. For each application at your company, track every OSS component. If component A has a vulnerability, you’ll know exactly which projects have issues. Once you have that list of components, order them by highest impact to the company and validate them.
When reflecting on this presentation, it seemed like there should be some market space for a third party OSS component validation company. And indeed, there is a company doing that: Validos. They seem to be on life support though. The latest list of validated packages was 2014 and latest list of members was 2017. Perhaps there will be a resurgence. It would certainly be nice if there was some sort of seal of approval you could put on your github vetting it as secure software by an accredited third party.
Data Driven DevOps by Baruch Sadogursky
DevOps is the common culture and goals between Ops, Dev, and QA. The value of DevOps can be judged by Flow Efficiency. Flow Efficiency can help communicate with other departments like Marketing or Sales. However, like velocity, it is not very actionable. You can tell whether it is low or high, but there is no call to action. Instead of combining metrics from the disparate components, use the specialized metrics of those components and observe how they interact with each other. For example, inefficient code (dev) can effect cost/customer (ops) and regression test execution time (QA). Collecting a variety of specific metrics can show progress on continuous improvement as long as the metrics themselves do not become goals. The metrics can also drive continuous improvement. Baruch had three examples of data driven improvements.
The first problem: QA were not getting enough test environments from Ops. The data showed that the testers were not checking back in their environments. By implementing an auto-checkin after a day of no usage they were able to reduce the number of test environments while still ensuring that enough environments were made available to testers.
Next, there was a problem with the Test Suite being unstable, causing QA to mark builds as failures without reason. They introduced a test stability defect type; and, by prioritizing defects of that type, got the test suite stable within a month.
Finally, the developers wanted to use a new garbage collector, but needed to make the business case for it. By monitoring operations data, they could show that the new collector reduced latency and resource usage thus bringing down costs.
After the examples, Baruch finished with some metrics pitfalls. It can be difficult to select the correct thing to measure and collecting too many metrics can be overwhelming. The data collected needs to be reliable. Unreliable data is worst than no data. Common vocabulary is necessary to make any sense of the data collected. The metrics collected need to be tweaked over time, especially as issues are resolved.
Each of these stories have had at least one parallel in my work experience, so hats off to Baruch on choosing relatable examples. Statistical data is certainly important to making good business decisions. I would add to this that qualitative data from the customers is just as if not more important. Sitting down with people and seeing how they use your software gives you so much information that is ‘too obvious’ to write down. The same goes for departmental cooperation. Problems tend to have to pass a certain threshold of irritation before being reported outside a peer group.
Ignites
- Ignite Karaoke and Devops by Tiffany Longworth
Ignite Karaoke takes the ignite concept and removes prior knowledge of what is on the slides from the equation. You have 15 seconds per slide to read and talk about it. This gets you used to making things up on the fly and going with it. It also creates a sense of terror. These are also common experiences as a system administrator.
This was a bit of a miss for me. I think 30 minutes is borderline too short for a presentation, so my expectation is that I will get very little out of a 5 minute one. - Intelligent Deployment Pipelines by Martez Reed
Shifting a process from Continuous Delivery to Continuous Deployment can be tricky. You can turn log messages into actionable insight into whether or not it is safe to deploy to production. ChaosToolkit can insert random effects into your staging environment. These analytics can be collected and processed with machine learning to arrive at a deployment decision.
Personally, I’m not sure this is an appropriate use of machine learning. The issue with a machine learning predicate is that it is very difficult to understand the why behind the result. For example, the machine learning algorithm says: do not deploy. Okay, so why is that? Maybe it is a human relatable reason, maybe it has learned that it is bad when the number of bytes in the log file is evenly divisible by 13 (this is slightly hyperbolic, in practice it would be more like a combination of different parameters with varying weights has passed a threshold). I’d much prefer an automated deployment checklist. - Everything I Really Need to Know about Devops I learned from The Golden Girls by Allie Richards
I am not familiar with The Golden Girls and 95% of this presentation went straight over my head. However, I did observe people who clearly did understand the references having a good time. I was able to pull two things out of the presentation: admit when you do not know something and it is important to have a style guide for your infrastructure code. - Everything you need to know about Kubernetes in 5 mins by JJ Asghar
This is covered in much more detail in the training session, so I will just reiterate things JJ strongly emphasized in both formats. Kubernetes is a shared API to compute resources. Do not use node ports. Do not run your database in Kubernetes. - Importance of Feedback by Chad Brown
Feedback is critical for us to improve things we are not aware we need to improve. In order to get good feedback, people need to feel safe to give it. Providing feedback is difficult: you should be thoughtful of what you say, how you say it, and when you say it. Be very specific and concise. Do not judge. Aim to provide information that can be used to improve.
The correct response when receiving feedback is ‘Thank You.’ When you get feedback it is important to ask questions if there is anything you do not understand. Talk to your peers and coworkers about the feedback that you get. This lets them know that their feedback is important to you and that you are prepared to take action on it.
This was the best of the Ignite sessions, including both days. Chad stayed on point, everything he said built towards the topic, and the content was relevant to me.
Open Space #1: Use Cases for Serverless
I chose to attend this session because I have not yet had the opportunity to work on a Serverless project. I’m also interested in knowing when I should and should not recommend it to clients. I learned that serverless architectures are more difficult to engineer, so a decision to go serverless needs significant buy in from the business. In addition I picked up a few rules of thumb for when to go serverless:
- Consider serverless when you do not know whether your feature will be used 5 times per day or 5 million times per second.
- Consider serverless when the majority of your application needs can be covered by AWS API Gateway. Or in other words if you server is mostly CRUD operations, consider using API Gateway and serverless for the other bits.
- Avoid serverless when your service needs to interact with a relational database.
- Avoid serverless when you are sensitive to single call latency. If it is not acceptable for any call to wait for the cold start, serverless is not the right solution.
Open Space #2: Infrastructure-as-Code in practice (CI/CD for infra)
None of the topics for this time slot spoke to me, so I chose to attend this one by process of elimination. We opened discussing how infrastructure as code is not yet feasible for routers or switches, but it is looking like it will be in a few years. We also discussed unit testing. It is not typically valuable for infrastructure code. The best value appears to be happy path integration tests. Even something as simple as load up the application and click one button.
Double, Double, Toil and Trouble! Top SRE Practices for DevOps by John Esser
The concept of Site Reliability Engineering (SRE), began at Google. The way it started was by constraining staffing to drive scalable architectures. If you cannot add staff linearly with the scale of the system, then the staff needs to become more efficient to scale up. High quality is essential for non-linear scaling. Software Engineers are hired to become SREs to ensure SREs have a software engineering mindset. This is important as the goal of an SRE is to eliminate toil through automation. The ideal operational load for an SRE is 50%; the other 50% of their time should be spent automating their workload. To get these high degrees of automation, software development experience is critical.
By ensuring high degrees of automation, the SRE team can handle larger systems without scaling their staff linearly. In larger systems, it is good practice to have a core SRE team that maintains standards and drives continuous improvement. Other SREs should embed themselves into agile development groups. This is important to build a dialog between development and operations. The software should not only be developed to a high standard, but maintained to a high standard.
The quality of the software is very important when trying to maintain it in a cost effective manner. To that end, the SRE team must have the ability to hand the application back to the dev team if the quality is lacking. However, it is important to perform blameless post-incident reviews. Once people are comfortable with no blame being assigned, they will take personal responsibility for errors. It is a good practice to have an error budget in your SLA. This gives you the flexibility to temporarily lower quality or reliability to accomplish business objectives.
I believe automation is a necessary component of continuous improvement. Sometimes I like to look at it as leveraging laziness to productive ends. You don’t have to be an SRE to get value out of automation. Anytime you have a tedious, boring, or onerous task you should consider automating it. Sometimes it won’t make sense, but getting into the habit of deliberately considering it every time will result in more automation and less tedium.
3 Levels of Improving, Continuously by Louda Peña
Always be in a state of evaluation. Your users, process, and culture combine to produce a product. But, remember that you are not building a product for everyone. Use your personas to drive features. Prototype as rapidly as possible using wireframes, paper prototypes, or whatever gets the job done. It is critical to empathize with the users, otherwise you will build the wrong product. When doing user testing, measure everything: reactions, time to perform tasks, emotional state, etc… On that note, if you are making a product for developers, getting no emotional reaction is typically a good thing.
When performing the above measurements and evaluations, avoid making your measurements a target. They will be gamed. If you have a user problem, then you have a problem. Find the root cause of the problem and fix it rather than addressing the most obvious symptom. If the solution involves an organizational change, remember that you need to convince the people in the organization to make that change. Organizations cannot be changed without going through the leaders in that organization.
The first thing I thought as I was reviewing my notes is that I have no idea what the 3 levels are. So I kind of see this presentation as a grab bag of ideas. I am a big fan of paper prototyping and direct involvement with the people who are using your product. I have seen the wrong thing get built many times as a result of ‘insulating’ the customer from the developers or vice versa.
Production as an Experiment Lab by Ramin Keene
Testing in production used to be a faux pas, but it is becoming mainstream. However, this does not mean release without testing. At a certain level of scale, it is not feasible to completely clone production to run as a test environment. A better way to look at testing in production is extending the software development life cycle beyond release.
As more companies move to microservices and serverless, continuous delivery becomes a lot less scary. It can be rebranded as progressive delivery, as you know exactly which portion of the software is delivered. These continuous releases can be used to test potential improvements to the software.
You can do an A-B test in production, where you deliver different versions of the software to different users. Then analytics can show which version of the software results in superior outcomes. A prime time to run these types of experiments is when you are over delivering on your SLA. This means you have an error budget to run the experiments. Quality is not something that makes the business money, it is a dial that can be turned up and down in service of creating a useful product.
Experiments are a bit different than tests. Tests confirm something you already know or suspect, but you run an experiment to learn something new. When running a production experiment, do not pull a statistician in afterwards to analyze the results. It is better to bring them in early, so that they tell you what to do right than what you did wrong. Be very careful about running multiple experiments at the same time. The smaller and more specific the experiment the better.
I can definitely see the value in doing these types of experiments to drive incremental improvements. However, I think this should be a small part of your improvement strategy. Direct customer involvement should be where you spend the majority of your attention. Competitor analysis is another great way to drive improvements. I’d also put code refactoring, additional automated tests, and documentation above production experiments.
Serverless DevOps: What do we do when the server goes away? by Tom McLaughlin
The core problem Tom has identified here is that in a serverless environment, the development team can bypass the ops team. More than that, when you go serverless, your goal should be to break any reliance on the ops team. The problem started when people began making the jump to delivery teams. These teams cut across departments with everyone focused on a single goal. Well, everyone except ops; the products were still handed over the wall to ops.
In companies that are going serverless, the ops team needs to disband and ops people need to join delivery teams. Operations folks can be involved in the product lifecycle at every step. In the design phase, ops can present OTS solutions to business objectives and explain the pros and cons of those solutions. Ops also has an active role in the development process, particularly with infrastructure as code. But, they can also manage non-unique aspects of the product. Consider your coworkers as if they were users. Once the product is deployed, a key ops responsibility is to determine which service is at fault when there is a customer complaint.
That said, the ops mindset is going to need to change. The ability to code is becoming a job requirement. Unlike previous technology changes, serverless is more likely to be adopted by large corporations than small startups.
One of the things that surprised me about Tom’s presentation is when he stated that working in Sprints is a completely alien concept to people outside of software development. There is kind of an interesting relationship between this presentation and John Esser’s SRE presentation. From one side you have a pull for software developers to move more into an operations role and from the other you have operations people being pulled into development. It does seem like the operations role in supporting a product is becoming closer to a specialized software development role following the SRE model.
Why Are You Using That Hammer? by Ian Alford
Ian’s opening statement was that not having any problems is not a realistic expectation. Scope large problems appropriately to avoid analysis paralysis. You need to continuously find the minimum problem that can be solved right now to keep moving towards the goal. You will need to make decisions to handle most problems you encounter. Those decisions and the reasons behind them need to be documented. Put that in github and version it.
Bring ops and development together to decide which tools will be used and the deployment strategy for new tools. Decisions need to be enforced. Look for gaps in your current solution and document them along with the workarounds. Once you know where your gaps are, you can both navigate around them and iterate on your solution to remove them.
Make sure you track your decision to ensure it is continuing to be relevant. When there is a change in the environment, it is doubly important to re-evaluate your decisions. There are times when it is expedient to circumvent a decision in order to get something done. Evaluate what kind of message that sends, and consider whether it is worth it.
Plan to re-evaluate your decisions periodically. This should involve listening to a representative sample of all people affected by it. Once you have that feedback, re-evaluate and iterate on the decision. Publicly admit your own mistakes. This will let others know that it is safe for them to do so as well.
I see how it is often better to pick a direction and go rather that get stuck trying to find the correct path. To continue with that analogy, objects in motion tend to stay in motion. So you do need to be willing to re-evaluate and make changes. Otherwise, you could very easily choose the wrong direction and drive your company off a cliff.
Ignites
- How to Succeed in Ops Without Really Trying by Allie Richards
This presentation was all about poking fun at DevOps personality tropes. I found it amusing and looking around most people enjoyed it. - Using Terraform Safely by Martez Reed
Terraform is infrastructure as code. It is written in golang to describe resources in a declarative fashion. Terraform modules are high level abstractions of common resource configurations. When selecting from a common Terraform module, there is a tradeoff between speed and security. To get things right from a security perspective, it is important to understand both the technology and the problem space. Finally, do not put sensitive data in Terraform.
This presentation hit some of the same beats as Paul Meharg’s security presentation on Day 2. I was not familiar with Terraform, so I did learn some basic information about it in this presentation. - Yea, But.. by Adam Shake
People resist change because keeping the status quo is less scary. The common excuses are the change is too big, too small, there is no time, or it is too expensive. If you want to break through this barrier, you need to build relationships with the people who will be affected by the change. Find small wins and low hanging fruit. Trumpet those small wins in future conversations to promote larger changes.
This sort of advice has some parallels in the sports, politics, and military fields. In any sort of leadership endeavor, a series of small, but not insignificant, successes will build morale. - Imposter Syndrome Ain’t Just a River in Egypt by Jesse Butler
Imposter syndrome often results in working very hard and never feeling like you are succeeding. It can take a long time to recognize success as success instead of staving off failure. It is important to accept that if you have a job, you deserve that job. Just do it. Strive for positivity and ask questions when you do not understand something. Help out others with impostor syndrome by sharing your story.
I think getting to know your coworkers outside of work is another way to help curb impostor syndrome. This helps you see your coworkers more as people instead of whatever it says on their business cards. - Slow is smooth, smooth is fast by Ryan Feiock
Work slowly and thoughtfully to avoid the mistakes you make when trying to get things done quickly. Before making any technical commitment, the technical people need to have a chance to weigh in on it. Part of that means providing some time for research. Make sure you understand the technology before starting product development, and bake the best practices for that technology into your process from the start. If you don’t take the time to do it right at the beginning, you will rarely get the chance go to back and change it later.
When building out a large infrastructure project, focus on the smallest piece. Make sure you have done everything correctly and document all the steps you take to ensure that. Take extra time to go over it and make sure you are not missing anything. Repeat this for the next smallest thing, making sure you have checked all of the boxes each time. Do not move on from one piece until you are completely finished with the current one.
In theory, this is great. In practice, what I have typically seen is that you seldom have any idea what the finished product is going to be at the beginning of a project. I have found it much more effective to get usable software into your client’s hands as fast as possible. This is the quickest way to find out that what they asked for has only mild association with what they need. As you perform these iterations, core components will become visible. These are where you should spend additional time hardening, testing, and refactoring. - Devops preppers: What the zombie apocalypse can teach us about incident management by Tiffany Longworth
When preparing for global disasters, you will find that your preparations for local and personal disasters are also covered. It is important to practice your disaster recovery processes. You need to build faith beforehand that these processes work. If restoring from backups is part of your disaster recovery plan, practice restoring from backups regularly. When the disaster happens, you want to be doing something you know will work. The most important element in disaster management is to have a single person in charge of managing it. The lines of communication must also be clear.
Two parts of this presentation resonated with me: establishing the person in charge of handling a disaster and practicing disaster recovery efforts. I believe there is a cost-benefit involved with what level of disaster you should prepare for. For some projects, it will make financial sense to have a fallback system in case s3 goes down. For other projects, it doesn’t make sense to make any disaster preparations.
Open Space #1: Working Remotely
It is important to maintain the normal work routine, especially if you have a family. Your family members need to know that you are at work and that they should only interrupt you if absolutely necessary. To that end, keep a separate work space in your home. Working from home is not for everyone, try it a few days a week before committing to it full time.
When one person on a team is working remote, that team is now a remote team. Daily standup needs to use video conferencing. Decisions need to be made online so that the remote workers are included. All decisions need to be documented, with team notifications of changes. The remote workers need to make an effort to continually communicate status with the rest of the team. It is also critical to have in-person contact with your peers. If you can’t physically interact with your team mates, attend meetups with others in your field.
I choose to attend this open space because I have worked in remote teams before, but I don’t typically work from home. Going forward, I need to make a deliberate effort to make sure any relevant conversations at the office make it to everyone on the team.
Open Space #2: Knowledge Dissemination
I found this session abstractly interesting. I had expected to be talking about how to transfer detailed knowledge of a system either for onboarding or handing off. But it ended up being all about taking some minor bit of general knowledge and spreading it through the entire organization. The example we used most often was a change to the dress code. We decided that multiple communication mediums are required to reach the most people. However, these communications should be spread out over time or the message risks being ignored. Email alone is definitely not sufficient.
Conclusion
DevOpsDays Indianapolis was well worth the price of admission. Obviously, you will get more out of it if you are working in a company that actually has a separate dev and ops team. However even if you just occasionally do projects for companies of this size, it is still worth attending to gain a greater understanding of that culture. The Kubernetes training was my favorite part of the conference. The Ignites were my least favorite part of the conference. I wish I had been able to attend all of the Open Space sessions. Though I also would have preferred the Open Space sessions to be smaller, 3-4 people per topic. I know that isn’t very practical, but a 30 minute ‘discussion’ with 20 people mostly resulted in 3-5 people doing the talking.