Daniel Micol, Author at Engineering Blog

A day in the life of a Technical Fellow

September 1, 2021September 1, 2021

In my two most recent blog posts, I talked about how to write a Long-Term Technical Vision and a Golden Path. These are future-looking and high-level artifacts so the question I keep hearing is: do I need to give up coding to grow in my career and become a Technical Fellow? In this post I will explain what it’s like being a Technical Fellow and how to strike a good balance between breadth and depth. Let’s also forget about the specific title for a moment, since different companies will have other names such as Distinguished or Senior Principal Engineer. What really matters is the scope and how to be able to cope with it while ensuring that you don’t become a person who’s too detached from the details and provides overly generic feedback and guidance.

Eventbrite has roughly 40 engineering teams and in theory I could say that my scope covers all of them. However, it’s unrealistic to be involved in so many of them and have enough context to provide meaningful contributions to each team. The two critical aspects for making this work are: knowing how to prioritize my time, and being able to delegate. But how did I learn this?

Earlier in my career, I was the tech lead for a small team with two other engineers. Over time the product that we had built was successful and we grew to three feature teams, with me being the uber tech lead for them. At first I was trying to be as embedded into each of them as I was when I belonged to just one team: attending their standups, being part of the technical design reviews, coding, etc. Soon enough I realized that this approach would not scale and I sought feedback on how to manage the situation. One piece of advice that was critical in my career was: “in order to grow, you need to find or grow other people to do what you’re doing now, so you can then become dispensable and start focusing on something else”. That “something else” could be taking on a larger scope or just finding another area to work on, but the key here is that what we should be aspiring to is growing others so that they end up doing a similar job to what we’re doing now, and we should become dispensable in our current role. It is interesting to think that our goal should be to reach a point where we’re almost irrelevant, and that took me time to properly understand, but it’s really key for career growth.

After growing tech leads in the three teams I was overseeing, I could start focusing on the larger picture. However, I didn’t want to become too detached from the lower level details, so I opted for working in a rotating way with the three teams, where each quarter I would become a part-time IC for each of those teams, including coding tasks, designs, code reviews and being on call. And I say part-time because I still had to invest time in my breadth activities and thinking about the long term. I structured my schedule in a way where my mornings would be mostly IC work and the afternoons would be filled with leading the overall organization and being a force multiplier. This dual approach where I oversaw the larger organization but also had time to tackle lower level aspects allowed me to focus on the bigger picture while being attached to the actual problems that teams were facing, and have enough context to be useful when providing them feedback and guidance.

Time has passed and at Eventbrite I now follow a similar model but with a larger set of teams. Since the rotational approach won’t work as well (rotating a team per quarter will take me 10+ years to complete each rotation), we decided to implement a model where Principal Engineers and above (including Technical Fellows) would have different engagement levels with each team, which could be divided into the three categories listed below:

Sponsors are part of a team and spend ~2 days/week working with that team, which includes attending the standup, participating in system designs, coding and being on call. We expect Principal+ engineers to sponsor at most 2 areas at any given point in time.
Guides spend ~2 hours/week on a given project. They are aware of the team’s mission and roadmap, protect the long-term architecture, provide the long-term direction of the product, and may be active in the code base.
Participants are available to a team for any questions they have or to help disambiguate areas of concern, they are active in meetings but may not be deep in the code base. Participants spend a few hours a month on the project/team.

With the above in mind, I am sponsoring two teams right now, and that is expected to rotate based on the teams who will need my involvement the most. As of today this means that I’m more involved in the Ordering and Event Infrastructure teams, including coding, working on technical designs, mentoring others in the team, etc.

So what’s a day in my life look like?

As I mentioned before, I structure my day so that in the mornings I will do IC work and the afternoons will be for breadth work. Right now my main area of focus as an IC is getting our new Ordering Pipeline implemented and that’s where I spend most of my coding cycles. This is a brand new service written in Kotlin, gRPC and uses AWS technologies such as DynamoDB and Lambdas. It’s particularly critical not only because Ordering is at the core of Eventbrite, but because it’s paving the way for the new generation of services that we’re starting to build in the company, since this is the first one with the technologies and processes outlined in the 3-Year Technical Vision and the Golden Path. And such, many other services who will follow will use what Ordering is building today as their reference architecture, and we’re also finding a few unpredicted gaps that we have to solve before other teams find them. I was also on call for this team a couple of weeks ago.

In contrast, my afternoons are typically filled with breadth work, that is, with 1:1s, syncs with other people in Argentina or the US, company tech talks, design reviews, and others. For example, I was recently heavily involved in coming up with a new engineering career guide for the company (which we’ll blog about at some point), or attending leadership syncs with our CTO and CPO about the current state of our Foundations and the challenges ahead of us.

As time passes my focus will move away from Ordering to other areas where I can contribute in depth, and by then I will have expected to grow the team to a state where they don’t miss me and they can keep moving forward without my help. Breadth work is there to stay and can be very different each week depending on what the company needs the most at that particular moment.

Writing our Golden Path

August 2, 2021

In my last blog post I explained how we defined our 3-year technical vision for the company. One of the key pillars of this vision is shifting from a model where we used the same tool for every job (mostly a combination of Python + Django + MySQL), to the right tool(s) for each job. Given that this would be a new way of working for our organization, we wanted to have some guidelines that teams would follow to ensure that our services and applications wouldn’t have a completely different tech stack depending on the team developing them, which would harm the maintainability of our overall architecture. This is why we decided to write a Golden Path document that would guide teams on the best set of technologies for each potential scenario and recommended tools for common repeatable use cases like logging, security, etc.

The Golden Path is a document that explains the allowed technologies available for use at Eventbrite when building software. It has been built collaboratively by the entire development organization and is in continuous evolution as teams find better solutions for the problems to be solved. We require any technology choice that is not included in this list to have explicit approval from the Architecture Review Committee (ARC), which is our engineering governance body, before implementing it.

Therefore, one principle around our Golden Path is that we are recommending the use of the “right tool for the job,” which most often means opting for industry standard technologies (enabling us to focus our limited innovation tokens on technological advancements unique to live experiences). Teams are encouraged to evaluate other alternatives that are not in this document when working on their system designs, or challenge currently deprecated ones, and propose these edits to ARC if they find them superior or better suited for their use case than the currently approved ones. This is the way we keep this as a living document that improves over time and adapts to new industry trends.

We divide technologies into the following life cycle phases:

Emerging. New technologies that are very likely to become recommended but are not production-ready yet.
Recommended. The default choice as of today.
Allowed. Technologies that we allow although the recommended one should be used if possible.
Deprecated. Discouraged for new development but could be maintained for currently-existing systems.
Rejected. Technologies that we don’t use or haven’t used in the past but have been rejected in previous evaluations.

Our Golden Path contains several sections such as programming languages (for microservices, data science, frontend), source package managers, web frameworks, databases and caching, among others. The guidance for how to apply the Golden Path when working on a technical design is as follows:

Every section in the document should have a matrix that outlines the best path forward for the use cases that we’ve faced in the past, or a description that clearly specifies this. If our use case is in that list, we should choose the best technology outlined in the matrix.
Even if we choose a technology that has been already evaluated in the past, we still need to come up with data for our specific scenario in key dimensions such as cost, latency, etc. to ensure that it will work for this specific scenario.
If a section doesn’t have a matrix yet, or our use case is not included, we will conduct a technology evaluation and contribute to the matrix. The guidelines for this are:
- We should consider at least two options and do a full bake off before we pick a winner. Choose based on the dimensions that are important for our scenario (features, use case fit, ease of use, cost, latency, consistency, etc). 
- We are not limited to AWS technologies. For the decisions that we make, we should evaluate both the AWS offering and any other leading non-AWS contender (e.g. DynamoDB and Cassandra), including compatibility and integration with other tools of the stack. We will not favor AWS by default and will only use it as a tie-breaker if both offerings are equivalent.  
- Technologies that are deprecated shouldn’t be re-evaluated unless there’s a strong belief that the particular scenario that is being designed will be different than the reasons why that technology was deprecated (e.g. we shouldn’t be looking into unmanaged solutions since those are deprecated). These exceptions will need to be approved by ARC.

Our Golden Path was published in early 2021, a few weeks after we finalized our 3-year tech vision, and every technical design or proposal that has emerged since then is following this new standard. We do envision that in a few years from now we should be able to remove these barriers since teams will have enough internal examples to decide the best tool for the job without the risk of significantly diverging the chosen options for similar use cases.

Here are a few examples of sections extracted from the Golden Path document:

Native Libraries and Wrappers

Native Libraries (recommended). We should favor using the native libraries of the tools that we use (e.g. AWS SDKs, feature flags, metrics, etc). Each team consuming those SDKs is responsible for upgrading to newer versions when needed.
Wrappers (deprecated). We do not want to use wrappers unless they provide clear additional benefit over native libraries (such as extended capabilities or use simplicity), and we do not believe in the argument that using native libraries is a lock-in to a specific technology, as the downside of building and consuming our own wrappers is a bigger problem. Wrappers tie us to specific underlying library versions, require migration effort as new native library versions are released, and are always a subset of the functionality that those libraries provide.

Microservice Programming Languages

Kotlin (recommended). This is the recommended language based on the JVM. It has several benefits over Python such as being multi-threaded, improved performance, and being strongly typed, among others. We should use this language whenever we need to build services that are scalable or performant.
Python (recommended). We support it given our extensive in-house knowledge and current stack. We should be careful when using it with services that are expected to have significant load since it’s single-threaded and interpreted languages are typically slower than compiled ones.
Node.js (emerging). We have experience with Node.js for frontend development but not microservices, although we’re evaluating it.
Go (emerging). We built the integration service in this language. We believe that Go has potential and we should do a feature evaluation at some point.

Service-to-service Communication

This is the communication that happens when a service calls another one directly, and can be either synchronous or asynchronous.

gRPC (recommended). This is the only recommended RPC protocol.
PySOA / Legacy SOA (deprecated). We support the services that are written in these protocols that are currently in production but don’t allow any new ones to use them.

Relational Databases

Useful when there are multiple entities in the data model that are strongly related.

AWS Aurora (recommended). We recommend AWS Aurora which is a managed database compatible with MySQL and PostgreSQL. However, we support only the MySQL flavor.
AWS RDS (rejected). We don’t allow RDS since it is less scalable than Aurora although it offers very similar functionality.
MySQL (deprecated). We maintain the current databases that we have on MySQL but don’t allow any new functionality to be implemented on this database.

Writing our 3-year technical vision

July 1, 2021July 1, 2021

I joined Eventbrite as their first Technical Fellow, the most senior engineering individual contributor role in the company. One of my initial goals was to come up with an overarching technical vision for the whole company aligned with our 3-year business strategy, and that would move us away from a monolithic architecture and central SRE team to a distributed system where we shift ownership to each team. In our most recent post, Vivek Sagi described the list of problems that we identified and our future-looking goals, which to recap are:

Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
Drive automation and reduce toil. All feature dev teams should be able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

To accomplish these goals, I started working with other engineers and product leaders to understand the history of our technical architecture and the challenges that we were facing including developer productivity issues, site reliability problems or scalability limitations. From these goals, we derived a set of requirements for our 3-year technical vision:

Features. As our product offering evolves to deliver high quality self-service experiences for Super Creators and Consumers, we must ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we will need to provide. For example, Super Creators require multi-event creating/editing, organization level reporting, and multi-event cart support – all of which will require significant architectural changes relative to our current offering. In addition, a new bundle of marketing tools will enhance creators’ ability to acquire new audiences and grow existing ones, especially by leveraging automation and machine learning to simplify the experience while increasing the impact. We seek to improve our offering for consumers to discover, and attend events and to maintain trust in our platform.
Leveraging Data. We have the opportunity to power new data differentiated products based on data from over a decade of past events and round out our focused product offering with key 3rd party integrations (e.g. Mailchimp, Zoom).
Performance. User perception of our product’s performance is paramount: a slow product is a poor product. In addition, better page performance leads to better SEO rankings. We decided to leverage Lighthouse’s performance score, an industry standard web dev performance metric, and we endeavor to achieve a green score (90 to 100) across our customer facing features. We also must enforce low latency in our internal infrastructure and API response times, and set reduction goals year-over-year.
Scale. We will support two types of scaling improvements. We will scale our systems to handle 5x the current load as we grow our business and we need to have systems that support this load and scale to such limits. The second one is related to spikiness in our traffic due to large event sales, where today we use a Waiting Room to throttle calls to our services and DB. We will design systems that can autoscale and descale in certain events and avoid having to overprovision our infrastructure on a manual basis.
Quality. The defect rate of our product offering can either make or break the experience for our users. In the past year, we have reduced the quantity of critical open bugs from 311 down to 175 and also reduced the number of bugs that missed our fix SLA from 200 to 110. We should aggressively lean into this trend and continue to reduce both by 50% YoY. We will improve our ability to deliver along that trend by increasing our test coverage, reducing our code complexity, having better tooling and increasing our level of automation.
Self-Service. We will improve self-service both externally and internally. For the former we will aim for a 50% YoY customer support contact rate reduction relative to total ticket sales, while ensuring that help center page views don’t disproportionately grow – the point being that we deliver product experiences that have sufficient in-line guidance to result in successful experiences. Internally we will ensure that data is accessible by teams, each of the data sources and services has clear documentation and runbooks as well as contracts and use cases. We will define these in “How We Work” guidelines that every team will follow.
Development Process. Finally, we must streamline our internal development processes and progress along the DevOps Big 4 to these levels: Deployment frequency: Elite (Daily for web and backend services and up to weekly for native apps), Lead time for changes: Elite (Less than one hour), Mean time to restore service: Elite (Less than one hour), and Change failure rate: Elite (0-15%).

Applying these principles to the problems that were outlined in the previous post, we thought about the following solutions to them:

Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. We need to decouple our monolith into smaller microservices that can evolve and scale independently. This is a similar trend that many other companies have followed as they grow, and based on our professional experience prior to Eventbrite, we know it works.
Our initial partial attempt to move to a Services Oriented Architecture (SOA) compounded the problem. In our prior attempt, we lacked a clear vision of what moving to SOA meant and how to accomplish it. We moved business logic out but not data, compounding the problem. This time around, we’ve prioritized this architecture transition at a company level, focusing first on the core business logic, including segregating and migrating the underlying data with every service.
Our performance became suboptimal leading to a poor utilization of our hardware resources. We planned to fix this in two ways: by moving to managed services, letting cloud providers deal with this responsibility, and choosing technologies that would autoscale properly based on our traffic patterns, which are spiky by nature due to large onsale events.
Our SDLC process was ad hoc and lacked sufficient controls in a few places. We’ve defined and set ownership boundaries between services and logical components. We’ve also enacted Architecture Review Committees to review designs to ensure we are building extensible services that don’t become monolithic themselves.
Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. We are transitioning to DevOps where each team is the owner of the end-to-end lifecycle of their services. Similar to an earlier point, we’ve implemented this successfully in the past at other companies and we know it works.
We lack automation in how we test, deploy, monitor and roll back our code. Our vision document has sections specifically addressing deployments, testing and operations, indicating that we should aspire to full automation and minimize (and remove, if possible) any manual intervention.
Our core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We see this as an architectural issue where our data boundaries were never established and we had many different services reading from the same tables and writing to them. We also used the same database technology for all of our use cases which has proven to be inefficient.
We also built homegrown tools such as our own RPC protocol, PySOA. We are no longer investing our time in areas that are not business critical and where we can’t build competitive differentiation. For everything else where we need a commoditized solution; evaluate buying instead of building whenever possible. This allows us to focus on providing customer value.

As we can see, we’re trying to move ownership from a centralized SRE team and monolithic architecture to empower teams to build and own their systems. But moving from a situation where the technology set for building features is very limited to another one which is much more open has its risks as well, and we didn’t want to end up with a technology spectrum so wide that it would be difficult to maintain. This is why we wrote our Golden Path, a living document that details the technologies that teams are allowed to use in production for their services, and covers areas such as RPC protocols, storage layers or programming languages. We say it’s a living document because teams are still encouraged to evaluate other technologies when designing their systems, and, if proven they’re the right choice, we’ll update our Golden Path to reflect these. We’ll write another post with more details about this Golden Path.

From an architecture perspective we also depicted a high level view of how we’d design our end system, starting from the client-facing applications and APIs:

And then describing the set of components that we would have in our internal network:

Our 3-year technical vision was a collaborative effort where the entire engineering team was involved. We reviewed the proposal multiple times with different stakeholders, including all engineers, data scientists, product managers and other roles in the company. We received hundreds of comments that enriched and made the whole proposal better. We hosted several Q&As to ensure that all aspects of the vision were clear and there were no outstanding items to be resolved. We also presented it to our CEO and the board of directors. We needed the entire company to become owners of this vision, and leaders in achieving it. After our 3-year technical vision was finalized, a few subsequent long-term thinking proposals were driven by our engineering organization, such as:

Operational Model. We describe the infrastructure and networking that we’ll have to support our shift from centrally-owned infrastructure to a distributed mindset where each team owns the end-to-end lifecycle of their services.
Data. We describe our future internal and external reporting capabilities, and how these will work with a service oriented architecture where each service has its own storage layer, and not limited to a centralized MySQL DB. It also covers how to have a centralized data lake our data scientists can rely on to build their ML models.
Frontend. We propose how to unify our frontend stack and extract our server side rendering from our monolith to Backend-for-Frontends for each application.
Mobile. We are rethinking our integration with our core services and how to share logic between the different applications that we have today.

Apart from this, the roadmaps from all of our teams have been adapted to align with our vision and now include areas of focus such as moving away from the monolith into their own service, having their own storage layer, or moving from in-house technology to industry standards. This is also reflected in all recent technical proposals that have been written, all of which start by clarifying that what’s outlined in the proposal is in alignment with our 3-year technical vision and the Golden Path.

But writing a document and sharing that proposal was just the seed of the vision. We’re now making tangible progress to get there, such as:

We have deprecated our in-house RPC protocol PySOA in favor of gRPC (tenet: we will choose conforming over creating/reforming). We did an initial evaluation where we compared PySOA with gRPC, and did a proof-of-concept to understand which would better suit our use cases. We decided to move to gRPC because it allows us to focus on our business needs instead of maintaining our own RPC protocol, and gRPC is superior since it supports HTTP2 (while PySOA relies on Redis), has a smaller payload size since it relies on protocol buffers and binary serialization, supports multiple programming languages instead of just Python, and has TLS/SSL support, among other advantages. We have also started writing new services using this new protocol.
We are enabling self-service AWS account provisioning and defining our networking and security layers so that teams can own their service’s infrastructure (tenet: teams will have end-to-end ownership of their systems and services).
We are migrating our unmanaged MySQL database to AWS Aurora (tenet: we favor cloud managed services or serverless for commoditized systems and components).
We have worked on several long-term designs for some of our key components such as Ordering and Event Management instead of focusing on shorter-term and incremental improvements (tenet: we will favor long-term maintainability and scale over short-term deliveries for strategic solutions). We have also started their implementation and expect our initial deliveries later this year.
We are writing new designs that break the previous limitation/guidance to only use Python/Django and MySQL and consider databases such as DynamoDB or QLDB, Kotlin or Go, and SNS or Kinesis, as a few examples (tenet: we will standardize on a few stacks but also empower teams to choose the right tool for the job).
We have recently launched an Operational Readiness Review process to analyze the reliability of our current codebase, as well as new designs, are overhauling our Security Review process, moving to full CI/CD, Dockerizing our monolith, raising the bar in our testing and quality processes, and several other initiatives that we have in place (tenet: we will strive for continuous improvement and will ask why not instead of why?).

These are just a few examples that show how having a clearly outlined long-term technical direction can have a significant impact on an organization’s architecture and processes. We will detail many more of these examples for actual impact in upcoming posts.

We are excited about this new, long-term thinking technical vision that will provide the right guidance to our teams, indicate how the different pieces in our system should fit together, and help our every-day decision-making process. And what’s even more exciting is that the whole company participated in its definition and have embraced it with energy and passion.