Writing our 3-year technical vision

I joined Eventbrite as their first Technical Fellow, the most senior engineering individual contributor role in the company. One of my initial goals was to come up with an overarching technical vision for the whole company aligned with our 3-year business strategy, and that would move us away from a monolithic architecture and central SRE team to a distributed system where we shift ownership to each team. In our most recent post, Vivek Sagi described the list of problems that we identified and our future-looking goals, which to recap are:

  • Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
  • Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
  • Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
  • Drive automation and reduce toil. All feature dev teams should be  able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
  • Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

To accomplish these goals, I started working with other engineers and product leaders to understand the history of our technical architecture and the challenges that we were facing including developer productivity issues, site reliability problems or scalability limitations. From these goals, we derived a set of requirements for our 3-year technical vision:

  1. Features. As our product offering evolves to deliver high quality self-service experiences for Super Creators and Consumers, we must ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we will need to provide. For example, Super Creators require multi-event creating/editing, organization level reporting, and multi-event cart support – all of which will require significant architectural changes relative to our current offering. In addition, a new bundle of marketing tools will enhance creators’ ability to acquire new audiences and grow existing ones, especially by leveraging automation and machine learning to simplify the experience while increasing the impact. We seek to improve our offering for consumers to discover, and attend events and to maintain trust in our platform.
  2. Leveraging Data. We have the opportunity to power new data differentiated products based on data from over a decade of past events and round out our focused product offering with key 3rd party integrations (e.g. Mailchimp, Zoom).
  3. Performance. User perception of our product’s performance is paramount: a slow product is a poor product. In addition, better page performance leads to better SEO rankings. We decided to leverage Lighthouse’s performance score, an industry standard web dev performance metric, and we endeavor to achieve a green score (90 to 100) across our customer facing features. We also must enforce low latency in our internal infrastructure and API response times, and set reduction goals year-over-year.
  4. Scale. We will support two types of scaling improvements. We will scale our systems to handle 5x the current load as we grow our business and we need to have systems that support this load and scale to such limits. The second one is related to spikiness in our traffic due to large event sales, where today we use a Waiting Room to throttle calls to our services and DB. We will design systems that can autoscale and descale in certain events and avoid having to overprovision our infrastructure on a manual basis.
  5. Quality. The defect rate of our product offering can either make or break the experience for our users. In the past year, we have reduced the quantity of critical open bugs from 311 down to 175 and also reduced the number of bugs that missed our fix SLA from 200 to 110. We should aggressively lean into this trend and continue to reduce both by 50% YoY. We will improve our ability to deliver along that trend by increasing our test coverage, reducing our code complexity, having better tooling and increasing our level of automation.
  6. Self-Service. We will improve self-service both externally and internally. For the former we will aim for a 50% YoY customer support contact rate reduction relative to total ticket sales, while ensuring that help center page views don’t disproportionately grow – the point being that we deliver product experiences that have sufficient in-line guidance to result in successful experiences. Internally we will ensure that data is accessible by teams, each of the data sources and services has clear documentation and runbooks as well as contracts and use cases. We will define these in “How We Work” guidelines that every team will follow.
  7. Development Process. Finally, we must streamline our internal development processes and progress along the DevOps Big 4 to these levels: Deployment frequency: Elite (Daily for web and backend services and up to weekly for native apps), Lead time for changes: Elite (Less than one hour), Mean time to restore service: Elite (Less than one hour), and Change failure rate: Elite (0-15%).

Applying these principles to the problems that were outlined in the previous post, we thought about the following solutions to them:

  1. Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. We need to decouple our monolith into smaller microservices that can evolve and scale independently. This is a similar trend that many other companies have followed as they grow, and based on our professional experience prior to Eventbrite, we know it works.
  2. Our initial partial attempt to move to a Services Oriented Architecture (SOA) compounded the problem. In our prior attempt, we lacked a clear vision of what moving to SOA meant and how to accomplish it. We moved business logic out but not data, compounding the problem. This time around, we’ve prioritized this architecture transition at a company level, focusing first on the core business logic, including segregating and migrating the underlying data with every service.
  3. Our performance became suboptimal leading to a poor utilization of our hardware resources. We planned to fix this in two ways: by moving to managed services, letting cloud providers deal with this responsibility, and choosing technologies that would autoscale properly based on our traffic patterns, which are spiky by nature due to large onsale events.
  4. Our SDLC process was ad hoc and lacked sufficient controls in a few places. We’ve defined and set ownership boundaries between services and logical components. We’ve also enacted Architecture Review Committees to review designs to ensure we are building extensible services that don’t become monolithic themselves.
  5. Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. We are transitioning to DevOps where each team is the owner of the end-to-end lifecycle of their services. Similar to an earlier point, we’ve implemented this successfully in the past at other companies and we know it works.
  6. We lack automation in how we test, deploy, monitor and roll back our code. Our vision document has sections specifically addressing deployments, testing and operations, indicating that we should aspire to full automation and minimize (and remove, if possible) any manual intervention.
  7. Our core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We see this as an architectural issue where our data boundaries were never established and we had many different services reading from the same tables and writing to them. We also used the same database technology for all of our use cases which has proven to be inefficient.
  8. We also built homegrown tools such as our own RPC protocol, PySOA. We are no longer  investing our time in areas that are not business critical and where we can’t build competitive differentiation. For everything else where we need a commoditized solution; evaluate buying instead of building whenever possible. This allows us to focus on providing customer value.

As we can see, we’re trying to move ownership from a centralized SRE team and monolithic architecture to empower teams to build and own their systems. But moving from a situation where the technology set for building features is very limited to another one which is much more open has its risks as well, and we didn’t want to end up with a technology spectrum so wide that it would be difficult to maintain. This is why we wrote our Golden Path, a living document that details the technologies that teams are allowed to use in production for their services, and covers areas such as RPC protocols, storage layers or programming languages. We say it’s a living document because teams are still encouraged to evaluate other technologies when designing their systems, and, if proven they’re the right choice, we’ll update our Golden Path to reflect these. We’ll write another post with more details about this Golden Path.

From an architecture perspective we also depicted a high level view of how we’d design our end system, starting from the client-facing applications and APIs:


And then describing the set of components that we would have in our internal network:

Our 3-year technical vision was a collaborative effort where the entire engineering team was involved. We reviewed the proposal multiple times with different stakeholders, including all engineers, data scientists, product managers and other roles in the company. We received hundreds of comments that enriched and made the whole proposal better. We hosted several Q&As to ensure that all aspects of the vision were clear and there were no outstanding items to be resolved. We also presented it to our CEO and the board of directors. We needed the entire company to become owners of this vision, and leaders in achieving it. After our 3-year technical vision was finalized, a few subsequent long-term thinking proposals were driven by our engineering organization, such as:

  • Operational Model. We describe the infrastructure and networking that we’ll have to support our shift from centrally-owned infrastructure to a distributed mindset where each team owns the end-to-end lifecycle of their services.
  • Data. We describe our future internal and external reporting capabilities, and how these will work with a service oriented architecture where each service has its own storage layer, and not limited to a centralized MySQL DB. It also covers how to have a centralized data lake our data scientists can rely on to build their ML models.
  • Frontend. We propose how to unify our frontend stack and extract our server side rendering from our monolith to Backend-for-Frontends for each application.
  • Mobile. We are rethinking our integration with our core services and how to share logic between the different applications that we have today.

Apart from this, the roadmaps from all of our teams have been adapted to align with our vision and now include areas of focus such as moving away from the monolith into their own service, having their own storage layer, or moving from in-house technology to industry standards. This is also reflected in all recent technical proposals that have been written, all of which start by clarifying that what’s outlined in the proposal is in alignment with our 3-year technical vision and the Golden Path.

But writing a document and sharing that proposal was just the seed of the vision. We’re now making tangible progress to get there, such as:

  • We have deprecated our in-house RPC protocol PySOA in favor of gRPC (tenet: we will choose conforming over creating/reforming). We did an initial evaluation where we compared PySOA with gRPC, and did a proof-of-concept to understand which would better suit our use cases. We decided to move to gRPC because it allows us to focus on our business needs instead of maintaining our own RPC protocol, and gRPC is superior since it supports HTTP2 (while PySOA relies on Redis), has a smaller payload size since it relies on protocol buffers and binary serialization, supports multiple programming languages instead of just Python, and has TLS/SSL support, among other advantages. We have also started writing new services using this new protocol.
  • We are enabling self-service AWS account provisioning and defining our networking and security layers so that teams can own their service’s infrastructure (tenet: teams will have end-to-end ownership of their systems and services).
  • We are migrating our unmanaged MySQL database to AWS Aurora (tenet: we favor cloud managed services or serverless for commoditized systems and components).
  • We have worked on several long-term designs for some of our key components such as Ordering and Event Management instead of focusing on shorter-term and incremental improvements (tenet: we will favor long-term maintainability and scale over short-term deliveries for strategic solutions). We have also started their implementation and expect our initial deliveries later this year.
  • We are writing new designs that break the previous limitation/guidance to only use Python/Django and MySQL and consider databases such as DynamoDB or QLDB, Kotlin or Go, and SNS or Kinesis, as a few examples (tenet: we will standardize on a few stacks but also empower teams to choose the right tool for the job).
  • We have recently launched an Operational Readiness Review process to analyze the reliability of our current codebase, as well as new designs, are overhauling our Security Review process, moving to full CI/CD, Dockerizing our monolith, raising the bar in our testing and quality processes, and several other initiatives that we have in place (tenet: we will strive for continuous improvement and will ask why not instead of why?).

These are just a few examples that show how having a clearly outlined long-term technical direction can have a significant impact on an organization’s architecture and processes. We will detail many more of these examples for actual impact in upcoming posts.

We are excited about this new, long-term thinking technical vision that will provide the right guidance to our teams, indicate how the different pieces in our system should fit together, and help our every-day decision-making process. And what’s even more exciting is that the whole company participated in its definition and have embraced it with energy and passion.

How we created our 3-year technical vision

Eventbrite is a global company. We have employees in many countries around the globe, and we believe in the future of remote work. This is no different in Engineering. With offices in San Francisco, Nashville, Mendoza and Madrid, and many of our team members working remotely, our global team has a rich and diverse culture. We have grown organically and also by acquiring several companies such as Ticketfly, ToneDen, Ticketea or Eventioz.

Our tech stack grew rapidly as we scaled and we ended up becoming a Python/Django monolith, called Core, with a centralized MySQL database. Starting with a code base that over time becomes monolithic is a common occurrence in startups that scale since they need to deliver fast and provide business value. However, in the long run this approach impacts the speed at which the company can operate and innovate due to the number of hard dependencies between the code, which is now fractionally owned by many different teams.

When I came onboard, approximately nine months ago, there were multiple initiatives underway to reduce our technical debt. However, we did not have a unifying 3-year technical vision that would act as a guiding principle, our north star, to keep us on the right path and enable us to deliver against our business strategy.

Having that vision for the future is paramount to the success of our team, our company, and the event creators and attendees we build for.

This is the first of many posts from our team on how we built our 3-year technical vision and are executing against it to increase our code quality and development velocity while reducing infrastructure costs.

Our intent in sharing this is twofold. For anyone considering a similar exercise, hopefully some or all of this resonates and will help you on your journey. We also acknowledge there are better ways to accomplish this and hope to learn these methods from you through your comments and feedback below.

The first step in creating a technical vision is to have a shared understanding of the problems with your current architecture.  The following were our problems.

Problems

  • Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. A monolithic architecture is a software pattern where all the codebase and infrastructure are tightly coupled, live in the same artifact, and have the same development and deployment lifecycle. This contrasts with distributed architectures where each component or service has its own set of artifacts and lifecycle. A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
  • Our initial partial attempt to move to a SOA architecture compounded the problem. We retained a single data store for most services and writes continued to happen from the  monolith. We also had issues where multiple data stores were being updated based on a single action and had multiple code paths that CRUD data, leading to consistency issues. In effect we created a complex distributed monolith because of the depth of dependencies, circular calls, data coupling and Eventbrite’s specific architecture. This also increases our blast radius exposure, meaning that a failure in a given part of the system could potentially affect many others, increasing the severity of the issue and the customer impact.
  • Our performance became suboptimal leading to a poor utilization of our hardware resources. There are two main reasons for this:
    • Relying on a relational MySQL database allows us to scale vertically but not horizontally, impacting the overall performance and scalability of our architecture. It is also not the ideal solution for many scenarios like reporting, data science model development, etc. We are also affected by inconsistent data model and query design, which requires a lot of human effort to overcome..
    • Our Python code is inherently single-threaded, therefore unable to use most of the capacity that our hosts have, leading to overprovisioning in some cases and requiring mitigation features such as waiting rooms to handle very spiky traffic patterns when we have large events on sale. We do not autoscale the core monolith given all its complexities. There’s a lack of clear code, service or data ownership, and we have some orphaned services. Side effects between service interactions is also a common problem.
  • Our SDLC process was adhoc and lacked sufficient controls in a few places. Software Engineers (SWEs) make code contributions to different repositories without a consistent review or approval process.
  • Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. That has led to SREs being the only team with production access, which inadvertently leads to them being perceived as a bottleneck even as they try to do what we asked them to do. In addition, our architecture tends to be limited to the tools that SREs use, instead of being able to choose the best technology.
  • We lack automation in how we test, deploy, monitor and roll back our code, placing an undue burden on some of our engineering teams who need to spend their time on these mundane tasks instead of delivering value.
  • The core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We’ve introduced new ledger style datastores to capture history that have led to consistency challenges.
  • We also built homegrown tools such as our own RPC protocol, PySOA, and an open source library because of some unique needs in our business and also because there were no off the shelf tools that did the job well a decade ago.  As the industry has evolved we now have off the shelf solutions that offer similar or better functionality. Maintaining undifferentiated home grown systems, which consume time from our engineering team and make it difficult to integrate with other industry standards is no longer prudent. We need to start looking at buy vs build options for all of our non-differentiated technical needs.

The next step was to describe our end goals. After we executed on our technical vision what would we want to accomplish?

Goals

  • Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
  • Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
  • Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
  • Drive automation and reduce toil. All feature dev teams should be  able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
  • Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

Defining aspirational goals was great but then we needed a set of tenets that would guide our decisions going forward. Here was the set of tenets we came up with.

Tenets

  • We will choose conforming over creating/reforming. We will research industry standards and will favor using them instead of building our own standard(s). By default we choose to not build wrappers on top of such standards. We will only build our own custom standard when it gives us a competitive advantage.
  • Teams will have end-to-end ownership of their systems and services. This includes the business proposal, system design, implementation, testing, documentation, deploying to production, monitoring and maintenance and responding to on-call for any production issues. We believe this ownership leads to more productive teams and higher quality code. We will minimize the delegation of any of these activities to other teams or roles, although we expect some to have a central responsibility such as SREs being responsible for the overall health of our site and apps.
  • We favor cloud managed services or serverless for commoditized systems and components. By default we choose not to maintain and scale our own hosts, databases or other infrastructure, and rely on cloud computing. We will focus on creating business value over non-value added commoditized tasks.
  • We will favor long-term maintainability and scale over short-term deliveries for strategic solutions. We will have a long-term vision for strategic solutions that are a core component of our 3-year strategy. We value having a maintainable and scalable solution, and accept having short-term and iterative steps but always with the long-term in mind.
  • We will standardize on a few stacks but also empower teams to choose the right tool for the job. We will choose a preferred logging, metrics, alerting, front end development and issue management stack. However teams are encouraged to pick appropriate programming languages, data stores, frameworks for their use cases after a thorough analysis. We will not allow an unreasonable growth in the number of technologies that we use, though.
  • We will strive for continuous improvement and will ask why not instead of why? We will be bold in our choices and will constantly seek new and better ways of solving problems. We will be rigorous in our analysis and will take risks in pursuit of excellence.

This is where our 3-Year technical vision started to take shape. Daniel Micol, our tech fellow, will shed more light on that process in our next blog post.

MySQL High Availability at Eventbrite

Situation

Eventbrite has been using MySQL since its inception as a company in 2006. MySQL has served the company well as an OLTP database and we’ve leveraged the strong features of MySQL such as native replication and fast reads, as well as dealt with some of its pain points such as impactful table ALTERs. The production environment relies heavily on MySQL replication. It includes a single primary instance for writes and a number of secondary replicas to distribute the reads.  

Fast forward to 2019.  Eventbrite is still being powered by MySQL but the version in production (MySQL 5.1) is woefully old and unsupported. Our MySQL production environment still leans heavily on native MySQL replication. We have a single primary for writes, numerous secondary replicas for reads, and an active-passive setup for failover in case the primary has issues. Our ability to failover is complicated and risky which has resulted in extended outages as we’ve fixed the root cause of the outage on the existing primary rather than failing over to a new primary. 

If the primary database is not available then our creators are not creating events and our consumers are not buying tickets for these events. The failover from active to passive primary is available as a last resort but requires us to rebuild a number of downstream replicas. Early in 2019, we had several issues with the primary MySQL 5.1 database and due to reluctance to failover we incurred extended outages while we fixed the source of the problems. 

The Database Reliability Engineering team in 2019 was tasked first and foremost with upgrading to MySQL 5.7 as well as implementing high availability and a number of other improvements to our production MySQL datastores. The goal was to implement an automatic failover strategy on MySQL 5.7 where an outage to our primary production MySQL environment would be measured in seconds rather than minutes or even hours. Below is a series of solutions/improvements that we’ve implemented since mid-year 2019 that have made a huge positive impact on our MySQL production environment. 

Solutions

MySQL 5.7 upgrade

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances. Due to the nature of the upgrade and the large gap between 5.1 and 5.7, we incurred downtime to make it happen. The maintenance window lasted ~30 minutes and it went like clockwork. The DBRE team completed ~15 Failover practice runs against Stage in the days leading up to the cut-over and it’s one of the reasons the cutover was so smooth. The cut-over required 50+ Engineers, Product, QA, Managers in a hangout to support with another 50+ Engineers assuming on-call responsibilities through the weekend. It was not just a DBRE team effort but a full Engineering team effort!

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver! 

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value. We had a recent production incident that was related to hitting the max value on an INT auto-increment primary key column. When this happens in production, it’s an ugly situation where all new inserts result in a primary key constraint error. If you’ve experienced this pain yourself then you know what I’m talking about. If not then take my word for it.  It’s painful!

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

Improved MySQL ALTERs

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via  gh-ost on one of our larger MySQL tables.  The duration of the ALTER was 50 hours and it completed with no application impact. So what’s the big deal?  

The big deal is that we could now ALTER tables in our production environment with little to no impact on our application, and this included some of our larger tables that were ~500GB in size.

Here is a little background. The ALTER TABLE statement in MySQL is very expensive. There is a global write lock on the table for the duration of the ALTER statement which leads to a concurrency nightmare.  The duration time for an ALTER is directly related to the size of the table so the larger the table, the larger the impact.  For OLTP environments where lock waits need to be as minimal as possible for transactions, the native MySQL ALTER command is not a viable option. As a result, online schema-change tools have been developed that emulate the MySQL ALTER TABLE functionality using creative ways to circumvent the locking.

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag.  Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage. 

GitHub introduced a new Online Schema Migration tool for MySQL (gh-ost ) that uses a binary log stream to capture table changes, and asynchronously applies them onto a “duplicate” table. gh-ost provides control over the migration process and allows for features such as pausing, suspending and throttling the migration. In addition, it offers many operational perks that make it safer and trustworthy to use. It is:

  • Triggerless
  • Pausable
  • Lightweight
  • Controllable
  • Testable

Orchestrator

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator is a MySQL high availability and replication management tool. It will detect a failure, promote a new primary, and then reassign the name/VIP. Here are some of the nice features of Orchestrator:

  • Discovery – Orchestrator actively crawls through your topologies and maps them. It reads basic MySQL info such as replication status and configuration. 
  • Refactoring – Orchestrator understands replication rules. It knows about binlog file:position and GTID. Moving replicas around is safe: orchestrator will reject an illegal refactoring attempt.
  • Recovery – Based on information gained from the topology itself, Orchestrator recognizes a variety of failure scenarios. The recovery process utilizes the Orchestrator’s understanding of the topology and its ability to perform refactoring. 

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

Manual failover tests

In March of 2020 the DBRE team completed several manual/controlled fail-overs using Orchestrator and HAProxy. Eventbrite experienced some AWS hardware issues on the MySQL primary and completing manual failovers was the first big test. Orchestrator passed the tests with flying colors.

Automatic failover

In May of 2020 we enabled automatic fail-over for our production MySQL data stores. This is a big step forward in addressing the single-point-of-failure situation with our primary MySQL instance. The DBRE team also completed several rounds of testing in QA/Stage for ProxySQL in preparation for the move from HAProxy to ProxySQL.

Show time

In July 2020, Eventbrite experienced hardware failure on the primary MySQL instance that resulted in automatic failover.  The new and improved automatic failover process via Orchestrator kicked in and we failed over to the new MySQL primary in ~20 seconds. The impact to the business was astronomically low! 

ProxySQL

In August of 2020 we made the jump to ProxySQL across our production MySQL environments.  ProxySQL is a proxy specially designed for MySQL. It allows the Eventbrite DBRE team to control database traffic and SQL queries that are issued against the databases. Some nice features include:

  • Query caching
  • Query Re-routing – to separate reads from writes
  • Connection pool and automatic retry of queries

Also during this time period we began our AWS Aurora evaluation as we began our  “Managed Databases” journey. Personally, I prefer to use the term “Database-as-a-Service (DBaaS)”. Amazon Aurora is a high-availability database that consists of compute nodes replicated across multiple availability zones to gain increased read scalability and failover protection.  It’s compatible with MySQL which is a big reason why we picked it.

Schema Migration Tool

In September of 2020, we began testing a new Self-Service Database Migration Tool to provide interactive schema changes (invokes gh-ost behind the scene).  It supports all “ALTER TABLE…”, “CREATE TABLE…”, or “DROP TABLE…” DDL statements. 

It includes a UI where you can see the status of migrations and run them with the click of a button:

Any developer can file and run migrations, and a DBRE is only required to approve the DDL (this is all configurable though). Original source for the tool can be found here shift. We’ve not pushed to production yet but we do have a running version in our non-prod environment.

Database-as-a-Service (DBaaS) and next steps

We’re barreling down the “Database-as-a-Service (DBaaS)” path with Amazon Aurora. 

DBaaS flips the DBRE role a bit by eliminating mundane operational tasks and allowing the DBRE team to align our work more closely with the business to derive value from the data assets that we manage. We can then focus on lending our DBA skills to application teams and end users—helping deliver new features, functionality, and proactive tuning value to the core business. 

Stay tuned for updates via future blog posts on our migration to AWS Aurora! We’re very interested in the scaled read operations and increased availability that Aurora DB cluster provides. Buckle your seatbelt and get ready for a wild ride 🙂  I’ll provide an update on “Database-as-a-Service (DBaaS)” at Eventbrite in my next blog post!

All comments are welcome! You can message me at ed@eventbrite.com. Special thanks to Steven Fast for co-authoring this blog post.

Building a Protest Map: A Behind the Scenes Look!

Sometimes, a project that you’re most proud of or passionate about isn’t one that involves architecture diagrams or deep technical discussions. Sometimes, the most significant project is one that allows peers to come together to build something that can have a positive impact. When we can all collaborate for a cause we’re passionate about, and get quick buy-in from our colleagues and executives, we can create a feature we’re proud of. 

Such is the case with Eventbrite’s Protests map. If you’ve visited the Eventbrite homepage recently, you’ve probably seen our map highlighting protests and marches to support racial equality, hosted on Eventbrite.

The map features events across the U.S., and the data source for the events shown is a Curated Collection.

The birth of a map

On May 25, 2020, news of George Floyd’s murder in Minneapolis spread rapidly across the country, igniting protests and rallies seeking justice. On June 9, a small team consisting of a product manager, engineer, designer, and content editor came together to build a map for these events on Eventbrite.com.

The plan was to create a map on Eventbrite that led users to protests across the United States. This map would aggregate Black Lives Matter protests and allow our users to interact based on their location. While some projects exemplify technical ingenuity of a team, this one relied on quick thinking and action, teamwork, and a driving force of passion. There are always a number of technical solutions to test out, but sometimes it’s simpler to leverage people.

Company buy-in and teamwork

The fact that this map lives on our homepage means that everyone, including event organizers and consumers would see this high-touch feature. That means that we needed quick buy-in from the whole company, especially impacted department leads, to launch this feature while protests were still occurring nation-wide. This project was greenlit in a matter of days. 

The trust involved in getting this map out the door was important to our success. Legal, PR, and marketing all came together to bring this project to life, it was a true cross-functional effort.

Collections and discovery

When we were in the brainstorming phase, we faced the question of how to populate the map with events that were specifically protests in support of Black Lives Matter. Our first instinct was to construct a query to identify events that would fit into the theme.

For a simple example, we could return events where the event’s organizer had selected the “Rally” format when they created the event. Then further filter those results by events where the organizer had entered a description containing the text “BLM” or “George Floyd.”

We quickly realized this wasn’t going to produce good results. Variations on this approach would either be over-specific or too broad, including far too many irrelevant events or excluding relevant ones. Plus, it would have been vulnerable to trolling and require us to constantly adjust the query as people learned how to configure their events to show up on the map.

It was probably naive to have even tried that approach, but the obvious alternative had its downsides too. The alternative being manual, human-driven curation through the use of an existing feature on our platform called “Collections.”

Collections allow Eventbrite users to create collections of events. Any user on Eventbrite can create a collection simply by clicking the “heart” icon on an event card:

Our marketing team uses this feature to highlight particularly interesting events that adhere to certain themes: online yoga classes, drive-in concerts, etc. Sometimes you’ll see it in our product under the name of “Editor’s Picks.”

But as mentioned above, this approach has downsides. It takes significant manual work to identify all the events that we want to be part of the collection. And event inventory has a unique challenge: events end. At some point in time, a planned event will take place and the content is no longer relevant to show consumers. The Collections feature depends on the same Elasticsearch index we use for search, and due to various constraints we actively remove events from the index after they end. 

Since we remove ended events, we run the risk of showing a sparse or empty map if we don’t maintain the collection. This meant that someone would have to constantly keep an eye on this collection and keep it updated with new events as people create them on our platform.

Luckily, we had help.

The initial list of protests came together through help from the rest of the company, specifically our company’s global diversity group, collaborating on Slack and Google Sheets.

After we launched, Eventbrite Co-founder and CEO, Julia Hartz was so inspired by the work that she offered extra funding for the marketing team to continue the work of identifying relevant events for the map. That’s some of the most positive internal feedback you can get!

The nuts-and-bolts

While this was a straightforward project from a technical implementation standpoint, for the most part leveraging an existing feature rather than building something entirely new, it’s worth highlighting the work we’ve done in Eventbrite Engineering over the past year to invest in our architecture and infrastructure. Like many companies, Eventbrite is undergoing the difficult process of moving away from a monolithic architecture to a microservices architecture (ours has a domain-driven spin). We call the monolith “Core.”

Although we’re nowhere close to finished, we’re already starting to reap the benefits of the hard work invested in moving features out of Core, piece by piece. A little over a year ago, it wouldn’t have been feasible to ship a project of this nature so quickly because deploying any part of the site, no matter how small, required deploying all of Core. Today, we can ship changes to the homepage within a matter of minutes. 

Just as importantly, we have reliable alerting to notify us of problems that arise after shipping a change. As a matter of fact, shortly after we launched this project we were alerted of an unusual spike in errors on the homepage. Within 5 minutes of receiving this notification, we rolled back the release containing the map. After another 10 minutes, we re-released the map with the fix.

Bringing people together

Our company’s mission is simple: Bring the world together through live experiences. Protests themselves bring together like-minded individuals to fight towards a common goal. Working on a feature that exemplifies our mission as a company was a truly inspiring opportunity. This lit a fire in our team that translated to the company as a whole.

Authors: Eloise Porter, Jaylum Chen, Alex Rice, and Allen Jilo

How are you building/maintaining team cohesion?

Let’s face it! 2020 has been a year of huge change! One of the biggest changes is with how we work. Like many other companies, Eventbrite has transitioned to a fully remote work environment with engineers and managers located across the globe. We have development centers in San Francisco (USA), Nashville (USA), Mendoza (Argentina), and Madrid (Spain) with teams spread across all four locations. The challenge for managers and individual contributors alike is to build/maintain cohesion for remote teams! 

This is for sure one of the most common topics in management these days. How do you keep cohesion when your team has transitioned to fully remote? A group of Eventbrite managers decided to try a few strategies and to document the process. 

At Eventbrite, our mission is to bring the world together, and in a pre-pandemic world it was through live events, these days it’s more about connecting, somehow, with people.

Building a team is more art than science, nurturing the culture, the bonds, and, most importantly, building trust between the members needs time and dedication. Doing all of this is difficult even when everyone can be at the same place, make jokes, laugh, and work side by side. In this new COVID-19 world, we lost many of these tools. We decided it was time for some new tricks!

Daily standup meetings

Right from the beginning of the COVID-19 pandemic in March, one of our Engineering Managers (Nacho) suggested that we try to sync every morning. Before COVID-19, we wouldn’t do it regularly. We gave it a try and it worked! Since then we talk everyday, we are very active on Slack, and it’s not always about work. It’s helped us both professionally and emotionally as we shared problems, frustrations, … We make the daily standups a top priority as these virtual meetings are a perfect chance to build rapport with our teammates.  

It was super hard to get everyone onboard, but eventually we were all there, always, every morning. It’s great to think that I have my team when I need them, available and willing to help.

Weekly Board Games

Starting in April, we set up a board game meeting for 4pm every Tuesday. We play Settlers of Catan on colonist.io while chatting over video conference. It has become a weekly ritual for one of our teams in Argentina, with between 3 and 5 people playing. It lets us spend some time together, have fun and socialize. 

  • Catan:
https://colonist.io/
  • Versión en línea de juego Basta!, Tutti frutti, Lápiz quieto, ¡Mercadito!, Dulce de Membrillo el que todos conocemos. 8 players
https://bastaonline.net/

Coffee breaks

There’s always time for a coffee break to talk about anything, work or non-work related stuff. We created a Slack shortcut (!cafebreak) to easily share the meeting details for anyone to quickly jump in. These short spontaneous recesses provide distraction and allow us to relax from the daily hustle. 

Keeping the Culture alive

The Eventbrite’s company culture makes it a unique place to work! The challenge is to continue to foster this culture while functioning as a fully remote team. It might be easy for a team that is accustomed to spending 9 hours a day, 5 days a week together to forget what makes them unique when everyone is scattered, without interaction besides Slack or Hangout calls. Knowing what makes your team unique is a key thing to understand. 

On our team, it is our love for problem solving. Thus every Friday afternoon, we stopped our work earlier to take some time and play around Hackerrank for a couple of hours, solving different exercises and sharing our solutions. It was an incredible way of having fun and learning from each other by checking how each one of us approached the solutions. It helped us remember who we were as a team.

Engineering Development Academy

Eventbrite implemented EDA (Engineering Development Academy) in Argentina. This initiative is a new way to find and incorporate engineering talent in our company. The EDA group of professionals is trained in technical (Python, Django, Testing, CI, CD, Javascript, React, CSS) and non-technical subjects (agile methodologies, Scrum, English) in a 3-month program.

Every day at 9:00 am, we start with breakfast/chat about things that do and do not work. Movies, news, etc. The first talks we looked for topics to share because the whole team was new, and thus forced everyone to speak. We also do afternoon checkpoints for 30 minutes at 2:00 PM for both technical or non-technical questions, to help make the training as positive as possible.

Lightning Talks

Emilio (Principal Engineer) organizes Lightning Talks on a weekly basis. These are open talks about anything that is important to an engineer and she/he is interested in sharing with others. What we have learned during COVID-19?

Spontaneous 1:1’s

There are often times where discussions are better in 1:1 or group meetings rather than discussing in Slack or collaborating via a Google doc or email. We encourage our team members to create a Google hangout or Zoom session at any time if it helps resolve issues quicker. 

Quick text messages (via Slack) are great for clarifying simple matters. Often a more detailed discussion is required and this is when a Google Hangout is preferred.  It’s important to recognize the distinction as too many calls can burn engineers out. But you can also waste lots of time exchanging messages when a five-minute call could provide answers to multiple questions.

Virtual Meeting tips

It can be challenging to connect with teammates during virtual/online meetings. Simple things such as keeping your camera on and using verbal and non-verbal cues (such as head nods or thumbs-up) are great ways to make a connection during virtual meetings. Also, go out of your way to recognize teammates who have gone above and beyond during online meetings.  While working remote it is often easy to bypass/miss the great work that others are doing.

Many of the virtual meetings will start at 5 minutes past the starting time. Use this time to show personal interest for the other participants. Ask them about their weekend? Break the ice while you’re waiting for all participants to join. It’s a great way to learn more about some of your co-workers who you may not necessarily know very well.

 

This article’s co-authors are Henry Lyne, Gabriel Flores, Ed Presz, Emiliano André and Juan Pablo Marsano. Reviewed by Rainu Ittycheriah.

Teaching new Presto performance tricks to the Old-School DBA

I’ve spent much of my career working with relational databases such as Oracle and MySQL, and SQL performance has always been an area of focus for me. I’ve spent countless hours reviewing EXPLAIN plans, rewriting subqueries, adding new indexes, and chasing down table-scans. I’ve been trained to make performance improvements such as:  only choose columns in a SELECT that are absolutely necessary, stay away from LIKE clauses, review the cardinality of columns before adding indexes, and always JOIN on indexed columns.

It’s been an instinctual part of my life as a Database Administrator who supports OLTP databases that have sold in excess of 20K tickets per minute to your favorite events. I remember a specific situation where a missing index caused our production databases to get flooded with table-scans that brought a world-wide on-sale to an immediate halt. I had a lot of explaining to do that day as the missing index made it to QA and Stage but not Production!

In recent years, I’ve transitioned to Data Engineering and began supporting Big Data environments.  Specifically, I’m supporting Eventbrite’s Data Warehouse which leverages Presto and Apache Hive using the Presto/Hive connector. The data files can be of different formats, but we’re using HDFS and S3.  The Hive metadata describes how data stored in HDFS/S3 maps to schemas, tables, and columns to be queried via SQL. We persist this metadata information in Amazon Aurora and access it through the Presto/Hive connector via the Hive Metastore Service (HMS). 

The stakes have changed and so have the skill-sets required. I’ve needed to retrain myself in how to write optimal SQL for Presto. Some of the best practices for Presto are the same as relational databases and others are brand new to me. This blog post summarizes some of the similarities and some of the differences with writing efficient SQL on MySQL vs Presto/Hive. Along the way I’ve had to learn new terms such as “federated queries”, “broadcast joins”, “reshuffling”, “join reordering”, and “predicate pushdown”.

Let’s start with the basics:

What is MySQL? The world’s most popular open source database. The MySQL software delivers a fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL is intended for mission-critical, heavy-load production database usage.

What is Presto? Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel relational databases.

Presto uses ANSI SQL syntax/semantics to build its queries. The advantage of this is that analysts with experience with relational databases will find it very easy and straightforward to write Presto queries! That said, the best practices for developing efficient SQL via Presto/Hive are different from those used to query standard RDBMS databases.

Let’s transition to Presto performance tuning tips and how they compare to standard best practices with MySQL.

 

1. Only specify the columns you need

Restricting columns for SELECTs can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*). This applies to Presto as well as MySQL! 

 

2. Consider the cardinality within GROUP BY

When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.

The GROUP BY operator distributes rows based on the order of the columns to the worker nodes, which hold the GROUP BY values in memory. As rows are being ingested, the GROUP BY columns are looked up in memory and the values are compared. If the GROUP BY columns match, the values are then aggregated together.

 

3. Use LIMIT with ORDER BY

The ORDER BY clause returns the results of a query in sort order. To  process the sort, Presto must send all rows of data to a single worker and then sort them. This sort can be a very memory-intensive operation for large datasets that will put strain on the Presto workers. The end result will be long execution times and/or memory errors. 

If you are using the ORDER BY clause to look at the top N values, then use a LIMIT clause to reduce the cost of the sort significantly by pushing the sorting/limiting to individual workers, rather than the sorting being done by a single worker. 

I highly recommend you use the LIMIT clause not just for SQL with ORDER BY but in any situation when you’re validating new SQL. This is good practice for MySQL as well as Presto!

 

4. Using approximate aggregate functions

When exploring large datasets often an approximation (with standard deviation of 2.3%) is more than good enough! Presto has approximate aggregation functions that give you significant performance improvements. Using the approx_distinct(x) function on large data sets vs COUNT(DISTINCT x) will result in performance gains. 

When an exact number may not be required―for instance, if you are looking for a rough estimate of the number of New Years events in the Greater New York area then consider using approx_distinct(). This function minimizes the memory usage by counting unique hashes of values instead of entire strings. The drawback is that there is a small standard deviation.

 

5. Aggregating a series of LIKE clauses in one single regexp_like clause

The LIKE operation is well known to be slow especially when not anchored to the left (i.e. the search text is surrounded by ‘%’ on both sides) or when used with a series of OR conditions. So it is no surprise that Presto’s query optimizer is unable to improve queries that contain many LIKE clauses.  

We’ve found improved  LIKE performance on Presto by  substituting the LIKE/OR  combination with a single REGEXP_LIKE clause, which is Presto native.  Not only is it easier to read but it’s also more performant. Based on some quick performance tests, we see ~30% increase in run-times with REGEXP_LIKE vs comparable LIKE/OR combination.

For example:

SELECT  ...FROM zoo 
WHERE method LIKE '%monkey%' OR 
      method LIKE '%hippo%' OR 
      method LIKE '%tiger%' OR 
      method LIKE '%elephant%'

can be optimized by replacing the four LIKE clauses with a single REGEXP_LIKE clause:

SELECT  ...FROM zoo 
WHERE REGEXP_LIKE(method, 'monkey|hippo|tiger|elephant')

 

6. Specifying large tables first in join clause

When joining tables, specify the largest table first in the join. The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. If the right-hand side table is “small” then it can be replicated to all the join workers which will save CPU and network costs.  This type of join will be most efficient when the right-hand side table is small enough to fit within one node. 

If you receive an ‘Exceeded max memory’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your largest table is the first table in your sequence of joins. 

This was an interesting performance tip for me. As we know, SQL is a declarative language and the ordering of tables used in joins in MySQL, for example,  is *NOT* particularly important. The MySQL optimizer will re-order to choose the most efficient path. With Presto, the join order matters. You’ve been WARNED! Presto does not perform automatic join-reordering unless using the Cost Based Optimizer!

 

7. Turning on the distributed hash join

If you’re battling with memory errors then try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. 

At Eventbrite we have the distributed_join variable set to ‘true’. (SHOW SESSION). Also it can be enabled by setting a session property (set session distributed_join = ‘true’).

 

8. Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date or country.  You define partitions at table creation, and they help reduce the amount of data scanned per query, thereby improving performance. 

Here are some hints on partitioning:

  • Columns that are used as WHERE filters are good candidates for partitioning.
  • Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Use caution when partitioning and make sure you don’t partition too finely. 
  • If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

A key partition column at Eventbrite is transaction date (txn_date).

CREATE TABLE IF NOT EXISTS fact_ticket_purchase
(
    ticket_id STRING,
....
    create_date STRING,
    update_date STRING
)
PARTITIONED BY (trx_date STRING)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

 

9. Optimize columnar data store generation

Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by using column-wise compression based on data type, special encoding, and predicate pushdown. At Eventbrite, we define Hive tables as PARQUET using compression equal to SNAPPY….

CREATE TABLE IF NOT EXISTS dim_event
(
    dim_event_id STRING,
....
    create_date STRING,
    update_date STRING,

)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

Apache Parquet is an open-source, column-oriented data storage format. Snappy is designed for speed and will not overload your CPU cores. The downside of course is that it does not compress as well as gzip or bzip2.

 

10. Presto’s Cost-Based Optimizer/Join Reordering 

We’re not currently using Presto’s Cost-Based Optimizer (CBO)! Eventbrite data engineering released Presto 330 in March 2020, but we haven’t tested CBO yet.

CBO inherently requires the table stats be up-to-date which we only calculate for a small subset of tables! Using the CBO, Presto will be able to intelligently decide the best sequence based on the statistics stored in the Hive Metastore.

As mentioned above, the order in which joins are executed in a query can have a big impact on performance. If we collect table statistics then the CBO can automatically pick the join order with the lowest computed costs. This is governed by the join_reordering_strategy (=AUTOMATIC) session property and I’m really excited to see this feature in action.

Another interesting join optimization is dynamic filtering. It relies on the stats estimates of the CBO to correctly convert the join distribution type to “broadcast” join. By using dynamic filtering via run-time predicate pushdown, we can squeeze out more performance gains for highly-selective inner-joins.  We look forward to using this feature in the near future!

 

11. Using WITH Clause

The WITH clause is used to define an inline view within a single query.  It allows for flattening nested subqueries. I find it hugely helpful for simplifying SQL, and making it more readable and easier to support.

 

12. Use Presto Web Interface

Presto provides a web interface for monitoring queries (https://prestodb.io/docs/current/admin/web-interface.html). 

The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. If Presto cluster is having any performance-related issues, this web interface is a good place to go to identify and capture slow running SQL!

 

13. Explain plan with Presto/Hive (Sample)

EXPLAIN is an invaluable tool for showing the logical or distributed execution plan of a statement and to validate the SQL statements. 

Logical Plan with Presto

explain select SUBSTRING(last_modified,1,4) ,count(*)  from hive.df_machine_learning.event_text where lower(name) like ‘%wilbraham%’ or (REGEXP_LIKE(lower(name), ‘.*wilbraham.*’)) group by 1 order by 1;

 

14. Explain plan with MySQL (Sample)

In this particular case you can see that the primary key is used on the ‘ejp_events’ table and the non-primary key on the “ejp_orders’ table. This query is going to be fast!

 

Conclusion

Presto is the “SQL-on-Anything” solution that powers Eventbrite’s data warehouse. It’s been very rewarding for me as the “Old School DBA” to learn new SQL tricks related to a distributed query engine such as Presto. In most cases, my SQL training on MySQL/Oracle has served me well but there are some interesting differences which I’ve attempted to call-out above. Thanks for reading and making it to the end. I appreciate it!

We look forward to giving Presto’s Cost-Based Optimizer a test drive and kicking the tires on new features such as dynamic filtering & partition pruning!

All comments are welcome, or you can message me at ed@eventbrite.com. You can learn more about Eventbrite’s use of Presto by checking out my previous post at Boosting Big Data workloads with Presto Auto Scaling.

Special thanks to Eventbrite’s Data Foundry team (Jeremy Bakker,  Alex Meyer, Jasper Groot, Rainu Ittycheriah, Gray Pickney, and Beck Cronin-Dixon) for the world-class Presto support, and Steven Fast for reviewing this blog post. Eventbrite’s Data Teams rock!

Create Meaningful (and Fun!) Remote Community

Create meaningful and fun remote community

What makes community meaningful? How can you create a meaningful culture and community in a time where we are all separated while working remote?

I want today to talk about community, and how important it is to be a proactive agent in fostering a meaningful (and fun!) remote work culture.  Just a smidge about me: before I was a software engineer at Eventbrite, I was on the ground figuring out how to create and grow community in various capacities – as a resident advisor in college, an event planner at Stanford, a coding teacher in various classrooms, and as a participant in various worldwide dance and performing communities.  Community is my thing.

I have seen a lot of articles floating around with recommendations on how to create a remote work culture. I want to point out that I’m very specifically using the phrase “remote community” as opposed to “remote culture” because to me, community is something present and tangible and interconnected, whereas culture can be this very nebulous thing.  Culture arises out of community – community is the foundation.

I am also writing this article from the perspective of a mostly remote work community, but these ideas can be taken and used in any type of community.  This article is not engineering specific but can easily used to benefit any engineering community.

What makes community meaningful?

Community, to me, is about caring for people.  A lot of times, community starts by happenstance – you and others happen to share an interest or work at the same job.  But, in my opinion, ‘happenstance’ community graduates to being a real community when people work together to create a place that everyone cares about.  That to me is where the meaning is derived – from people caring enough to want to make their community a place where people feel supported and feel joy.

How you can create meaningful community

I find that creating meaningful community is as simple as stepping up. Being proactive goes a long, long way – either on an individual level, like reaching out to someone you know who is having a hard time. Or being proactive with a broader group in mind, like creating activities that bring people together, etc.

I want to touch mostly on the latter – how to engage a broader group, with either active engagement or passive engagement.  One more thing to consider – how to make things fun! For me, it’s pretty easy – I think about what I enjoy and what I want to see in a community, and then I make it happen.  You can too!

Active Engagement

Active engagement involves events or activities that require people to be present and available at the time that it is happening.  Here are some ideas:

  1. Remote happy hour.  Schedule one once a week with your department or with cross sections of the company you normally don’t interact with.  Use this as an opportunity to talk to new people that you would normally don’t cross paths with.Pro tip – if you think more than 10 people are going to show up, figure out some way to create ‘rooms’ – either separate Hangouts that people can jump in and out of, or utilize Zoom’s room feature.
  2. Remote team lunch.  Suggestion – why not every day or twice a week?
  3. 2pm 2 minute planks.  Get stronger every day!  We do this via video chat as well to cheer each other on.

Passive Engagement

Passive engagement involves activities that people can circle back around to when they have time.  Ideas include:

  1. Daily Photo Share.  Post a theme for the photo of that day (Baby photo of you! Throwback photo of your parents! Photo of you doing an activity you love!) and see what you end up learning.  I have added a list of ideas that I compiled with my co-workers at the bottom of this article.
  2. Dear Diary chaining.  Remember when email chaining was a thing?  Well, my coworker came up with this idea – write a ‘Dear Diary’ entry about your day, post with a picture, and then tag a new coworker for the next day! It’s been interesting to get a more in-depth look into people’s mentality and how they are trying to stay optimistic.

Out of all the ideas above, so far the Daily Photo Share has seemed to have the most impact and been the most uplifting – I tend to post it around 12pm, and then spend the next hour scrolling and commenting on everyone’s amazing pictures of themselves, their families, and the things they love.  It’s an amazing way to see everyone share what they care about.

What’s your time commitment?

It’s up to you to decide your community time commitment.  It doesn’t have to be everything that’s listed above. I recommend the thought model of 2 minutes, 2 hours, or 2 months – what can you contribute? Can you contribute 2 minutes of your time, or 2 hours, or maybe even 2 months?  The 2pm 2 minute plank is a great example of a short amount of time, or even commenting on someone else’s photo of the daily photo share. 2 hours could be something like deciding to throw and promote a remote happy hour.

I think that now, more than ever, it is really important to take time out of our lives to reach out, to connect, to engage – to build community, and to receive from our communities.

I want to hear from you.  What ideas do you have for active or passive engagement in your community?  What makes community meaningful to you, and how can we surface that more in remote communities?

(Some) Daily Photo Share Theme Ideas:
  • Throwback photo of your parent(s)
  • Hobby/activity you love doing
  • Last meal you took a photo of
  • Your family (immediate or otherwise)
  • Your workspace at home
  • Silly photo of you
  • An item you cherish
  • Vacation you recently took
  • Photo you are proud of (either proud that you took, or of something you are proud of)
  • Photo with a story (and tell the story!)
  • Something that most people don’t know about you
  • Favorite costume photo of you (Halloween or otherwise)
  • Photo of you where you like your vibe
  • Share a photo of a coworker that you appreciate, and why you appreciate them
  • Photo of a new hobby or life hack have you acquired in the last few weeks
  • Something/someone you miss (and give back story!)
  • This could have been an ad for…
  • Everybody should try this once
  • “I didn’t want to be in this photo“

Leveraging AWS “spot” instances to drive down costs

I had the pleasure of participating in a Big Data Cost Optimization event in San Francisco (October 17) at the AWS Loft location. I was a panelist in a round-table discussion focusing on the use of AWS spot instances. It was a terrific opportunity to share how Eventbrite’s Data Engineering team is smartly using spot instances for Big Data & EMR to drive down costs. Thanks to Jasper Groot from the Data Engineering team for helping to co-author this blog post!

At Eventbrite, we’re using spot instances to leverage AWS auto scaling and I’ve published a blog entry on this topic (Big Data workloads with Presto Auto Scaling). Two key focus areas for the data engineering team are ephemeral computing (something that lasts for a very short time)  and idempotence (repeatable behaviour with the same outcome).  Making a commitment in these areas has allowed us to leverage spot instances to effectively manage costs.

Spot instances are a cost-effective choice if you can be flexible about when your applications run and if your applications can be interrupted. They are well-suited for data analysis, batch jobs, background processing, and non mission-critical tasks. Below are some of the panel questions and my answers, and I’d love to share what I learned based on the conversation.
Continue reading “Leveraging AWS “spot” instances to drive down costs”

Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations

D3.js is the de facto standard JavaScript library for building interactive data visualizations on the Web. When beginning a new D3.js project, developers usually start by taking one of the community’s many examples and using it to jumpstart new work. This is a convenient way to get something working fast so that they can iterate over it. However, by working this way, you soon struggle to maintain, extend, or modify your charts. D3.js examples are made to demonstrate implementations and techniques, not to ship them to production.

Discover “Pro D3.js”, the new book that walks you through the creation of maintainable, modular, and testable charts. You will also learn how to package and document your visualizations, producing open-source libraries. Read on to learn more about the book and its contents. Continue reading “Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations”

A Story of a React Re-Rendering Bug

As front-end developers, we often find ourselves getting into perplexing bugs when the page we build involves a lot of user interactions. When we find a bug, no matter how tricky it is, it means something is wrong in the code. There is no magic, and the code itself does not lie.

This blog will take you on a short journey about how I fixed a particularly annoying bug that existed in one of our products. Continue reading “A Story of a React Re-Rendering Bug”