- Overview
- About me
- Philosophy
- Governance structure
- Frontend
- Backend
- Data
- Caching
- Tracing
- Protocols and communication patterns
- Access control
- Testing
- DevOps
- QA
- Theories of computing
- Data science and machine learning
A checklist of tech considerations when designing fullstack architecture with SRE in mind.
Since the list is heavily biased towards my tech background, here's a summary of it.
- Frontend: Typescript, React, Vue, Angular, Jekyll/Hugo/Wordpress
- Backend: Go, Java (SE 5 - 14), Node.js, Scala/Akka, Python 3/Flask/gunicorn, Haskell, C, Ruby
- Scripting: Bash, Python 3, Perl
- Cloud: AWS 90%, GCP 10%
- Virtualisation: VirtualBox/vmware, docker, kubernetes
- CI/CD: Circle CI, GitHub Actions, Concourse, Airflow, Jenkins
- DB/Big data: PostgreSQL/MySQL/Aurora/Oracle, Cassandra/DynamoDB, MongoDB/Firebase, BigQuery/Snowflake/Parquet (storage format)/Spark+Streaming, Prometheus, Neo4j
- Middleware: Kafka/Kinesis/SQS/RabbitMQ/PubSub/ZeroMQ, Redis, Fluentd, Apigee/Kong/Tyk/KrakenD, Auth0, GraphQL, LaunchDarkly
- Serverless: AWS Lambda, Cloud Function, Step Functions
- Convention over configuration
- Consistency over creativity (i.e., least surprise)
- Single source of truth (SSOT)
- Batteries included
- Full configurability, with a default setting that works 90% of time.
- Configuration as code
- Infrastructure as code
- Docs as code
- Automate yourself out of a job
- Driven by empathy, not ego (fancy features/algorithms never beat a good user experience)
- Centralisation isn't evil, chaos is
- Simplicity is the ultimate sophistication
- Exceptions are not exceptional, they're part of the system and part of the story
- Make choices based on problem, not on hype or bias
Governance is concerned with
- Ownership of components (e.g., products, services, pieces of infrastructure, codes)
- Communication patterns across teams/components
- Technical guidance and conflict resolution
The goal is to build a lean and efficient engineering team to support business aspirations (unless engineering is the business, the priority should be business).
Law of engineering: engineering team should not scale linearly with business growth.
- Language: Typescript + ES6. Full stop.
- Frameworks:
- React >= 0.14 with functional components
- Get started with create-react-app
- hooks and context, and when it gets too difficult, redux
- Jest > Mocha
- TestCafe > Cypress > Selenium
- Shared common library. bit.dev has good UI and doubles as a package repo. lerna may be hard to use.
- yarn > npm
- webpack for bundling
- GraphQL for data querying
- axios for HTTP requests
- Single sign-on (SSO)
- Progressive web apps (PWA)
- Deep linking: a page can be addressed and shared by a link
- ReactNative
- Flutter
- Native language (Java or Swift)
One BFF per frontend app.
Usages:
- Resource intensive tasks
- Complex query/aggregation
- Protecting FE from unstable API changes
- OAuth 2 authorization code flow
- Special protocol (e.g., websocket)
- Follow the 12 factor app
- SOLID
- Coordination with leader election
The philosophy is to abstract common functionality into a shared layer for better governance and easier upgrades.
- Middleware
- Context: common fields across all request flows (e.g., trace ID, caller ID)
- Common data types (e.g., timestamp, currency, coordinates, country code, error codes)
- Authentication
- Configuration
- Cross service communication
- Protocol abstraction: the underlying protocol used should be hidden from user. This allows easier protocol updates (e.g., HTTP to gRPC)
- Service discovery: services should be addressable by a name.
- Handlers base class (e.g., message handler, request handler)
- DAO base class
- Cache library
- Logging
- Metrics
- Configuration files should live in the same repository as code.
- Secrets should go into a secret manager. Deployment infrastructure should inject the secrets at run time.
- Allow secret overriding using environment variables (for running locally).
Definition: a feature flag is a toggle that a program uses to decide its behaviour. This is useful when rolling out new features gradually.
A feature flag system has these concepts:
- Flag value (aka. toggle value): the value that a program gets for a feature flag
- Rule: a rule is associated with a flag, and it maps parameters to a flag value. E.g., a rule can map all users with age < 18 to flag value
under-18
. - Feature management service: service that stores flags and rules. This service has admin APIs or UI to configure the flags and rules. E.g., LaunchDarkly.
Make an effort to keep the number of feature flags low.
Make a plan to remove unnecessary feature flags. A feature flag is no longer needed if 100% of the traffic is using the new feature.
- Structured logging: variables get their own fields, log messages are static strings
- What to log
- Timestamp
- Trace ID
- Caller ID
- Service
- Environment
- When to log
- Service starts. Log configurations.
- Service crashes
- Assertions fail (a code path that should never happen)
- Errors are handled
- Log at least one message per happy path
- When not to log:
- Error propagation without handling
- Normal code flow
- Duplicates
- On demand (cost saving)
- No infrastructure
- No single point of failure
- High scalability
- Not suitable for long running tasks (due to timeout)
- Not suitable for resource intensive tasks
- Not suitable for programs with local persistence (e.g., memory cache)
- Reactive (i.e., requires external triggers to run), not suitable for pro-active tasks (e.g., periodic notification, heartbeating)
- Complex workflow with multiple functions needs careful orchestration (e.g., step function)
- Logs need different shipping mechanism (because you don't control the VM and cannot install log aggregation daemons)
- The owner should be the writer
- There should be only one writer to a data set (the moderator)
- The owner should provide libraries for reading the data. The libraries should hide the low level details of how data is retrieved (e.g., directly from DB, or via the owner).
- Data should have created and updated timestamps
- If multiple data versions can co-exist, several strategies:
- New table: good for isolation, bad for management (especially if tables are created by CD pipeline)
- A version column: good for management, bad for indexing and possible hot partition.
- If old version needs to be migrated to new version, consider a tool like AWS Data Pipeline.
- SQL
- Pros
- Joint query is easy
- Transaction is easy
- Cons
- Schema change is hard
- Usually poor scalability
- Pros
- NoSQL
- Pros
- Schema change is easy
- Easy to scale
- Cons
- Can only query on indexes
- No joining, making application code complex
- Limited transaction support
- Bad index design can result in hot partitions
- Pros
SQL DB usually scales computing and storage together, which can be wasteful. An exception is AWS Aurora which scales them independenly.
NoSQL uses sharding to achieve high scalability.
See concurrency for discussion on transaction and data consistency.
- For microservices, DB should be treated more like working memory than long term source of truth (which should be your data warehouse instead).
- Prefer NoSQL than SQL.
- Avoid ORM(e.g., Hibernate, SQLAlchemy). They make your code bloated, less clear and fragile.
- Always define a DAO layer in application to expose an interface customised to the business logic. This reduces coupling of business logic to DB, and improves testability.
- Implement data (un)marshalling in the DAO.
When complex query (multi-fields, condition filter, pagination, sorting, etc) is needed, it's best to keep indexes in a search engine.
This also makes it possible to use a simple DB (e.g., NoSQL).
Examples:
- Elasticsearch: index search
- Solr: text search
- Aloglia: text search
The data is usually structured, with repeated and nested fields (e.g., JSON, YAML).
The DW therefore needs to handle them correctly. Columnar storage following Google Dremel whitepaper is ideal (e.g., Parquet format).
- Trasactional data and Event sourcing: model data change as events, and store the events in DW. Use cases:
- user activity analysis
- trend detection
- usage tracking
- Snapshot data: point-in-time data. Use cases:
- account balance
- inventry stock level
Snapshot data may be collected in several ways:
- exported from service DB
- constructed by playing back transactional data over last snapshot
- Have a data pipeline architecture as part of infrastructure.
- Define schema with version for all data types, with validation rules
- Validate incoming data before storing
- Common schema fields:
- Timestamp
- Trace ID
- Caller ID
- Service
- Environment
- Dedupe ID
- Is it test? (without this, test and real data is mingled and it's painful to separate them later)
- Don't serve data from DW directly. Instead, use a pipeline to ETL the data into a service, then serve it using APIs.
- Cache eviction: prevent out of memory issue. There is a number of strategies, with LRU being the most popular.
- Cache invalidation
- time-to-live (TTL)
- Event driven invalidation
- Choices
- OpenTracing
- Zipkin(written in Scala)
- Jaeger(written in Go)
- AWS X-Ray
- What tags to include with trace
- environment
- service
- API/endpoint invoked
- Request-response
- Long polling
- Websocket
- Task queue (aka. message queue)
- Broadcast
- Protocol buffers (abv. protobuf). Usages:
- Remote procedure call (RPC)
- Inter-process communication (IPC)
- Message schema definition
- OpenAPI (aka. Swagger)
- Thrift, less popular
- Avro, less popular
SDKs for different languages can be generated from API definitions.
- Implement API gateway to handle:
- API routing
- Protocol translation (e.g., REST to Protobuf)
- Authentication
- Logging/metrics
- Usage auditing
- Limit the usage of polymorphic payload (if payload is different in structure, better to make it a different API)
- Error response is part of the design, not an after thought.
- Standardise and regulate the use of error codes. Adhere to HTTP status code definitions.
- Treat HTTP 5xx status as system failures that require intervention (i.e., don't use them lightly).
This ensures that services and APIs are addressable in the infrastructure (by a unique and stable name).
- Overlay networks
- Address: an abstract concept of where data should be sent, e.g., IP address
- Routers: interprets the address and sends traffic to the correct endpoint
- DNS server: specialised service that resolves service name to address
This is relevant for SaaS systems, where multiple users/customers/partners share the same application and infrastructure but not data.
Data segregation is the most common technique used.
Infrastructure segration does not scale well.
https://martinfowler.com/articles/practical-test-pyramid.html
- Test the right thing, at the right level (API level > unit level > whole system level). E.g., DAO (having more complexity presumably) deserves more testing than HTTP handlers.
- Aim for quality, not coverage. E.g., 90% DAO test coverage with mocked DB isn't better than 70% with real DB.
- Higher level tests should be more general, lower level tests should be more specific (e.g., cover edge cases)
- You can't cover everything in test, but you can make sure you know how to fix it when it breaks (e.g., with good monitoring/logging)
- Use BDD style (i.e., structure test as a scenario) but not BDD itself (i.e., don't do scenario-to-code translation)
- Make test data identifiable. Test data should never intefere with real data.
- Not be afraid to run test in production. This requires building application with testability in mind.
- Scenario description
- Number of users
- Scenario of each user
- Ramp up/cool down period
- Metrics
- Response time
- Throughput
- Error rate
- Monitoring: make sure the load generator isn't stressed out, by monitoring its CPU/memory/network
- Scaling: runing multiple instances, and aggregate the logs.
- Tools
- Gatling: open source, written in Scala, good report UI, own DSL
- JMeter: open source, written in Java, hard to configure
- Locust: open source, written in Python
- NeoLoad: commercial
- BlazeMeter: commercial
- Limit scope of variables
- Consistent naming
- Declare constants at top
- Always use a linter and integrate it into CI
- Encourage the use of IDEs
- Reproducible builds: Use a package manager that can lock dependency versions
- Aim for level 5 of CI maturity
- Configuration as code
- Infrastructure as code
- No ad-hoc fixes
- Immutable infrastructure (no hot fixes)
- Developers should self-service
- Pager is the last line of escalation. Use it sparingly.
Components with shared ownership should be considered a piece of infrastructure, and managed in a single place (instead of distributed across repositories/codebases).
Examples:
- Service API definitions (service provider != API owner)
- Message schema definitions
- Documentation
- Data in data warehouse
CLI and scripts should be the prefered way of automation.
They should be well documented, versioned and published for easy installation. Example: goreleaser
- Circle CI
- Travis CI
- GitHub Actions
- Concourse CI: ususally as a deployment pipeline instead of code builder
- Jenkins
- Bamboo: commercial
- Deploy from master (never branches)
- Canary release
- Blue/green deployment
- Acceptance test on the full platform
- Healthcheck endpoints for long running services
- Tracing
- Service dependency graph based on traffic and healthcheck. This makes service grade/decommission safer
- Service metrics and dashboard
QA > writing test
QA is part of DevOps, not a separate team
- Provide tooling/library/framework/process to make low level testing self-serviced by developers (unit tests, component tests, load tests).
- Develop and own end user and high level tests, from an organisation or company perspective.
- Test automation, reducing manual intervention.
- Standardise test methodology across teams.
- Reduce noise from fragile tests, false positives, long-time-known bugs, to prevent distraction and increase sensitivity to true positives across the company.
- Big O notation for both time and space complexity
- Cyclomatic complexity: how many logical path a function has
- Models of concurrency
- Thread
- Event loop + Asynchronous IO
- Message passing and CSP (including actor systems)
- Common errors
- Distributed systems
- CAP theorem
- Concensus algorithms: Paxos, Raft
- Eventual consistency
- Gossip protocol
- BASE
- Warning not suitable for systems requiring ACID, e.g., bank account transfer.
- Leader election
- Sharding
- LSM: used by most NoSQL DB to ensure no data loss
TBC