Server / Maintenance tasks, Operations

See also list on basecamp - not yet integrated here.

"Future Refinements" from DCE report also not yet fully integrated

Short-term needs

Best Practice

Ongoing

Sysadmin

Walkthrough performing an actual backup recovery. Document the steps and how we determine whether data has been lost.

Note these tasks result in related ongoing maintenance.

set up service monitoring
set up log analysis
Perform risk assessments and business impact analyses (BIA); keep these up-to-date
Help design and implement redundancies (e.g. failover server) for needs identified by BIA. Execute redundancies as-needed

OS-level updates and upgrades
Security patches / monitoring this space
Backup script maintenance
AWS expertise
Own and maintain deployment scripts
Help coordinate and perform large-scale upgrades (e.g. those that require spinning up new boxes and doing switch-overs of drives or DNS entries)
Keep tabs on storage use over time and coordinate projections thereof
Create and manage SSL certs
Manage user (server) accounts
Firewall configuration

Grey area: responsibility shared, unclear, or variable

database administration / tuning
Integrate Hydra user accounts with CHF LDAP server
monitor and benchmark JVM, make heap size, garbage collection adjustments as needed

Ops

set up CI server or service
set up security filters for incoming / outgoing code
modify new ansible project to work with vagrant to create a development environment.

configure differences between staging, prod, and test environments in ansible and capistrano

Conversation topics - sysadmin friends:

Scope of duties
Current projects
Do you do any "ops"-y stuff? Would you want to?
Project back log
Routine duties
Do you also do coding?
How many boxes do you manage?
Have you experienced or simulated data loss & recovery?
Do you have/do the things in our "best practice" column?
How do you define sysadmin vs. developer responsibilities?
AWS: delete volumes on instance termination? for attached as well as root volumes. My instinct is to turn this off and manually delete volumes (or schedule a job to delete all "available" volumes older than X days?). But wondering if there is a scenario in which it makes sense to keep it on.