Skip to content

databricks/scala-style-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

142 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Databricks Scala Guide

At Databricks, our engineers work on some of the most actively developed Scala codebases in the world, including our own internal repo called "universe" as well as the various open source projects we contribute to, e.g. Apache Spark and Delta Lake. This guide draws from our experience coaching and working with our engineering teams as well as the broader open source community.

Code is written once by its author, but read and modified multiple times by lots of other engineers. As most bugs actually come from future modification of the code, we need to optimize our codebase for long-term, global readability and maintainability. The best way to achieve this is to write simple code.

Scala is an incredibly powerful language that is capable of many paradigms. We have found that the following guidelines work well for us on projects with high velocity. Depending on the needs of your team, your mileage might vary.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

  1. Document History

  2. Syntactic Style

  3. Scala Language Features

  4. Concurrency

  5. Performance

  6. Java Interoperability

  7. Testing

  8. Miscellaneous

We mostly follow Java's and Scala's standard naming conventions.

  • Classes, traits, objects should follow Java class convention, i.e. PascalCase style.

    class ClusterManager
    
    trait Expression
  • Packages should follow Java package naming conventions, i.e. all-lowercase ASCII letters.

    package com.databricks.resourcemanager
  • Methods/functions should be named in camelCase style.

  • Constants should be all uppercase letters and be put in a companion object.

    object Configuration {
      val DEFAULT_PORT = 10000
    }
  • An enumeration class or object which extends the Enumeration class shall follow the convention for classes and objects, i.e. its name should be in PascalCase style. Enumeration values shall be in the upper case with words separated by the underscore character _. For example:

      private object ParseState extends Enumeration {
      type ParseState = Value
    
      val PREFIX,
          TRIM_BEFORE_SIGN,
          SIGN,
          TRIM_BEFORE_VALUE,
          VALUE,
          VALUE_FRACTIONAL_PART,
          TRIM_BEFORE_UNIT,
          UNIT_BEGIN,
          UNIT_SUFFIX,
          UNIT_END = Value
    }
  • Annotations should also follow Java convention, i.e. PascalCase. Note that this differs from Scala's official guide.

    final class MyAnnotation extends StaticAnnotation
  • Variables should be named in camelCase style, and should have self-evident names.

    val serverPort = 1000
    val clientPort = 2000
  • It is OK to use one-character variable names in small, localized scope. For example, "i" is commonly used as the loop index for a small loop body (e.g. 10 lines of code). However, do NOT use "l" (as in Larry) as the identifier, because it is difficult to differentiate "l" from "1", "|", and "I".

  • Limit lines to 100 characters.
  • The only exceptions are import statements and URLs (although even for those, try to keep them under 100 chars).

"If an element consists of more than 30 subelements, it is highly probable that there is a serious problem" - Refactoring in Large Software Projects.

In general:

  • A method should contain less than 30 lines of code.
  • A class should contain less than 30 methods.
  • Put one space before and after operators, including the assignment operator.

    def add(int1: Int, int2: Int): Int = int1 + int2
  • Put one space after commas.

    Seq("a", "b", "c") // do this
    
    Seq("a","b","c") // don't omit spaces after commas
  • Put one space after colons.

    // do this
    def getConf(key: String, defaultValue: String): String = {
      // some code
    }
    
    // don't put spaces before colons
    def calculateHeaderPortionInBytes(count: Int) : Int = {
      // some code
    }
    
    // don't omit spaces after colons
    def multiply(int1:Int, int2:Int): Int = int1 * int2
  • Use 2-space indentation in general.

    if (true) {
      println("Wow!")
    }
  • For method declarations, use 4 space indentation for their parameters and put each in each line when the parameters don't fit in two lines. Return types can be either on the same line as the last parameter, or start a new line with 2 space indent.

    def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
        path: String,
        fClass: Class[F],
        kClass: Class[K],
        vClass: Class[V],
        conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
      // method body
    }
    
    def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
        path: String,
        fClass: Class[F],
        kClass: Class[K],
        vClass: Class[V],
        conf: Configuration = hadoopConfiguration)
      : RDD[(K, V)] = {
      // method body
    }
  • For classes whose header doesn't fit in two lines, use 4 space indentation for its parameters, put each in each line, put the extends on the next line with 2 space indent, and add a blank line after class header.

    class Foo(
        val param1: String,  // 4 space indent for parameters
        val param2: String,
        val param3: Array[Byte])
      extends FooInterface  // 2 space indent here
      with Logging {
    
      def firstMethod(): Unit = { ... }  // blank line above
    }
  • For method and class constructor invocations, use 2 space indentation for its parameters and put each in each line when the parameters don't fit in two lines.

    foo(
      someVeryLongFieldName,  // 2 space indent here
      andAnotherVeryLongFieldName,
      "this is a string",
      3.1415)
    
    new Bar(
      someVeryLongFieldName,  // 2 space indent here
      andAnotherVeryLongFieldName,
      "this is a string",
      3.1415)
  • Do NOT use vertical alignment. They draw attention to the wrong parts of the code and make the aligned code harder to change in the future.

    // Don't align vertically
    val plus     = "+"
    val minus    = "-"
    val multiply = "*"
    
    // Do the following
    val plus = "+"
    val minus = "-"
    val multiply = "*"
  • A single blank line appears:
    • Between consecutive members (or initializers) of a class: fields, constructors, methods, nested classes, static initializers, instance initializers.
      • Exception: A blank line between two consecutive fields (having no other code between them) is optional. Such blank lines are used as needed to create logical groupings of fields.
    • Within method bodies, as needed to create logical groupings of statements.
    • Optionally before the first member or after the last member of the class (neither encouraged nor discouraged).
  • Use one or two blank line(s) to separate class or object definitions.
  • Excessive number of blank lines is discouraged.
  • Methods should be declared with parentheses, unless they are accessors that have no side-effect (state mutation, I/O operations are considered side-effects).
    class Job {
      // Wrong: killJob changes state. Should have ().
      def killJob: Unit
    
      // Correct:
      def killJob(): Unit
    }
  • Callsite should follow method declaration, i.e. if a method is declared with parentheses, call with parentheses. Note that this is not just syntactic. It can affect correctness when apply is defined in the return object.
    class Foo {
      def apply(args: String*): Int
    }
    
    class Bar {
      def foo: Foo
    }
    
    new Bar().foo  // This returns a Foo
    new Bar().foo()  // This returns an Int!

Put curly braces even around one-line conditional or loop statements. The only exception is if you are using if/else as an one-line ternary operator that is also side-effect free.

// Correct:
if (true) {
  println("Wow!")
}

// Correct:
if (true) statement1 else statement2

// Correct:
try {
  foo()
} catch {
  ...
}

// Wrong:
if (true)
  println("Wow!")

// Wrong:
try foo() catch {
  ...
}

Suffix long literal values with uppercase L. It is often hard to differentiate lowercase l from 1.

val longValue = 5432L  // Do this

val longValue = 5432l  // Do NOT do this

Use Java docs style instead of Scala docs style.

/** This is a correct one-liner, short description. */

/**
 * This is correct multi-line JavaDoc comment. And
 * this is my second line, and if I keep typing, this would be
 * my third line.
 */

/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */

If a class is long and has many methods, group them logically into different sections, and use comment headers to organize them.

class DataFrame {

  ///////////////////////////////////////////////////////////////////////////
  // DataFrame operations
  ///////////////////////////////////////////////////////////////////////////

  ...

  ///////////////////////////////////////////////////////////////////////////
  // RDD operations
  ///////////////////////////////////////////////////////////////////////////

  ...
}

Of course, the situation in which a class grows this long is strongly discouraged, and is generally reserved only for building certain public APIs.

  • Avoid using wildcard imports, unless you are importing more than 6 entities, or implicit methods. Wildcard imports make the code less robust to external changes.

  • Always import packages using absolute paths (e.g. scala.util.Random) instead of relative ones (e.g. util.Random).

  • In addition, sort imports in the following order:

    • java.* and javax.*
    • scala.*
    • Third-party libraries (org.*, com.*, etc)
    • Project classes (com.databricks.* or org.apache.spark if you are working on Spark)
  • Within each group, imports should be sorted in alphabetic ordering.

  • You can use IntelliJ's import organizer to handle this automatically, using the following config:

    java
    javax
    _______ blank line _______
    scala
    _______ blank line _______
    all other imports
    _______ blank line _______
    com.databricks  // or org.apache.spark if you are working on Spark
    
  • For method whose entire body is a pattern match expression, put the match on the same line as the method declaration if possible to reduce one level of indentation.

    def test(msg: Message): Unit = msg match {
      case ...
    }
  • When calling a function with a closure (or partial function), if there is only one case, put the case on the same line as the function invocation.

    list.zipWithIndex.map { case (elem, i) =>
      // ...
    }

    If there are multiple cases, indent and wrap them.

    list.map {
      case a: Foo =>  ...
      case b: Bar =>  ...
    }
  • If the only goal is to match on the type of the object, do NOT expand fully all the arguments, as it makes refactoring more difficult and the code more error prone.

    case class Pokemon(name: String, weight: Int, hp: Int, attack: Int, defense: Int)
    case class Human(name: String, hp: Int)
    
    // Do NOT do the following, because
    // 1. When a new field is added to Pokemon, we need to change this pattern matching as well
    // 2. It is easy to mismatch the arguments, especially for the ones that have the same data types
    targets.foreach {
      case target @ Pokemon(_, _, hp, _, defense) =>
        val loss = sys.min(0, myAttack - defense)
        target.copy(hp = hp - loss)
      case target @ Human(_, hp) =>
        target.copy(hp = hp - myAttack)
    }
    
    // Do this:
    targets.foreach {
      case target: Pokemon =>
        val loss = sys.min(0, myAttack - target.defense)
        target.copy(hp = target.hp - loss)
      case target: Human =>
        target.copy(hp = target.hp - myAttack)
    }

Avoid infix notation for methods that aren't symbolic methods (i.e. operator overloading).

// Correct
list.map(func)
string.contains("foo")

// Wrong
list map (func)
string contains "foo"

// But overloaded operators should be invoked in infix style
arrayBuffer += elem

Avoid excessive parentheses and curly braces for anonymous methods.

// Correct
list.map { item =>
  ...
}

// Correct
list.map(item => ...)

// Wrong
list.map(item => {
  ...
})

// Wrong
list.map { item => {
  ...
}}

// Wrong
list.map({ item => ... })

Case classes are regular classes but extended by the compiler to automatically support:

  • Public getters for constructor parameters
  • Copy constructor
  • Pattern matching on constructor parameters
  • Automatic toString/hash/equals implementation

Constructor parameters should NOT be mutable for case classes. Instead, use copy constructor. Having mutable case classes can be error prone, e.g. hash maps might place the object in the wrong bucket using the old hash code.

// This is OK
case class Person(name: String, age: Int)

// This is NOT OK
case class Person(name: String, var age: Int)

// To change values, use the copy constructor to create a new instance
val p1 = Person("Peter", 15)
val p2 = p1.copy(age = 16)

Avoid defining apply methods on classes. These methods tend to make the code less readable, especially for people less familiar with Scala. It is also harder for IDEs (or grep) to trace. In the worst case, it can also affect correctness of the code in surprising ways, as demonstrated in Parentheses.

It is acceptable to define apply methods on companion objects as factory methods. In these cases, the apply method should return the companion class type.

object TreeNode {
  // This is OK
  def apply(name: String): TreeNode = ...

  // This is bad because it does not return a TreeNode
  def apply(name: String): String = ...
}

Always add override modifier for methods, both for overriding concrete methods and implementing abstract methods. The Scala compiler does not require override for implementing abstract methods. However, we should always add override to make the override obvious, and to avoid accidental non-overrides due to non-matching signatures.

trait Parent {
  def hello(data: Map[String, String]): Unit = {
    print(data)
  }
}

class Child extends Parent {
  import scala.collection.Map

  // The following method does NOT override Parent.hello,
  // because the two Maps have different types.
  // If we added "override" modifier, the compiler would've caught it.
  def hello(data: Map[String, String]): Unit = {
    print("This is supposed to override the parent method, but it is actually not!")
  }
}

Destructuring bind (sometimes called tuple extraction) is a convenient way to assign two variables in one expression.

val (a, b) = (1, 2)

However, do NOT use them in constructors, especially when a and b need to be marked transient. The Scala compiler generates an extra Tuple2 field that will not be transient for the above example.

class MyClass {
  // This will NOT work because the compiler generates a non-transient Tuple2
  // that points to both a and b.
  @transient private val (a, b) = someFuncThatReturnsTuple2()
}

Avoid using call by name. Use () => T explicitly.

Background: Scala allows method parameters to be defined by-name, e.g. the following would work:

def print(value: => Int): Unit = {
  println(value)
  println(value + 1)
}

var a = 0
def inc(): Int = {
  a += 1
  a
}

print(inc())

in the above code, inc() is passed into print as a closure and is executed (twice) in the print method, rather than being passed in as a value 1. The main problem with call-by-name is that the caller cannot differentiate between call-by-name and call-by-value, and thus cannot know for sure whether the expression will be executed or not (or maybe worse, multiple times). This is especially dangerous for expressions that have side-effect.

Avoid using multiple parameter lists. They complicate operator overloading, and can confuse programmers less familiar with Scala. For example:

// Avoid this!
case class Person(name: String, age: Int)(secret: String)

One notable exception is the use of a 2nd parameter list for implicits when defining low-level libraries. That said, implicits should be avoided!

Do NOT use symbolic method names, unless you are defining them for natural arithmetic operations (e.g. +, -, *, /). Under no other circumstances should they be used. Symbolic method names make it very hard to understand the intent of the methods. Consider the following two examples:

// symbolic method names are hard to understand
channel ! msg
stream1 >>= stream2

// self-evident what is going on
channel.send(msg)
stream1.join(stream2)

Scala type inference, especially left-side type inference and closure inference, can make code more concise. That said, there are a few cases where explicit typing should be used:

  • Public methods should be explicitly typed, otherwise the compiler's inferred type can often surprise you.
  • Implicit methods should be explicitly typed, otherwise it can crash the Scala compiler with incremental compilation.
  • Variables or closures with non-obvious types should be explicitly typed. A good litmus test is that explicit typ