Databricks Scala Guide

At Databricks, our engineers work on some of the most actively developed Scala codebases in the world, including our own internal repo called "universe" as well as the various open source projects we contribute to, e.g. Apache Spark and Delta Lake. This guide draws from our experience coaching and working with our engineering teams as well as the broader open source community.

Code is written once by its author, but read and modified multiple times by lots of other engineers. As most bugs actually come from future modification of the code, we need to optimize our codebase for long-term, global readability and maintainability. The best way to achieve this is to write simple code.

Scala is an incredibly powerful language that is capable of many paradigms. We have found that the following guidelines work well for us on projects with high velocity. Depending on the needs of your team, your mileage might vary.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Document History

2015-03-16: Initial version.
2015-05-25: Added override Modifier section.
2015-08-23: Downgraded the severity of some rules from "do NOT" to "avoid".
2015-11-17: Updated apply Method section: apply method in companion object should return the companion class.
2015-11-17: This guide has been translated into Chinese. The Chinese translation is contributed by community member Hawstein. We do not guarantee that it will always be kept up-to-date.
2015-12-14: This guide has been translated into Korean. The Korean translation is contributed by Hyukjin Kwon and reviewed by Yun Park, Kevin (Sangwoo) Kim, Hyunje Jo and Woochel Choi. We do not guarantee that it will always be kept up-to-date.
2016-06-15: Added Anonymous Methods section.
2016-06-21: Added Variable Naming Convention section.
2016-12-24: Added Case Classes and Immutability section.
2017-02-23: Added Testing section.
2017-04-18: Added Prefer existing well-tested methods over reinventing the wheel section.
2019-12-18: Added Symbol Literals section.
2022-08-05: Updated Monadic Chaining section: do not monadic-chain with an if-else block.

Syntactic Style

Naming Convention

We mostly follow Java's and Scala's standard naming conventions.

Classes, traits, objects should follow Java class convention, i.e. PascalCase style.
```
class ClusterManager

trait Expression
```
Packages should follow Java package naming conventions, i.e. all-lowercase ASCII letters.
```
package com.databricks.resourcemanager
```
Methods/functions should be named in camelCase style.
Constants should be all uppercase letters and be put in a companion object.
```
object Configuration {
  val DEFAULT_PORT = 10000
}
```

An enumeration class or object which extends the Enumeration class shall follow the convention for classes and objects, i.e. its name should be in PascalCase style. Enumeration values shall be in the upper case with words separated by the underscore character _. For example:

  private object ParseState extends Enumeration {
  type ParseState = Value

  val PREFIX,
      TRIM_BEFORE_SIGN,
      SIGN,
      TRIM_BEFORE_VALUE,
      VALUE,
      VALUE_FRACTIONAL_PART,
      TRIM_BEFORE_UNIT,
      UNIT_BEGIN,
      UNIT_SUFFIX,
      UNIT_END = Value
}

Annotations should also follow Java convention, i.e. PascalCase. Note that this differs from Scala's official guide.
```
final class MyAnnotation extends StaticAnnotation
```

Variable Naming Convention

Variables should be named in camelCase style, and should have self-evident names.
```
val serverPort = 1000
val clientPort = 2000
```
It is OK to use one-character variable names in small, localized scope. For example, "i" is commonly used as the loop index for a small loop body (e.g. 10 lines of code). However, do NOT use "l" (as in Larry) as the identifier, because it is difficult to differentiate "l" from "1", "|", and "I".

Line Length

Limit lines to 100 characters.
The only exceptions are import statements and URLs (although even for those, try to keep them under 100 chars).

Rule of 30

"If an element consists of more than 30 subelements, it is highly probable that there is a serious problem" - Refactoring in Large Software Projects.

In general:

A method should contain less than 30 lines of code.
A class should contain less than 30 methods.

Spacing and Indentation

Put one space before and after operators, including the assignment operator.
```
def add(int1: Int, int2: Int): Int = int1 + int2
```

Put one space after commas.

Seq("a", "b", "c") // do this

Seq("a","b","c") // don't omit spaces after commas

Put one space after colons.

// do this
def getConf(key: String, defaultValue: String): String = {
  // some code
}

// don't put spaces before colons
def calculateHeaderPortionInBytes(count: Int) : Int = {
  // some code
}

// don't omit spaces after colons
def multiply(int1:Int, int2:Int): Int = int1 * int2

Use 2-space indentation in general.
```
if (true) {
  println("Wow!")
}
```

For method declarations, use 4 space indentation for their parameters and put each in each line when the parameters don't fit in two lines. Return types can be either on the same line as the last parameter, or start a new line with 2 space indent.

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
    path: String,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V],
    conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
  // method body
}

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
    path: String,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V],
    conf: Configuration = hadoopConfiguration)
  : RDD[(K, V)] = {
  // method body
}

For classes whose header doesn't fit in two lines, use 4 space indentation for its parameters, put each in each line, put the extends on the next line with 2 space indent, and add a blank line after class header.

class Foo(
    val param1: String,  // 4 space indent for parameters
    val param2: String,
    val param3: Array[Byte])
  extends FooInterface  // 2 space indent here
  with Logging {

  def firstMethod(): Unit = { ... }  // blank line above
}

For method and class constructor invocations, use 2 space indentation for its parameters and put each in each line when the parameters don't fit in two lines.

foo(
  someVeryLongFieldName,  // 2 space indent here
  andAnotherVeryLongFieldName,
  "this is a string",
  3.1415)

new Bar(
  someVeryLongFieldName,  // 2 space indent here
  andAnotherVeryLongFieldName,
  "this is a string",
  3.1415)

Do NOT use vertical alignment. They draw attention to the wrong parts of the code and make the aligned code harder to change in the future.

// Don't align vertically
val plus     = "+"
val minus    = "-"
val multiply = "*"

// Do the following
val plus = "+"
val minus = "-"
val multiply = "*"

Blank Lines (Vertical Whitespace)

A single blank line appears:
- Between consecutive members (or initializers) of a class: fields, constructors, methods, nested classes, static initializers, instance initializers.
  - Exception: A blank line between two consecutive fields (having no other code between them) is optional. Such blank lines are used as needed to create logical groupings of fields.
- Within method bodies, as needed to create logical groupings of statements.
- Optionally before the first member or after the last member of the class (neither encouraged nor discouraged).
Use one or two blank line(s) to separate class or object definitions.
Excessive number of blank lines is discouraged.

Parentheses

Methods should be declared with parentheses, unless they are accessors that have no side-effect (state mutation, I/O operations are considered side-effects).
```
class Job {
  // Wrong: killJob changes state. Should have ().
  def killJob: Unit

  // Correct:
  def killJob(): Unit
}
```
Callsite should follow method declaration, i.e. if a method is declared with parentheses, call with parentheses. Note that this is not just syntactic. It can affect correctness when apply is defined in the return object.
```
class Foo {
  def apply(args: String*): Int
}

class Bar {
  def foo: Foo
}

new Bar().foo  // This returns a Foo
new Bar().foo()  // This returns an Int!
```

Curly Braces

Put curly braces even around one-line conditional or loop statements. The only exception is if you are using if/else as an one-line ternary operator that is also side-effect free.

// Correct:
if (true) {
  println("Wow!")
}

// Correct:
if (true) statement1 else statement2

// Correct:
try {
  foo()
} catch {
  ...
}

// Wrong:
if (true)
  println("Wow!")

// Wrong:
try foo() catch {
  ...
}

Long Literals

Suffix long literal values with uppercase L. It is often hard to differentiate lowercase l from 1.

val longValue = 5432L  // Do this

val longValue = 5432l  // Do NOT do this

Documentation Style

Use Java docs style instead of Scala docs style.

/** This is a correct one-liner, short description. */

/**
 * This is correct multi-line JavaDoc comment. And
 * this is my second line, and if I keep typing, this would be
 * my third line.
 */

/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */

Ordering within a Class

If a class is long and has many methods, group them logically into different sections, and use comment headers to organize them.

class DataFrame {

  ///////////////////////////////////////////////////////////////////////////
  // DataFrame operations
  ///////////////////////////////////////////////////////////////////////////

  ...

  ///////////////////////////////////////////////////////////////////////////
  // RDD operations
  ///////////////////////////////////////////////////////////////////////////

  ...
}

Of course, the situation in which a class grows this long is strongly discouraged, and is generally reserved only for building certain public APIs.

Imports

Avoid using wildcard imports, unless you are importing more than 6 entities, or implicit methods. Wildcard imports make the code less robust to external changes.
Always import packages using absolute paths (e.g. scala.util.Random) instead of relative ones (e.g. util.Random).
In addition, sort imports in the following order:
- java.* and javax.*
- scala.*
- Third-party libraries (org.*, com.*, etc)
- Project classes (com.databricks.* or org.apache.spark if you are working on Spark)
Within each group, imports should be sorted in alphabetic ordering.

You can use IntelliJ's import organizer to handle this automatically, using the following config:

java
javax
_______ blank line _______
scala
_______ blank line _______
all other imports
_______ blank line _______
com.databricks  // or org.apache.spark if you are working on Spark

Pattern Matching

For method whose entire body is a pattern match expression, put the match on the same line as the method declaration if possible to reduce one level of indentation.
```
def test(msg: Message): Unit = msg match {
  case ...
}
```
When calling a function with a closure (or partial function), if there is only one case, put the case on the same line as the function invocation.
```
list.zipWithIndex.map { case (elem, i) =>
  // ...
}
```
If there are multiple cases, indent and wrap them.
```
list.map {
  case a: Foo =>  ...
  case b: Bar =>  ...
}
```

If the only goal is to match on the type of the object, do NOT expand fully all the arguments, as it makes refactoring more difficult and the code more error prone.

case class Pokemon(name: String, weight: Int, hp: Int, attack: Int, defense: Int)
case class Human(name: String, hp: Int)

// Do NOT do the following, because
// 1. When a new field is added to Pokemon, we need to change this pattern matching as well
// 2. It is easy to mismatch the arguments, especially for the ones that have the same data types
targets.foreach {
  case target @ Pokemon(_, _, hp, _, defense) =>
    val loss = sys.min(0, myAttack - defense)
    target.copy(hp = hp - loss)
  case target @ Human(_, hp) =>
    target.copy(hp = hp - myAttack)
}

// Do this:
targets.foreach {
  case target: Pokemon =>
    val loss = sys.min(0, myAttack - target.defense)
    target.copy(hp = target.hp - loss)
  case target: Human =>
    target.copy(hp = target.hp - myAttack)
}

Infix Methods

Avoid infix notation for methods that aren't symbolic methods (i.e. operator overloading).

// Correct
list.map(func)
string.contains("foo")

// Wrong
list map (func)
string contains "foo"

// But overloaded operators should be invoked in infix style
arrayBuffer += elem

Anonymous Methods

Avoid excessive parentheses and curly braces for anonymous methods.

// Correct
list.map { item =>
  ...
}

// Correct
list.map(item => ...)

// Wrong
list.map(item => {
  ...
})

// Wrong
list.map { item => {
  ...
}}

// Wrong
list.map({ item => ... })

Scala Language Features

Case Classes and Immutability

Case classes are regular classes but extended by the compiler to automatically support:

Public getters for constructor parameters
Copy constructor
Pattern matching on constructor parameters
Automatic toString/hash/equals implementation

Constructor parameters should NOT be mutable for case classes. Instead, use copy constructor. Having mutable case classes can be error prone, e.g. hash maps might place the object in the wrong bucket using the old hash code.

// This is OK
case class Person(name: String, age: Int)

// This is NOT OK
case class Person(name: String, var age: Int)

// To change values, use the copy constructor to create a new instance
val p1 = Person("Peter", 15)
val p2 = p1.copy(age = 16)

apply Method

Avoid defining apply methods on classes. These methods tend to make the code less readable, especially for people less familiar with Scala. It is also harder for IDEs (or grep) to trace. In the worst case, it can also affect correctness of the code in surprising ways, as demonstrated in Parentheses.

It is acceptable to define apply methods on companion objects as factory methods. In these cases, the apply method should return the companion class type.

object TreeNode {
  // This is OK
  def apply(name: String): TreeNode = ...

  // This is bad because it does not return a TreeNode
  def apply(name: String): String = ...
}

override Modifier

Always add override modifier for methods, both for overriding concrete methods and implementing abstract methods. The Scala compiler does not require override for implementing abstract methods. However, we should always add override to make the override obvious, and to avoid accidental non-overrides due to non-matching signatures.

trait Parent {
  def hello(data: Map[String, String]): Unit = {
    print(data)
  }
}

class Child extends Parent {
  import scala.collection.Map

  // The following method does NOT override Parent.hello,
  // because the two Maps have different types.
  // If we added "override" modifier, the compiler would've caught it.
  def hello(data: Map[String, String]): Unit = {
    print("This is supposed to override the parent method, but it is actually not!")
  }
}

Destructuring Binds

Destructuring bind (sometimes called tuple extraction) is a convenient way to assign two variables in one expression.

val (a, b) = (1, 2)

However, do NOT use them in constructors, especially when a and b need to be marked transient. The Scala compiler generates an extra Tuple2 field that will not be transient for the above example.

class MyClass {
  // This will NOT work because the compiler generates a non-transient Tuple2
  // that points to both a and b.
  @transient private val (a, b) = someFuncThatReturnsTuple2()
}

Call by Name

Avoid using call by name. Use () => T explicitly.

Background: Scala allows method parameters to be defined by-name, e.g. the following would work:

def print(value: => Int): Unit = {
  println(value)
  println(value + 1)
}

var a = 0
def inc(): Int = {
  a += 1
  a
}

print(inc())

in the above code, inc() is passed into print as a closure and is executed (twice) in the print method, rather than being passed in as a value 1. The main problem with call-by-name is that the caller cannot differentiate between call-by-name and call-by-value, and thus cannot know for sure whether the expression will be executed or not (or maybe worse, multiple times). This is especially dangerous for expressions that have side-effect.

Multiple Parameter Lists

Avoid using multiple parameter lists. They complicate operator overloading, and can confuse programmers less familiar with Scala. For example:

// Avoid this!
case class Person(name: String, age: Int)(secret: String)

One notable exception is the use of a 2nd parameter list for implicits when defining low-level libraries. That said, implicits should be avoided!

Symbolic Methods (Operator Overloading)

Do NOT use symbolic method names, unless you are defining them for natural arithmetic operations (e.g. +, -, *, /). Under no other circumstances should they be used. Symbolic method names make it very hard to understand the intent of the methods. Consider the following two examples:

// symbolic method names are hard to understand
channel ! msg
stream1 >>= stream2

// self-evident what is going on
channel.send(msg)
stream1.join(stream2)

Type Inference

Scala type inference, especially left-side type inference and closure inference, can make code more concise. That said, there are a few cases where explicit typing should be used:

Public methods should be explicitly typed, otherwise the compiler's inferred type can often surprise you.
Implicit methods should be explicitly typed, otherwise it can crash the Scala compiler with incremental compilation.
Variables or closures with non-obvious types should be explicitly typed. A good litmus test is that explicit typ

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
README-KO.md		README-KO.md
README-ZH.md		README-ZH.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Scala Guide

Table of Contents

Document History

Syntactic Style

Naming Convention

Variable Naming Convention

Line Length

Rule of 30

Spacing and Indentation

Blank Lines (Vertical Whitespace)

Parentheses

Curly Braces

Long Literals

Documentation Style

Ordering within a Class

Imports

Pattern Matching

Infix Methods

Anonymous Methods

Scala Language Features

Case Classes and Immutability

apply Method

override Modifier

Destructuring Binds

Call by Name

Multiple Parameter Lists

Symbolic Methods (Operator Overloading)

Type Inference

Folders and files

Latest commit

History

Repository files navigation

Databricks Scala Guide