Sym File Maintenance in KDB+

The sym file is one of the most critical components of a KDB+ database. It provides significant improvements in both memory efficiency and query performance. However, to preserve these benefits, the sym file must be properly maintained. Excessively large sym files can negatively impact database load times and overall system performance, making it important to keep them as compact as reasonably possible.
This blog examines the considerations involved in database design to minimise sym file growth. It also explores practical techniques for reducing the size of an existing sym file when it becomes excessively large.
This post builds on the concepts introduced in my previous blog on general KDB+ database maintenance. If you are unfamiliar with those fundamentals, I recommend reading that post first, as many of the practices discussed there form the foundation for effective sym file maintenance. Additionally, the code described throughout this blog is available within the dbm.q module.
Warning
Operations on sym files can fail for many reasons. To avoid data loss, always back up your database before performing any of the operations described in this blog.
Symbols and Enumeration
In KDB+, the symbol type is an interned string. This means only one copy of each distinct string value is stored in memory, and all references point to that single instance.
On disk, this concept is implemented through enumeration. Enumerating a list of symbols converts each symbol into an integer index referencing a unique list of symbols known as the domain. This domain is stored in the sym file.
Symbol columns in splayed or partitioned tables must be enumerated. Instead of storing the symbol values directly, these columns store integer indexes into the domain. When the database is mapped into memory, these indexes are resolved back to their corresponding symbol values transparently.
Enumeration provides several important benefits:
- Reduced memory and disk usage — only one copy of each unique symbol is stored, while references use fixed-width integer indexes.
- Improved query performance — integer comparisons are significantly faster than string comparisons.
- Efficient memory mapping — fixed-width representations improve cache efficiency and access speed.
However, enumeration also introduces operational complexity:
- Tables must be properly enumerated before being written to disk.
- The sym file becomes a critical dependency; corruption or inconsistency can lead to severe data integrity issues.
- The sym file grows over time as new unique symbols are added. This growth, often referred to as sym file bloat, can significantly increase database load times.
Why Sym File Maintenance Matters
The sym file (or files) is loaded into memory when a database is opened. Its size therefore directly impacts startup time and memory usage. In systems with high symbol cardinality, such as trade identifiers or user-generated strings, the sym file can grow rapidly.
Uncontrolled growth can lead to:
- Slow database startup times
- Increased memory consumption
- Reduced overall system efficiency
- Longer recovery times during failover or restart
Proper design and ongoing maintenance are essential to prevent these issues.
Design Considerations
When designing a database, the choice of datatype for columns is an important consideration. A common decision is whether to use the symbol or string type for values containing alphanumeric characters. In many cases, it may not be immediately obvious which type is most appropriate.
In practice, the following guidelines can help inform that choice:
| Symbol | String |
|---|---|
Values contain only alphanumeric, underscore, or dot characters (0-9A-Za-z_.) | Values contain spaces or special characters (which can make certain operations more cumbersome) |
| High repetition of values | Low repetition of values |
| Column frequently used in comparison operations | Column rarely used in comparison operations |
Note
Forward slashes can be used in file symbols without any cumbersome handling. However, in such cases the use of a symbol is usually deliberate, as the value is intended to represent a file path.
You may also encounter identifier columns that are highly unique but frequently used in comparison operations. In such cases, consider using the GUID type, which provides efficient comparisons without introducing large numbers of entries into a symbol domain.
Creating a Decision Function
We can create a helper function that analyses a list of values (provided as strings) and suggests whether they should be stored as strings or symbols.
First, given a string x, we check whether it contains any characters that are unsuitable for symbols:
hasBadSymChar:{0<count x except .Q.nA,.Q.a,"_."}
Example:
q)hasBadSymChar each ("abc";"def";"g.h";"i_j";"k@l")
00001b
If any value in the list contains such characters, the column should remain a string:
any hasBadSymChar each vals
Measuring Uniqueness
Next, we compute the uniqueness ratio (also called relative cardinality):
\[ \text{Uniqueness Ratio} = \frac{\text{Number of Unique Elements}}{\text{Total Number of Elements}} \]
In q:
(count distinct vals)%count vals
A ratio of 1 indicates that all values are unique, while a ratio approaching 0 indicates a high level of repetition.
The interpretation of this ratio depends on the dataset, but one possible guideline might be:
| Ratio | Cardinality |
|---|---|
| > 0.9 | Very high |
| 0.5 - 0.9 | High |
| 0.1 - 0.5 | Medium |
| 0.01 - 0.1 | Low |
| < 0.01 | Very low |
However, the ratio alone does not tell the full story. Consider the following examples:
| Total Values | Distinct Values | Uniqueness Ratio |
|---|---|---|
| 10 | 5 | 0.5 |
| 1,000,000 | 500,000 | 0.5 |
Both datasets have the same ratio, yet the second introduces 500,000 unique symbols into the domain. Depending on the application, this may still be considered very high cardinality.
For this reason, we should also consider absolute cardinality (the number of distinct values).
Combining Relative and Absolute Cardinality
We can define both relative and absolute thresholds. If both thresholds are exceeded, the values should remain strings; otherwise, they can be stored as symbols.
(relThreshold<uniqueCount%count vals) and
absThreshold<uniqueCount:count distinct vals
The final helper function returns the values either as strings or symbols depending on these checks:
decideType:{[absThreshold;relThreshold;vals]
$[
any hasBadSymChar each vals; vals;
(relThreshold<uniqueCount%count vals) and
absThreshold<uniqueCount:count distinct vals; vals;
`$vals
]
};
Example:
decideType[10000;0.5;] ("abc";"def";"ghi";..)
Here, the absolute cardinality threshold is set to 10,000 and the relative cardinality threshold to 0.5. This means that values will be stored as symbols only if:
- the uniqueness ratio is below
0.5, and - the number of distinct values is below
10,000.
Otherwise, the values remain strings.
When choosing the absolute threshold, consider how many additional symbols your current sym file can reasonably accommodate. The relative threshold can help anticipate future growth patterns in the data.
This method provides a simple heuristic for datatype selection. In practice, the final decision may also depend on query patterns, expected data growth, and domain-specific considerations. Nevertheless, applying these checks provides a useful baseline when designing a database schema.
Converting Between String and Symbol Types
In the previous section we discussed how to decide whether a column should use the string or symbol type. However, in many cases the schema of an existing database has already been established and such decisions have already been made. Over time, requirements may change and you may decide that a column would be better represented using the other type.
In this section we look at how to convert a column from symbol to string and from string to symbol.
From Symbol to String
A column can usually be cast from one type to another using castCol provided by dbm.q. However, converting a column from symbol to string requires a slightly different approach.
Instead of casting directly, we use fnCol to apply the string function to the column values.
When operating directly on the column file on disk, a symbol column does not contain the symbol values themselves. Instead, it contains the enumeration indexes into the symbol domain. Therefore, before converting to strings we must first map those indexes back to their corresponding symbol values.
The lambda passed to fnCol therefore looks like this:
{[db;vals] $[20h=type vals; string (get db,key vals) vals; vals]}[db;]
Here vals is the list of column values.
If vals is an enumeration type (since KDB+ 3.6, enumerations always have type 20h):
- The domain file is loaded using
get db,key vals - The enumeration indexes are mapped to their symbol values
- The resulting symbols are converted to strings using
string
If the values are not enumerated symbols, they are returned unchanged.
Avoiding Repeated Domain Loads
fnCol may apply the operation to many column files in a partitioned database. If we load the domain file inside the lambda, it would be reloaded for every partition, which can be inefficient if the domain file is large.
To avoid this, we determine the domain once and reuse it.
However, identifying the domain file is not completely straightforward because a table may exist in many partitions, each containing a copy of the column file. To determine the domain name we inspect one of these column files.
A common approach is to use the most recent partition, which can generally be assumed to be in a consistent state.
We first create a helper function to obtain the table directory from the most recent partition.
// Get the table directory from the most recent partition
mostRecentTdir:{[db;tname] last asc allTablePaths[db;tname]}
allTablePaths (provided in dbm.q) returns every path to the table across all partitions.
Sorting these paths and taking the last value gives the most recent partition.
This also works for splayed tables, since allTablePaths will return a single path.
Determining the Domain Name
Once we have the table path we can inspect the column file to determine the domain used by the enumeration.
// Get the name of the domain used by the given column
colDomainName:{[db;tname;cname]
$[20h=type vals:get mostRecentTdir[db;tname],cname; key vals; `]
}
If the column is an enumeration type, the domain name is obtained using key. Otherwise, the function returns a null symbol.
Symbol to String Conversion
We can now implement the function that converts a symbol column to a string column.
// Convert a column of symbol type to string type
symToStrCol:{[db;tname;cname]
if[not null domainName:colDomainName[db;tname;cname];
fnCol[db;tname;cname;string get[db,domainName]@]
];
}
Since the column type is already verified in colDomainName, the lambda passed to fnCol does not need to check the type again.
Note
After this conversion, many values in the sym file may no longer be referenced by any column. These redundant symbols remain in the domain and can contribute to unnecessary sym file growth. We will discuss how to remove these unused values in a later section.
From String to Symbol
Converting from string to symbol is more involved.
The process requires:
- Casting the column values to symbols
- Adding any new symbols to the domain
- Converting the symbols to their corresponding domain indexes
- Writing the enumerated values back to disk
Because we must enumerate the symbols against a specific domain, the function requires an additional argument dname, which is the name of the domain.
Loading the Domain
We begin by loading the domain file into memory.
dfile:.Q.dd[db;dname]
domain:get dfile
Iterating Over Column Files
In a partitioned database, the column will exist in many partition directories. Each column file may contain different symbols, so the domain may need to be updated as we process each partition.
For this reason the operation cannot safely be parallelised, since concurrent updates to the domain could lead to inconsistent enumeration values.
Instead, we iterate over each column path and update the domain incrementally.
Loading and Casting Column Values
For each column path we first verify that the column exists.
has1Col[tdir;cname]
If the column file exists we load the values and cast them to symbols.
vals:`$get cfile:.Q.dd[tdir;cname]
This cast works for string columns, but there is an important edge case when the column is of type char.
If a char list such as
"abc"
is cast directly using `$, the result will be a single symbol
`abc
instead of three individual symbols.
To ensure that each element is converted separately we use an explicit each-right:
vals:`$/:get cfile:.Q.dd[tdir;cname]
This works correctly for both char and string inputs.
Updating the Domain
Any new symbols encountered in the column must be added to the domain.
domain:domain union vals
Converting Symbols to Domain Indexes
We convert the symbol values into indexes of the domain using the find operator ?.
domain?vals
Since the column may contain many repeated values, we optimise this operation using .Q.fu, which applies a function only to the unique values and then expands the result back to the original shape.
.Q.fu[domain?;syms]
Example:
// Without optimisation
q)3*1 2 1 2 1 1
3 6 3 6 3 3
// Only two multiplications are performed (3*1 and 3*2), and the results are reused
q).Q.fu[3*;1 2 1 2 1 1]
3 6 3 6 3 3
Writing the Enumerated Values
The resulting indexes are assigned the domain using !:
dname!.Q.fu[domain?;vals]
The updated column values are then written back to disk.
.[cfile;();:;dname!.Q.fu[domain?;vals]]
Helper Function
This logic can be encapsulated in a helper function that converts a single column file.
// Convert a column from string type to symbol type
strToSym1Col:{[domain;tdir;cname;dname]
if[has1Col[tdir;cname];
domain:domain union vals:`$/:get cfile:.Q.dd[tdir;cname];
.[cfile;();:;dname!.Q.fu[domain?;vals]]
];
domain
}
The function returns the (possibly updated) domain so it can be reused when processing additional column files.
Complete Conversion Function
We can now construct the full function that processes every column path in the table.
// Convert a column from string type to symbol type
strToSymCol:{[db;tname;cname;dname]
dfile:.Q.dd[db;dname];
domain:strToSym1Col[;;cname;dname]/[get dfile;allTablePaths[db;tname]];
dfile set domain;
}
After all column files have been processed, the final domain is written back to disk.
Cleaning a Sym File
Databases evolve over time. Tables and columns are added, schemas change, and older structures are eventually removed. When symbol columns are dropped, the symbols that were referenced exclusively by those columns may no longer be used anywhere in the database. However, their entries remain in the sym file.
This results in a sym file containing many unused values. While functionally harmless, these unused entries waste space and can significantly increase database load times.
The solution is to clean (or compact) the sym file by identifying and removing unused symbols. The first step is to determine which symbols are still referenced.
Identifying Used & Unused Symbols
To identify unused symbols within a database, we must first determine which symbols are actually in use. We can then filter the domain to exclude those unused entries.
Used Symbols in a Single Splayed Table
Assume we have the path to a splayed table stored in the variable tdir. First, we retrieve the list of column names:
getColNames tdir
getColNames is a helper function defined in dbm.q that takes a splayed table directory path and returns the list of column names. For example:
q)getColNames `:splayDB/trade
`time`sym`ex`size`price
We are only interested in columns of enumeration type (20h).
To check a column’s type, we must load it:
get tdir,col
We can then test whether the column is an enumeration:
20h=type get tdir,col
Example:
q)20h=type get `:splayDB/trade,`price
0b
q)20h=type get `:splayDB/trade,`ex
1b
If a column is enumerated, we can determine its domain name using key:
// ex column is enumerated against the domain called sym
q)key get `:splayDB/trade,`ex
`sym
Multiple Domains Caveat
Although uncommon, it is possible for different columns in the same table to be enumerated against different domains.
For example, in the database splayDBMulti, the trade table has two enumeration columns — sym and ex — each using a different domain:
q)tdir:`:splayDBMulti/trade
q)key get tdir,`sym
`sym1
q)key get tdir,`ex
`sym2
The corresponding domain files are stored at the database root:
q)key `:splayDBMulti
`s#`sym1`sym2`trade
To account for this possibility, we return a dictionary mapping domain name to used indexes, rather than a flat list of symbols.
Using splayDBMulti as an example:
q)tdir:`:splayDBMulti/trade
q)map:([])
Extracting usage from the sym column:
q)show enum:get tdir,`sym
`sym1!0 1 2 3 4
q)map[key enum]:value enum
q)map
sym1| 0 1 2 3 4
Now for the ex column:
q)show enum:get tdir,`ex
`sym2!0 1 0 2 1
q)map[key enum]:value enum
q)map
sym1| 0 1 2 3 4
sym2| 0 1 0 2 1
Avoiding Overwrites
If another column is also enumerated against sym2, naïvely assigning would overwrite existing entries:
q)enum:`sym2!2 3 0 2 1
q)map[key enum]:value enum
q)map
sym1| 0 1 2 3 4
sym2| 2 3 0 2 1
Instead, we must append:
q)map[key enum],:value enum
q)map
sym1| 0 1 2 3 4
sym2| 0 1 0 2 1 2 3 0 2 1
Since we only care about distinct indexes, we refine this using distinct:
q)map[key enum]:distinct map[key enum],value enum
q)map
sym1| 0 1 2 3 4
sym2| 0 1 2 3
Even better, we can use union, which is equivalent to distinct + append (,):
q)map[key enum]:map[key enum] union value enum
q)map
sym1| 0 1 2 3 4
sym2| 0 1 2 3
Completing the Function
We now iterate over all columns using over (/), building the domain map incrementally:
// Build a mapping of domain name to indexes of used symbols in the given splayed table
domainUsed1:{[tdir]
{[tdir;map;col]
if[20h=type enum:get tdir,col;
map[key enum]:map[key enum] union value enum
];
map
}[tdir]/[([]);getColNames tdir]
}
Examples:
q)domainUsed1 `:splayDB/trade
sym| 0 1 2 3 4 5 6 7
q)domainUsed1 `:splayDBMulti/trade
sym1| 0 1 2 3 4
sym2| 0 1 2
Applying to Multiple Partitions
As described in a previous blog, we can wrap this to operate across partitions:
// Build a mapping of domain name to indexes of used symbols in the given database table
domainUsed:{[db;tname] (union'/) domainUsed1 peach allTablePaths[db;tname]}
domainUsed1 is applied in parallel (peach) across all table directories. The result is a list of dictionaries, which we combine using union' and /.
Example of union':
// union is applied on a key by key basis using '
q)(`sym1`sym2!(0 1 2;3 4 5)) union' `sym1`sym3!(2 6 1;7 8 9)
sym1| 0 1 2 6
sym2| 3 4 5
sym3| 7 8 9
Unused Symbols
Once we know which indexes are used, finding unused indexes is straightforward.
We generate all possible indexes using til:
til count get db,domainName
Then remove the used indexes:
(til count get db,domainName) except domainUsed[db;tname] domainName
To apply this to all domains:
// Build a mapping of domain name to indexes of un-used symbols in the given database table
domainUnused:{[db;tname] except'[;used] (til count get db,) each key used:domainUsed[db;tname]}
Note
domainUnusedonly includes domains that have at least one used symbol. If a domain does not appear in the result, it is completely unused by the table.
Identifying unused symbols is not strictly required to rebuild domains, but it is useful for assessing whether compaction is necessary.
Complete Domain Usage
To compute usage across the entire database, we apply domainUsed to every table in our database (using listTabs provided by dbm.q) and combine using (union'/):
// Build a mapping of domain name to indexes of used symbols in the given database
domainUsage:{[db] (union'/) domainUsed[db;] peach listTabs db}
Persisting the New Domain(s)
After identifying the used indexes, we rebuild the domains in two stages.
1) Resolving Domain Indexes
We convert index mappings into actual symbol values:
// Convert a domain mapping from index values to symbol values
resolveDomainMap:{[db;dm] ((get db,) each key dm)@'dm}
Example:
q)resolveDomainMap[db;] domainUsed[db;tname]
sym| IBM AMZN GOOGL META SPOT L O SI
2) Persisting to Disk
Domains can be written using set:
q).Q.dd[`:temp;`sym] set `IBM`AMZN`GOOGL`META`SPOT`L`O`SI
We generalise this:
// Save each domain as a file containing the symbol values
persistDomainMap:{[dir;dm] (.Q.dd[dir;] each key dm) set' dm}
Important
The
dirparameter should not initially be the database root. Always write rebuilt sym files to a temporary location first. Only after successful re-enumeration should they replace the originals.
Rebuilding Domains
We combine all steps into a single convenient function:
// Recreate all domains of the given database with only the used symbols
rebuildDomains:{[db;dir] persistDomainMap[dir;] resolveDomainMap[db;] domainUsage db}
Calling rebuildDomains writes cleaned domain files to dir but does not modify the database. This makes it a safe first step when performing a sym file clean-up.
Re-enumeration
Re-enumerating a database against a new domain requires performing the following steps for each symbol column:
- Load the column values.
- Resolve the existing indexes against the current domain.
- Convert the resulting symbols into indexes of the new domain.
- Write the updated column values back to disk.
Because the new domain has already been created, the re-enumeration step only modifies individual column files. This means the operation can be performed in parallel, since no shared state is modified.
Warning
The steps in this section perform in-place updates to the database. If interrupted or executed incorrectly, data corruption or loss may occur. This process must be performed during a maintenance window with the database offline. Until re-enumeration and domain replacement are fully complete, the database should be considered inconsistent and must not be accessed.
Re-enumerating a Single Column
A column only needs to be re-enumerated if it is an enumeration type and therefore has an associated domain. We can obtain the current domain name using colDomainName, which we defined earlier:
currDomainName:colDomainName[db;tname;cname]
If currDomainName is null, the column is not an enumeration type and we can exit early. Otherwise, we load the current domain:
currDomain:get db,currDomainName
We must also provide the path to the new domain file (newDomainFile), which may not yet reside within the database directory. We load this file and extract its basename to determine the new domain name:
newDomain:get newDomainFile
newDomainName:last ` vs newDomainFile
At this point we have everything required to re-enumerate the column against the new domain.
{[currDomain;newDomain;newDomainName;vals]
newDomainName!.Q.fu[newDomain?;currDomain vals]
}[currDomain;newDomain;newDomainName;]
This lambda (projection) re-enumerates the column values (vals) in the same way we performed enumeration in the previous section. The steps are:
- Resolve the existing indexes using
currDomain - Locate the new indexes in
newDomain - Apply
.Q.futo avoid repeated lookups - Assign the new domain name using
!
We can apply this operation to every column file using fnCol:
fnCol[db;tname;cname;] {[currDomain;newDomain;newDomainName;vals]
newDomainName!.Q.fu[newDomain?;currDomain vals]
}[currDomain;newDomain;newDomainName;]
Putting everything together gives the complete function:
// Re-enumerate a column against a new domain across all partitions
reenumerateCol:{[db;tname;cname;newDomainFile]
if[not null currDomainName:colDomainName[db;tname;cname];
currDomain:get db,currDomainName;
newDomain:get newDomainFile;
newDomainName:last ` vs newDomainFile;
fnCol[db;tname;cname;] {[currDomain;newDomain;newDomainName;vals]
newDomainName!.Q.fu[newDomain?;currDomain vals]
}[currDomain;newDomain;newDomainName;]
];
}
This function allows us to re-enumerate a single column. While uncommon, it can be useful when columns within the same table are enumerated against different domains or when re-enumeration needs to be performed incrementally.
Re-enumerating a Table
In most databases, the same domain is used for all symbol columns within a table.
We can therefore define reenumerateTab to re-enumerate all enumeration-type columns (returned by listEnumCols from dbm.q) across all partitions of a table:
// Re-enumerate all enumeration columns in a table
reenumerateTab:{[db;tname;newDomainFile]
reenumerateCol[db;tname;;newDomainFile] peach listEnumCols[db;tname];
}
For databases that use a single global domain (the most common case), we can extend this to all tables:
// Re-enumerate every table in the database
reenumerateAll:{[db;newDomainFile]
reenumerateTab[db;;newDomainFile] peach listTabs db;
};
Replacing the Old Sym Files
After re-enumeration, the updated column files reference the new domain file(s). However, these files are not yet located in the database root directory.
The final step is therefore to:
- Remove or archive the existing sym file(s).
- Move the rebuilt sym file(s) into the database root.
This can be done at the system level. For example, on Linux:
mv /path/to/database/oldSymFile /path/to/archive/oldSymFile
mv /path/to/newSymFile /path/to/database/newSymFile
Once the new sym file is in place, the database can be restarted and will load the rebuilt domain.
Renaming a Domain
Renaming a domain is not a common operation, but it can be useful in situations such as changes to naming conventions or schema refactoring.
Renaming a domain requires two main steps:
- Create a copy of the existing domain file using the new name.
- Re-enumerate any columns currently enumerated against the old domain so that they reference the new domain.
The first step is simply a filesystem copy. The dbm.q library already provides an internal helper function copy for this purpose.
copy[`:db/currSym;`:db/newSym]
The second step is slightly more involved.
Re-enumerating the entire database cannot be done using reenumerateAll, because a database may contain multiple domains. Calling reenumerateAll would re-enumerate every enumeration column against the new domain, regardless of which domain it currently uses.
Instead, we need stricter versions of the re-enumeration functions that only operate on columns enumerated against a specific domain.
We will therefore introduce reenumerateColFrom, reenumerateTabFrom, and reenumerateAllFrom. These functions behave similarly to the previously defined re-enumeration functions, but they only operate on columns whose current domain matches the specified domain.
reenumerateColFrom
Compared with reenumerateCol, this function has two differences:
- It takes an additional parameter
currDomainName. - It checks that the column is currently enumerated against this domain before performing the re-enumeration.
// Re-enumerate a column from a specific domain to a new domain
reenumerateColFrom:{[db;tname;cname;currDomainName;newDomainFile]
if[currDomainName=colDomainName[db;tname;cname];
// same as reenumerateCol ..
];
}
Since reenumerateCol and reenumerateColFrom share most of their logic, we can refactor the common functionality into a helper function.
Shared Re-enumeration Helper
reenumerateCol0:{[db;tname;cname;currDomainName;newDomainFile]
currDomain:get db,currDomainName;
newDomain:get newDomainFile;
newDomainName:fs.basename newDomainFile;
fnCol[db;tname;cname;] {[currDomain;newDomain;newDomainName;vals]
newDomainName!.Q.fu[newDomain?;currDomain vals]
}[currDomain;newDomain;newDomainName;];
}
This helper performs the actual re-enumeration and is used by both variants of the function.
The original reenumerateCol can now be simplified:
reenumerateCol:{[db;tname;cname;newDomainFile]
if[not null currDomainName:colDomainName[db;tname;cname];
reenumerateCol0[db;tname;cname;currDomainName;newDomainFile]
];
}
And the stricter version becomes:
reenumerateCol:{[db;tname;cname;currDomainName;newDomainFile]
if[currDomainName=colDomainName[db;tname;cname];
reenumerateCol0[db;tname;cname;currDomainName;newDomainFile]
];
}
Re-enumerating Tables and Databases
Because reenumerateTab and reenumerateAll are implemented in terms of reenumerateCol, we can define reenumerateTabFrom and reenumerateAllFrom in the same way using reenumerateColFrom.
// Re-enumerate all enumeration columns in a table from a given domain
reenumerateTabFrom:{[db;tname;currDomainName;newDomainFile]
reenumerateColFrom[db;tname;;currDomainName;newDomainFile] peach listEnumCols[db;tname];
};
// Re-enumerate every table in the database from a given domain
reenumerateAllFrom:{[db;currDomainName;newDomainFile]
reenumerateTabFrom[db;;currDomainName;newDomainFile] peach listTabs db;
};
We can now re-enumerate only the appropriate columns across the entire database:
reenumerateAllFrom[db;currName;newFile]
Final Renaming Function
Combining these steps, we arrive at the final renaming function:
// Rename a (symbol) domain
renameDomain:{[db;currName;newName]
currFile:.Q.dd[db;currName];
newFile:.Q.dd[db;newName];
copy[currFile;newFile];
reenumerateAllFrom[db;currName;newFile];
}
This function deliberately does not delete the original domain file. Keeping the old file temporarily allows for recovery if anything goes wrong during the process.
Once the database has been validated, the old domain file can be safely archived or removed.
Conclusion
The sym file is a fundamental component of a KDB+ database. By enabling efficient symbol enumeration, it provides significant improvements in both storage efficiency and query performance. However, these benefits come with operational considerations. Without proper management, sym files can grow unnecessarily large, increasing startup times and memory usage while potentially introducing avoidable complexity into the system.
In this post we explored several techniques for managing sym files more effectively. These included identifying when symbol enumeration is appropriate, converting columns away from enumeration when it no longer provides benefit, renaming domains safely, and cleaning unused values from a sym file. Together, these techniques provide a practical toolkit for maintaining healthy and efficient symbol domains over the lifetime of a database.
As with many aspects of KDB+ database maintenance, prevention is often easier than correction. Thoughtful schema design, careful consideration of symbol cardinality, and periodic maintenance can prevent excessive sym file growth before it becomes a problem. When issues do arise, the techniques described here can help restore order while minimising risk to the underlying data.
Ultimately, a well-maintained sym file contributes directly to faster startup times, lower memory usage, and more predictable system behaviour — all of which are essential for reliable production KDB+ systems.