Based on Debian 11 "Bullseye" environment.

bsfilter

A Bayesian filter is a must for spam filtering. It can learn how to classify spam emails.

There are famous filters such as SpamAssassin or bogofilter, but they can't be used for Japanese (and Chinese, probably) due to tokenizer issues.
"bsfilter" is minor and old but it's from Japan and works fine with Japanese spam emails.

Install

The least requirement is only bsfilter itself. It uses a bigram algorithm for tokenization.

# apt install bsfilter

Configure

Make /etc/bsfilter.conf to share with all virtual users.

pipe
insert-flag
insert-probability
auto-update
db gdbm
  • "db gdbm" is for too long tokens generated by accident. Such tokens can break the sdbm database.

Integration with Dovecot Sieve

Dovecot Sieve plugins

Plugins are required to call external programs from sieve scripts.
In this case, bsfiter works as an external filter to add headers.
Additionally, always call bsfilter before per-user sieve script to add spam probability headers.

In /etc/dovecot/conf.d/90-sieve.conf, add filter extensions, and enable extprograms plugin. This filtering process will be run in global script only, so extensions are enabled global only. (Each user can't use this extension.)

plugin {
  * snip *

  # Location Sieve of scripts that need to be executed before the user's
  # personal script. If a 'file' location path points to a directory, all the 
  # Sieve scripts contained therein (with the proper `.sieve' extension) are
  # executed. The order of execution within that directory is determined by the
  # file names, using a normal 8bit per-character comparison.
  #
  # Multiple script locations can be specified by appending an increasing number
  # to the setting name. The Sieve scripts found from these locations are added
  # to the script execution sequence in the specified order. Reading the
  # numbered sieve_before settings stops at the first missing setting, so no
  # numbers may be skipped.
  sieve_before = /var/lib/dovecot/sieve.d/  # Uncomment this line
  #sieve_before2 = ldap:/etc/sieve-ldap.conf;name=ldap-domain
  #sieve_before3 = (etc...)

  * snip *

  # Which Sieve language extensions are ONLY available in global scripts. This
  # can be used to restrict the use of certain Sieve extensions to administrator
  # control, for instance when these extensions can cause security concerns.
  # This setting has higher precedence than the `sieve_extensions' setting
  # (above), meaning that the extensions enabled with this setting are never
  # available to the user's personal script no matter what is specified for the
  # `sieve_extensions' setting. The syntax of this setting is similar to the
  # `sieve_extensions' setting, with the difference that extensions are
  # enabled or disabled for exclusive use in global scripts. Currently, no
  # extensions are marked as such by default.
  sieve_global_extensions = +vnd.dovecot.filter

  # The Pigeonhole Sieve interpreter can have plugins of its own. Using this
  # setting, the used plugins can be specified. Check the Dovecot wiki
  # (wiki2.dovecot.org) or the pigeonhole website
  # (http://pigeonhole.dovecot.org) for available plugins.
  # The sieve_extprograms plugin is included in this release.
  sieve_plugins = sieve_extprograms

  * snip *
}

In /etc/dovecot/conf.d/90-sieve-extprograms.conf, add location information about filters.

plugin {

  # The directory where the program sockets are located for the
  # vnd.dovecot.pipe, vnd.dovecot.filter and vnd.dovecot.execute extension
  # respectively. The name of each unix socket contained in that directory
  # directly maps to a program-name referenced from the Sieve script.
  #sieve_pipe_socket_dir = sieve-pipe
  #sieve_filter_socket_dir = sieve-filter
  #sieve_execute_socket_dir = sieve-execute

  # The directory where the scripts are located for direct execution by the
  # vnd.dovecot.pipe, vnd.dovecot.filter and vnd.dovecot.execute extension
  # respectively. The name of each script contained in that directory
  # directly maps to a program-name referenced from the Sieve script.
  #sieve_pipe_bin_dir = /usr/lib/dovecot/sieve-pipe
  sieve_filter_bin_dir = /usr/lib/dovecot/sieve-filter  # Uncomment this line
  #sieve_execute_bin_dir = /usr/lib/dovecot/sieve-execute
}

Reload Dovecot to enable new configurations.

# systemctl reload dovecot

Shell script to execute bsfilter

Make a shell script to execute bsfilter with some parameters in /usr/lib/dovecot/sieve-filter/10-bsfilter.sh.
The directory doesn't exist so make it when you make the first script.

bsfilter --config-file /etc/bsfilter.conf

Add execute permission to this script.

# chmod +x /usr/lib/dovecot/sieve-filter/10-bsfilter.sh

Sieve script to execute bsfilter

As specified for "sieve_before", Dovecot Sieve need a sieve script in the directory /var/lib/dovecot/sieve.d/10-bsfilter.sieve.

require ["fileinto", "mailbox", "vnd.dovecot.filter"];

filter "10-bsfilter.sh";

if header :contains ["X-Spam-Flag"] "Yes" {
  fileinto :create "Junk";
}

After filtering with bsfilter, check the spam header and send it to Junk directory.
":create" is a mailbox plugin functionality to create the directory if it doesn't exist.

The global scripts have to be pre-compiled by hand.

# sievec 10-bsfilter.sieve
  • Dovecot compiles sieve scripts automatically, but it will fail with the permission error in this directory.
  • ManageSieve will handle this pre-compile process automatically for per-user scripts.

Test

Send an email to any user account, and you can find the newly added headers about spam probability.

X-Spam-Flag: No
X-Spam-Probability: 0.000000

All emails will be classified clean because bsfilter hasn't done any spam learning.
If you have any troubles, change the "mail_debug" to yes in /etc/dovecot/conf.d/10-logging.conf to see what exactly is going on.


Spam/Clean Learning

I use (Dovecot original) sdbox format for the mailbox, but bsfilter is not compatible with this format.
When learning spam/clean emails, some tweaks are required.

Prerequisites

  • Use the same Junk folder structure for all users (for the easy scripting)

Steps

  1. Pick up emails to learn from one user. (Skip if there're nothing to learn.)
  2. Fetch those emails and export to text file one by one. (Without the first line that shows 'text:')
  3. Let bsfilter learn exported emails.
  4. Update bsfilter probability DB.
  5. Do this process for each user.

Prepare mail folders (directories)

In my case, I made "Junk" folder at the same level as INBOX (top-level). "not_clean" and "not_spam" folders under "Junk" folder.

Junk: Mails with "X-Spam-Flag: Yes"
 + not_clean: Put spam mails with "X-Spam-Flag: No" (Put spam emails that went through the filter)
 + not_spam:  Put clean mails with "X-Spam-Flag: Yes" (Put false positive emails)

Make these folders with your MUA.

Dovecot stats permission

Because vmail user can't access the Dovecot stats socket, you should see the following error.

doveadm(vmail): Error: net_connect_unix(/var/run/dovecot/stats-writer) failed: Permission denied

It seems this error doesn't stop the main process, so just ignoring is one way. Or add vmail to the dovecot group to eliminate this error.

# adduser vmail dovecot

SSL config

This learning script will be done by the user vmail, but vmail will fail when accessing the SSL certificate according to the /etc/dovecot/conf.d/10-ssl.conf.
Tweak SSL configuration to enable SSL only when dovecot (and dovadm) is executed by root.
(This howto was written here.)

  1. Copy 10-ssl.conf to 10-ssl.conf.ext
  2. Set 10-ssl.conf.ext permission to 600 (root only)
  3. Set SSL to "no" in 10-ssl.conf (for non-root users)
  4. Include and override 10-ssl.conf.ext if readable (executed by root)
# cd /etc/dovecot/conf.d
# cp 10-ssl.conf 10-ssl.conf.ext
# chmod 600 10-ssl.conf.ext

Change 10-ssl.conf as shown below.

# SSL/TLS support: yes, no, required. <doc/wiki/SSL.txt>
ssl = no

!include_try 10-ssl.conf.ext

It doesn't affect the normal dovecot process, but try restarting it to make sure.

# systemctl restart dovecot

Prepare learning script

Store this script in /var/lib/dovecot/scripts/bsfilter_learn.sh.

Cronjob

Add a cronjob /etc/cron.d/bsfilter_learn

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
 */30 * *  *  * vmail     bash /var/lib/dovecot/scripts/bsfilter_learn.sh > /dev/null

bsfilter will output the content of learned mails to STDOUT. Cron will deliver the STDOUT of cronjobs by email. As a result, you'll get all spam and clean mails whenever any user does learning. This is just annoying, so the STDOUT is thrown away to /dev/null.
If the learning process has any errors, that should be noticed by the mail to vmail@example.jp.


Update History

2021-09-20

  • Update to Bullseye version
  • Re-write bsfilter-learn script with more use of dovadm

2021-09-26

  • Change bsfilter_learn.sh from pasted text to embedded gist