Do not try to do it all with grok. I would break off the initial common section with dissect, then pull out the ModSecurity message using grok, then chop up the rest using a kv filter. Something like
dissect { mapping => { "message" => "[%{ts}] [:%{level}] [pid %{pid}] [client %{clientA}] [client %{clientB}] %{[@metadata][restOfLine]}" } }
grok { match => { "[@metadata][restOfLine]" => [ "ModSecurity: (?<theMessage>[^\[]+ )(?<[@metadata][theRest]>\[.*)" ] } }
kv { source => "[@metadata][theRest]" field_split => "\]\[" value_split => " " }
grok is one of the most powerful (and popular) filters for parsing events. That's exactly why you should at least consider the rest of the filters to see if something more specific (and therefore cheaper) can do the job.
If you need tag to be an array with a single member when it is a string then I would use a ruby filter for that.