Package org.apache.lucene.util.fst
Class FST<T>
java.lang.Object
org.apache.lucene.util.fst.FST<T>
- All Implemented Interfaces:
Accountable
Represents an finite state machine (FST), using a compact byte[] format.
The format is similar to what's used by Morfologik (https://github.com/morfologik/morfologik-stemming).
See the package documentation
for some simple examples.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
Represents a single arc.static class
Reads bytes stored in an FST.static enum
Specifies allowed range of each int input label for this FST. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final byte
Value of the arc flags to declare a node with fixed length arcs designed for binary search.(package private) static final byte
Value of the arc flags to declare a node with fixed length arcs and bit table designed for direct addressing.private static final long
private static final int
static final int
This flag is set if the arc has an output.private static final int
(package private) static final int
private static final int
(package private) static final int
(package private) final BytesStore
ABytesStore
, used during building, or during reading when the FST is very large (more than 1 GB).private static final int
private static final float
Maximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing.(package private) T
static final int
If arc has this label then that arc is final/acceptedprivate static final String
private static final long
(package private) static final int
(package private) static final int
(package private) static final int
private final FSTStore
(package private) final FST.INPUT_TYPE
private static final long
private long
private final int
private static final int
private static final int
private static final int
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
Constructor Summary
ConstructorsConstructorDescriptionLoad a previously saved FST.Load a previously saved FST; maxBlockBits allows you to control the size of the byte[] pages used to hold the FST bytes.FST
(FST.INPUT_TYPE inputType, Outputs<T> outputs, int bytesPageBits) -
Method Summary
Modifier and TypeMethodDescription(package private) long
addNode
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn) findTargetArc
(int labelToMatch, FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) Finds an arc leaving the incoming arc, replacing the arc in place.(package private) void
finish
(long newStartNode) private static boolean
flag
(int flags, int bit) Returns aFST.BytesReader
for this FST, positioned at position 0.getFirstArc
(FST.Arc<T> arc) Fills virtual 'start' arc, ie, an empty incoming arc to the FST's start nodeprivate static int
getNumPresenceBytes
(int labelRange) Gets the number of bytes required to flag the presence of each arc in the given label range, one bit per arc.(package private) boolean
isExpandedTarget
(FST.Arc<T> follow, FST.BytesReader in) Returns whetherarc
's target points to a node in expanded format (fixed length arcs).long
Return the memory usage of this object in bytes.static <T> FST<T>
Reads an automaton from a file.readArc
(FST.Arc<T> arc, FST.BytesReader in) Reads an arc.readArcByDirectAddressing
(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex) Reads a present direct addressing node arc, with the provided index in the label range.readArcByDirectAddressing
(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex, int presenceIndex) Reads a present direct addressing node arc, with the provided index in the label range and its corresponding presence index (which is the count of presence bits before it).readArcByIndex
(FST.Arc<T> arc, FST.BytesReader in, int idx) (package private) static <T> FST.Arc<T>
readEndArc
(FST.Arc<T> follow, FST.Arc<T> arc) readFirstRealTargetArc
(long nodeAddress, FST.Arc<T> arc, FST.BytesReader in) readFirstTargetArc
(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) Follow thefollow
arc and read the first arc of its target; this changes the providedarc
(2nd arg) in-place and returns it.int
Reads one BYTE1/2/4 label from the providedDataInput
.readLastArcByDirectAddressing
(FST.Arc<T> arc, FST.BytesReader in) Reads the last arc of a direct addressing node.readLastTargetArc
(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) Follows thefollow
arc and reads the last arc of its target; this changes the providedarc
(2nd arg) in-place and returns it.readNextArc
(FST.Arc<T> arc, FST.BytesReader in) In-place read; returns the arc.(package private) int
readNextArcLabel
(FST.Arc<T> arc, FST.BytesReader in) Peeks at next arc's label; does not alter arc.readNextRealArc
(FST.Arc<T> arc, FST.BytesReader in) Never returns null, but you should never call this if arc.isLast() is true.private void
readPresenceBytes
(FST.Arc<T> arc, FST.BytesReader in) Reads the presence bits of a direct-addressing node.private long
void
Writes an automaton to a file.void
save
(DataOutput metaOut, DataOutput out) private void
(package private) void
setEmptyOutput
(T v) private boolean
shouldExpandNodeWithDirectAddressing
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange) Returns whether the given node should be expanded with direct addressing instead of binary search.private boolean
shouldExpandNodeWithFixedLengthArcs
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> node) Returns whether the given node should be expanded with fixed length arcs.static <T> boolean
targetHasArcs
(FST.Arc<T> arc) returns true if the node at this address has any outgoing arcstoString()
private void
writeLabel
(DataOutput out, int v) private void
writeNodeForBinarySearch
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc) private void
writeNodeForDirectAddressing
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange) private void
writePresenceBits
(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long dest, int numPresenceBytes) Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
Field Details
-
BASE_RAM_BYTES_USED
private static final long BASE_RAM_BYTES_USED -
BIT_FINAL_ARC
private static final int BIT_FINAL_ARC- See Also:
-
BIT_LAST_ARC
static final int BIT_LAST_ARC- See Also:
-
BIT_TARGET_NEXT
static final int BIT_TARGET_NEXT- See Also:
-
BIT_STOP_NODE
private static final int BIT_STOP_NODE- See Also:
-
BIT_ARC_HAS_OUTPUT
public static final int BIT_ARC_HAS_OUTPUTThis flag is set if the arc has an output.- See Also:
-
BIT_ARC_HAS_FINAL_OUTPUT
private static final int BIT_ARC_HAS_FINAL_OUTPUT- See Also:
-
ARCS_FOR_BINARY_SEARCH
public static final byte ARCS_FOR_BINARY_SEARCHValue of the arc flags to declare a node with fixed length arcs designed for binary search.- See Also:
-
ARCS_FOR_DIRECT_ADDRESSING
static final byte ARCS_FOR_DIRECT_ADDRESSINGValue of the arc flags to declare a node with fixed length arcs and bit table designed for direct addressing.- See Also:
-
FIXED_LENGTH_ARC_SHALLOW_DEPTH
static final int FIXED_LENGTH_ARC_SHALLOW_DEPTH -
FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
static final int FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS -
FIXED_LENGTH_ARC_DEEP_NUM_ARCS
static final int FIXED_LENGTH_ARC_DEEP_NUM_ARCS -
DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
private static final float DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTORMaximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing. This factor prevents expansions that are obviously too costly even if there are sufficient credits. -
FILE_FORMAT_NAME
- See Also:
-
VERSION_START
private static final int VERSION_START- See Also:
-
VERSION_LITTLE_ENDIAN
private static final int VERSION_LITTLE_ENDIAN- See Also:
-
VERSION_CURRENT
private static final int VERSION_CURRENT- See Also:
-
FINAL_END_NODE
private static final long FINAL_END_NODE- See Also:
-
NON_FINAL_END_NODE
private static final long NON_FINAL_END_NODE- See Also:
-
END_LABEL
public static final int END_LABELIf arc has this label then that arc is final/accepted- See Also:
-
inputType
-
emptyOutput
T emptyOutput -
bytes
ABytesStore
, used during building, or during reading when the FST is very large (more than 1 GB). If the FST is less than 1 GB then bytesArray is set instead. -
fstStore
-
startNode
private long startNode -
outputs
-
version
private final int version -
DEFAULT_MAX_BLOCK_BITS
private static final int DEFAULT_MAX_BLOCK_BITS
-
-
Constructor Details
-
FST
FST(FST.INPUT_TYPE inputType, Outputs<T> outputs, int bytesPageBits) -
FST
Load a previously saved FST.- Throws:
IOException
-
FST
public FST(DataInput metaIn, DataInput in, Outputs<T> outputs, FSTStore fstStore) throws IOException Load a previously saved FST; maxBlockBits allows you to control the size of the byte[] pages used to hold the FST bytes.- Throws:
IOException
-
-
Method Details
-
flag
private static boolean flag(int flags, int bit) -
ramBytesUsed
public long ramBytesUsed()Description copied from interface:Accountable
Return the memory usage of this object in bytes. Negative values are illegal.- Specified by:
ramBytesUsed
in interfaceAccountable
-
toString
-
finish
- Throws:
IOException
-
getEmptyOutput
-
setEmptyOutput
-
save
- Throws:
IOException
-
save
Writes an automaton to a file.- Throws:
IOException
-
read
Reads an automaton from a file.- Throws:
IOException
-
writeLabel
- Throws:
IOException
-
readLabel
Reads one BYTE1/2/4 label from the providedDataInput
.- Throws:
IOException
-
targetHasArcs
returns true if the node at this address has any outgoing arcs -
addNode
- Throws:
IOException
-
shouldExpandNodeWithFixedLengthArcs
private boolean shouldExpandNodeWithFixedLengthArcs(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> node) Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.
-
shouldExpandNodeWithDirectAddressing
private boolean shouldExpandNodeWithDirectAddressing(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange) Returns whether the given node should be expanded with direct addressing instead of binary search.Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.
-
writeNodeForBinarySearch
private void writeNodeForBinarySearch(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc) -
writeNodeForDirectAddressing
private void writeNodeForDirectAddressing(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange) -
writePresenceBits
private void writePresenceBits(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn, long dest, int numPresenceBytes) -
getNumPresenceBytes
private static int getNumPresenceBytes(int labelRange) Gets the number of bytes required to flag the presence of each arc in the given label range, one bit per arc. -
readPresenceBytes
Reads the presence bits of a direct-addressing node. Actually we don't read them here, we just keep the pointer to the bit-table start and we skip them.- Throws:
IOException
-
getFirstArc
Fills virtual 'start' arc, ie, an empty incoming arc to the FST's start node -
readLastTargetArc
FST.Arc<T> readLastTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws IOException Follows thefollow
arc and reads the last arc of its target; this changes the providedarc
(2nd arg) in-place and returns it.- Returns:
- Returns the second argument (
arc
). - Throws:
IOException
-
readUnpackedNodeTarget
- Throws:
IOException
-
readFirstTargetArc
public FST.Arc<T> readFirstTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws IOException Follow thefollow
arc and read the first arc of its target; this changes the providedarc
(2nd arg) in-place and returns it.- Returns:
- Returns the second argument (
arc
). - Throws:
IOException
-
readFirstRealTargetArc
public FST.Arc<T> readFirstRealTargetArc(long nodeAddress, FST.Arc<T> arc, FST.BytesReader in) throws IOException - Throws:
IOException
-
isExpandedTarget
Returns whetherarc
's target points to a node in expanded format (fixed length arcs).- Throws:
IOException
-
readNextArc
In-place read; returns the arc.- Throws:
IOException
-
readNextArcLabel
Peeks at next arc's label; does not alter arc. Do not call this if arc.isLast()!- Throws:
IOException
-
readArcByIndex
- Throws:
IOException
-
readArcByDirectAddressing
public FST.Arc<T> readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex) throws IOException Reads a present direct addressing node arc, with the provided index in the label range.- Parameters:
rangeIndex
- The index of the arc in the label range. It must be present. The real arc offset is computed based on the presence bits of the direct addressing node.- Throws:
IOException
-
readArcByDirectAddressing
private FST.Arc<T> readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex, int presenceIndex) throws IOException Reads a present direct addressing node arc, with the provided index in the label range and its corresponding presence index (which is the count of presence bits before it).- Throws:
IOException
-
readLastArcByDirectAddressing
public FST.Arc<T> readLastArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in) throws IOException Reads the last arc of a direct addressing node. This method is equivalent to callreadArcByDirectAddressing(Arc, BytesReader, int)
withrangeIndex
equal toarc.numArcs() - 1
, but it is faster.- Throws:
IOException
-
readNextRealArc
Never returns null, but you should never call this if arc.isLast() is true.- Throws:
IOException
-
readArc
Reads an arc.
Precondition: The arc flags byte has already been read and set; the given BytesReader is positioned just after the arc flags byte.- Throws:
IOException
-
readEndArc
-
findTargetArc
public FST.Arc<T> findTargetArc(int labelToMatch, FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws IOException Finds an arc leaving the incoming arc, replacing the arc in place. This returns null if the arc was not found, else the incoming arc.- Throws:
IOException
-
seekToNextNode
- Throws:
IOException
-
getBytesReader
Returns aFST.BytesReader
for this FST, positioned at position 0.
-