[Windows Mobile .NET CF] 中英字典 – Day9 | laneser - 點部落

2009-08-14

[Windows Mobile .NET CF] 中英字典 – Day9

寫了一些網路應用相關軟體, 靠著網路, 手機成了強大的工具介面.

但是接下來我想寫一些可以不需要網路的 Windows Mobile Applications, 比方說, 中英字典.

字典這樣的軟體其實不難寫, 門檻就在於字典檔…

可以從網路上找到一些免費的字典檔, 比方說 pydict

根據研究它的字典檔, 有幾個特性:
1. 依照 a ~ z 分開存放英文資料
2. 已經排序 (a-z)
3. 是 big5 編碼.
4. 用 ‘=’ 切割資料, 分別為英文=中文=音標

根據 pydict 的程式內容, 我寫了一個最基本的 class 來代表一筆字典檔的資料,
也把音標以及詞性轉換出來, 程式如下:

/// <summary>
/// 代表一筆中英文字典資料
/// </summary>
public class dictdata
{
    /// <summary>
    /// file pos , 內部使用
    /// </summary>
    public long filepos { get; set; }

    /// <summary>
    /// english
    /// </summary>
    public string eng { get; set; }

    /// <summary>
    /// chinese
    /// </summary>
    public string chinese { get; set; }

    /// <summary>
    /// soundmark
    /// </summary>
    public string soundmark { get; set; }


    /// <summary>
    /// 詞性表
    /// </summary>
    private static string[] prop = new string[] { 
        " ", " ", " ", "<<形容詞>>", "<<副詞>>", "art. ", 
        "<<連接詞>>", "int.  ", "<<名詞>>", " ", " ", "num. ", 
        "prep. ", " ", "pron.  ", "<<動詞>>", "<<助動詞>>", 
        "<<非及物動詞>>", "<<及物動詞>>", "vbl. ", " ", "st. ", 
        "pr. ", "<<過去分詞>>", "<<複數>>", "ing. ", " ", "<<形容詞>>", 
        "<<副詞>>", "pla. ", "pn. ", " " };

    /// <summary>
    /// 顯示中文內容
    /// </summary>
    public string displaychinese
    {
        get
        {
            //根據 詞性表 改變內文
            StringBuilder result = new StringBuilder();
            for (int i = 0; i < chinese.Length; i++)
            {
                int idx = Convert.ToInt32(chinese[i]);
                if (idx < prop.Length)
                {
                    if (result.Length != 0)
                    {
                        result.Append("\r\n");
                    }
                    result.Append(prop[idx]);
                }
                else
                    result.Append(chinese[i]);
            }
            return result.ToString();
        }
    }

    /// <summary>
    /// 音標表
    /// </summary>
    private static Dictionary<int, string> soundmarktable = new Dictionary<int, string>() {
    {0x01,"I"},
    {0x02,"E"},
    {0x03,"ae"},
    {0x04,"a"},
    {0x06,"c"},
    {0x0b,"8"},
    {0x0e, "U"},
    {0x0f, "^"},
    {0x10, "2"},
    {0x11, "2*"},
    {0x13, "2~"},
    {0x17, "l."},
    {0x19, "n_"},
    {0x1c, "&"},
    {0x1d, "S"},
    {0x1e, "3"},
    };

    public string displaysoundmark
    {
        get
        {
            // 根據音標表顯示
            StringBuilder result = new StringBuilder();
            for (int i = 0; i < soundmark.Length; i++)
            {
                int idx = Convert.ToInt32(soundmark[i]);
                if ((i == 0) && (idx == 0x65))
                    continue;
                string founddisplay;
                if (soundmarktable.TryGetValue(idx, out founddisplay) == true)
                    result.Append(founddisplay);
                else
                    result.Append(soundmark[i]);
            }
            return result.ToString();
        }
    }

    public string displaysoundmarkandchinese
    {
        get
        {
            return "音標:" + displaysoundmark + "\r\n" + displaychinese;
        }
    }


    private static char[] splitchar = new char[] { '=' };

    public dictdata(string data, long pos)
    {
        filepos = pos;
        if (string.IsNullOrEmpty(data) == false)
        {
            string[] parts = data.Split(splitchar);
            eng = (parts.Length >= 1) ? parts[0] : string.Empty;
            chinese = (parts.Length >= 2) ? parts[1] : string.Empty;
            soundmark = (parts.Length >= 3) ? parts[2] : string.Empty;
        }
        else
        {
            eng = string.Empty;
            chinese = string.Empty;
            soundmark = string.Empty;
        }
    }

    public override string ToString()
    {
        return eng;
    }
}

我打算展示兩種方法來查詢這樣的字典檔,
一種是直接在檔案做搜尋, 另一種則是直接全部載入到記憶體做搜尋.
當然是全部載入到記憶體, 直接利用 .NET Framework 的 Container Class 搜尋簡單得多.
但是大家應該知道, 之所以簡單得多, 是因為 .NET Framework 已經幫我們做掉很多了.

所謂倒吃甘蔗, 所以先介紹如何在檔案直接搜尋.

因為是已經根據英文排序好的資料, 所以就要善用排序搜尋.
已經排序好的資料, 又快又好寫的搜尋方法就是 BinarySearch.

雖然 .NET Framework 有 BinarySearch, 但是由於檔案是以 byte 為單位,
而資料卻是以一行一行為單位, 所以內建的 BinarySearch 是不能用的,
所以我們不但要自己寫, 還要在計算的時候做 byte 轉換到一行一行的資料.
也就是說, 這算是有一點變化的 BinarySearch 歐 ^_^
(在下的本行可以算是搜尋吧?! 所以這一定要寫得好一點才不會丟臉!)

BinarySearch 的主要函式程式碼 :

/// <summary>
/// 用 Binary Search 找到英文字 index , 找不到就傳回最接近的.
/// </summary>
/// <param name="r"></param>
/// <param name="index"></param>
/// <param name="zonebegin"></param>
/// <param name="zoneend"></param>
/// <param name="encode"></param>
/// <returns></returns>
public dictdata SeekToLine(Stream r, string index, long zonebegin, long zoneend, Encoding encode)
{
    // 預設 middle 為 readpreline.
    r.Position = (zonebegin + zoneend) / 2;
    dictdata middledata = ReadPreLine(r, encode);

    // 找到正確的英文字
    if (string.Compare(middledata.eng, index, true) == 0)
        return middledata;

    // read pre line 找不到正確的英文字, 有可能是 middle 要採用 readnextline...
    if (middledata.filepos == zonebegin)
    {
        r.Position = (zonebegin + zoneend) / 2;
        middledata = ReadNextLine(r, encode);

        // 找到正確的英文字
        if (string.Compare(middledata.eng, index, true) == 0)
            return middledata;

        // 找不到正確的英文字
        if (middledata.filepos == zoneend)
            return middledata;
    }

    string middleindexlow = middledata.eng.ToLower();
    int cmp = index.CompareTo(middleindexlow);
    if (cmp < 0)
    {
        // 搜尋 Binray Tree 左邊
        return SeekToLine(r, index, zonebegin, middledata.filepos, encode);
    }
    else
    {
        // 搜尋 Binray Tree 右邊
        return SeekToLine(r, index, middledata.filepos, zoneend, encode);
    }
}

當然, 要搭配將 byte index 轉換為以一行一行資料為基本的函式碼:

/// <summary>
/// 往前搜尋直到發現 endtag, 回傳的 position 指向 endtag 下一個 byte
/// </summary>
/// <param name="r"></param>
/// <param name="endtag"></param>
/// <returns></returns>
private long seekback(Stream r, int endtag)
{
    long currentpos = r.Position;
    while (currentpos > 0)
    {
        currentpos--;
        r.Position = currentpos;
        if (r.ReadByte() == endtag)
            return currentpos+1;            
    }
    r.Position = 0;
    return 0;
}

/// <summary>
/// 往後搜尋直到發現 begintag, 回傳的 position 指向 begintag 下一個 byte
/// </summary>
/// <param name="r"></param>
/// <param name="begintag"></param>
/// <returns></returns>
private long seeknext(Stream r, int begintag)
{
    long currentpos = r.Position;
    long finalpos = r.Length;
    while (currentpos < finalpos)
    {
        currentpos++;
        r.Position = currentpos;
        if (r.ReadByte() == begintag)
            return currentpos + 1;
    }
    r.Position = finalpos;
    return finalpos;
}

/// <summary>
/// 讀出 stream r 目前位置的前一行
/// </summary>
/// <param name="r"></param>
/// <param name="encode"></param>
/// <returns></returns>
public dictdata ReadPreLine(Stream r, Encoding encode)
{
    long pos = seekback(r, 0x0a);
    StreamReader endReader = new StreamReader(r, encode);
    try
    {
        return new dictdata(endReader.ReadLine(), pos);
    }
    finally
    {
        endReader.DiscardBufferedData();
    }
}

/// <summary>
/// 讀出 stream r 目前位置的下一行
/// </summary>
/// <param name="r"></param>
/// <param name="encode"></param>
/// <returns></returns>
public dictdata ReadNextLine(Stream r, Encoding encode)
{
    long pos = seeknext(r, 0x0a);
    StreamReader sr = new StreamReader(r, encode);
    try
    {
        return new dictdata(sr.ReadLine(), pos);
    }
    finally
    {
        sr.DiscardBufferedData();
    }
}

於是, 我們就可以寫出英文找到中文的搜尋程式:

/// <summary>
/// 找回最接近 index 的數筆資料, 最多回傳 maxcount 筆
/// </summary>
/// <param name="r"></param>
/// <param name="index"></param>
/// <param name="encode"></param>
/// <param name="maxcount"></param>
/// <returns></returns>
private List<dictdata> SeekData(Stream r, string index, Encoding encode, int maxcount)
{           
    dictdata firstdata = SeekToLine(r, index, 0, r.Length, encode);
    return ReadData(r, firstdata.filepos, encode, maxcount);
}

/// <summary>
/// 讀取資料
/// </summary>
/// <param name="r"></param>
/// <param name="beginpos"></param>
/// <param name="encode"></param>
/// <param name="maxcount"></param>
/// <returns></returns>
private List<dictdata> ReadData(Stream r, long beginpos, Encoding encode, int maxcount)
{
    List<dictdata> result = new List<dictdata>();
    r.Position = beginpos;
    StreamReader sr = new StreamReader(r, encode);
    try
    {
        while (maxcount-- > 0)
            result.Add(new dictdata(sr.ReadLine(), 0));
    }
    finally
    {
        sr.DiscardBufferedData();
    }
    return result;
}

/// <summary>
/// 查詢英文, 回傳最多 maxcount 個字典資料
/// </summary>
/// <param name="english"></param>
/// <param name="maxcount"></param>
/// <returns></returns>
public List<dictdata> EnglishToChinese(string english, int maxcount)
{
    if (english.Length == 0)
        return new List<dictdata>();

    string filename = Path.Combine(libpath, english[0] + ".lib");
    if ((filename != lastopenfile) || (lastopenfs == null))
    {
        if (lastopenfs != null)
        {
            lastopenfs.Dispose();
            lastopenfs = null;
        }
        if (File.Exists(filename))
        {
            lastopenfile = filename;
            lastopenfs = File.OpenRead(filename);
        }
    }

    if (lastopenfs != null)
    {
        // 當 english term 長度為 1 時, 我們可以做加速的動作
        // 通常第一行就是該英文.
        if (english.Length == 1)
        {
            lastopenfs.Position = 0;
            StreamReader sr = new StreamReader(lastopenfs, encode);
            try
            {
                dictdata firstitem = new dictdata(sr.ReadLine(), 0);
                if (firstitem.eng == english.ToLower())
                {
                    // 是的, 找到第一行就是我們要的
                    return ReadData(lastopenfs, 0, encode, maxcount);
                }
            }
            finally
            {
                sr.DiscardBufferedData();
            }
        }

        return SeekData(lastopenfs, english.ToLower(), encode, maxcount);
    }
    else
        return new List<dictdata>();
}

我們當然可以做中翻英的功能, 很簡單, 很暴力:

public List<dictdata> ChineseToEnglish(string chinese)
{
    var result = new List<dictdata>();
    List<string> libfiles = new List<string>(Directory.GetFiles(libpath, "*.lib"));
    libfiles.Sort();
    foreach (string libfile in libfiles)
    {
        using (FileStream fs = File.OpenRead(libfile))
        using (StreamReader sr = new StreamReader(fs, encode))
        {
            string linedata;
            while ((linedata = sr.ReadLine()) != null)
            {
                var dictitem = new dictdata(linedata, 0);
                if (dictitem.chinese.IndexOf(chinese) >= 0)
                {
                    // found.
                    result.Add(dictitem);
                }
            }
        }
    }
    return result;
}

如果, 記憶體夠大 (整個字典檔統統載入記憶體大約會耗費 10MB),
就直接在記憶體搜尋, 那麼整個程式會簡單的多…
所以, 我們可以設計一個共通的介面, 讓外部使用的人可以輕鬆切換記憶體搜尋,
或是檔案搜尋.

/// <summary>
/// 字典介面
/// </summary>
public interface IDict : IDisposable
{
    List<dictdata> EnglishToChinese(string english, int maxcount);
    List<dictdata> ChineseToEnglish(string chinese);
}

於是, 檔案搜尋的程式碼會像這樣:

/// <summary>
/// 使用 pydict 的字典檔, 不載入記憶體, 直接在檔案中搜尋
/// </summary>
public class dict : IDict
{
    /// <summary>
    /// dict lib path
    /// </summary>
    private string libpath;

    /// <summary>
    /// 上次打開的檔案名稱, 加速用
    /// </summary>
    private string lastopenfile;

    /// <summary>
    /// 上次打開的 FileStream, 加速用
    /// </summary>
    private FileStream lastopenfs;

    /// <summary>
    /// 編碼
    /// </summary>
    private static Encoding encode = Encoding.GetEncoding("Big5");

    public dict()
    {
        libpath =
            Path.Combine(
            System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase),
            "lib");
    }

    // 略...內容就是上面提到的...

    #region IDisposable 成員
     public void Dispose()
    {
        if (lastopenfs != null)
        {
            lastopenfs.Dispose();
            lastopenfile = null;
            lastopenfs = null;

        }
    }
    #endregion

}

而整個在記憶體搜尋的字典程式碼 (是不是比直接在檔案上面搜尋簡單多了!):

/// <summary>
/// 將 pydict 的字典檔載入記憶體, 直接在記憶體中搜尋
/// </summary>
public class memdict : IDict
{
    /// <summary>
    /// dict lib path
    /// </summary>
    private string libpath;

    /// <summary>
    /// 全部的字典檔內容
    /// </summary>
    private List<dictdata> allsorteddict;

    private static Encoding encode = Encoding.GetEncoding("Big5");

    public memdict()
    {
        libpath =
            Path.Combine(
            System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase),
            "lib");

                    List<string> libfiles = new List<string>(Directory.GetFiles(libpath, "*.lib"));
        libfiles.Sort();
        allsorteddict = new List<dictdata>();
        foreach (string libfile in libfiles)
        {
            using (FileStream fs = File.OpenRead(libfile))
            using (StreamReader sr = new StreamReader(fs, encode))
            {
                string linedata;
                while ((linedata = sr.ReadLine()) != null)
                {
                    allsorteddict.Add(new dictdata(linedata, 0));
                }
            }
        }
    }

    /// <summary>
    /// 查詢英文, 回傳最多 maxcount 個字典資料
    /// </summary>
    /// <param name="english"></param>
    /// <param name="maxcount"></param>
    /// <returns></returns>
    public List<dictdata> EnglishToChinese(string english, int maxcount)
    {
        if (english.Length == 0)
            return new List<dictdata>();

        int idx = allsorteddict.BinarySearch(new dictdata(english.ToLower() + "=", 0),
            new ComparisonComparer<dictdata>((x,y) => x.eng.CompareTo(y.eng)));

        bool isfound = (idx >= 0);

        if (isfound == false)
            idx = ~idx;

        var result = new List<dictdata>();
        for (int i = idx; (i < allsorteddict.Count) && (i < (idx + maxcount)); i++)
        {
            result.Add(allsorteddict[i]);
        }
        return result;
    }

    public List<dictdata> ChineseToEnglish(string chinese)
    {
        var result = new List<dictdata>();
        foreach (var dictitem in allsorteddict)
        {
            if (dictitem.chinese.IndexOf(chinese) >= 0)
            {
                // found.
                result.Add(dictitem);
            }
        }
        return result;
    }

    #region IDisposable 成員

    public void Dispose()
    {
    }

    #endregion
}

歐, 還要搭配一個小工具把 Comparesion delegate 轉為實做 IComparer 的物件:

/// <summary>
/// 將 Comparesion delegate 轉為一個實做 Comparer 的物件
/// </summary>
/// <typeparam name="T"></typeparam>
public sealed class ComparisonComparer<T> : IComparer<T>
{
    private readonly Comparison<T> comparison;

    public ComparisonComparer(Comparison<T> comparison)
    {
        this.comparison = comparison;
    }

    public int Compare(T x, T y)
    {
        return comparison(x, y);
    }
}

是的, 按照慣例, 功能部份的程式寫完了,
就來拖拉 UI 啦

你可以看到很簡單的幾個設計, 在上面的 TextBox 輸入文字,
如果是英翻中, 因為 BinarySearch 很快 (不論檔案或是記憶體搜尋皆然),
所以我們可以作即時搜尋, 這點就是網路不容易作到的事情.
而如果切換為中翻英, 就要用暴力法查詢, 會需要等待, 所以不能作即時搜尋,
要靠右邊的搜尋按鍵.

中翻英的搜尋功能就做在上方 TextBox 的 TextChanged 觸發函式:

private void textBox1_TextChanged(object sender, EventArgs e)
{
    // 中翻英沒辦法做到即時查詢
    if (IsEnglishToChinese == false)
        return;

    int maxcount = 20;
    var dicitems = dic.EnglishToChinese(textBox1.Text, maxcount);
    updatelist(dicitems,
        (dicitems.Count > 0) ?
        (String.Compare(dicitems[0].eng, textBox1.Text, true) == 0) : false);
}

private void updatelist(List<dictdata> dictdatas, bool shoulddisplayfirst)
{
    listBox1.BeginUpdate();
    try
    {
        listBox1.Items.Clear();

        foreach (var ditem in dictdatas)
            listBox1.Items.Add(ditem);

        if ((dictdatas.Count > 0) && (shoulddisplayfirst == true))
        {
            listBox1.SelectedIndex = 0;
            textBox2.Text = dictdatas[0].displaysoundmarkandchinese;
        }
        else
            textBox2.Text = string.Empty;
    }
    finally
    {
        listBox1.EndUpdate();
    }
}

然後, 我們可以在使用者點選左邊候選英文列表時, 在右邊顯示中文內容:

private void listBox1_SelectedIndexChanged(object sender, EventArgs e)
{
    dictdata dict = listBox1.SelectedItem as dictdata;
    if (dict != null)
        textBox2.Text = dict.displaysoundmarkandchinese;
    else
        textBox2.Text = string.Empty;
}

中文查詢的功能就做在 Search Button 按下的時候, UI 也需要顯示等待的狀況:

private void button1_Click(object sender, EventArgs e)
{
    // 英翻中已經做到即時查詢, 不需要再查一次
    if (IsEnglishToChinese == true)
        return;

    Cursor.Current = Cursors.WaitCursor;
    try
    {
        var result = dic.ChineseToEnglish(textBox1.Text);
        updatelist(result, true);
    }
    finally
    {
        Cursor.Current = Cursors.Default;
    }
}

最後, 切換中翻英, 英翻中的程式碼:

private void menuItem4_Click(object sender, EventArgs e)
{
    updateEnglishToChineseStatus(false);
}

private void updateEnglishToChineseStatus(bool isengtochinese)
{
    IsEnglishToChinese = isengtochinese;
    menuItem4.Checked = !IsEnglishToChinese;
    menuItem5.Checked = IsEnglishToChinese;
    this.Text = IsEnglishToChinese ? "英翻中" : "中翻英";
}

private void menuItem5_Click(object sender, EventArgs e)
{
    updateEnglishToChineseStatus(true);
}

因為我們設計了統一繼承的介面, 所以要切換記憶體搜尋就很簡單囉:

private void menuItem3_Click(object sender, EventArgs e)
{
    if (dic is memdict)
        return; // 已經載入記憶體了
    dic.Dispose();

    Cursor.Current = Cursors.WaitCursor;
    try
    {
        dic = new memdict();
    }
    finally
    {
        Cursor.Current = Cursors.Default;
    }
}

使用的範例畫面如下:

…

是的, 打算朝範例邁進啊~~~

原始檔案若包含了所有字典檔會傳不上來..
我僅僅保留 a 的字典檔, 其他得有興趣的人自己補上就好 : wm6dict.zip

回首頁