Joonas' Note

Joonas' Note

Content-based File Format Detection (ํŒŒ์ผ ํ™•์žฅ์ž ์˜ˆ์ธก) ๋ณธ๋ฌธ

AI/๋จธ์‹ ๋Ÿฌ๋‹

Content-based File Format Detection (ํŒŒ์ผ ํ™•์žฅ์ž ์˜ˆ์ธก)

2022. 5. 18. 23:29 joonas

    Dataset

    https://www.kaggle.com/datasets/joonasyoon/file-format-detection

     

    Programming Laungages and File Format Detection

    can you know what file format is? and written in which language?

    www.kaggle.com

    Code

    https://www.kaggle.com/code/joonasyoon/ml-content-based-file-format-detection

     

    [ML] ๐Ÿ’พ Content-based File Format Detection ๐Ÿ“ƒ

    Explore and run machine learning code with Kaggle Notebooks | Using data from Programming Laungages and File Format Detection

    www.kaggle.com


    Context

    ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๋ถ€ํ„ฐ, ML ๋ชจ๋ธ๊นŒ์ง€ ์ „๋ถ€ ๋งŒ๋“ค์–ด์„œ ํ™•์ธํ•ด๋ดค๋‹ค.

    ์ฒ˜์Œ๋ถ€ํ„ฐ ML ๋ชจ๋ธ๊นŒ์ง€ ์ž‘์„ฑํ•  ์ƒ๊ฐ์€ ์•„๋‹ˆ์—ˆ๊ณ , ํŒŒ์ผ์˜ ํ™•์žฅ์ž๋Š” ๋‹จ์ˆœํžˆ ์ด๋ฆ„์˜ ์ผ๋ถ€์ผ ๋ฟ์ด๋‹ˆ๊นŒ,
    ํ™•์žฅ์ž๊ฐ€ ์—†๋Š” ์ƒํƒœ์—์„œ ํŒŒ์ผ ๋‚ด์šฉ๋งŒ ๋ณด๊ณ  ์–ด๋–ค ์–ธ์–ด๋กœ ์ž‘์„ฑ๋˜์—ˆ๋Š” ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์„๊นŒ? ํ•˜๋Š” ์˜๋ฌธ์—์„œ ์ถœ๋ฐœํ–ˆ๋‹ค.

    D/C/Go ์–ธ์–ด ์˜ˆ์‹œ

    GitHub์—๋Š” ์ˆ˜๋งŽ์€ ์ฝ”๋“œ๋“ค์ด ์ˆ˜๋งŽ์€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋“ค๋กœ, ๊ทธ๋ฆฌ๊ณ  ์ˆ˜๋งŽ์€ ์ฝ”๋“œ ์Šคํƒ€์ผ๋กœ ์ž‘์„ฑ๋˜์–ด ์žˆ๊ณ  ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๊ฒƒ์„ ์ˆ˜์ง‘ํ•ด์„œ ๋ฐ์ดํ„ฐ ์…‹์„ ๋งŒ๋“ค์—ˆ๋‹ค.

    30๊ฐœ๊ฐ€ ๋„˜๋Š” ๋ ˆํฌ์ง€ํ† ๋ฆฌ์—์„œ 8๋งŒ๊ฐœ๊ฐ€ ๋„˜๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ์„ ๋ชจ์•˜๋‹ค. ๋žœ๋ค์œผ๋กœ ๋ชจ์œผ๋‹ค๋ณด๋‹ˆ Dart, Rust, C#, Go๋ฅผ ๊ฐ€์žฅ ๋งŽ์ด ๋ชจ์•˜๋‹ค. ๋‚˜๋จธ์ง€ ์–ธ์–ด๋“ค๋„ ํŒŒ์ผ ์‚ฌ์ด์ฆˆ๋ฅผ ์ƒ๊ฐํ•˜๋ฉด ์ ์€ ์–‘์€ ์•„๋‹ˆ๋ผ์„œ ํ•™์Šต์ด ์–ด๋ ค์›Œ๋ณด์ด์ง€๋Š” ์•Š์•˜๋‹ค.

    ๊ทธ๋ž˜๋„ ํŒŒ์ผ์ด 500๊ฐœ๋Š” ๋„˜๋Š” ์–ธ์–ด๋“ค๋งŒ ํ•™์Šตํ•ด์„œ ์—์ธกํ•ด๋ณด๊ธฐ๋กœ ํ–ˆ๊ณ , ์–ธ์–ด๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

    • C
    • C#
    • C++
    • Dart
    • Diff
    • Elixir
    • GAS
    • GLSL
    • Go
    • JSON
    • Java
    • Javascript
    • Julia
    • Kotlin
    • Markdown
    • PHP
    • Ruby
    • Rust
    • SQL
    • Text
    • YAML

    JSON, Text, YAML ๋„ ์žˆ์œผ๋‹ˆ ์–ด๋–ค ํŠน์ • ์–ธ์–ด๋ผ๊ธฐ๋ณด๋‹ค๋Š” ๋ฐ์ดํ„ฐ ํฌ๋งท์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ๊ฒƒ์ด ๋งž์•„๋ณด์ธ๋‹ค.

    ํ•œ ๊ฐ€์ง€ ๊ฑฑ์ •๋˜๋Š” ๊ฒƒ์€ ์ •๋ง ์™„์ „ํ•œ ๋žœ๋ค์ธ Text๋ฅผ ์ž˜ ๊ฑธ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์„ ์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

    ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ฐ”๋ผ์ง€ ์•Š๊ณ  ๋‹จ์ˆœํ•˜๊ฒŒ CountVectorizer๋กœ ๋ฒกํ„ฐํ™”ํ•ด์„œ ์ „๋ถ€ ํ•™์Šต์‹œ์ผฐ๋Š”๋ฐ, ์˜์™ธ๋กœ ์ž˜ ๋‚˜์˜จ๋‹ค. ์•„์ฃผ ์กฐ๊ธˆ๋งŒ ํ•™์Šตํ•ด๋„ 80%๋Š” ์‰ฝ๊ฒŒ ๋„˜์–ด๊ฐ„๋‹ค.

    8๋งŒ๊ฐœ ํŒŒ์ผ(์•ฝ 1GB)์„ ์ „๋ถ€ ์ฝ์œผ๋ฉด ๋ฒกํ„ฐ ํฌ๊ธฐ๊ฐ€ 400๋งŒ ์ •๋„๋Š” ๋œ๋‹ค.

    Conclusion

    LinearSVC:
      elapsed time: 0:06:04.338887
      accuracy: 94.61%
      roc_auc: 0.9877388985308935
    LogisticRegression:
      elapsed time: 1:06:56.671633
      accuracy: 97.02%
      roc_auc: 0.9882273614011244
    RidgeClassifier:
      elapsed time: 0:04:10.478925
      accuracy: 55.39%
      roc_auc: None
    random_forest:
      elapsed time: 0:05:06.379483
      accuracy: 93.68%
      roc_auc: 0.9825555088479243
    k_neighbors:
      elapsed time: 0:00:33.741252
      accuracy: 87.67%
      roc_auc: 0.9706357200070198
    SGD:
      elapsed time: 0:01:24.359626
      accuracy: 88.76%
      roc_auc: None

    ์ „๋ฐ˜์ ์œผ๋กœ ์Šค์ฝ”์–ด๊ฐ€ ๋ฌด์ฒ™ ๋†’์€ ํŽธ์ด๋‹ค. ์‹ค์ œ๋กœ ๋ช‡ ๊ฐœ๋ฅผ ๋ฝ‘์•„์„œ ๋ชจ๋ธ์— ์ฝ”๋“œ๋ฅผ ๋„ฃ๊ณ  ์˜ˆ์ธก ํด๋ž˜์Šค๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ์ž˜ ๋‚˜์˜จ๋‹ค.

    CountVectorizer์—์„œ ํŠน์ˆ˜๊ธฐํ˜ธ๋‚˜ ๊ณต๋ฐฑ, stop words๋“ค์ด ์ œ๊ฑฐ๋ ํ…๋ฐ ๋‚จ์€ ๋‹จ์–ด๋“ค๋งŒ์œผ๋กœ๋„ ์–ธ์–ด๋ฅผ ๊ตฌ๋ถ„ํ• ๋งŒํผ์˜ ์œ ์˜๋ฏธํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๋ชจ์–‘์ด๋‹ค. ์˜ˆ์ƒ์ปจ๋Œ€, ๊ฐ ์–ธ์–ด๋งˆ๋‹ค ์“ฐ์ด๋Š” ์˜ˆ์•ฝ์–ด๋“ค๋กœ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

    ๊ธ€ ์ƒ๋‹จ์— ์ฒจ๋ถ€๋œ ์บ๊ธ€ ๋…ธํŠธ๋ถ์—์„œ ์ „์ฒด ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์ด๊ฑธ ์–ด๋””๋‹ค ์จ๋จน์„ ์ˆ˜ ์žˆ๋‚˜ ํ•˜๊ฒ ์ง€๋งŒ, VSCode์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•˜๊ณ , Slack์˜ snippet์—์„œ๋„ ์“ฐ์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

    Comments