海野さん卒論の履歴(No.7)

履歴一覧
差分を表示
現在との差分を表示
ソースを表示
海野さん卒論へ行く。
- 1 (2025-07-01 (火) 12:38:40)
- 2 (2025-07-01 (火) 12:58:10)
- 3 (2025-07-08 (火) 13:35:49)
- 4 (2025-07-15 (火) 13:46:56)
- 5 (2025-07-17 (木) 18:42:10)
- 6 (2025-07-22 (火) 14:13:24)
- 7 (2025-07-29 (火) 13:50:27)
- 8 (2025-12-12 (金) 18:06:29)
- 9 (2025-12-12 (金) 23:27:13)

背景

富山県のホタルイカの身投げは、産卵行動や環境要因と密接に関連する現象であり、その量の分析と予測は、生態学的理解の深化や気候変動の指標、さらに地域観光や漁業資源管理における重要な意義を持っている。

研究の流れ

1 過去のホタルイカ身投げデータと、その日の天気、気温、月齢、風などのデータを収集

2 身投げが起きやすい条件の分析、学習

3 ホタルイカの身投げ量の予測モデルの作成

4 身投げ量の予報を提供するWebサイトの作成

データの収集・統合

過去の身投げ量データの取得

1. ホタルイカ掲示板というサイト（https://rara.jp/hotaruika-toyama/）の口コミからスクレイピングで口コミ内容を取得

2. その口コミ内容から、ホタルイカの身投げ量を5段階に分類し、0〜4の整数で数値化

3. その値の日毎の平均をその日の身投げ量とし、身投げ量、日付をjsonファイルに保存

2015年〜2025年まで10年分、それぞれ2〜5月の4ヶ月分の約1300日分のデータを収集した。過去の身投げ量以外の以下に記載したデータを収集するときは、この日付の日のデータを収集する。

過去の天気、風、気温などのデータの取得

気象庁の公式サイト（https://www.data.jma.go.jp/risk/obsdl/#）に、過去の天気、風、気温、降水量などのデータがあり、指定した場所、期間を指定すると、過去の天気などを CSV 形式でダウンロードできる.それをjsonデータに変換する。

潮位データの取得

API経由で潮位、潮の種類などが取得でき、その取得したデータをjsonファイルに保存する。（https://tide736.net/#google_vignette）

月齢データの取得

月齢は、西暦と日付から計算できる.日付ごとの月齢をjsonファイルにまとめる。

データの統合

これらのjsonファイルを日付をキーにして一つのjsonファイルにまとめる。

分析・学習

分析方法

目的変数（予測したい値）：身投げ量（avg_amount）

説明変数（目的変数に影響を与える値）：下記特徴量

データ分割：1220 日分のデータのうち85％を学習用、残りの15％をテスト用として使用

特徴量（どのような要因で分析するか）

時間特徴量

年、月、日、曜日、年の第何週

月齢（月齢 sin, cos 変換含む）

月齢は 0～29.53 日で周期的に変化するため、sin と cos で周期性を表現。

月齢 sin は「上弦の月」で最大（1）、「下弦の月」で最小（-1）。

月齢 cos は「新月」で最大（1）、満月で最小（-1）。

気温（平均 / 最大 / 最小 / 標準偏差）

時間帯別（10-13 時、14-17 時、18-21 時、22-0 時、1-4 時）の平均気温も算出。

降水量（合計 / 有無）

時間帯別の降水量合計も算出。降水の有無をバイナリ変数で表現。

風速・風向（平均 / 最大 / 最小 / 標準偏差）

潮汐データ

潮型（若潮、長潮、小潮、中潮、大潮）を数値変換。

夜間（21 時～5 時）の満潮・干潮高（平均 / 最大 / 最小）。

ラグ特徴量

気象・潮汐変数の 1 日前、2 日前の値。

目的変数（avg_amount）の 1 日前、2 日前、3 日間平均。

コード

   import json
   import pandas as pd
   import numpy as np
   from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
   from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
   from sklearn.preprocessing import LabelEncoder, MinMaxScaler
   import matplotlib.pyplot as plt
   import seaborn as sns
   import lightgbm as lgb
   import holidays

   # イベント日リスト（YYYY-MM-DD形式で実際のイベント日を記入）
   EVENT_DATES = [
       "2023-03-25", "2023-04-10", "2023-04-20"
   ]

   jp_holidays = holidays.country_holidays("JP")

   DIRECTION_MAP = {
       '北': 0, '北北東': 1, '北東': 2, '東北東': 3, '東': 4, '東南東': 5, '南東': 6, '南南東': 7,
       '南': 8, '南南西': 9, '南西': 10, '西南西': 11, '西': 12, '西北西': 13, '北西': 14, '北北西': 15
   }
   TIDE_TYPE_MAP = {'若潮': 0, '長潮': 1, '小潮': 2, '中潮': 3, '大潮': 4}

   def direction_to_radian(direction):
       idx = DIRECTION_MAP.get(direction)
       if idx is None:
           return np.nan
       return 2 * np.pi * idx / 16

   def mean_wind_direction(directions):
       radians = np.array([direction_to_radian(d) for d in directions if d in DIRECTION_MAP])
       if len(radians) == 0:
           return np.nan
       sin_sum = np.sin(radians).mean()
       cos_sum = np.cos(radians).mean()
       mean_rad = np.arctan2(sin_sum, cos_sum)
       if mean_rad < 0:
           mean_rad += 2 * np.pi
       return int(np.round(mean_rad * 16 / (2 * np.pi))) % 16

   def safe_float(x):
       try:
           return float(x)
       except Exception:
           return np.nan

   def is_wind_night_time(time_str):
       try:
           hour = int(time_str.split(' ')[1].split(':')[0])
           return (20 <= hour <= 23) or (0 <= hour <= 4)
       except Exception:
           return False

   def is_temp_precip_night_time(time_str):
       try:
           hour = int(time_str.split(' ')[1].split(':')[0])
           return (10 <= hour <= 23) or (0 <= hour <= 4)
       except Exception:
           return False

   def is_night_tide_time(time_str):
       try:
           hour = int(time_str.split(':')[0])
           return (21 <= hour <= 23) or (0 <= hour <= 5)
       except Exception:
           return False

   def get_time_period(hour):
       if 10 <= hour <= 13:
           return '10_13'
       elif 14 <= hour <= 17:
           return '14_17'
       elif 18 <= hour <= 21:
           return '18_21'
       elif hour in [22, 23, 0]:
           return '22_0'
       elif 1 <= hour <= 4:
           return '1_4'
       else:
           return None

   all_rows = []
   # JSONファイルのパスを適切に設定してください
   with open('avg_amount_moon_temperature_wind_precipitation_tide_2.json', encoding='utf-8') as f:
       data = json.load(f)
   for entry in data:
       date = entry['date']
       avg_amount = safe_float(entry['avg_amount'])
       moon_age = safe_float(entry['moon_age'])
       try:
           date_dt = pd.to_datetime(date, format='%Y/%m/%d')
           year = date_dt.year
           month = date_dt.month
           day = date_dt.day
           weekday = date_dt.weekday()
           date_str_ymd = date_dt.strftime('%Y-%m-%d')
           week_of_year = date_dt.isocalendar().week
           week_of_month = (date_dt.day - 1) // 7 + 1
           day_of_year = date_dt.dayofyear
       except Exception:
           year = np.nan
           month = np.nan
           day = np.nan
           weekday = np.nan
           date_str_ymd = None
           week_of_year = np.nan
           week_of_month = np.nan
           day_of_year = np.nan

       is_weekend = 1 if weekday in [5, 6] else 0
       is_holiday = 1 if date_str_ymd in jp_holidays else 0
       is_event = 1 if date_str_ymd in EVENT_DATES else 0

       temp_periods = {'10_13': [], '14_17': [], '18_21': [], '22_0': [], '1_4': []}
       prec_periods = {'10_13': [], '14_17': [], '18_21': [], '22_0': [], '1_4': []}

       temps = []
       for d in entry.get('temperature_data', []):
           temp = safe_float(d.get('average_temperature(℃)'))
           time_str = d.get('time', '')
           if temp is not None and temp != '' and time_str:
               try:
                   hour = int(time_str.split(' ')[1].split(':')[0])
               except Exception:
                   continue
               if is_temp_precip_night_time(time_str):
                   temps.append(temp)
               period = get_time_period(hour)
               if period:
                   temp_periods[period].append(temp)
       temperature_mean = np.nanmean(temps) if temps else np.nan
       temperature_max = np.nanmax(temps) if temps else np.nan
       temperature_min = np.nanmin(temps) if temps else np.nan
       temperature_std = np.nanstd(temps) if temps else np.nan

       precs = []
       for d in entry.get('precipitation_data', []):
           prec = safe_float(d.get('precipitation_totals(mm)'))
           time_str = d.get('time', '')
           if prec is not None and prec != '' and time_str:
               try:
                   hour = int(time_str.split(' ')[1].split(':')[0])
               except Exception:
                   continue
               if is_temp_precip_night_time(time_str):
                   precs.append(prec)
               period = get_time_period(hour)
               if period:
                   prec_periods[period].append(prec)
       precipitation_sum = np.nansum(precs) if precs else 0
       precipitation_binary = 1 if precipitation_sum > 0 else 0

       temp_period_features = {}
       prec_period_features = {}
       for p in temp_periods:
           temp_period_features[f'temperature_mean_{p}'] = np.nanmean(temp_periods[p]) if temp_periods[p] else np.nan
       for p in prec_periods:
           prec_period_features[f'precipitation_sum_{p}'] = np.nansum(prec_periods[p]) if prec_periods[p] else 0

       wind_speeds = [safe_float(d.get('average_wind_speed_m_s')) for d in entry.get('wind_data', [])
                   if d.get('average_wind_speed_m_s') not in [None, '']
                   and is_wind_night_time(d.get('time',''))]
       wind_speed_mean = np.nanmean(wind_speeds) if wind_speeds else np.nan
       wind_speed_max = np.nanmax(wind_speeds) if wind_speeds else np.nan
       wind_speed_min = np.nanmin(wind_speeds) if wind_speeds else np.nan
       wind_speed_std = np.nanstd(wind_speeds) if wind_speeds else np.nan

       wind_dirs = [d.get('wind_direction') for d in entry.get('wind_data', [])
                   if d.get('wind_direction', '') != '' and is_wind_night_time(d.get('time',''))]
       wind_direction_mean = mean_wind_direction(wind_dirs) if wind_dirs else np.nan

       tide_type_str = entry.get('tide_data_night', {}).get('tide_type', None)
       tide_type_num = TIDE_TYPE_MAP.get(tide_type_str, np.nan)
       high_tides = entry.get('tide_data_night', {}).get('high_tide', [])
       night_high_tides = [
           safe_float(ht.get('tide_height_cm')) for ht in high_tides
           if ht.get('time') and is_night_tide_time(ht['time'])
       ]
       high_tide_height_mean = np.nanmean(night_high_tides) if night_high_tides else np.nan
       high_tide_height_max = np.nanmax(night_high_tides) if night_high_tides else np.nan
       high_tide_height_min = np.nanmin(night_high_tides) if night_high_tides else np.nan
       low_tides = entry.get('tide_data_night', {}).get('low_tide', [])
       night_low_tides = [
           safe_float(lt.get('tide_height_cm')) for lt in low_tides
           if lt.get('time') and is_night_tide_time(lt['time'])
       ]
       low_tide_height_mean = np.nanmean(night_low_tides) if night_low_tides else np.nan
       low_tide_height_max = np.nanmax(night_low_tides) if night_low_tides else np.nan
       low_tide_height_min = np.nanmin(night_low_tides) if night_low_tides else np.nan

       all_rows.append({
           'date': date,
           'year': year,
           'month': month,
           'day': day,
           'avg_amount': avg_amount,
           'moon_age': moon_age,
           'weekday': weekday,
           'week_of_year': week_of_year,
           'week_of_month': week_of_month,
           'day_of_year': day_of_year,
           'is_weekend': is_weekend,
           'is_holiday': is_holiday,
           'is_event': is_event,
           'temperature_mean': temperature_mean,
           'temperature_max': temperature_max,
           'temperature_min': temperature_min,
           'temperature_std': temperature_std,
           'precipitation_sum': precipitation_sum,
           'precipitation_binary': precipitation_binary,
           'wind_speed_mean': wind_speed_mean,
           'wind_speed_max': wind_speed_max,
           'wind_speed_min': wind_speed_min,
           'wind_speed_std': wind_speed_std,
           'wind_direction_mean': wind_direction_mean,
           'tide_type_num': tide_type_num,
           'high_tide_height_mean': high_tide_height_mean,
           'high_tide_height_max': high_tide_height_max,
           'high_tide_height_min': high_tide_height_min,
           'low_tide_height_mean': low_tide_height_mean,
           'low_tide_height_max': low_tide_height_max,
           'low_tide_height_min': low_tide_height_min,
           **temp_period_features,
           **prec_period_features
       })

   df = pd.DataFrame(all_rows)
   df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
   df = df[df['date'].dt.month >= 2]

   df = df.sort_values('date').reset_index(drop=True)

   df['moon_age_sin'] = np.sin(2 * np.pi * df['moon_age'] / 29.53)
   df['moon_age_cos'] = np.cos(2 * np.pi * df['moon_age'] / 29.53)

   df['moon_age'] = df['moon_age'].ffill()
   df['year'] = df['year'].fillna(df['year'].mode()[0])
   df['month'] = df['month'].fillna(df['month'].mode()[0])
   df['day'] = df['day'].fillna(df['day'].mode()[0])
   df['week_of_year'] = df['week_of_year'].fillna(df['week_of_year'].mode()[0])
   df['week_of_month'] = df['week_of_month'].fillna(df['week_of_month'].mode()[0])
   df['day_of_year'] = df['day_of_year'].fillna(df['day_of_year'].mode()[0])

   for c in ['temperature_mean','temperature_max','temperature_min','temperature_std']:
       df[c] = df[c].ffill()
   df['precipitation_sum'] = df['precipitation_sum'].fillna(0)
   df['precipitation_binary'] = df['precipitation_binary'].fillna(0)
   for c in ['wind_speed_mean','wind_speed_max','wind_speed_min','wind_speed_std']:
       df[c] = df[c].ffill()
   df['wind_direction_mean'] = df['wind_direction_mean'].ffill()
   df['tide_type_num'] = df['tide_type_num'].fillna(df['tide_type_num'].mode()[0])
   for c in [
       'high_tide_height_mean','high_tide_height_max','high_tide_height_min',
       'low_tide_height_mean','low_tide_height_max','low_tide_height_min'
   ]:
       df[c] = df[c].fillna(0)
   for c in ['weekday','is_weekend','is_holiday','is_event']:
       df[c] = df[c].fillna(0)
   for p in ['10_13', '14_17', '18_21', '22_0', '1_4']:
       df[f'temperature_mean_{p}'] = df[f'temperature_mean_{p}'].ffill()
       df[f'precipitation_sum_{p}'] = df[f'precipitation_sum_{p}'].fillna(0)

   le = LabelEncoder()
   df['wind_direction_encoded'] = le.fit_transform(df['wind_direction_mean'])

   for win in [3]:
       df[f'temperature_mean_{win}d_mean'] = df['temperature_mean'].rolling(window=win, min_periods=1).mean().shift(1)
       df[f'precipitation_sum_{win}d_sum'] = df['precipitation_sum'].rolling(window=win, min_periods=1).sum().shift(1)

   df['avg_amount_lag1'] = df['avg_amount'].shift(1)

   ### 変更点 ### 潮関連のデータをラグ特徴量のリストから除外
   lag_cols = [
       'moon_age', 'moon_age_sin', 'moon_age_cos', 'temperature_mean', 'temperature_max', 'temperature_min',
       'temperature_std', 'precipitation_sum', 'precipitation_binary', 'wind_speed_mean', 'wind_speed_max', 'wind_speed_min',
       'wind_speed_std', 'wind_direction_encoded',
       'day_of_year',
   ]
   for col in lag_cols:
       df[f'{col}_lag1'] = df[col].shift(1)
       df[f'{col}_lag2'] = df[col].shift(2)

   df = df.dropna().reset_index(drop=True)

   features = [
       'year', 'month',
       'week_of_month',
       'day_of_year',
       'moon_age', 'moon_age_sin', 'moon_age_cos',
       'weekday', 'is_weekend', 'is_holiday', 'is_event',
       'temperature_mean', 'temperature_max', 'temperature_min', 'temperature_std',
       'precipitation_sum', 'precipitation_binary',
       'wind_speed_mean', 'wind_speed_max', 'wind_speed_min', 'wind_speed_std',
       'wind_direction_encoded',
       'tide_type_num',
       'high_tide_height_mean','high_tide_height_max','high_tide_height_min',
       'low_tide_height_mean','low_tide_height_max','low_tide_height_min',
       'avg_amount_lag1',
       'temperature_mean_3d_mean',
       'precipitation_sum_3d_sum'
   ]
   features += [f'temperature_mean_{p}' for p in ['10_13', '14_17', '18_21', '22_0', '1_4']]
   features += [f'precipitation_sum_{p}' for p in ['10_13', '14_17', '18_21', '22_0', '1_4']]
   ### 変更点 ### lag_colsに対応して特徴量リストを更新
   features += [f"{col}_lag1" for col in lag_cols]
   features += [f"{col}_lag2" for col in lag_cols]

   X = df[features]
   y = df['avg_amount'].values.reshape(-1, 1)

   scaler_X = MinMaxScaler()
   scaler_y = MinMaxScaler()
   X_scaled = scaler_X.fit_transform(X)
   y_scaled = scaler_y.fit_transform(y)

   tscv = TimeSeriesSplit(n_splits=5)
   param_grid = {
       'num_leaves': [15, 31, 63],
       'learning_rate': [0.01, 0.05, 0.1],
       'n_estimators': [100, 300, 500],
       'random_state': [42]
   }
   lgb_model = lgb.LGBMRegressor()
   gs = GridSearchCV(lgb_model, param_grid, scoring='r2', cv=tscv, n_jobs=-1, verbose=1)
   gs.fit(X_scaled, y_scaled.ravel())
   best_model = gs.best_estimator_
   best_params = gs.best_params_
   print("Best Params:", best_params)

   feature_importance = pd.DataFrame({
       'feature': features,
       'importance': best_model.feature_importances_
   }).sort_values('importance', ascending=False)

   print(f"\n重要度0の特徴量: {feature_importance[feature_importance['importance'] == 0]['feature'].tolist()}\n")

   ### 変更点 ### 重要度が10以上の特徴量を選択
   selected_features = feature_importance[feature_importance['importance'] >= 10]['feature'].tolist()
   print(f"選択された特徴量の数: {len(selected_features)}")
   print(f"選択された特徴量: {selected_features}\n")

   # 選択された特徴量に目的変数そのものが含まれていないか確認
   if 'avg_amount' in selected_features:
       print("警告：選択された特徴量に 'avg_amount' が含まれています。除外します。")
       selected_features.remove('avg_amount')

   X_sel = df[selected_features]
   scaler_X_sel = MinMaxScaler()
   X_sel_scaled = scaler_X_sel.fit_transform(X_sel)

   test_size = int(len(df) * 0.15)
   X_train, X_test = X_sel_scaled[:-test_size], X_sel_scaled[-test_size:]
   y_train, y_test = y_scaled[:-test_size], y_scaled[-test_size:]

   best_model.fit(X_train, y_train.ravel())
   y_pred = best_model.predict(X_test)

   mse = mean_squared_error(y_test, y_pred)
   mae = mean_absolute_error(y_test, y_pred)
   r2 = r2_score(y_test, y_pred)
   ### 変更点 ### 評価指標のタイトルを更新
   print("\n=== モデル評価（正規化後, 重要度10以上の特徴量のみ） ===")
   print("平均二乗誤差 (MSE):", mse)
   print("平均絶対誤差 (MAE):", mae)
   print("決定係数 (R²):", r2)

   feature_importance_sel = pd.DataFrame({
       'feature': selected_features,
       'importance': best_model.feature_importances_
   }).sort_values('importance', ascending=False)
   ### 変更点 ### 特徴量重要度のタイトルを更新
   print("\n特徴量の重要度（重要度10以上の特徴量のみ）:\n", feature_importance_sel)

   plt.figure(figsize=(10, 12))
   sns.barplot(x='importance', y='feature', data=feature_importance_sel)
   ### 変更点 ### グラフのタイトルを更新
   plt.title('Feature Importance (LightGBM, Importance >= 10)')
   plt.tight_layout()
   plt.show()

   plt.figure(figsize=(8, 6))
   plt.plot(y_test, label='True (normalized)')
   plt.plot(y_pred, label='Pred (normalized)')
   plt.legend()
   ### 変更点 ### グラフのタイトルを更新
   plt.title('Prediction vs True (Normalized, Importance >= 10)')
   plt.show()

   y_test_inv = scaler_y.inverse_transform(y_test.reshape(-1, 1))
   y_pred_inv = scaler_y.inverse_transform(y_pred.reshape(-1, 1))
   plt.figure(figsize=(8, 6))
   plt.plot(y_test_inv, label='True (original scale)')
   plt.plot(y_pred_inv, label='Pred (original scale)')
   plt.legend()
   ### 変更点 ### グラフのタイトルを更新
   plt.title('Prediction vs True (Original Scale, Importance >= 10)')
   plt.show()

   from joblib import dump

   # すでに学習済みのモデル・スケーラー・特徴量リストがある前提
   dump(best_model, 'best_model.joblib')
   dump(scaler_X_sel, 'scaler_X_sel.joblib')
   dump(scaler_y, 'scaler_y.joblib')

   import pickle
   with open('selected_features.pkl', 'wb') as f:
       pickle.dump(selected_features, f)

分析結果

・平均二乗誤差 (MSE): 0.0179

・平均絶対誤差 (MAE): 0.0990

・決定係数 (R²): 0.4191

上位特徴量

Webサイトの作成

Webサイトのリポジトリ：https://github.com/yuchi1128/hotaruika-bakuwaki-forecast

使用方法（Dockerがインストールされている前提です）

ターミナルで

  # コードをローカルに持ってくる
  git clone https://github.com/yuchi1128/hotaruika-bakuwaki-forecast.git
  # Dockerコンテナの起動（自動でフロントエンド、バックエンド、データベースのサ ーバーが起動されます）
  docker-compose up -d --build

機能

・ホタルイカの身投げ量を 1 週間後まで予測して表示

・その日の天気や月齢、潮の満ち引きなどの情報を表示

・ホタルイカ掬いでの注意点を掲載

・口コミ機能（good・bad、返信）

使用技術

・フロントエンド：TypeScript（Next.js）

・バックエンド：Go

・データベース：PostgreSQL

・インフラ：Docker

・ホスティング：GCP（Cloud Run）

構成

予測APIの作成

予測APIのリポジトリ：https://github.com/yuchi1128/hotaruika-prediction-api

機能

・日付を指定しないで GET リクエストを送るとその日から 1 週間のホタルイカの身投げ量を予測して json 形式で返す

・日付を指定して GET リクエストを送るとその日のみのホタルイカの身投げ量を予測してjson で返す

使用技術

・Python（FastAPI）

・Docker

・GCP（Cloud Run）

APIの構造

海野さん卒論 の履歴(No.7)

目次

背景

研究の流れ

データの収集・統合

過去の身投げ量データの取得

過去の天気、風、気温などのデータの取得

潮位データの取得

月齢データの取得

データの統合

分析・学習

分析方法

特徴量（どのような要因で分析するか）

時間特徴量

月齢（月齢 sin, cos 変換含む）

気温（平均 / 最大 / 最小 / 標準偏差）

降水量（合計 / 有無）

風速・風向（平均 / 最大 / 最小 / 標準偏差）

潮汐データ

ラグ特徴量

コード

分析結果

上位特徴量

Webサイトの作成

使用方法（Dockerがインストールされている前提です）

機能

使用技術

構成

予測APIの作成

機能

使用技術

APIの構造